Skip to main content
Version: Next

Monitoring

This page walks you through checking the resource status of the entire cluster and the state of GPU/NPU devices.

Click the Monitoring menu in the left sidebar to see system monitoring information.

Cluster Overview

Shows the cluster-wide average resource usage as cards and time-series graphs.

Use the time buttons (1h / 6h / 24h / 7d) in the upper right to adjust the time-series graph range.

Monitoring Cluster

CardDescription
CPUAverage CPU usage across cluster nodes
MemoryAverage memory usage across cluster nodes
DiskDisk usage across cluster nodes
Time-series graphsTime series for the items above

Device Overview

Shows the status of GPU/NPU accelerators.

Monitoring Device

ItemDescription
Per-accelerator-type overviewTotal / Allocated / Unallocated counts and average utilization
Accelerator type filterFilter by accelerator type
Accelerator allocation status filterFilter by accelerator allocation status
Node cardPer-node CPU / Memory / Disk usage and information on installed accelerators
Device cardPer-device Usage / Temp / Power / VRAM usage, plus assigned Pod information
tip

Click a node card or a device card to see the time-series graph for that item.

Shows the CPU / Memory / Disk time-series graphs for that node.

Node detail graphs


Responding to Anomalies

If you observe the following situations during monitoring, take the actions below.

SymptomAction
Device temperature overheatingCheck the Temperature Policy in Advanced Deployment Settings, and verify that auto scale-down or traffic restriction kicks in when the threshold is exceeded.
Accelerator utilization stuck at 100%Increase Replicas in Deploy a Model, or adjust the Auto Scaling settings.
Node memory/disk exhaustionStop unnecessary Servings, or clean up unused Volumes in Storage.