Docs
uk8s
GPU
Cluster GPU Monitoring

Cluster GPU Monitoring

1. Introduction

Uk8s utilizes the open source component Dcgm-Exporter to obtain GPU related monitoring indicators, mainly including:

  • GPU card utilization
  • Container GPU resource utilization

2. Deployment

2.1. If the Monitoring Center is Not Enabled

After enabling the Monitoring Center, you can view the Dashboard NVIDIA/DCGM/Exporter/Node and NVIDIA/DCGM/Exporter/Container on the Grafana page.

2.2. If the Monitoring Center is Already Enabled

⚠️ If the version of the monitoring center 1.0.6 > version >= 1.0.5-3 or version > 1.0.6, the deployment file below is installed by default, please skip the following deployment content, otherwise, you need to carry out the following deployment.

2.2.1. Deployment of Dcgm-Exporter

kubectl apply -f https://docs.ucloud-global.com/uk8s/yaml/gpu-share/dcgm-exporter.yaml

2.2.2. Deployment of NVIDIA/DCGM/Exporter/Node Dashboard

After logging into Grafana, you need to first download the json file —> Select ’+’ on the left navigation bar —> Import —> Paste the downloaded json content into the second input box —> Load

2.2.3. Deployment of NVIDIA/DCGM/Exporter/Container Dashboard

⚠️ The official chart does not contain container-related information, if you need to view the GPU related information of the container, you need to import the Dashboard made by Uk8s.

After logging into Grafana, you need to first download the json file —> Select ’+’ on the left navigation bar —> Import —> Paste the downloaded json content into the second input box —> Load

3. Test

You can quickly start a GPU Pod with the following command. This Pod will run for a period of time and end. You can then check the GPU usage of this Pod on the NVIDIA/DCGM/Exporter/Container Dashboard in Grafana.

cat << EOF | kubectl create -f -
 apiVersion: v1
 kind: Pod
 metadata:
   name: dcgmproftester
 spec:
   restartPolicy: OnFailure
   containers:
   - name: dcgmproftester11
     image: uhub.ucloud-global.com/uk8s/dcgmproftester
     args: ["--no-dcgm-validation", "-t 1004", "-d 120"]
     resources:
       limits:
          nvidia.com/gpu: 1
     securityContext:
       capabilities:
          add: ["SYS_ADMIN"]
 
EOF

4. Dashboard Chart

DashboardGrafana ChartsFunction
NVIDIA/DCGM/Exporter/NodeGPU TemperatureGPU Card Temperature
NVIDIA/DCGM/Exporter/NodeGPU Power UsageGPU Power Consumption
NVIDIA/DCGM/Exporter/NodeGPU SM ClocksGPU Clock Frequency
NVIDIA/DCGM/Exporter/NodeGPU UtilizationGPU Utilization
NVIDIA/DCGM/Exporter/NodeTensor Core UtilizationFraction of cycles that Tensor Pipes are active
NVIDIA/DCGM/Exporter/NodeGPU Framebuffer Mem UsedGPU Memory Usage
NVIDIA/DCGM/Exporter/NodeGPU XID ErrorGPU Card Dropping
NVIDIA/DCGM/Exporter/ContainerGPU UtilizationContainer GPU Utilization
NVIDIA/DCGM/Exporter/ContainerGPU Framebuffer MemContainer GPU Memory Usage & Remaining
NVIDIA/DCGM/Exporter/ContainerGPU Memory UsageContainer GPU Memory Usage Rate

5. Monitoring Rules

We have configured the GPU Card Dropping alarm rule by default. If there is a need to add new alarm rules, you can change the alarm rules with the following command.

kubectl -n uk8s-monitor edit prometheusrule uk8s-gpu

6. Common DCGM Metrics

6.1. Utilization

Metric NameMetric TypeMetric UnitMetric Meaning
DCGM_FI_DEV_GPU_UTILGauge%GPU Utilization
DCGM_FI_DEV_MEM_COPY_UTILGauge%GPU Memory Bandwidth Utilization
DCGM_FI_DEV_ENC_UTILGauge%GPU Encoder Utilization
DCGM_FI_DEV_DEC_UTILGauge%GPU Decoder Utilization

6.2. Memory

In GPU, the video card memory (video memory) is also called frame buffer.

Metric NameMetric TypeMetric UnitMetric Meaning
DCGM_FI_DEV_FB_FREEGaugeMiBGPU Frame Buffer Remaining
DCGM_FI_DEV_FB_USEDGaugeMiBGPU Frame Buffer Used

6.3. Frequency

Metric NameMetric TypeMetric UnitMetric Meaning
DCGM_FI_DEV_SM_CLOCKGaugeMHzGPU SM Clock Frequency
DCGM_FI_DEV_MEM_CLOCKGaugeMHzGPU Memory Clock Frequency

6.4. Profiling

Metric NameMetric TypeMetric UnitMetric Meaning
DCGM_FI_PROF_GR_ENGINE_ACTIVEGauge%The proportion of time the Graphics or Compute engine is Active within a time interval.
DCGM_FI_PROF_SM_ACTIVEGauge%The percentage of time at least one thread bun is Active on an SM (Streaming Multiprocessor) within a time interval, the value is the average of all SMs.
DCGM_FI_PROF_SM_OCCUPANCYGauge%The ratio of the thread bundles residing on SM to the maximum amount of thread bundles that can reside on SM within a time interval, the value is the average of all SMs.
DCGM_FI_PROF_PIPE_TENSOR_ACTIVEGauge%The fraction of cycles the Tensor Pipes are Active per unit time.
DCGM_FI_PROF_DRAM_ACTIVEGauge%The fraction of active memory copy cycles (a cycle with a DRAM instruction is considered 100% for that cycle).
DCGM_FI_PROF_PIPE_FP64_ACTIVEGauge%The fraction of cycles F64 Pipes are Active per unit time.
DCGM_FI_PROF_PIPE_FP32_ACTIVEGauge%The fraction of cycles F32 Pipes are Active per unit time.
DCGM_FI_PROF_PIPE_FP16_ACTIVEGauge%The fraction of cycles F16 Pipes are Active per unit time.
DCGM_FI_PROF_NVLINK_RX_BYTESCounterB/sData flow received via NVLink.
DCGM_FI_PROF_NVLINK_TX_BYTESCounterB/sData flow transmitted via NVLink.
DCGM_FI_PROF_PCIE_RX_BYTESCounterB/sNumber of bytes received via PCIe bus.
DCGM_FI_PROF_PCIE_TX_BYTESCounterB/sNumber of bytes transmitted via PCIe bus.
DCGM_FI_DEV_PCIE_REPLAY_COUNTERCounterTimesRetry times for GPU PCIe bus.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTALCounter-The total count of NVLink bandwidth counters for all GPU channels.

6.5. Temperature and Power

Metric NameMetric TypeMetric UnitMetric Meaning
DCGM_FI_DEV_GPU_TEMPGaugeCurrent GPU Temperature
DCGM_FI_DEV_MEMORY_TEMPGaugeCurrent GPU Memory Temperature
DCGM_FI_DEV_POWER_USAGEGaugeWCurrent GPU Power Consumption
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTIONCountmJTotal Energy Consumption since GPU Startup

6.6. XID Errors and Violations

Metric NameMetric TypeMetric UnitMetric Meaning
DCGM_FI_DEV_XID_ERRORSGauge-The recent error code
DCGM_CUSTOM_XID_ERRORS_TOTAL_COUNTERCounter-Total number of error codes
DCGM_FI_DEV_POWER_VIOLATIONCounterμsThe cumulative duration of violations due to power limits
DCGM_FI_DEV_THERMAL_VIOLATIONCounterμsThe cumulative duration of violations due to thermal limits
DCGM_FI_DEV_SYNC_BOOST_VIOLATIONCounterμsThe cumulative duration of violations due to synchronous boosting limits
DCGM_FI_DEV_BOARD_LIMIT_VIOLATIONCounterμsThe cumulative duration of violations due to circuit board limits
DCGM_FI_DEV_LOW_UTIL_VIOLATIONCounterμsThe cumulative duration of violations due to low utilization limits
DCGM_FI_DEV_RELIABILITY_VIOLATIONCounterμsThe cumulative duration of violations due to circuit board reliability limits

6.7. Disabled Memory Pages

Metric NameMetric TypeMetric UnitMetric Meaning
DCGM_FI_DEV_RETIRED_SBECounterIndividualMemory pages disabled due to single bit errors
DCGM_FI_DEV_RETIRED_DBECounterIndividualMemory pages disabled due to double bit errors

6.8. Others

Metric NameMetric TypeMetric UnitMetric Meaning
DCGM_FI_DEV_VGPU_LICENSE_STATUSGauge-vGPU License Status
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWSCounter-Number of rows remapped due to uncorrectable errors
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWSCounter-Number of rows remapped due to correctable errors
DCGM_FI_DEV_ROW_REMAP_FAILUREGauge-Whether row remapping failed