20.0 Accelerators > GPU Metrics

Conventions

20.4 GPU Metrics

GPU metrics can be collected for nodes that:

GPU metric tracking must be enabled in moab.cfg:

RMCFG[torque]  flags=RECORDGPUMETRICS				

There is one GPU metric for all GPU devices within a node (gpu_timestamp) and nine GPU metrics for each GPU device within a node. If the maximum GPU devices within a node is 4, you must increase the MAXGMETRIC value in moab.cfg by (maxgpudevices x gpumetrics) + 1. In this case, the formula is (4 x 9) + 1 = 37, so whatever the MAXGMETRIC value is, it must be increased by 37. This way, when enabling GPU metrics recording, Moab has enough GMETRIC types to accommodate the GPU metrics.

20.4-A GPU Metrics Map

The GPU metric names map is as follows (where X is the GPU number):

Metric name as returned by pbsnodes GMETRIC name as stored in Moab Metric output
timestamp

gpu_timestamp

The gpu_timestamp metric is global to all GPUs on the node and indicates the last time the driver collected information on the GPUs.

The time data was collected in epoch time
gpu_fan_speed gpuX_fan The current fan speed as a percentage
gpu_memory_total gpuX_mem The total GPU memory in megabytes
gpu_memory_used gpuX_usedmem The total used GPU memory in megabytes
gpu_utilization gpuX_util The GPU capability currently in use as a percentage
gpu_memory_utilization gpuX_memutil The GPU memory currently in use as a percentage
gpu_ecc_mode gpuX_ecc Whether ECC is enabled or disabled
gpu_single_bit_ecc_errors gpuX_ecc1err The total number of EEC single-bit errors since the last counter reset
gpu_double_bit_ecc_errors gpuX_ecc2err The total number of EEC double-bit errors since the last counter reset
gpu_temperature gpuX_temp The GPU current temperature in Celsius

Example 20-1: GPU example

$ mdiag -n -v --xml

<Data>
<node AGRES="GPUS=2;"
AVLCLASS="[test 8][batch 8]"
CFGCLASS="[test 8][batch 8]"
GMETRIC="gpu1_fan:59.00,gpu1_mem:2687.00,gpu1_usedmem:74.00,gpu1_util:94.00,gpu1_memutil:9.00,gpu1_ecc:0.00,gpu1_ecc1err:0.00,gpu1_ecc2err:0.00,gpu1_temp:89.00,gpu0_fan:70.00,gpu0_mem:2687.00,gpu0_usedmem:136.00,gpu0_util:94.00,gpu0_memutil:9.00,gpu0_ecc:0.00,gpu0_ecc1err:0.00,gpu0_ecc2err:0.00,gpu0_temp:89.00,gpu_timestamp:1304526680.00"
GRES="GPUS=2;"
LASTUPDATETIME="1304526518" LOAD="1.050000"
MAXJOB="0" MAXJOBPERUSER="0" MAXLOAD="0.000000" NODEID="gpu"
NODEINDEX="0" NODESTATE="Idle" OS="linux" OSLIST="linux"
PARTITION="makai" PRIORITY="0" PROCSPEED="0" RADISK="1"
RAMEM="5978" RAPROC="7" RASWAP="22722" RCDISK="1" RCMEM="5978"
RCPROC="8" RCSWAP="23493" RMACCESSLIST="makai" SPEED="1.000000"
STATMODIFYTIME="1304525679" STATTOTALTIME="315649"
STATUPTIME="315649"></node>
</Data>