(Click to open topic with navigation)
GPU metrics can be collected for nodes that:
GPU metric tracking must be enabled in moab.cfg:
RMCFG[torque] flags=RECORDGPUMETRICS
There is one GPU metric for all GPU devices within a node (gpu_timestamp) and nine GPU metrics for each GPU device within a node. If the maximum GPU devices within a node is 4, you must increase the MAXGMETRIC value in moab.cfg by (maxgpudevices x gpumetrics) + 1. In this case, the formula is (4 x 9) + 1 = 37, so whatever the MAXGMETRIC value is, it must be increased by 37. This way, when enabling GPU metrics recording, Moab has enough GMETRIC types to accommodate the GPU metrics.
The GPU metric names map is as follows (where X is the GPU number):
Metric name as returned by pbsnodes | GMETRIC name as stored in Moab | Metric output |
---|---|---|
timestamp |
gpu_timestamp The gpu_timestamp metric is global to all GPUs on the node and indicates the last time the driver collected information on the GPUs. |
The time data was collected in epoch time |
gpu_fan_speed | gpuX_fan | The current fan speed as a percentage |
gpu_memory_total | gpuX_mem | The total GPU memory in megabytes |
gpu_memory_used | gpuX_usedmem | The total used GPU memory in megabytes |
gpu_utilization | gpuX_util | The GPU capability currently in use as a percentage |
gpu_memory_utilization | gpuX_memutil | The GPU memory currently in use as a percentage |
gpu_ecc_mode | gpuX_ecc | Whether ECC is enabled or disabled |
gpu_single_bit_ecc_errors | gpuX_ecc1err | The total number of EEC single-bit errors since the last counter reset |
gpu_double_bit_ecc_errors | gpuX_ecc2err | The total number of EEC double-bit errors since the last counter reset |
gpu_temperature | gpuX_temp | The GPU current temperature in Celsius |
$ mdiag -n -v --xml <Data> <node AGRES="GPUS=2;" AVLCLASS="[test 8][batch 8]" CFGCLASS="[test 8][batch 8]" GMETRIC="gpu1_fan:59.00,gpu1_mem:2687.00,gpu1_usedmem:74.00,gpu1_util:94.00,gpu1_memutil:9.00,gpu1_ecc:0.00,gpu1_ecc1err:0.00,gpu1_ecc2err:0.00,gpu1_temp:89.00,gpu0_fan:70.00,gpu0_mem:2687.00,gpu0_usedmem:136.00,gpu0_util:94.00,gpu0_memutil:9.00,gpu0_ecc:0.00,gpu0_ecc1err:0.00,gpu0_ecc2err:0.00,gpu0_temp:89.00,gpu_timestamp:1304526680.00" GRES="GPUS=2;" LASTUPDATETIME="1304526518" LOAD="1.050000" MAXJOB="0" MAXJOBPERUSER="0" MAXLOAD="0.000000" NODEID="gpu" NODEINDEX="0" NODESTATE="Idle" OS="linux" OSLIST="linux" PARTITION="makai" PRIORITY="0" PROCSPEED="0" RADISK="1" RAMEM="5978" RAPROC="7" RASWAP="22722" RCDISK="1" RCMEM="5978" RCPROC="8" RCSWAP="23493" RMACCESSLIST="makai" SPEED="1.000000" STATMODIFYTIME="1304525679" STATTOTALTIME="315649" STATUPTIME="315649"></node> </Data>