GPGPU metrics can be collected for nodes that:
GPU metric tracking must be enabled in moab.cfg:
RMCFG[torque] flags=RECORDGPUMETRICS
There is one GPU metric for all GPU devices within a node (gpu_timestamp) and nine GPU metrics for each GPU device within a node. If the maximum GPU devices within a node is 4, you must increase the MAXGMETRIC value in moab.cfg by (maxgpudevices x gpumetrics) + 1. In this case, the formula is (4 x 9) + 1 = 37, so whatever the MAXGMETRIC value is, it must be increased by 37. This way, when enabling GPU metrics recording, Moab has enough GMETRIC types to accommodate the GPU metrics. |
The GPU metric names map is as follows (where X is the GPU number):
Metric name as returned by pbsnodes | GMETRIC name as stored in Moab | Metric output | ||
---|---|---|---|---|
timestamp | gpu_timestamp
|
The time data was collected in epoch time | ||
gpu_fan_speed | gpuX_fan | The current fan speed as a percentage | ||
gpu_memory_total | gpuX_mem | The total GPU memory in megabytes | ||
gpu_memory_used | gpuX_usedmem | The total used GPU memory in megabytes | ||
gpu_utilization | gpuX_util | The GPU capability currently in use as a percentage | ||
gpu_memory_utilization | gpuX_memutil | The GPU memory currently in use as a percentage | ||
gpu_ecc_mode | gpuX_ecc | Whether ECC is enabled or disabled | ||
gpu_single_bit_ecc_errors | gpuX_ecc1err | The total number of EEC single-bit errors since the last counter reset | Total EEC single-bit errors||
gpu_double_bit_ecc_errors | gpuX_ecc2err | The total number of EEC double-bit errors since the last counter reset | ||
gpu_temperature | gpuX_temp | The GPU current temperature in Celsius |
$ mdiag -n -v --xml <Data> <node AGRES="GPUS=2;" AVLCLASS="[test 8][batch 8]" CFGCLASS="[test 8][batch 8]" GMETRIC="gpu1_fan:59.00,gpu1_mem:2687.00,gpu1_usedmem:74.00,gpu1_util:94.00,gpu1_memutil:9.00,gpu1_ecc:0.00,gpu1_ecc1err:0.00,gpu1_ecc2err:0.00,gpu1_temp:89.00,gpu0_fan:70.00,gpu0_mem:2687.00,gpu0_usedmem:136.00,gpu0_util:94.00,gpu0_memutil:9.00,gpu0_ecc:0.00,gpu0_ecc1err:0.00,gpu0_ecc2err:0.00,gpu0_temp:89.00,gpu_timestamp:1304526680.00" GRES="GPUS=2;" LASTUPDATETIME="1304526518" LOAD="1.050000" MAXJOB="0" MAXJOBPERUSER="0" MAXLOAD="0.000000" NODEID="gpu" NODEINDEX="0" NODESTATE="Idle" OS="linux" OSLIST="linux" PARTITION="makai" PRIORITY="0" PROCSPEED="0" RADISK="1" RAMEM="5978" RAPROC="7" RASWAP="22722" RCDISK="1" RCMEM="5978" RCPROC="8" RCSWAP="23493" RMACCESSLIST="makai" SPEED="1.000000" STATMODIFYTIME="1304525679" STATTOTALTIME="315649" STATUPTIME="315649"></node> </Data>
Copyright © 2012 Adaptive Computing Enterprises, Inc.®