23.1 GPGPU Metrics

GPGPU metrics can be collected for nodes that:

GPU metric tracking must be enabled in moab.cfg:

RMCFG[torque]  flags=RECORDGPUMETRICS				
Note There is one GPU metric for all GPU devices within a node (gpu_timestamp) and nine GPU metrics for each GPU device within a node. If the maximum GPU devices within a node is 4, you must increase the MAXGMETRIC value in moab.cfg by (maxgpudevices x gpumetrics) + 1. In this case, the formula is (4 x 9) + 1 = 37, so whatever the MAXGMETRIC value is, it must be increased by 37. This way, when enabling GPU metrics recording, Moab has enough GMETRIC types to accommodate the GPU metrics.

23.1.1 GPU Metrics Map

The GPU metric names map is as follows (where X is the GPU number):

Total EEC single-bit errors
Metric name as returned by pbsnodes GMETRIC name as stored in Moab Metric output
gpu_timestamp
NoteThe gpu_timestamp metric is global to all GPUs on the node and indicates the last time the driver collected information on the GPUs.
The time data was collected in epoch time
gpuX_fan The current fan speed as a percentage
gpuX_mem The total GPU memory in megabytes
gpuX_usedmem The total used GPU memory in megabytes
gpuX_util The GPU capability currently in use as a percentage
gpuX_memutil The GPU memory currently in use as a percentage
gpuX_ecc Whether ECC is enabled or disabled
gpuX_ecc1err The total number of EEC single-bit errors since the last counter reset
gpuX_ecc2err The total number of EEC double-bit errors since the last counter reset
gpuX_temp The GPU current temperature in Celsius

23.1.2 Example

$ mdiag -n -v --xml

<Data>
<node AGRES="GPUS=2;"
AVLCLASS="[test 8][batch 8]"
CFGCLASS="[test 8][batch 8]"
GMETRIC="gpu1_fan:59.00,gpu1_mem:2687.00,gpu1_usedmem:74.00,gpu1_util:94.00,gpu1_memutil:9.00,gpu1_ecc:0.00,gpu1_ecc1err:0.00,gpu1_ecc2err:0.00,gpu1_temp:89.00,gpu0_fan:70.00,gpu0_mem:2687.00,gpu0_usedmem:136.00,gpu0_util:94.00,gpu0_memutil:9.00,gpu0_ecc:0.00,gpu0_ecc1err:0.00,gpu0_ecc2err:0.00,gpu0_temp:89.00,gpu_timestamp:1304526680.00"
GRES="GPUS=2;"
LASTUPDATETIME="1304526518" LOAD="1.050000"
MAXJOB="0" MAXJOBPERUSER="0" MAXLOAD="0.000000" NODEID="gpu"
NODEINDEX="0" NODESTATE="Idle" OS="linux" OSLIST="linux"
PARTITION="makai" PRIORITY="0" PROCSPEED="0" RADISK="1"
RAMEM="5978" RAPROC="7" RASWAP="22722" RCDISK="1" RCMEM="5978"
RCPROC="8" RCSWAP="23493" RMACCESSLIST="makai" SPEED="1.000000"
STATMODIFYTIME="1304525679" STATTOTALTIME="315649"
STATUPTIME="315649"></node>
</Data>				

Copyright © 2012 Adaptive Computing Enterprises, Inc.®