(Click to open topic with navigation)
The pbs_mom file can now query for GPU hardware information and report status to the pbs_server. gpustatus will appear in pbsnodes output. New commands allow for setting GPU modes and for resetting GPU ECC error counts.
This feature is only available in TORQUE 2.5.6, 3.0.2, and later.
This document assumes that you have installed the NVIDIA CUDA ToolKit and the NVIDIA development drivers on a compute node with an NVIDIA GPU. (Both can be downloaded from http://developer.nvidia.com/category/zone/cuda-zone).
You will want to download the latest version if you run into problems compiling.
If the pbs_server does not have GPUs, it only needs to be configured with --enable-nvidia-gpus. All other systems that have NVIDIA GPUs will need:
nvml.h is only found in the NVIDIA CUDA ToolKit.
Systems that have NVIDIA GPUs require the following:
Server
./configure --with-debug --enable-nvidia-gpus
Compute nodes (with NVIDIA GPUs)
./configure --with-debug --enable-nvidia-gpus --with-nvml-lib=/usr/lib64 --with-nvml-include=/cuda/NVML
If all of the compute nodes have the same hardware and software configuration, you can choose to compile on one compute node and then run make packages.
> make packages
Building ./torque-package-clients-linux-x86_64.sh ...
Building ./torque-package-mom-linux-x86_64.sh ...
Building ./torque-package-server-linux-x86_64.sh ...
Building ./torque-package-gui-linux-x86_64.sh ...
Building ./torque-package-devel-linux-x86_64.sh ...
Done.
The package files are self-extracting packages that can be copied and executed on your production machines. (Use --help for options.)
When updating, it is good practice to stop the pbs_server and make a backup of the TORQUE home directory. You will also want to back up the output of qmgr -c 'p s'. The update will only overwrite the binaries.
If you move GPU cards to different slots, you must restart pbs_server in order for TORQUE to recognize the drivers as the same ones in different locations rather than 2 new, additional drivers.
For further details, see these topics:
There are three configuration (./configure) options available for use with Nvidia GPGPUs:
--enable-nvidia-gpus is used to enable the new features for the Nvidia GPGPUs. By default, the pbs_moms use the nvidia_smi command to interface with the Nvidia GPUs.
./configure --enable-nvidia-gpus
To use the NVML (NVIDIA Management Library) API instead of nvidia-smi, configure TORQUE using --with-nvml-lib=DIR and --with-nvml-include=DIR. These commands specify the location of the libnvidia-ml library and the location of the nvml.h include file.
./configure -with-nvml-lib=/usr/lib
--with-nvml-include=/usr/local/cuda/Tools/NVML
server_priv/nodes:
node001 gpus=1
node002 gpus=4
…
pbsnodes -a
node001
…
gpus = 1
...
By default, when TORQUE is configured with --enable-nvidia-gpus the $TORQUE_HOME/nodes file is automatically updated with the correct GPU count for each MOM node.
GPU modes for NVIDIA 260.x driver
GPU Modes for NVIDIA 270.x driver
root@gpu:~# pbsnodes gpu
gpu
...
gpus = 2
gpu_status = gpu[1]=gpu_id=0:6:0;gpu_product_name=Tesla
C2050;gpu_display=Disabled;gpu_pci_device_id=6D110DE;gpu_pci_location_id=0:6:0;
gpu_fan_speed=54 %;gpu_memory_total=2687 Mb;gpu_memory_used=74
Mb;gpu_mode=Default;gpu_state=Unallocated;gpu_utilization=96
%;gpu_memory_utilization=10
%;gpu_ecc_mode=Enabled;gpu_single_bit_ecc_errors=0;gpu_double_bit_ecc_errors=
0;gpu_temperature=88 C,gpu[0]=gpu_id=0:5:0;gpu_product_name=Tesla
C2050;gpu_display=Enabled;gpu_pci_device_id=6D110DE;gpu_pci_location_id=0:5:0;
gpu_fan_speed=66 %;gpu_memory_total=2687 Mb;gpu_memory_used=136
Mb;gpu_mode=Default;gpu_state=Unallocated;gpu_utilization=96
%;gpu_memory_utilization=10
%;gpu_ecc_mode=Enabled;gpu_single_bit_ecc_errors=0;
gpu_double_bit_ecc_errors=0;gpu_temperature=86 C,driver_ver=270.41.06,timestamp=Wed May 4 13:00:35
2011
qsub allows specifying required compute mode when requesting GPUs. If no GPU mode is requested, it will default to "exclusive" for Nvidia driver version 260 or "exclusive_thread" for NVIDIA driver version 270 and above.
Related topics