The pbs_mom file can now query for GPU hardware information and report status to the pbs_server. gpustatus will appear in pbsnodes output. New commands allow for setting GPU modes and for resetting GPU ECC error counts.
This feature is only available in TORQUE 2.5.6, 3.0.2, and later.
This document assumes that you have installed the NVIDIA CUDA TooolKit and the NVIDIA development drivers on a compute node with a NVIDAI GPU. (Both can be downloaded from http://developer.nvidia.com/category/zone/cuda-zone).
You will want to download the latest version if you run into problems compiling.
If the pbs_server does not have GPUs, it only needs to be configured with --enable-nvidia-gpus. All other systems that have NVIDIA GPUs will need:
For example, you would configure the a PBS_SERVER that does not have GPUs, but will be managing compute nodes with NVIDIA GPUs in this way:
Server
./configure --with-debug --with-nvidia-gpus |
Compute nodes (with NVIDIA GPUs)
./configure --with-debug --enable-nvidia-gpus --with-nvml-lib=/usr/lib64 --with-nvml-include=/cuda/NVML |
If all of the compute nodes have the same hardware and software configuration, you can choose to compile on one compute node and then run make packages.
> make packages Building ./torque-package-clients-linux-x86_64.sh ... Building ./torque-package-mom-linux-x86_64.sh ... Building ./torque-package-server-linux-x86_64.sh ... Building ./torque-package-gui-linux-x86_64.sh ... Building ./torque-package-devel-linux-x86_64.sh ... Done. |
The package files are self-extracting packages that can be copied and executed on your production machines. (Use --help for options.)
For more information, see Compute nodes.
When updating, it is good practice to stop the pbs_server and make a backup of the TORQUE home directory. You will also want to backup the output of qmgr -c 'p s'. The update will only overwrite the binaries.
For further details, see these topics:
There are three configuration (./configure) options available for use with Nvidia GPGPUs:
--enable-nvidia-gpus is used to enable the new features for the Nvidia GPGPUs. By default, the pbs_moms use the nvidia_smi command to interface with the Nvidia GPUs.
./configure --enable-nvidia-gpus |
To use the NVML (NVIDIA Management Library) API instead of nvidia-smi, configure TORQUE using --with-nvml-lib=DIR and --with-nvml-include=DIR. These commands specify the location of the libnvidia-ml library and the location of the nvml.h include file.
./configure –with-nvml-lib=/usr/lib --with-nvml-include=/usr/local/cuda/Tools/NVML server_priv/nodes: node001 gpus=1 node002 gpus=4 … pbsnodes -a node001 … gpus = 1 ... |
By default, when TORQUE is configured with --enable-nvidia-gpus the $TORQUE_HOME/nodes file is automatically updated with the correct GPU count for each MOM node.
GPU modes for NVIDIA 260.x driver
GPU Modes for NVIDIA 270.x driver
root@gpu:~# pbsnodes gpu gpu ... gpus = 2 gpu_status = gpu[1]=gpu_id=0:6:0;gpu_product_name=Tesla C2050;gpu_display=Disabled;gpu_pci_device_id=6D110DE;gpu_pci_location_id=0:6:0; gpu_fan_speed=54 %;gpu_memory_total=2687 Mb;gpu_memory_used=74 Mb;gpu_mode=Default;gpu_state=Unallocated;gpu_utilization=96 %;gpu_memory_utilization=10 %;gpu_ecc_mode=Enabled;gpu_single_bit_ecc_errors=0;gpu_double_bit_ecc_errors= 0;gpu_temperature=88 C,gpu[0]=gpu_id=0:5:0;gpu_product_name=Tesla C2050;gpu_display=Enabled;gpu_pci_device_id=6D110DE;gpu_pci_location_id=0:5:0; gpu_fan_speed=66 %;gpu_memory_total=2687 Mb;gpu_memory_used=136 Mb;gpu_mode=Default;gpu_state=Unallocated;gpu_utilization=96 %;gpu_memory_utilization=10 %;gpu_ecc_mode=Enabled;gpu_single_bit_ecc_errors=0; gpu_double_bit_ecc_errors=0;gpu_temperature=86 C,driver_ver=270.41.06,timestamp=Wed May 4 13:00:35 2011 |
qsub allows specifying required compute mode when requesting GPUs. If no GPU mode is requested, it will default to "exclusive" for Nvidia driver version 260 or "exclusive_thread" for Nvidia driver version 270 and above.
Related topics
© 2012 Adaptive Computing