20.0 Accelerators > NVIDIA GPUs

Conventions

20.3 NVIDIA GPUs

The pbs_mom file can now query for GPU hardware information and report status to the pbs_server. gpustatus will appear in pbsnodes output. New commands allow for setting GPU modes and for resetting GPU ECC error counts.

This feature is only available in TORQUE 2.5.6, 3.0.2, and later.

This document assumes that you have installed the NVIDIA CUDA ToolKit and the NVIDIA development drivers on a compute node with an NVIDIA GPU. (Both can be downloaded from http://developer.nvidia.com/category/zone/cuda-zone).

You will want to download the latest version if you run into problems compiling.

If the pbs_server does not have GPUs, it only needs to be configured with --enable-nvidia-gpus. All other systems that have NVIDIA GPUs will need:

Systems that have NVIDIA GPUs require the following:

Server

./configure --with-debug --enable-nvidia-gpus

Compute nodes (with NVIDIA GPUs)

./configure --with-debug --enable-nvidia-gpus --with-nvml-lib=/usr/lib64 --with-nvml-include=/cuda/NVML

If all of the compute nodes have the same hardware and software configuration, you can choose to compile on one compute node and then run make packages.

> make packages

Building ./torque-package-clients-linux-x86_64.sh ...

Building ./torque-package-mom-linux-x86_64.sh ...

Building ./torque-package-server-linux-x86_64.sh ...

Building ./torque-package-gui-linux-x86_64.sh ...

Building ./torque-package-devel-linux-x86_64.sh ...

Done.

The package files are self-extracting packages that can be copied and executed on your production machines. (Use --help for options.)

When updating, it is good practice to stop the pbs_server and make a backup of the TORQUE home directory. You will also want to back up the output of qmgr -c 'p s'. The update will only overwrite the binaries.

If you move GPU cards to different slots, you must restart pbs_server in order for TORQUE to recognize the drivers as the same ones in different locations rather than 2 new, additional drivers.

For further details, see these topics:

TORQUE configuration

There are three configuration (./configure) options available for use with Nvidia GPGPUs:

--enable-nvidia-gpus is used to enable the new features for the Nvidia GPGPUs. By default, the pbs_moms use the nvidia_smi command to interface with the Nvidia GPUs.

./configure --enable-nvidia-gpus

To use the NVML (NVIDIA Management Library) API instead of nvidia-smi, configure TORQUE using --with-nvml-lib=DIR and --with-nvml-include=DIR. These commands specify the location of the libnvidia-ml library and the location of the nvml.h include file.

./configure -with-nvml-lib=/usr/lib

--with-nvml-include=/usr/local/cuda/Tools/NVML

server_priv/nodes:

node001 gpus=1

node002 gpus=4

pbsnodes -a

node001

    …

    gpus = 1

...

By default, when TORQUE is configured with --enable-nvidia-gpus the $TORQUE_HOME/nodes file is automatically updated with the correct GPU count for each MOM node.

GPU modes for NVIDIA 260.x driver

GPU Modes for NVIDIA 270.x driver

gpu_status

root@gpu:~# pbsnodes gpu

gpu

...

    gpus = 2

    gpu_status = gpu[1]=gpu_id=0:6:0;gpu_product_name=Tesla

    C2050;gpu_display=Disabled;gpu_pci_device_id=6D110DE;gpu_pci_location_id=0:6:0;

    gpu_fan_speed=54 %;gpu_memory_total=2687 Mb;gpu_memory_used=74

Mb;gpu_mode=Default;gpu_state=Unallocated;gpu_utilization=96

%;gpu_memory_utilization=10

%;gpu_ecc_mode=Enabled;gpu_single_bit_ecc_errors=0;gpu_double_bit_ecc_errors=

0;gpu_temperature=88 C,gpu[0]=gpu_id=0:5:0;gpu_product_name=Tesla

C2050;gpu_display=Enabled;gpu_pci_device_id=6D110DE;gpu_pci_location_id=0:5:0;

gpu_fan_speed=66 %;gpu_memory_total=2687 Mb;gpu_memory_used=136

Mb;gpu_mode=Default;gpu_state=Unallocated;gpu_utilization=96

%;gpu_memory_utilization=10

%;gpu_ecc_mode=Enabled;gpu_single_bit_ecc_errors=0;

gpu_double_bit_ecc_errors=0;gpu_temperature=86 C,driver_ver=270.41.06,timestamp=Wed May 4 13:00:35

2011

New NVIDIA GPU support

qsub allows specifying required compute mode when requesting GPUs. If no GPU mode is requested, it will default to "exclusive" for Nvidia driver version 260 or "exclusive_thread" for NVIDIA driver version 270 and above.

Related topics