TORQUE Resource Manager

3.7 Scheduling GPUs

In TORQUE 2.5.4 and later, users can request GPUs on a node at job submission by specifying a nodes= resource request using the qsub -l option. The number of GPUs a node has must be specified in the nodes file. The GPU is then reported in the output of pbsnodes:

napali
state = free
np = 2
ntype = cluster
status = rectime=1288888871,varattr=,jobs=,state=free,netload=1606207294,gres=tom:!/home/dbeer/dev/scripts/dynamic_resc.sh,loadave=0.10,ncpus=2,physmem=3091140kb,availmem=32788032348kb,totmem=34653576492kb,idletime=4983,nusers=3,nsessions=14,sessions=3136 1805 2380 2428 1161 3174 3184 3191 3209 3228 3272 3333 20560 32371,uname=Linux napali 2.6.32-25-generic #45-Ubuntu SMP Sat Oct 16 19:52:42 UTC 2010 x86_64,opsys=linux
mom_service_port = 15002
mom_manager_port = 15003
gpus = 1

The $PBS_GPUFILE has been created to include GPU awareness. The GPU appears as a separate line in $PBS_GPUFILE and follows this syntax:

<hostname>-gpu<index>

If a job were submitted to run on a server called napali (the submit command would look something like: qsub test.sh -l nodes=1:ppn=2:gpus=1), the $PBS_GPUFILE would contain:

napali
napali
napali-gpu0

The first two lines signify the job has 2 ppn on napali, and the last line explains that napali has GPU index 0 (the first GPU) to execute on as well. It is left up to the job's owner to make sure that the job executes properly on the GPU. By default, TORQUE treats GPUs exactly the same as ppn (which corresponds to CPUs).

Using GPUs with NUMA

The pbs_server requires awareness of how the MOM is reporting nodes since there is only one MOM daemon and multiple MOM nodes. Configure the server_priv/nodes file with the num_numa_nodes and numa_gpu_node_str attributes. The attribute num_numa_nodes tells pbs_server how many NUMA nodes are reported by the MOM. If each NUMA node has the same number of GPUs, add the total number of GPUs to the nodes file. Following is an example of how to configure the nodes file with num_numa_nodes:

	numahost gpus=12 num_numa_nodes=6

This line in the nodes file tells pbs_server there is a host named numa-10 and that it has 12 GPUs and 6 nodes. The pbs_server divides the value of GPUs (12) by the value for num_numa_nodes (6) and determines there are 2 GPUs per NUMA node.

In this example, the NUMA system is uniform in its configuration of GPUs per node board, but a system does not have to be configured with the same number of GPUs per node board. For systems with non-uniform GPU distributions, use the attribute numa_gpu_node_str to let pbs_server know where GPUs are located in the cluster.

If there are equal numbers of GPUs on each NUMA node, you can specify them with a string. For example, if there are 3 NUMA nodes and the first has 0 GPUs, the second has 3, and the third has 5, you would add this to the nodes file entry:

	numa_gpu_node_str=0,3,5

In this configuration, pbs_server knows it has three MOM nodes and the nodes have 0, 3, and 5 GPUs respectively. Note that the attribute gpus is not used. The gpus attribute is ignored because the number of GPUs per node is specifically given.

In TORQUE 3.0.2 or later, qsub supports the mapping of -l gpus=X to -l gres=gpus:X. This allows users who are using NUMA systems to make requests such as -l ncpus=20,gpus=5 indicating they are not concerned with the GPUs in relation to the NUMA nodes they request; they only want a total of 20 cores and 5 GPUs.

Torque NVidia GPGPUs

The pbs_mom file can now query for GPU hardware information and report status to the pbs_server. lgpustatus will appear in pbsnodes output. New commands allow for setting GPU modes and for resetting GPU ECC error counts. This feature is only available in Torque 2.5.6, 3.0.2, and later.

Torque Configuration

There are three configuration (./configure) options available for use with Nvidia GPGPUs:

  • --enable-nvidia-gpus
  • --with-nvml-lib=DIR
  • --with-nvml-include=DIR
  • --enable-nvidia-gpus is used to enable the new features for the Nvidia GPGPUs. By default, the pbs_moms use the nvidia_smi command to interface with the Nvidia GPUs.

    ./configure --enable-nvidia-gpus

    To use the NVML (NVIDIA Management Library) API instead of nvidia-smi, configure TORQUE using --with-nvml-lib=DIR and --with-nvml-include=DIR. These commands specify the location of the libnvidia-ml library and the location of the nvml.h include file.

    ./configure –with-nvml-lib=/usr/lib
    --with-nvml-include=/usr/local/cuda/Tools/NVML
    server_priv/nodes:
    node001  gpus=1
    node002  gpus=4
    …
    pbsnodes  -a
    node001
        …
        gpus = 1
    ...

    Note By default, when TORQUE is configured with --enable-nvidia-gpus the $TORQUE_HOME/nodes file is automatically updated with the correct GPU count for each MOM node.

    GPU Modes for Nvidia 260.x driver

    • 0 – Default - Shared mode available for multiple processes
    • 1 – Exclusive - Only one COMPUTE thread is allowed to run on the GPU
    • 2 – Prohibited - No COMPUTE contexts are allowed to run on the GPU

    GPU Modes for Nvidia 270.x driver

    • 0 – Default - Shared mode available for multiple processes
    • 1 – Exclusive Thread - Only one COMPUTE thread is allowed to run on the GPU (v260 exclusive)
    • 2 – Prohibited - No COMPUTE contexts are allowed to run on the GPU
    • 3 – Exclusive Process - Only one COMPUTE process is allowed to run on the GPU

    gpu_status

    root@gpu:~# pbsnodes gpu
    gpu
    ...
         gpus = 2
         gpu_status = gpu[1]=gpu_id=0:6:0;gpu_product_name=Tesla
         C2050;gpu_display=Disabled;gpu_pci_device_id=6D110DE;gpu_pci_location_id=0:6:0;
         gpu_fan_speed=54 %;gpu_memory_total=2687 Mb;gpu_memory_used=74 
    Mb;gpu_mode=Default;gpu_state=Unallocated;gpu_utilization=96 
    %;gpu_memory_utilization=10
    %;gpu_ecc_mode=Enabled;gpu_single_bit_ecc_errors=0;gpu_double_bit_ecc_errors=
    0;gpu_temperature=88 C,gpu[0]=gpu_id=0:5:0;gpu_product_name=Tesla
    C2050;gpu_display=Enabled;gpu_pci_device_id=6D110DE;gpu_pci_location_id=0:5:0;
    gpu_fan_speed=66 %;gpu_memory_total=2687 Mb;gpu_memory_used=136
    Mb;gpu_mode=Default;gpu_state=Unallocated;gpu_utilization=96
    %;gpu_memory_utilization=10
    %;gpu_ecc_mode=Enabled;gpu_single_bit_ecc_errors=0;
    gpu_double_bit_ecc_errors=0;gpu_temperature=86 C,driver_ver=270.41.06,timestamp=Wed May  4 13:00:35
    2011

    New Nvidia GPU Support

    qsub allows specifying required compute mode when requesting GPUs

    • qsub -l nodes=1:ppn=1:gpus=1
    • qsub -l nodes=1:gpus=1
    • qsub -l nodes=1:gpus=1:exclusive_thread
    • qsub -l nodes=1:gpus=1:exclusive_process
    • qsub –l nodes=1:gpus=1:reseterr
    • qsub –l nodes=1:gpus=1:reseterr:exclusive_thread (exclusive_thread:reseterr)
    • qsub –l nodes=1:gpus=1:reseterr:exclusive_process