1.7 TORQUE on NUMA Systems

TORQUE Resource Manager

1.7 TORQUE on NUMA Systems

Starting in TORQUE version 3.0, TORQUE can be configured to take full advantage of Non-Uniform Memory Archtecture (NUMA) systems. The following instructions are a result of development on SGI Altix and UV hardware.

1.7.1 TORQUE NUMA Configuration

There are three steps to configure TORQUE to take advantage of NUMA architectures:

Configure TORQUE with --enable-numa-support.
Create the mom_priv/mom.layout file.
Configure server_priv/nodes.

1.7.2 Building TORQUE with NUMA Support

To turn on NUMA support for TORQUE the -enable-numa-support option must be used during the configure portion of the installation. In addition to any other configuration options, add the -enable-num-support option as indicated in the following example:

$ ./configure --enable-numa-support

1.7.2.1 Creating mom.layout

When TORQUE is enabled to run with NUMA support, there is only a single instance of pbs_mom (MOM) that is run on the system. However, TORQUE will report that there are multiple nodes running in the cluster. While pbs_mom and pbs_server both know there is only one instantiation of pbs_mom, they manage the cluster as if there were multiple separate MOM nodes.

The mom.layout file is a virtual mapping between the system hardware configuration and how the administrator wants TORQUE to view the system. Each line in the mom.layout file equates to a node in the cluster and is referred to as a NUMA node. To properly set up the mom.layout file, it is important to know how the hardware is configured. Use the topology command line utility and inspect the contents of /sys/devices/system/node. The hwloc library can also be used to create a custom discovery tool.

Typing topology on the command line of a NUMA system produces something similar to the following:

Partition number: 0 
6 Blades 
72 CPUs 
378.43 Gb Memory Total 

Blade         ID       asic  NASID         Memory 
------------------------------------------------- 
    0 r001i01b00  UVHub 1.0      0    67089152 kB 
    1 r001i01b01  UVHub 1.0      2    67092480 kB 
    2 r001i01b02  UVHub 1.0      4    67092480 kB 
    3 r001i01b03  UVHub 1.0      6    67092480 kB 
    4 r001i01b04  UVHub 1.0      8    67092480 kB 
    5 r001i01b05  UVHub 1.0     10    67092480 kB 

CPU      Blade PhysID CoreID APIC-ID Family Model Speed L1(KiB) L2(KiB) L3(KiB) 
------------------------------------------------------------------------------- 
  0 r001i01b00     00     00       0      6    46  2666 32d/32i     256   18432 
  1 r001i01b00     00     02       4      6    46  2666 32d/32i     256   18432 
  2 r001i01b00     00     03       6      6    46  2666 32d/32i     256   18432 
  3 r001i01b00     00     08      16      6    46  2666 32d/32i     256   18432 
  4 r001i01b00     00     09      18      6    46  2666 32d/32i     256   18432 
  5 r001i01b00     00     11      22      6    46  2666 32d/32i     256   18432 
  6 r001i01b00     01     00      32      6    46  2666 32d/32i     256   18432 
  7 r001i01b00     01     02      36      6    46  2666 32d/32i     256   18432 
  8 r001i01b00     01     03      38      6    46  2666 32d/32i     256   18432 
  9 r001i01b00     01     08      48      6    46  2666 32d/32i     256   18432 
 10 r001i01b00     01     09      50      6    46  2666 32d/32i     256   18432 
 11 r001i01b00     01     11      54      6    46  2666 32d/32i     256   18432 
 12 r001i01b01     02     00      64      6    46  2666 32d/32i     256   18432 
 13 r001i01b01     02     02      68      6    46  2666 32d/32i     256   18432 
 14 r001i01b01     02     03      70      6    46  2666 32d/32i     256   18432

From this partial output, note that this system has 72 CPUs on 6 blades. Each blade has 12 CPUs grouped into clusters of 6 CPUs. If the entire content of this command were printed you would see each Blade ID and the CPU ID assigned to each blade.

The topology command shows how the CPUs are distributed, but you likely also need to know where memory is located relative to CPUs, so go to /sys/devices/system/node. If you list the node directory you will see something similar to the following:

# ls -al 
total 0 
drwxr-xr-x 14 root root    0 Dec  3 12:14 . 
drwxr-xr-x 14 root root    0 Dec  3 12:13 .. 
-r--r--r--  1 root root 4096 Dec  3 14:58 has_cpu 
-r--r--r--  1 root root 4096 Dec  3 14:58 has_normal_memory 
drwxr-xr-x  2 root root    0 Dec  3 12:14 node0 
drwxr-xr-x  2 root root    0 Dec  3 12:14 node1 
drwxr-xr-x  2 root root    0 Dec  3 12:14 node10 
drwxr-xr-x  2 root root    0 Dec  3 12:14 node11 
drwxr-xr-x  2 root root    0 Dec  3 12:14 node2 
drwxr-xr-x  2 root root    0 Dec  3 12:14 node3 
drwxr-xr-x  2 root root    0 Dec  3 12:14 node4 
drwxr-xr-x  2 root root    0 Dec  3 12:14 node5 
drwxr-xr-x  2 root root    0 Dec  3 12:14 node6 
drwxr-xr-x  2 root root    0 Dec  3 12:14 node7 
drwxr-xr-x  2 root root    0 Dec  3 12:14 node8 
drwxr-xr-x  2 root root    0 Dec  3 12:14 node9 
-r--r--r--  1 root root 4096 Dec  3 14:58 online 
-r--r--r--  1 root root 4096 Dec  3 14:58 possible

The directory entries node0, node1,...node11 represent groups of memory and CPUs local to each other. These groups are a node board, a grouping of resources that are close together. In most cases, a node board is made up of memory and processor cores. Each bank of memory is called a memory node by the operating system, and there are certain CPUs that can access that memory very rapidly. Note under the directory for node board node0 that there is an entry called cpulist. This contains the CPU IDs of all CPUs local to the memory in node board 0.

Now create the mom.layout file. The content of cpulist 0-5 indicating CPUs 0-5 are local to the memory of node board 0. The cpulist for node board 1 shows 6-11 indicating CPUs 6-11 are local to the memory of node board 1. Repeat this for all twelve node boards and create the following mom.layout file for the 72 CPU system.

cpus=0-5	mem=0 
cpus=6-11 	mem=1
cpus=12-17	mem=2
cpus=18-23	mem=3
cpus=24-29	mem=4
cpus=30-35 	mem=5
cpus=36-41 	mem=6
cpus=42-47 	mem=7
cpus=48-53 	mem=8
cpus=54-59 	mem=9
cpus=60-65 	mem=10
cpus=66-71 	mem=11

cpus= should be the index of the cpus for that nodeboard or entity, and these are the cpus that will be considered part of that numa node.
mem= should be the index of the memory nodes that are associated with that node board or entity, and the memory from these will be considered part of that NUMA node.

Each line in the mom.layout file is reported as a node to pbs_server by the pbs_mom daemon.

The mom.layout file does not need to match the hardware layout exactly. It is possible to combine node boards and create larger NUMA nodes. The following example shows how to do this:

cpus=0-11	mem=0-1

The memory nodes can be combined the same as CPUs. The memory nodes combined must be contiguous. You cannot combine mem 0 and 2.

1.7.2.2 Configuring server_priv/nodes

The pbs_server requires awareness of how the MOM is reporting nodes since there is only one MOM daemon and multiple MOM nodes. So, configure the server_priv/nodes file with the num_numa_nodes and numa_node_str attributes. The attribute num_numa_nodes tells pbs_server how many numa nodes are reported by the MOM. Following is an example of how to configure the nodes file with num_numa_nodes:

	numa-10 np=72 num_numa_nodes=12

This line in the nodes file tells pbs_server there is a host named numa-10 and that it has 72 processors and 12 nodes. The pbs_server divides the value of np (72) by the value for num_numa_nodes (12) and determines there are 6 CPUs per NUMA node.

In this example, the NUMA system is uniform in its configuration of CPUs per node board, but a system does not need to be configured with the same number of CPUs per node board. For systems with non-uniform CPU distributions, use the attribute numa_node_str to let pbs_server know where CPUs are located in the cluster.

The following is an example of how to configure the server_priv/nodes file for non-uniformly distributed CPUs:

	Numa-11 numa_node_str=6,8,12

In this configuration, pbs_server knows it has three MOM nodes and the nodes have 6, 8, and 12 CPUs respectively. Note that the attribute np is not used. The np attribute is ignored because the number of CPUs per node is expressly given.

1.7.2.2.1 Enforcement of memory resource limits

TORQUE can better enforce memory limits with the use of the utility memacctd. The memacctd utility is provided by SGI on SuSe Linux Enterprise Edition (SLES). It is a daemon that caches memory footprints when it is queried. When configured to use the memory monitor, TORQUE queries memacctd. It is up to the user to make sure memacctd is installed. See the SGI memacctd man page for more information.

To configure TORQUE to use memacctd for memory enforcement do the following:

Start memacctd as instructed by SGI.
Reconfigure TORQUE with -enable-numa-mem-monitor. This will link in the necessary library when TORQUE is recompiled.
Recompile and reinstall TORQUE.
Restart all MOM nodes.
(optional) Alter the qsub filter to include a default memory limit for all jobs that are not submitted with memory limit.