Starting in TORQUE version 3.0, TORQUE can be configured to take full advantage of Non-Uniform Memory Archtecture (NUMA) systems. The following instructions are a result of development on SGI Altix and UV hardware.
There are three steps to configure TORQUE to take advantage of NUMA architectures:
To turn on NUMA support for TORQUE the -enable-numa-support option must be used during the configure portion of the installation. In addition to any other configuration options, add the -enable-num-support option as indicated in the following example:
$ ./configure --enable-numa-support
When TORQUE is enabled to run with NUMA support, there is only a single instance of pbs_mom (MOM) that is run on the system. However, TORQUE will report that there are multiple nodes running in the cluster. While pbs_mom and pbs_server both know there is only one instantiation of pbs_mom, they manage the cluster as if there were multiple separate MOM nodes.
The mom.layout file is a virtual mapping between the system hardware configuration and how the administrator wants TORQUE to view the system. Each line in the mom.layout file equates to a node in the cluster and is referred to as a NUMA node. To properly set up the mom.layout file, it is important to know how the hardware is configured. Use the topology command line utility and inspect the contents of /sys/devices/system/node. The hwloc library can also be used to create a custom discovery tool.
Typing topology on the command line of a NUMA system produces something similar to the following:
Partition number: 0 6 Blades 72 CPUs 378.43 Gb Memory Total Blade ID asic NASID Memory ------------------------------------------------- 0 r001i01b00 UVHub 1.0 0 67089152 kB 1 r001i01b01 UVHub 1.0 2 67092480 kB 2 r001i01b02 UVHub 1.0 4 67092480 kB 3 r001i01b03 UVHub 1.0 6 67092480 kB 4 r001i01b04 UVHub 1.0 8 67092480 kB 5 r001i01b05 UVHub 1.0 10 67092480 kB CPU Blade PhysID CoreID APIC-ID Family Model Speed L1(KiB) L2(KiB) L3(KiB) ------------------------------------------------------------------------------- 0 r001i01b00 00 00 0 6 46 2666 32d/32i 256 18432 1 r001i01b00 00 02 4 6 46 2666 32d/32i 256 18432 2 r001i01b00 00 03 6 6 46 2666 32d/32i 256 18432 3 r001i01b00 00 08 16 6 46 2666 32d/32i 256 18432 4 r001i01b00 00 09 18 6 46 2666 32d/32i 256 18432 5 r001i01b00 00 11 22 6 46 2666 32d/32i 256 18432 6 r001i01b00 01 00 32 6 46 2666 32d/32i 256 18432 7 r001i01b00 01 02 36 6 46 2666 32d/32i 256 18432 8 r001i01b00 01 03 38 6 46 2666 32d/32i 256 18432 9 r001i01b00 01 08 48 6 46 2666 32d/32i 256 18432 10 r001i01b00 01 09 50 6 46 2666 32d/32i 256 18432 11 r001i01b00 01 11 54 6 46 2666 32d/32i 256 18432 12 r001i01b01 02 00 64 6 46 2666 32d/32i 256 18432 13 r001i01b01 02 02 68 6 46 2666 32d/32i 256 18432 14 r001i01b01 02 03 70 6 46 2666 32d/32i 256 18432
From this partial output, note that this system has 72 CPUs on 6 blades. Each blade has 12 CPUs grouped into clusters of 6 CPUs. If the entire content of this command were printed you would see each Blade ID and the CPU ID assigned to each blade.
The topology command shows how the CPUs are distributed, but you likely also need to know where memory is located relative to CPUs, so go to /sys/devices/system/node. If you list the node directory you will see something similar to the following:
# ls -al total 0 drwxr-xr-x 14 root root 0 Dec 3 12:14 . drwxr-xr-x 14 root root 0 Dec 3 12:13 .. -r--r--r-- 1 root root 4096 Dec 3 14:58 has_cpu -r--r--r-- 1 root root 4096 Dec 3 14:58 has_normal_memory drwxr-xr-x 2 root root 0 Dec 3 12:14 node0 drwxr-xr-x 2 root root 0 Dec 3 12:14 node1 drwxr-xr-x 2 root root 0 Dec 3 12:14 node10 drwxr-xr-x 2 root root 0 Dec 3 12:14 node11 drwxr-xr-x 2 root root 0 Dec 3 12:14 node2 drwxr-xr-x 2 root root 0 Dec 3 12:14 node3 drwxr-xr-x 2 root root 0 Dec 3 12:14 node4 drwxr-xr-x 2 root root 0 Dec 3 12:14 node5 drwxr-xr-x 2 root root 0 Dec 3 12:14 node6 drwxr-xr-x 2 root root 0 Dec 3 12:14 node7 drwxr-xr-x 2 root root 0 Dec 3 12:14 node8 drwxr-xr-x 2 root root 0 Dec 3 12:14 node9 -r--r--r-- 1 root root 4096 Dec 3 14:58 online -r--r--r-- 1 root root 4096 Dec 3 14:58 possible
The directory entries node0, node1,...node11 represent groups of memory and CPUs local to each other. These groups are a node board, a grouping of resources that are close together. In most cases, a node board is made up of memory and processor cores. Each bank of memory is called a memory node by the operating system, and there are certain CPUs that can access that memory very rapidly. Note under the directory for node board node0 that there is an entry called cpulist. This contains the CPU IDs of all CPUs local to the memory in node board 0.
Now create the mom.layout file. The content of cpulist 0-5 indicating CPUs 0-5 are local to the memory of node board 0. The cpulist for node board 1 shows 6-11 indicating CPUs 6-11 are local to the memory of node board 1. Repeat this for all twelve node boards and create the following mom.layout file for the 72 CPU system.
cpus=0-5 mem=0 cpus=6-11 mem=1 cpus=12-17 mem=2 cpus=18-23 mem=3 cpus=24-29 mem=4 cpus=30-35 mem=5 cpus=36-41 mem=6 cpus=42-47 mem=7 cpus=48-53 mem=8 cpus=54-59 mem=9 cpus=60-65 mem=10 cpus=66-71 mem=11
Each line in the mom.layout file is reported as a node to pbs_server by the pbs_mom daemon.
The mom.layout file does not need to match the hardware layout exactly. It is possible to combine node boards and create larger NUMA nodes. The following example shows how to do this:
cpus=0-11 mem=0-1
The memory nodes can be combined the same as CPUs. The memory nodes combined must be contiguous. You cannot combine mem 0 and 2.
The pbs_server requires awareness of how the MOM is reporting nodes since there is only one MOM daemon and multiple MOM nodes. So, configure the server_priv/nodes file with the num_numa_nodes and numa_node_str attributes. The attribute num_numa_nodes tells pbs_server how many numa nodes are reported by the MOM. Following is an example of how to configure the nodes file with num_numa_nodes:
numa-10 np=72 num_numa_nodes=12
This line in the nodes file tells pbs_server there is a host named numa-10 and that it has 72 processors and 12 nodes. The pbs_server divides the value of np (72) by the value for num_numa_nodes (12) and determines there are 6 CPUs per NUMA node.
In this example, the NUMA system is uniform in its configuration of CPUs per node board, but a system does not need to be configured with the same number of CPUs per node board. For systems with non-uniform CPU distributions, use the attribute numa_node_str to let pbs_server know where CPUs are located in the cluster.
The following is an example of how to configure the server_priv/nodes file for non-uniformly distributed CPUs:
Numa-11 numa_node_str=6,8,12
In this configuration, pbs_server knows it has three MOM nodes and the nodes have 6, 8, and 12 CPUs respectively. Note that the attribute np is not used. The np attribute is ignored because the number of CPUs per node is expressly given.
TORQUE can better enforce memory limits with the use of the utility memacctd. The memacctd utility is provided by SGI on SuSe Linux Enterprise Edition (SLES). It is a daemon that caches memory footprints when it is queried. When configured to use the memory monitor, TORQUE queries memacctd. It is up to the user to make sure memacctd is installed. See the SGI memacctd man page for more information.
To configure TORQUE to use memacctd for memory enforcement do the following: