(Click to open topic with navigation)
To turn on NUMA support for TORQUE the --enable-numa-support option must be used during the configure portion of the installation. In addition to any other configuration options, add the --enable-numa-support option as indicated in the following example:
$ ./configure --enable-numa-support
Don't use MOM hierarchy with NUMA.
When TORQUE is enabled to run with NUMA support, there is only a single instance of pbs_mom (MOM) that is run on the system. However, TORQUE will report that there are multiple nodes running in the cluster. While pbs_mom and pbs_server both know there is only one instance of pbs_mom, they manage the cluster as if there were multiple separate MOM nodes.
The mom.layout file is a virtual mapping between the system hardware configuration and how the administrator wants TORQUE to view the system. Each line in mom.layout equates to a node in the cluster and is referred to as a NUMA node.
Automatically Creating mom.layout (Recommended)
A perl script named mom_gencfg is provided in the contrib/ directory that generates the mom.layout file for you. The script can be customized by setting a few variables in it. To automatically create the mom.layout file, follow these instructions (these instructions are also included in the script):
$ ./mom_gencfg
Manually Creating mom.layout
To properly set up the mom.layout file, it is important to know how the hardware is configured. Use the topology command line utility and inspect the contents of /sys/devices/system/node. The hwloc library can also be used to create a custom discovery tool.
Typing topology on the command line of a NUMA system produces something similar to the following:
Partition number: 0
6 Blades
72 CPUs
378.43 Gb Memory Total
Blade ID asic NASID Memory
-------------------------------------------------
0 r001i01b00 UVHub 1.0 0 67089152 kB
1 r001i01b01 UVHub 1.0 2 67092480 kB
2 r001i01b02 UVHub 1.0 4 67092480 kB
3 r001i01b03 UVHub 1.0 6 67092480 kB
4 r001i01b04 UVHub 1.0 8 67092480 kB
5 r001i01b05 UVHub 1.0 10 67092480 kB
CPU Blade PhysID CoreID APIC-ID Family Model Speed L1(KiB) L2(KiB) L3(KiB)
-------------------------------------------------------------------------------
0 r001i01b00 00 00 0 6 46 2666 32d/32i 256 18432
1 r001i01b00 00 02 4 6 46 2666 32d/32i 256 18432
2 r001i01b00 00 03 6 6 46 2666 32d/32i 256 18432
3 r001i01b00 00 08 16 6 46 2666 32d/32i 256 18432
4 r001i01b00 00 09 18 6 46 2666 32d/32i 256 18432
5 r001i01b00 00 11 22 6 46 2666 32d/32i 256 18432
6 r001i01b00 01 00 32 6 46 2666 32d/32i 256 18432
7 r001i01b00 01 02 36 6 46 2666 32d/32i 256 18432
8 r001i01b00 01 03 38 6 46 2666 32d/32i 256 18432
9 r001i01b00 01 08 48 6 46 2666 32d/32i 256 18432
10 r001i01b00 01 09 50 6 46 2666 32d/32i 256 18432
11 r001i01b00 01 11 54 6 46 2666 32d/32i 256 18432
12 r001i01b01 02 00 64 6 46 2666 32d/32i 256 18432
13 r001i01b01 02 02 68 6 46 2666 32d/32i 256 18432
14 r001i01b01 02 03 70 6 46 2666 32d/32i 256 18432
From this partial output, note that this system has 72 CPUs on 6 blades. Each blade has 12 CPUs grouped into clusters of 6 CPUs. If the entire content of this command were printed you would see each Blade ID and the CPU ID assigned to each blade.
The topology command shows how the CPUs are distributed, but you likely also need to know where memory is located relative to CPUs, so go to /sys/devices/system/node. If you list the node directory you will see something similar to the following:
# ls -al
total 0
drwxr-xr-x 14 root root 0 Dec 3 12:14 .
drwxr-xr-x 14 root root 0 Dec 3 12:13 ..
-r--r--r-- 1 root root 4096 Dec 3 14:58 has_cpu
-r--r--r-- 1 root root 4096 Dec 3 14:58 has_normal_memory
drwxr-xr-x 2 root root 0 Dec 3 12:14 node0
drwxr-xr-x 2 root root 0 Dec 3 12:14 node1
drwxr-xr-x 2 root root 0 Dec 3 12:14 node10
drwxr-xr-x 2 root root 0 Dec 3 12:14 node11
drwxr-xr-x 2 root root 0 Dec 3 12:14 node2
drwxr-xr-x 2 root root 0 Dec 3 12:14 node3
drwxr-xr-x 2 root root 0 Dec 3 12:14 node4
drwxr-xr-x 2 root root 0 Dec 3 12:14 node5
drwxr-xr-x 2 root root 0 Dec 3 12:14 node6
drwxr-xr-x 2 root root 0 Dec 3 12:14 node7
drwxr-xr-x 2 root root 0 Dec 3 12:14 node8
drwxr-xr-x 2 root root 0 Dec 3 12:14 node9
-r--r--r-- 1 root root 4096 Dec 3 14:58 online
-r--r--r-- 1 root root 4096 Dec 3 14:58 possible
The directory entries node0, node1,...node11 represent groups of memory and CPUs local to each other. These groups are a node board, a grouping of resources that are close together. In most cases, a node board is made up of memory and processor cores. Each bank of memory is called a memory node by the operating system, and there are certain CPUs that can access that memory very rapidly. Note under the directory for node board node0 that there is an entry called cpulist. This contains the CPU IDs of all CPUs local to the memory in node board 0.
Now create the mom.layout file. The content of cpulist 0-5 are local to the memory of node board 0, and the memory and cpus for that node are specified in the layout file by saying nodes=0. The cpulist for node board 1 shows 6-11 and memory node index 1. To specify this, simply write nodes=1. Repeat this for all twelve node boards and create the following mom.layout file for the 72 CPU system.
nodes=0
nodes=1
nodes=2
nodes=3
nodes=4
nodes=5
nodes=6
nodes=7
nodes=8
nodes=9
nodes=10
nodes=11
Each line in the mom.layout file is reported as a node to pbs_server by the pbs_mom daemon.
The mom.layout file does not need to match the hardware layout exactly. It is possible to combine node boards and create larger NUMA nodes. The following example shows how to do this:
nodes=0-1
The memory nodes can be combined the same as CPUs. The memory nodes combined must be contiguous. You cannot combine mem 0 and 2.
Configuring server_priv/nodes
The pbs_server requires awareness of how the MOM is reporting nodes since there is only one MOM daemon and multiple MOM nodes. So, configure the server_priv/nodes file with the num_node_boards and numa_board_str attributes. The attribute num_node_boards tells pbs_server how many numa nodes are reported by the MOM. Following is an example of how to configure the nodes file with num_node_boards:
numa-10 np=72 num_node_boards=12
This line in the nodes file tells pbs_server there is a host named numa-10 and that it has 72 processors and 12 nodes. The pbs_server divides the value of np (72) by the value for num_node_boards (12) and determines there are 6 CPUs per NUMA node.
In this example, the NUMA system is uniform in its configuration of CPUs per node board, but a system does not need to be configured with the same number of CPUs per node board. For systems with non-uniform CPU distributions, use the attribute numa_board_str to let pbs_server know where CPUs are located in the cluster.
The following is an example of how to configure the server_priv/nodes file for non-uniformly distributed CPUs:
Numa-11 numa_board_str=6,8,12
In this configuration, pbs_server knows it has three MOM nodes and the nodes have 6, 8, and 12 CPUs respectively. Note that the attribute np is not used. The np attribute is ignored because the number of CPUs per node is expressly given.
Enforcement of memory resource limits
TORQUE can better enforce memory limits with the use of the utility memacctd. The memacctd utility is provided by SGI on SuSe Linux Enterprise Edition (SLES). It is a daemon that caches memory footprints when it is queried. When configured to use the memory monitor, TORQUE queries memacctd. It is up to the user to make sure memacctd is installed. See the SGI memacctd man page for more information.
To configure TORQUE to use memacctd for memory enforcement
Related topics