TORQUE Resource Manager > Installation and Configuration > TORQUE on NUMA Systems > Building TORQUE with NUMA Support

Building TORQUE with NUMA Support

To turn on NUMA support for TORQUE the --enable-numa-support option must be used during the configure portion of the installation. In addition to any other configuration options, add the --enable-numa-support option as indicated in the following example:

$ ./configure --enable-numa-support

Don't use MOM hierarchy with NUMA.

When TORQUE is enabled to run with NUMA support, there is only a single instance of pbs_mom (MOM) that is run on the system. However, TORQUE will report that there are multiple nodes running in the cluster. While pbs_mom and pbs_server both know there is only one instance of pbs_mom, they manage the cluster as if there were multiple separate MOM nodes.

The mom.layout file is a virtual mapping between the system hardware configuration and how the administrator wants TORQUE to view the system. Each line in mom.layout equates to a node in the cluster and is referred to as a NUMA node.

Automatically Creating mom.layout (Recommended)

A perl script named mom_gencfg is provided in the contrib/ directory that generates the mom.layout file for you. The script can be customized by setting a few variables in it. To automatically create the mom.layout file, follow these instructions (these instructions are also included in the script):

  1. Verify hwloc version 1.1 or higher is installed - see contrib/hwloc_install.sh.
  2. Install Sys::Hwloc from CPAN.
  3. Verify $PBS_HOME is set to the proper value.
  4. Update the variables in the 'Config Definitions' section of the script. Especially update firstNodeId and nodesPerBoard if desired. The firstNodeId variable should be set above 0 if you have a root cpuset that you wish to exclude and the nodesPerBoard variable is the number of NUMA nodes per board. Each node is defined in /sys/devices/system/node, in a subdirectory node<node index>
  5. Back up your current file in case a variable is set incorrectly or neglected.
  6. Run the script.
  7. $ ./mom_gencfg

Manually Creating mom.layout

To properly set up the mom.layout file, it is important to know how the hardware is configured. Use the topology command line utility and inspect the contents of /sys/devices/system/node. The hwloc library can also be used to create a custom discovery tool.

Typing topology on the command line of a NUMA system produces something similar to the following:

Partition number: 0

6 Blades

72 CPUs

378.43 Gb Memory Total

 

Blade         ID       asic  NASID         Memory

-------------------------------------------------

    0 r001i01b00  UVHub 1.0      0    67089152 kB

    1 r001i01b01  UVHub 1.0      2    67092480 kB

    2 r001i01b02  UVHub 1.0      4    67092480 kB

    3 r001i01b03  UVHub 1.0      6    67092480 kB

    4 r001i01b04  UVHub 1.0      8    67092480 kB

    5 r001i01b05  UVHub 1.0     10    67092480 kB

 

CPU      Blade PhysID CoreID APIC-ID Family Model Speed L1(KiB) L2(KiB) L3(KiB)

-------------------------------------------------------------------------------

  0 r001i01b00     00     00       0      6    46  2666 32d/32i     256   18432

  1 r001i01b00     00     02       4      6    46  2666 32d/32i     256   18432

  2 r001i01b00     00     03       6      6    46  2666 32d/32i     256   18432

  3 r001i01b00     00     08      16      6    46  2666 32d/32i     256   18432

  4 r001i01b00     00     09      18      6    46  2666 32d/32i     256   18432

  5 r001i01b00     00     11      22      6    46  2666 32d/32i     256   18432

  6 r001i01b00     01     00      32      6    46  2666 32d/32i     256   18432

  7 r001i01b00     01     02      36      6    46  2666 32d/32i     256   18432

  8 r001i01b00     01     03      38      6    46  2666 32d/32i     256   18432

  9 r001i01b00     01     08      48      6    46  2666 32d/32i     256   18432

 10 r001i01b00     01     09      50      6    46  2666 32d/32i     256   18432

 11 r001i01b00     01     11      54      6    46  2666 32d/32i     256   18432

 12 r001i01b01     02     00      64      6    46  2666 32d/32i     256   18432

 13 r001i01b01     02     02      68      6    46  2666 32d/32i     256   18432

 14 r001i01b01     02     03      70      6    46  2666 32d/32i     256   18432

From this partial output, note that this system has 72 CPUs on 6 blades. Each blade has 12 CPUs grouped into clusters of 6 CPUs. If the entire content of this command were printed you would see each Blade ID and the CPU ID assigned to each blade.

The topology command shows how the CPUs are distributed, but you likely also need to know where memory is located relative to CPUs, so go to /sys/devices/system/node. If you list the node directory you will see something similar to the following:

# ls -al

total 0

drwxr-xr-x 14 root root    0 Dec 3 12:14 .

drwxr-xr-x 14 root root    0 Dec 3 12:13 ..

-r--r--r--  1 root root 4096 Dec 3 14:58 has_cpu

-r--r--r--  1 root root 4096 Dec 3 14:58 has_normal_memory

drwxr-xr-x  2 root root 0 Dec 3 12:14 node0

drwxr-xr-x  2 root root 0 Dec 3 12:14 node1

drwxr-xr-x  2 root root 0 Dec 3 12:14 node10

drwxr-xr-x  2 root root 0 Dec 3 12:14 node11

drwxr-xr-x  2 root root 0 Dec 3 12:14 node2

drwxr-xr-x  2 root root 0 Dec 3 12:14 node3

drwxr-xr-x  2 root root 0 Dec 3 12:14 node4

drwxr-xr-x  2 root root 0 Dec 3 12:14 node5

drwxr-xr-x  2 root root 0 Dec 3 12:14 node6

drwxr-xr-x  2 root root 0 Dec 3 12:14 node7

drwxr-xr-x  2 root root 0 Dec 3 12:14 node8

drwxr-xr-x  2 root root 0 Dec 3 12:14 node9

-r--r--r--  1 root root 4096 Dec 3 14:58 online

-r--r--r--  1 root root 4096 Dec 3 14:58 possible

The directory entries node0, node1,...node11 represent groups of memory and CPUs local to each other. These groups are a node board, a grouping of resources that are close together. In most cases, a node board is made up of memory and processor cores. Each bank of memory is called a memory node by the operating system, and there are certain CPUs that can access that memory very rapidly. Note under the directory for node board node0 that there is an entry called cpulist. This contains the CPU IDs of all CPUs local to the memory in node board 0.

Now create the mom.layout file. The content of cpulist 0-5 are local to the memory of node board 0, and the memory and cpus for that node are specified in the layout file by saying nodes=0. The cpulist for node board 1 shows 6-11 and memory node index 1. To specify this, simply write nodes=1. Repeat this for all twelve node boards and create the following mom.layout file for the 72 CPU system.

nodes=0
nodes=1
nodes=2
nodes=3
nodes=4
nodes=5
nodes=6
nodes=7
nodes=8
nodes=9
nodes=10
nodes=11

Each line in the mom.layout file is reported as a node to pbs_server by the pbs_mom daemon.

The mom.layout file does not need to match the hardware layout exactly. It is possible to combine node boards and create larger NUMA nodes. The following example shows how to do this:

nodes=0-1

The memory nodes can be combined the same as CPUs. The memory nodes combined must be contiguous. You cannot combine mem 0 and 2.

Configuring server_priv/nodes

The pbs_server requires awareness of how the MOM is reporting nodes since there is only one MOM daemon and multiple MOM nodes. So, configure the server_priv/nodes file with the num_node_boards and numa_board_str attributes. The attribute num_node_boards tells pbs_server how many numa nodes are reported by the MOM. Following is an example of how to configure the nodes file with num_node_boards:

numa-10 np=72 num_node_boards=12

This line in the nodes file tells pbs_server there is a host named numa-10 and that it has 72 processors and 12 nodes. The pbs_server divides the value of np (72) by the value for num_node_boards (12) and determines there are 6 CPUs per NUMA node.

In this example, the NUMA system is uniform in its configuration of CPUs per node board, but a system does not need to be configured with the same number of CPUs per node board. For systems with non-uniform CPU distributions, use the attribute numa_board_str to let pbs_server know where CPUs are located in the cluster.

The following is an example of how to configure the server_priv/nodes file for non-uniformly distributed CPUs:

Numa-11 numa_board_str=6,8,12

In this configuration, pbs_server knows it has three MOM nodes and the nodes have 6, 8, and 12 CPUs respectively. Note that the attribute np is not used. The np attribute is ignored because the number of CPUs per node is expressly given.

Enforcement of memory resource limits

TORQUE can better enforce memory limits with the use of the utility memacctd. The memacctd utility is provided by SGI on SuSe Linux Enterprise Edition (SLES). It is a daemon that caches memory footprints when it is queried. When configured to use the memory monitor, TORQUE queries memacctd. It is up to the user to make sure memacctd is installed. See the SGI memacctd man page for more information.

To configure TORQUE to use memacctd for memory enforcement

  1. Start memacctd as instructed by SGI.
  2. Reconfigure TORQUE with --enable-memacct. This will link in the necessary library when TORQUE is recompiled.
  3. Recompile and reinstall TORQUE.
  4. Restart all MOM nodes.
  5. (Optional) Alter the qsub filter to include a default memory limit for all jobs that are not submitted with memory limit.

Related Topics 

© 2015 Adaptive Computing