G.10 Moab-NUMA-Support Integration Guide

This topic is for NUMA-support systems on large-scale SLES systems using SGI Altix and UV hardware only and requires Torque 3.0 or later.

Scheduling a shared-memory NUMA type system (not the same as a modern SMP-based individual compute node, which cannot share memory between compute nodes) requires some special configuration. Additionally, Moab can use NODESETs to guarantee feasibility of large memory jobs and to enforce node allocation based on the system's interconnect network topology.

G.10.1 Configuration

To integrate Moab and NUMA

  1. Configure Moab to schedule large memory jobs. Because Moab creates a partition for each resource manager by default, you must configure the cluster controlled by the resource manager to be a shared-memory system to support jobs spanning multiple nodes/blades. To do so, use the PARCFG parameter.
  2. RMCFG[sys-uv]  TYPE=Torque
    PARCFG[sys-uv] FLAGS=SharedMem

    Cluster sys-uv is now configured as a shared-memory system to Moab.

  3. Configure NODESETs as shown below.
  4. NODESETISOPTIONAL FALSE
    NODESETATTRIBUTE FEATURE
    NODESETPOLICY ONEOF
    NODESETPRIORITYTYPE FIRSTFIT

    The NODESET parameters tell Moab that performing node allocation using node sets is required, that the node set name is a feature name assigned to compute nodes, that a job must fit within the available nodes of one node set, and that Moab should use the first node set that contains sufficient available nodes to satisfy the job's request.

  5. To configure Moab to perform topology-aware node allocation using node sets, you must create a node set definition for each set of nodes that has the same number of maximum network "hops" from any node to every other node within the node set. For an example, see the following sample scenario:

    Use case

    The SGI UV 1000 has a two-socket blade with a physical organization of 16 blades within a blade chassis (SGI term is Intra-Rack Unit or IRU), two blade chassis (IRUs) within a rack, and up to four racks within a single UV system. The UV 1000 interconnect network has a topology that requires zero hops between the two sockets on the same physical blade, one hop between an even-odd blade pair (e.g. blades 0 and 1, 2 and 3, etc.), two hops between all even-numbered or all odd-numbered blades within an IRU, three hops maximum between all blades within an IRU, four hops maximum between all even-numbered blades or all odd-numbered blades within a UV system, and five hops maximum between all blades within a UV system.

    1. Define topology-aware node definitions to parallel the compute nodes reachable within a specific hop count. For the UV 1000, this means the sockets of each blade will belong to six separate node set definitions; i.e., one each for 0, 1, 2, 3, 4, and 5 hops).
    2. Define multiple node sets for different nodes reachable in a specific hop count based on the context of where they are in the network topology; that is, you must create a separate and distinct node set definition for each pair of blades reachable with one hop, for each IRU for its nodes reachable in three hops, etc.
    3. Moab node sets are usually defined as compute node features; that is, each node set defined to Moab should appear as a "feature" name on one or more compute nodes. Which node set/feature names appear on each compute node depends on where the compute node is in the interconnect network topology.

      Since the SGI UV operating system identifies each blade socket as a separate NUMA node, each NUMA node within a UV system is traditionally an individual compute node to Moab (although Torque has the ability to redefine a compute node definition by grouping OS NUMA nodes, which some UV installations do to define a blade as a compute node).

      For the sake of illustration, this example assumes each OS NUMA node, which is a UV blade socket, is also a compute node in Moab. This means each compute node (blade socket) will have six feature names assigned, where each feature name must reflect both the compute node's location in the network topology and the hop count the name represents. A feature name is constructed by using the same root name for a hop count and a number for the topology location at the hop-count level.

      For example, the root feature name "blade" represents the zero-hop count and the numbers "0", "1,", etc, represent the consecutively numbered blades throughout the entire UV system, which yields feature names of "blade0" for the first blade in the system, "blade1" for the second blade, etc, to "blade127" for the last blade in a fully populated 4-rack UV system. To illustrate further, the root feature name "iru" represents the 3-hops count and the numbers "0" through "7" represent the eight IRUs within a full 4-rack UV system.

    4. For each compute node, configure the correct feature name for each of the hop counts possible and its location within the topology at the hop-count level (e.g., blade (0 hops), blade pair (1 hop), odd- or even-numbered nodes within an IRU (2 hops), IRU (3 hops), odd- or even-numbered nodes within the UV (4 hops), and UV system (5 hops)). The following example illustrates the feature names assigned to the compute nodes for an SGI UV 1000 system using the following root feature names.
      • blade (0 hops)
      • pair (1 hop)
      • eiru (2 hops for even-numbered blades within an IRU)
      • oiru (2 hops for odd-numbered blades within an IRU)
      • iru (3 hops)
      • esys (4 hops for even-numbered blades within a UV system)
      • osys (4 hops for odd-numbered blades within a UV system)
      • sys (5 hops)

      Note that nodes 0 and 1 are not given any feature names. This is because the operating system instance for the UV system runs on the first blade and in order to not adversely affect OS performance, no jobs should run on the same compute resources as the operating system; hence, these nodes have no node set feature names and therefore will never be chosen to run jobs. In addition, some of the first feature names at a specific hop count-level are omitted (such as pair0) since it makes no sense to define them when the first blade is a substantial part of the nodes making up a node set.

      The node name of a UV system has the same name as the UV system's host name plus the NUMA node's relative socket number.

    /var/spool/torque/server_priv/nodes:
    sys-uv0
    sys-uv1
    sys-uv2   blade1          oiru0 iru0 osys sys
    sys-uv3   blade1          oiru0 iru0 osys sys
    sys-uv4   blade2   pair1  eiru0 iru0 esys sys
    sys-uv5   blade2   pair1  eiru0 iru0 esys sys
    sys-uv6   blade3   pair1  oiru0 iru0 osys sys
    sys-uv7   blade3   pair1  oiru0 iru0 osys sys
    sys-uv8   blade4   pair2  eiru0 iru0 esys sys
    sys-uv9   blade4   pair2  eiru0 iru0 esys sys
    sys-uv10  blade5   pair2  oiru0 iru0 osys sys
    sys-uv11  blade5   pair2  oiru0 iru0 osys sys
    sys-uv12  blade6   pair3  eiru0 iru0 esys sys
    sys-uv13  blade6   pair3  eiru0 iru0 esys sys
    sys-uv14  blade7   pair3  oiru0 iru0 osys sys
    sys-uv15  blade7   pair3  oiru0 iru0 osys sys
    sys-uv16  blade8   pair4  eiru0 iru0 esys sys
    sys-uv17  blade8   pair4  eiru0 iru0 esys sys
    sys-uv18  blade9   pair4  oiru0 iru0 osys sys
    sys-uv19  blade9   pair4  oiru0 iru0 osys sys
    sys-uv20  blade10  pair5  eiru0 iru0 esys sys
    sys-uv21  blade10  pair5  eiru0 iru0 esys sys
    sys-uv22  blade11  pair5  oiru0 iru0 osys sys
    sys-uv23  blade11  pair5  oiru0 iru0 osys sys
    sys-uv24  blade12  pair6  eiru0 iru0 esys sys
    sys-uv25  blade12  pair6  eiru0 iru0 esys sys
    sys-uv26  blade13  pair6  oiru0 iru0 osys sys
    sys-uv27  blade13  pair6  oiru0 iru0 osys sys
    sys-uv28  blade14  pair7  eiru0 iru0 esys sys
    sys-uv29  blade14  pair7  eiru0 iru0 esys sys
    sys-uv30  blade15  pair7  oiru0 iru0 osys sys
    sys-uv31  blade15  pair7  oiru0 iru0 osys sys
    sys-uv32  blade16  pair8  eiru1 iru1 esys sys
    sys-uv33  blade16  pair8  eiru1 iru1 esys sys
    sys-uv34  blade17  pair9  oiru1 iru1 osys sys
    sys-uv35  blade17  pair9  oiru1 iru1 osys sys
    ...
    sys-uv62  blade31  pair15 oiru1 iru1 osys sys
    sys-uv63  blade31  pair15 oiru1 iru1 osys sys
    sys-uv64  blade32  pair16 eiru2 iru2 esys sys
    sys-uv65  blade32  pair16 eiru2 iru2 esys sys
    ...
    sys-uv126 blade63  pair31 oiru3 iru3 osys sys
    sys-uv127 blade63  pair31 oiru3 iru3 osys sys
    sys-uv128 blade64  pair32 eiru4 iru4 esys sys
    sys-uv129 blade64  pair32 eiru4 iru4 esys sys
    ...
    sys-uv190 blade95  pair47 oiru5 iru5 osys sys
    sys-uv191 blade95  pair47 oiru5 iru5 osys sys
    sys-uv192 blade96  pair48 eiru6 iru6 esys sys
    sys-uv193 blade96  pair48 eiru6 iru6 esys sys
    ...
    sys-uv252 blade126 pair63 eiru7 iru7 esys sys
    sys-uv253 blade126 pair63 eiru7 iru7 esys sys
    sys-uv254 blade127 pair63 oiru7 iru7 osys sys
    sys-uv255 blade127 pair63 oiru7 iru7 osys sys

  6. Define the order in which Moab should check node sets for available nodes. Since the NODESETPRIORITYTYPE has a value of FIRSTFIT, the node sets must be ordered from smallest to largest so Moab will always choose the node set with the fewest nodes required to satisfy the job's request. This means listing all blades, blade pairs, even and odd IRUs, IRUs, even and odd system, and system, respectively.
  7. moab.cfg:
    NODESETLIST blade1,blade2,blade3,…,blade127,pair1,pair2,pair3,…,pair63,eiru0,oiru0,eiru1,oiru1,…,eiru7,oiru7,iru0,iru1,…,iru7,esys,osys,sys
  8. Configure Moab to use the PRIORITY NODEALLOCATIONPOLICY. This allocation policy causes Moab to allocate enough nodes to fulfill a job's processor and memory requirement.
  9. NODEALLOCATIONPOLICY PRIORITY
  10. Set NODEACCESSPOLICY to SINGLEJOB to ensure that Moab will schedule large memory requests correctly and efficiently. This is necessary even when a job uses only the memory of a NUMA node.
  11. NODEACCESSPOLICY SINGLEJOB

    The policy SINGLEJOB tells Moab not to allow jobs to share NUMA resources (cores and memory), which for a shared-memory system is very important for fast job execution. For example, if Moab scheduled a job to use the cores of a NUMA node where memory is used by another job, both jobs would execute slowly (up to 10 times more slowly).

G.10.2 Job Submission

Jobs can request processors and memory using the -l nodes=<number of cpus> and -l mem=<amount of memory> syntaxes. You should not have JOBNODEMATCHPOLICY EXACTNODE configured on a NUMA system. You must use the sharedmem job flag on submission to force the job to run only on a sharedmem partition or cluster and to indicate that the job can span multiple nodes. For example:

qsub -l nodes=3,mem=640sgb,flags=sharedmem

© 2016 Adaptive Computing