(Click to open topic with navigation)
Scheduling a shared-memory NUMA type system (not the same as a modern SMP-based individual compute node, which cannot share memory between compute nodes) requires some special configuration. Additionally, Moab can use NODESETs to guarantee feasibility of large memory jobs and to enforce node allocation based on the system's interconnect network topology.
To integrate Moab and NUMA
RMCFG[sys-uv] TYPE=TORQUE PARCFG[sys-uv] FLAGS=SharedMem
Cluster sys-uv is now configured as a shared-memory system to Moab.
NODESETPOLICY ONEOF NODESETPRIORITYTYPE FIRSTFIT
The NODESET parameters tell Moab that performing node allocation using node sets is required, that the node set name is a feature name assigned to compute nodes, that a job must fit within the available nodes of one node set, and that Moab should use the first node set that contains sufficient available nodes to satisfy the job's request.
The SGI UV 1000 has a two-socket blade with a physical organization of 16 blades within a blade chassis (SGI term is Intra-Rack Unit or IRU), two blade chassis (IRUs) within a rack, and up to four racks within a single UV system. The UV 1000 interconnect network has a topology that requires zero hops between the two sockets on the same physical blade, one hop between an even-odd blade pair (e.g. blades 0 and 1, 2 and 3, etc.), two hops between all even-numbered or all odd-numbered blades within an IRU, three hops maximum between all blades within an IRU, four hops maximum between all even-numbered blades or all odd-numbered blades within a UV system, and five hops maximum between all blades within a UV system.
Since the SGI UV operating system identifies each blade socket as a separate NUMA node, each NUMA node within a UV system is traditionally an individual compute node to Moab (although TORQUE has the ability to redefine a compute node definition by grouping OS NUMA nodes, which some UV installations do to define a blade as a compute node).
For the sake of illustration, this example assumes each OS NUMA node, which is a UV blade socket, is also a compute node in Moab. This means each compute node (blade socket) will have six feature names assigned, where each feature name must reflect both the compute node's location in the network topology and the hop count the name represents. A feature name is constructed by using the same root name for a hop count and a number for the topology location at the hop-count level.
For example, the root feature name "blade" represents the zero-hop count and the numbers "0", "1,", etc, represent the consecutively numbered blades throughout the entire UV system, which yields feature names of "blade0" for the first blade in the system, "blade1" for the second blade, etc, to "blade127" for the last blade in a fully populated 4-rack UV system. To illustrate further, the root feature name "iru" represents the 3-hops count and the numbers "0" through "7" represent the eight IRUs within a full 4-rack UV system.
Note that nodes 0 and 1 are not given any feature names. This is because the operating system instance for the UV system runs on the first blade and in order to not adversely affect OS performance, no jobs should run on the same compute resources as the operating system; hence, these nodes have no node set feature names and therefore will never be chosen to run jobs. In addition, some of the first feature names at a specific hop count-level are omitted (such as pair0) since it makes no sense to define them when the first blade is a substantial part of the nodes making up a node set.
The node name of a UV system has the same name as the UV system's host name plus the NUMA node's relative socket number.
sys-uv2 blade1 oiru0 iru0 osys sys
sys-uv3 blade1 oiru0 iru0 osys sys
sys-uv4 blade2 pair1 eiru0 iru0 esys sys
sys-uv5 blade2 pair1 eiru0 iru0 esys sys
sys-uv6 blade3 pair1 oiru0 iru0 osys sys
sys-uv7 blade3 pair1 oiru0 iru0 osys sys
sys-uv8 blade4 pair2 eiru0 iru0 esys sys
sys-uv9 blade4 pair2 eiru0 iru0 esys sys
sys-uv10 blade5 pair2 oiru0 iru0 osys sys
sys-uv11 blade5 pair2 oiru0 iru0 osys sys
sys-uv12 blade6 pair3 eiru0 iru0 esys sys
sys-uv13 blade6 pair3 eiru0 iru0 esys sys
sys-uv14 blade7 pair3 oiru0 iru0 osys sys
sys-uv15 blade7 pair3 oiru0 iru0 osys sys
sys-uv16 blade8 pair4 eiru0 iru0 esys sys
sys-uv17 blade8 pair4 eiru0 iru0 esys sys
sys-uv18 blade9 pair4 oiru0 iru0 osys sys
sys-uv19 blade9 pair4 oiru0 iru0 osys sys
sys-uv20 blade10 pair5 eiru0 iru0 esys sys
sys-uv21 blade10 pair5 eiru0 iru0 esys sys
sys-uv22 blade11 pair5 oiru0 iru0 osys sys
sys-uv23 blade11 pair5 oiru0 iru0 osys sys
sys-uv24 blade12 pair6 eiru0 iru0 esys sys
sys-uv25 blade12 pair6 eiru0 iru0 esys sys
sys-uv26 blade13 pair6 oiru0 iru0 osys sys
sys-uv27 blade13 pair6 oiru0 iru0 osys sys
sys-uv28 blade14 pair7 eiru0 iru0 esys sys
sys-uv29 blade14 pair7 eiru0 iru0 esys sys
sys-uv30 blade15 pair7 oiru0 iru0 osys sys
sys-uv31 blade15 pair7 oiru0 iru0 osys sys
sys-uv32 blade16 pair8 eiru1 iru1 esys sys
sys-uv33 blade16 pair8 eiru1 iru1 esys sys
sys-uv34 blade17 pair9 oiru1 iru1 osys sys
sys-uv35 blade17 pair9 oiru1 iru1 osys sys
sys-uv62 blade31 pair15 oiru1 iru1 osys sys
sys-uv63 blade31 pair15 oiru1 iru1 osys sys
sys-uv64 blade32 pair16 eiru2 iru2 esys sys
sys-uv65 blade32 pair16 eiru2 iru2 esys sys
sys-uv126 blade63 pair31 oiru3 iru3 osys sys
sys-uv127 blade63 pair31 oiru3 iru3 osys sys
sys-uv128 blade64 pair32 eiru4 iru4 esys sys
sys-uv129 blade64 pair32 eiru4 iru4 esys sys
sys-uv190 blade95 pair47 oiru5 iru5 osys sys
sys-uv191 blade95 pair47 oiru5 iru5 osys sys
sys-uv192 blade96 pair48 eiru6 iru6 esys sys
sys-uv193 blade96 pair48 eiru6 iru6 esys sys
sys-uv252 blade126 pair63 eiru7 iru7 esys sys
sys-uv253 blade126 pair63 eiru7 iru7 esys sys
sys-uv254 blade127 pair63 oiru7 iru7 osys sys
sys-uv255 blade127 pair63 oiru7 iru7 osys sys
The policy SINGLEJOB tells Moab not to allow jobs to share NUMA resources (cores and memory), which for a shared-memory system is very important for fast job execution. For example, if Moab scheduled a job to use the cores of a NUMA node where memory is used by another job, both jobs would execute slowly (up to 10 times more slowly).
Jobs can request processors and memory using the -l nodes=<number of cpus> and -l mem=<amount of memory> syntaxes. You should not have JOBNODEMATCHPOLICY EXACTNODE configured on a NUMA system. You must use the sharedmem job flag on submission to force the job to run only on a sharedmem partition or cluster and to indicate that the job can span multiple nodes. For example:
qsub -l nodes=3,mem=640sgb,flags=sharedmem