7.3 Node Set Overview

7.3.1 Node Set Usage Overview

While backfill improves the scheduler's performance, this is only half the battle. The efficiency of a cluster, in terms of actual work accomplished, is a function of both scheduling performance and individual job efficiency. In many clusters, job efficiency can vary from node to node as well as with the node mix allocated. Most parallel jobs written in popular languages such as MPI or PVM do not internally load balance their workload and thus run only as fast as the slowest node allocated. Consequently, these jobs run most effectively on homogeneous sets of nodes. However, while many clusters start out as homogeneous, they quickly evolve as new generations of compute nodes are integrated into the system. Research has shown that this integration, while improving scheduling performance due to increased scheduler selection, can actually decrease average job efficiency.

A feature called node sets allows jobs to request sets of common resources without specifying exactly what resources are required. Node set policy can be specified globally or on a per-job basis. In addition to their use in forcing jobs onto homogeneous nodes, these policies may also be used to guide jobs to one or more types of nodes on which a particular job performs best, similar to job preferences available in other systems. For example, an I/O intensive job may run best on a certain range of processor speeds, running slower on slower nodes, while wasting cycles on faster nodes. A job may specify ANYOF:FEATURE:bigmem,fastos to request nodes with the bigmem or fastos feature. Alternatively, if a simple feature-homogeneous node set is desired, ONEOF:FEATURE may be specified. On the other hand, a job may request a feature based node set with the configuration ONEOF:FEATURE:bigmem,fastos, in which case Moab will first attempt to locate adequate nodes where all nodes contain the bigmem feature. If such a set cannot be found, Moab will look for sets of nodes containing the other specified features. In highly heterogeneous clusters, the use of node sets improves job throughput by 10 to 15%.

Node sets can be requested on a system wide or per job basis. System wide configuration is accomplished via the NODESET* parameters while per job specification occurs via the resource manager extensions.

The GLOBAL node is included in all feature node sets.

When creating node sets, you have the option of using a fixed configuration or of creating node sets dynamically (by using the msub command). This topic explains how to set up both node set use cases.

7.3.2 Node Set Configuration Examples

Global node sets are defined using the NODESETPOLICY, NODESETATTRIBUTE, NODESETLIST, and NODESETISOPTIONAL parameters. As stated before, you can create node sets dynamically (see Dynamic example) or with a fixed configuration (see Fixed configuration example). The use of these parameters can be best highlighted with two examples.

7.3.2.A Fixed configuration example

In this example, a large site possesses a Myrinet based interconnect and wishes to, whenever possible, allocate nodes within Myrinet switch boundaries. To accomplish this, they could assign node attributes to each node indicating which switch it was associated with (switchA, switchB, and so forth) and then use the following system wide node set configuration:

NODESETPOLICY     ONEOF
NODESETATTRIBUTE  FEATURE
NODESETISOPTIONAL TRUE
NODESETLIST       switchA,switchB,switchC,switchD
...

Node Set Policy

In the preceding example, the NODESETPOLICY parameter is set to the policy ONEOF and tells Moab to allocate nodes within a single attribute set. Other node set policies are listed in the following table:

Policy Description
ANYOF Select resources from all sets contained in node set list. The job could span multiple node sets.
FIRSTOF Select resources from first set to match specified constraints.
ONEOF Select a single set that contains adequate resources to support job.

Node Set Attribute

The example's NODESETATTRIBUTE parameter is set to FEATURE, specifying that the node sets are to be constructed along node feature boundaries.

You could also set the NODESETATTRIBUTE to VARATTR, specifying that node sets are to be constructed according to VARATTR values on the job.

Node Set Constraint Handling

The next parameter, NODESETISOPTIONAL, indicates that Moab should not delay the start time of a job if the desired node set is not available but adequate idle resources exist outside of the set. Setting this parameter to TRUE basically tells Moab to attempt to use a node set if it is available, but if not, run the job as soon as possible anyway.

Setting NODESETISOPTIONAL to FALSE will force the job to always run in a complete nodeset regardless of any start delay this imposes.

Node Set List

Finally, the NODESETLIST value of switchA switchB... tells Moab to only use node sets based on the listed feature values. This is necessary since sites will often use node features for many purposes and the resulting node sets would be of little use for switch proximity if they were generated based on irrelevant node features indicating things such as processor speed or node architecture.

To add nodes to the NODESETLIST, you must configure features on your nodes using the NODECFG FEATURES attribute.

NODECFG[node01] FEATURES=switchA
NODECFG[node02] FEATURES=switchA
NODECFG[node03] FEATURES=switchB

Nodes node01 and node02 contain the switchA feature, and node node03 contains the switchB feature.

Node Set Priority

When resources are available in more than one resource set, the NODESETPRIORITYTYPE parameter allows control over how the best resource set is selected. Legal values for this parameter are described in the following table:

Priority Type Description Details
AFFINITY Avoid a resource set with negative affinity. Choosing this type causes Moab to select a node set with no negative affinity nodes (nodes that have a reservation that with negative affinity). If all node sets have negative affinity, then Moab will select the first matching node set.
BESTFIT Select the smallest resource set possible.

Choosing this type causes Moab, when selecting a node set, to eliminate sets that do not have all the required resources. From the remaining sets, Moab chooses the set with the least amount of resources. This priority type most closely matches the job requirements in order to waste the least amount of resources.

This type minimizes fragmentation of larger resource sets.

FIRSTFIT Select the first set with enough resources. Moab will select the first nodeset with enough resources to satisfy the job. This is the fastest of the priority types.
MINLOSS Select the resource set that results in the minimal wasted resources assuming no internal job load balancing is available. (Assumes parallel jobs only run as fast as the slowest allocated node.)

Choosing this type works only when using the following configuration:

NODESETATTRIBUTE FEATURE

In a SHAREDMEM environment (See Moab-NUMA-Support Integration Guide for more information.), Moab will select the node set based on NUMA properties (the smallest feasible node set).

WORSTFIT Select the largest resource set possible.

This type causes Moab, when choosing a node set, to eliminate sets that do not have all the required resources. From the remaining sets, Moab chooses the set with the greatest amount of resources.

This type minimizes fragmentation of smaller resource sets, but increases fragmentation of larger resource sets.

7.3.2.B Dynamic example

In this example, a site wants to be able to dynamically specify which VARATTR values the node set will be based on. To accomplish this, they could use the following configuration in the moab.cfg file:

NODESETISOPTIONAL FALSE
NODESETPOLICY     FIRSTOF
NODESETATTRIBUTE  VARATTR

Node Set Attribute

The example's NODESETATTRIBUTE parameter is set to VARATTR specifying that the node sets are to be constructed by job VARATTR values that are specified dynamically in the msub command.

Node Set Policy

In the preceding example, the NODESETPOLICY parameter is set to the policy FIRSTOF and tells Moab to allocate nodes from the first set that matches specified constraints.

Node Set Constraint Handling

The parameter, NODESETISOPTIONAL, indicates that Moab should not delay the start time of a job if the desired node set is not available but adequate idle resources exist outside of the set. Setting this parameter to FALSE will force the job to always run in a complete node set regardless of any start delay this imposes.

msub example

With the configuration (above) set in the moab.cfg, Moab is configured for dynamic node sets. You can create node sets dynamically by using the msub -l command. (For more information, see Resource Manager Extensions.) Use the following format:

msub -l nodeset=FIRSTOF:VARATTR:<var>[=<value>],...

For example, if you wanted to create a dynamic node set for the Provo datacenter:

msub -l nodeset=FIRSTOF:VARATTR:datacenter=Provo

This command causes Moab to set datacenter=Provo as the node set.

You can specify more than one VARATTR in the command. For example, if you want to create a dynamic node set for the Provo datacenter and the SaltLake datacenter:

msub -l nodeset=FIRSTOF:VARATTR:datacenter=Provo:datacenter=SaltLake

If you specify only datacenter (without specifying a value, such as =Provo), Moab will look up all possible values (values reported on the node for that VARATTR), and then choose one. So if, for example, you have nodes that have VARATTRs datacenter=Provo, datacenter=SaltLake, and datacenter=StGeorge, then specifying msub -l nodeset=FIRSTOF:VARATTR:datacenter will cause the job to run in Provo or SaltLake or StGeorge.

You should also note that Moab also adds the VARATTR (whether you specify it or if Moab chooses it) to the required attribute (REQATTR) of the job. For example, if you specify datacenter=Provo as the VARATTR, datacenter=Provo will also be added to the job REQATTR. Likewise, if you specify only datacenter, and Moab chooses datacenter=SaltLake, then datacenter=SaltLake will be added to the job REQATTR.

If you do not request a VARATTR in the nodeset of the msub -l command, the job will run as if it did not use node sets at all, and nothing will be added to its REQATTR.

If you manually specify a different REQATTR on a job (for example, datacenter=SaltLake) from the node set VARATTR (for example, datacenter=Provo), the job will never run.

7.3.2.C NODESETPLUS

Moab supports additional NodeSet behavior by specifying the NODESETPLUS parameter. Possible values when specifying this parameter are SPANEVENLY and DELAY.

Neither SPANEVENLY nor DELAY will work with multi-req jobs or preemption.

Value Description
SPANEVENLY Moab attempts to fit all jobs within one node set, or it spans any number of node sets evenly. When a job specifies a NODESETDELAY, Moab attempts to contain the job within a single node set; if unable to do so, it spans node sets evenly, unless doing so would delay the job beyond the requested NODESETDELAY.
DELAY

Moab attempts to schedule the job within a nodeset for the configured NODESETDELAY. If Moab cannot find space for the job to start within NODESETDELAY (Moab considers future workload to determine if space will open up in time and might create a future reservation), then Moab schedules the job and ignores the nodeset requirement.

7.3.2.D Nested Node Sets

Moab attempts to fit jobs on node sets in the order they are specified in the NODESETLIST. You can create nested node sets by listing your node sets in a specific order. Here is an example of a "smallest to largest" nested node set:

NODESETPOLICY ONEOF
NODESETATTRIBUTE FEATURE
NODESETISOPTIONAL FALSE
NODESETLIST blade1a,blade1b,blade2a,blade2b,blade3a,blade3b,blade4a,blade4b,quad1a,quad1b,quad2a,quad2b,octet1,octet2,sixteen

The accompanying cluster would look like this:

Image 7-3: Octet, quad, and blade node sets on a cluster

Click to enlarge

In this example, Moab tries to fit the job on the nodes in the blade sets first. If that doesn't work, it moves up to the nodes in the quad sets (a set of four blade sets). If the quads are insufficient, it tries the nodes in the octet sets (a set of four quad node sets).

7.3.3 Requesting Node Sets for Job Submission

On a per job basis, each user can specify the equivalent of all parameters except NODESETDELAY. As mentioned previously, this is accomplished using the resource manager extensions.

7.3.4 Configuring Node Sets for Classes

Classes can be configured with a default node set. In the configuration file, specify DEFAULT.NODESET with the following syntax: DEFAULT.NODESET=<SETTYPE>:<SETATTR>[:<SETLIST>[,<SETLIST>]...]. For example, in a heterogeneous cluster with two different types of processors, the following configuration confines jobs assigned to the amd class to run on either ATHLON or OPTERON processors:

CLASSCFG[amd] DEFAULT.NODESET=ONEOF:FEATURE:ATHLON,OPTERON
...

Related Topics 

© 2016 Adaptive Computing