(Click to open topic with navigation)
While backfill improves the scheduler's performance, this is only half the battle. The efficiency of a cluster, in terms of actual work accomplished, is a function of both scheduling performance and individual job efficiency. In many clusters, job efficiency can vary from node to node as well as with the node mix allocated. Most parallel jobs written in popular languages such as MPI or PVM do not internally load balance their workload and thus run only as fast as the slowest node allocated. Consequently, these jobs run most effectively on homogeneous sets of nodes. However, while many clusters start out as homogeneous, they quickly evolve as new generations of compute nodes are integrated into the system. Research has shown that this integration, while improving scheduling performance due to increased scheduler selection, can actually decrease average job efficiency.
A feature called node sets allows jobs to request sets of common resources without specifying exactly what resources are required. Node set policy can be specified globally or on a per-job basis and can be based on node processor speed, memory, network interfaces, or locally defined node attributes. In addition to their use in forcing jobs onto homogeneous nodes, these policies may also be used to guide jobs to one or more types of nodes on which a particular job performs best, similar to job preferences available in other systems. For example, an I/O intensive job may run best on a certain range of processor speeds, running slower on slower nodes, while wasting cycles on faster nodes. A job may specify ANYOF:FEATURE:bigmem,fastos to request nodes with the bigmem or fastos feature. Alternatively, if a simple feature-homogeneous node set is desired, ONEOF:FEATURE may be specified. On the other hand, a job may request a feature based node set with the configuration ONEOF:FEATURE:bigmem,fastos, in which case Moab will first attempt to locate adequate nodes where all nodes contain the bigmem feature. If such a set cannot be found, Moab will look for sets of nodes containing the other specified features. In highly heterogeneous clusters, the use of node sets improves job throughput by 10 to 15%.
Node sets can be requested on a system wide or per job basis. System wide configuration is accomplished via the NODESET* parameters while per job specification occurs via the resource manager extensions. In all cases, node sets are a dynamic construct, created on a per job basis and built only of nodes that meet all of a job's requirements.
The GLOBAL node is included in all feature node sets.
Global node sets are defined using the NODESETPOLICY, NODESETATTRIBUTE, NODESETLIST, and NODESETISOPTIONAL parameters.
The use of these parameters may be best highlighted with an example. In this example, a large site possesses a Myrinet based interconnect and wishes to, whenever possible, allocate nodes within Myrinet switch boundaries. To accomplish this, they could assign node attributes to each node indicating which switch it was associated with (switchA, switchB, and so forth) and then use the following system wide node set configuration:
NODESETPOLICY ONEOF NODESETATTRIBUTE FEATURE NODESETISOPTIONAL TRUE NODESETLIST switchA,switchB,switchC,switchD ...
In the preceding example, the NODESETPOLICY parameter is set to the policy ONEOF and tells Moab to allocate nodes within a single attribute set. Other nodeset policies are listed in the following table:
The example's NODESETATTRIBUTE parameter is set to FEATURE specifying that the node sets are to be constructed along node feature boundaries.
The next parameter, NODESETISOPTIONAL, indicates that Moab should not delay the start time of a job if the desired node set is not available but adequate idle resources exist outside of the set. Setting this parameter to TRUE basically tells Moab to attempt to use a node set if it is available, but if not, run the job as soon as possible anyway.
Setting NODESETISOPTIONAL to FALSE will force the job to always run in a complete nodeset regardless of any start delay this imposes.
Finally, the NODESETLIST value of switchA switchB... tells Moab to only use node sets based on the listed feature values. This is necessary since sites will often use node features for many purposes and the resulting node sets would be of little use for switch proximity if they were generated based on irrelevant node features indicating things such as processor speed or node architecture.
To add nodes to the NODESETLIST, you must configure features on your nodes using the NODECFG FEATURES attribute.
NODECFG[node01] FEATURES+=switchA
NODECFG[node02] FEATURES+=switchA
NODECFG[node03] FEATURES+=switchB
Nodes node01 and node02 contain the switchA feature, and node node03 contains the switchB feature.
When resources are available in more than one resource set, the NODESETPRIORITYTYPE parameter allows control over how the best resource set is selected. Legal values for this parameter are described in the following table:
Priority Type | Description | Details |
---|---|---|
AFFINITY | Avoid a resource set with negative affinity. | Choosing this type causes Moab to select a node set with no negative affinity nodes (nodes that have a reservation that with negative affinity). If all node sets have negative affinity, then Moab will select the first matching node set. |
BESTFIT | Select the smallest resource set possible. |
Choosing this type causes Moab, when selecting a node set, to eliminate sets that do not have all the required resources. From the remaining sets, Moab chooses the set with the least amount of resources. This priority type most closely matches the job requirements in order to waste the least amount of resources. This type minimizes fragmentation of larger resource sets. |
MINLOSS | Select the resource set that results in the minimal wasted resources assuming no internal job load balancing is available. (Assumes parallel jobs only run as fast as the slowest allocated node.) |
Choosing this type works only when using the following configuration: NODESETATTRIBUTE FEATURE In a SHAREDMEM environment (See Moab-NUMA Integration Guide for more information.), Moab will select the node set based on NUMA properties (the smallest feasible node set). |
WORSTFIT | Select the largest resource set possible. |
This type causes Moab, when choosing a node set, to eliminate sets that do not have all the required resources. From the remaining sets, Moab chooses the set with the greatest amount of resources. This type minimizes fragmentation of smaller resource sets, but increases fragmentation of larger resource sets. |
Moab supports additional node set behavior by specifying the NODESETPLUS parameter. Possible values when specifying this parameter are SPANEVENLY and DELAY.
Neither SPANEVENLY nor DELAY will work with multi-req jobs or preemption.
Moab attempts to fit all jobs within one node set, or it spans any number of node sets evenly. When a job specifies a NODESETDELAY, Moab attempts to contain the job within a single node set; if unable to do so, it spans node sets evenly, unless doing so would delay the job beyond the requested NODESETDELAY.
Moab attempts to fit all jobs within the best possible SMP machine (when scheduling nodeboards in an Altix environment) unless doing so delays the job beyond the requested NODESETDELAY.
Moab attempts to fit jobs on node sets in the order they are specified in the NODESETLIST. You can create nested node sets by listing your node sets in a specific order. Here is an example of a "smallest to largest" nested node set:
NODESETPOLICY ONEOF NODESETATTRIBUTE FEATURE NODESETISOPTIONAL FALSE NODESETLIST blade1a,blade1b,blade2a,blade2b,blade3a,blade3b,blade4a,blade4b,quad1a,quad1b,quad2a,quad2b,octet1,octet2,sixteen
The accompanying cluster would look like this:
Image 8-2: Octet, quad, and blade node sets on a cluster |
Click to enlarge |
In this example, Moab tries to fit the job on the nodes in the blade sets first. If that doesn't work, it moves up to the nodes in the quad sets (a set of four blade sets). If the quads are insufficient, it tries the nodes in the octet sets (a set of four quad node sets).
On a per job basis, each user can specify the equivalent of all parameters except NODESETDELAY. As mentioned previously, this is accomplished using the resource manager extensions.
Classes can be configured with a default node set. In the configuration file, specify DEFAULT.NODESET with the following syntax: DEFAULT.NODESET=<SETTYPE>:<SETATTR>[:<SETLIST>[,<SETLIST>]...]. For example, in a heterogeneous cluster with two different types of processors, the following configuration confines jobs assigned to the amd class to run on either ATHLON or OPTERON processors:
CLASSCFG[amd] DEFAULT.NODESET=ONEOF:FEATURE:ATHLON,OPTERON ...
Related topics