While backfill improves the scheduler's performance, this is only half the battle. The efficiency of a cluster, in terms of actual work accomplished, is a function of both scheduling performance and individual job efficiency. In many clusters, job efficiency can vary from node to node as well as with the node mix allocated. Most parallel jobs written in popular languages such as MPI or PVM do not internally load balance their workload and thus run only as fast as the slowest node allocated. Consequently, these jobs run most effectively on homogeneous sets of nodes. However, while many clusters start out as homogeneous, they quickly evolve as new generations of compute nodes are integrated into the system. Research has shown that this integration, while improving scheduling performance due to increased scheduler selection, can actually decrease average job efficiency.
A feature called node sets allows jobs to request sets of common resources without specifying exactly what resources are required. Node set policy can be specified globally or on a per-job basis and can be based on node processor speed, memory, network interfaces, or locally defined node attributes. In addition to their use in forcing jobs onto homogeneous nodes, these policies may also be used to guide jobs to one or more types of nodes on which a particular job performs best, similar to job preferences available in other systems. For example, an I/O intensive job may run best on a certain range of processor speeds, running slower on slower nodes, while wasting cycles on faster nodes. A job may specify ANYOF:PROCSPEED:450,500,650 to request nodes in the range of 450 to 650 MHz. Alternatively, if a simple procspeed-homogeneous node set is desired, ONEOF:PROCSPEED may be specified. On the other hand, a communication sensitive job may request a network based node set with the configuration ONEOF:NETWORK:via,myrinet,ethernet, in which case Moab will first attempt to locate adequate nodes where all nodes contain via network interfaces. If such a set cannot be found, Moab will look for sets of nodes containing the other specified network interfaces. In highly heterogeneous clusters, the use of node sets improves job throughput by 10 to 15%.
Node sets can be requested on a system wide or per job basis. System wide configuration is accomplished via the NODESET* parameters while per job specification occurs via the resource manager extensions. In all cases, node sets are a dynamic construct, created on a per job basis and built only of nodes that meet all of a job's requirements.
|The GLOBAL node is included in all feature node sets.|
The use of these parameters may be best highlighted with an example. In this example, a large site possesses a Myrinet based interconnect and wishes to, whenever possible, allocate nodes within Myrinet switch boundaries. To accomplish this, they could assign node attributes to each node indicating which switch it was associated with (switchA, switchB, and so forth) and then use the following system wide node set configuration:
NODESETPOLICY ONEOF NODESETATTRIBUTE FEATURE NODESETISOPTIONAL TRUE NODESETLIST switchA,switchB,switchC,switchD ...
In the preceding example, the NODESETPOLICY parameter is set to the policy ONEOF and tells Moab to allocate nodes within a single attribute set. Other nodeset policies are listed in the following table:
|ANYOF||Select resources from all sets contained in node set list.|
|ONEOF||Select all sets that contain adequate resources to support job.|
The example's NODESETATTRIBUTE parameter is set to FEATURE specifying that the node sets are to be constructed along node feature boundaries.
|ARCH||Base node set boundaries on node's architecture attribute.|
|CLASS||Base node set boundaries on node's classes.|
|FEATURE||Base node set boundaries on node's node feature attribute.|
|MEMORY||Base node set boundaries on node's configured real memory.|
|PROCSPEED||Base node set boundaries on node's reported processor speed.|
The next parameter, NODESETISOPTIONAL, indicates that Moab should not delay the start time of a job if the desired node set is not available but adequate idle resources exist outside of the set. Setting this parameter to TRUE basically tells Moab to attempt to use a node set if it is available, but if not, run the job as soon as possible anyway.
|Setting NODESETISOPTIONAL to FALSE will force the job to always run in a complete nodeset regardless of any start delay this imposes.|
Finally, the NODESETLIST value of switchA switchB... tells Moab to only use node sets based on the listed feature values. This is necessary since sites will often use node features for many purposes and the resulting node sets would be of little use for switch proximity if they were generated based on irrelevant node features indicating things such as processor speed or node architecture.
On occasion, site administrators may want to allow a less strict interpretation of nodes sets. In particular, many sites seek to enforce a more liberal PROCSPEED based node set policy, where almost balanced node allocations are allowed but wildly varying node allocations are not. In such cases, the parameter NODESETTOLERANCE may be used. This parameter allows specification of the percentage difference between the fastest and slowest node that can be within a nodeset using the following calculation:
(Speed.Max - Speed.Min) / Speed.Min <= NODESETTOLERANCE
Thus setting NODESETTOLERANCE to 0.5 would allow the fastest node in a particular node set to be up to 50% faster than the slowest node in that set. With a 0.5 setting, a job may allocate a mix of 500 and 750 MHz nodes but not a mix of 500 and 900 MHz nodes. Currently, tolerances are only supported when the NODESETATTRIBUTE parameter is set to PROCSPEED. The MAXBALANCE node allocation algorithm is often used in conjunction with tolerance based node sets.
When resources are available in more than one resource set, the NODESETPRIORITYTYPE parameter allows control over how the best resource set is selected. Legal values for this parameter are described in the following table:
|AFFINITY||Select the resource set that does not have a negative affinity reservation.||Jobs will avoid resource sets that have a negative affinity.|
|BESTFIT||Select the smallest resource set possible.||Minimizes fragmentation of larger resource sets.|
|BESTRESOURCE||Select the resource set with the best nodes.||Only supported when NODESETATTRIBUTE is set to PROCSPEED. Selects the fastest possible nodes for the job.|
|MINLOSS||Select the resource set that results in the minimal wasted resources assuming no internal job load balancing is available. (Assumes parallel jobs only run as fast as the slowest allocated node.)||
MINLOSS works only when using one of the following configurations:
|WORSTFIT||Select the largest resource set possible.||Minimizes the creation of small resource set fragments but fragments larger resource sets.|
Moab supports additional NodeSet behavior by specifying the NODESETPLUS parameter. Possible values when specifying this parameter are SPANEVENLY and DELAY.
Moab attempts to fit all jobs within one node set, or it spans any number of node sets evenly. When a job specifies a NODESETDELAY, Moab attempts to contain the job within a single node set; if unable to do so, it spans node sets evenly, unless doing so would delay the job beyond the requested NODESETDELAY.
Moab attempts to fit all jobs within the best possible SMP machine (when scheduling nodeboards in an Altix environment) unless doing so delays the job beyond the requested NODESETDELAY.
Moab attempts to fit jobs on node sets in the order they are specified in the NODESETLIST. You can create nested node sets by listing your node sets in a specific order. Here is an example of a "smallest to largest" nested node set:
NODESETPOLICY ONEOF NODESETATTRIBUTE FEATURE NODESETISOPTIONAL FALSE NODESETLIST blade1a,blade1b,blade2a,blade2b,blade3a,blade3b,blade4a,blade4b,quad1a,quad1b,quad2a,quad2b,octet1,octet2,sixteen
The accompanying cluster would look like this:
|Click to enlarge|
In this example, Moab tries to fit the job on the nodes in the blade sets first. If that doesn't work, it moves up to the nodes in the quad sets (a set of four blade sets). If the quads are insufficient, it tries the nodes in the octet sets (a set of four quad node sets).
On a per job basis, each user can specify the equivalent of all parameters except NODESETDELAY. As mentioned previously, this is accomplished using the resource manager extensions.
Classes can be configured with a default node set. In the configuration file, specify DEFAULT.NODESET with the following syntax: DEFAULT.NODESET=<SETTYPE>:<SETATTR>[:<SETLIST>[,<SETLIST>]...]. For example, in a heterogeneous cluster with two different types of processors, the following configuration confines jobs assigned to the amd class to run on either ATHLON or OPTERON processors:
CLASSCFG[amd] DEFAULT.NODESET=ONEOF:FEATURE:ATHLON,OPTERON ...