Maui Scheduler

3.2 Scheduling Environment

3.2.1 Scheduling Objects

Maui functions by manipulating five primary, elementary objects. These are jobs, nodes, reservations, QOS structures, and policies. In addition to these, multiple minor elementary objects and composite objects are also utilized. These objects are also defined in the scheduling dictionary.

3.2.1.1 Jobs

Job information is provided to the Maui scheduler from a resource manager such as Loadleveler, PBS, Wiki, or LSF. Job attributes include ownership of the job, job state, amount and type of resources required by the job, and a wallclock limit, indicating how long the resources are required. A job consists of one or more requirements each of which requests a number of resources of a given type. For example, a job may consist of two requirements, the first asking for '1 IBM SP node with at least 512 MB of RAM' and the second asking for '24 IBM SP nodes with at least 128 MB of RAM'. Each requirements consists of one or more tasks where a task is defined as the minimal independent unit of resources. By default, each task is equivalent to one processor. In SMP environments, however, users may wish to tie one or more processors together with a certain amount of memory and/or other resources.

3.2.1.1.1 Requirement (or Req)

A job requirement (or req) consists of a request for a single type of resources. Each requirement consists of the following components:

- Task Definition

A specification of the elementary resources which compose an individual task.

- Resource Constraints

A specification of conditions which must be met in order for resource matching to occur. Only resources from nodes which meet all resource constraints may be allocated to the job req.

- Task Count

The number of task instances required by the req.

- Task List

The list of nodes on which the task instances have been located.

- Req Statistics

Statistics tracking resource utilization

3.2.1.2 Nodes

As far as Maui is concerned, a node is a collection of resources with a particular set of associated attributes. In most cases, it fits nicely with the canonical world view of a node such as a PC cluster node or an SP node. In these cases, a node is defined as one or more CPU's, memory, and possibly other compute resources such as local disk, swap, network adapters, software licenses, etc. Additionally, this node will described by various attributes such as an architecture type or operating system. Nodes range in size from small uniprocessor PC's to large SMP systems where a single node may consist of hundreds of CPU's and massive amounts of memory.

Information about nodes is provided to the scheduler chiefly by the resource manager. Attributes include node state, configured and available resources (i.e., processors, memory, swap, etc.), run classes supported, etc.

3.2.1.3 Advance Reservations

An advance reservation is an object which dedicates a block of specific resources for a particular use. Each reservation consists of a list of resources, an access control list, and a time range for which this access control list will be enforced. The reservation prevents the listed resources from being used in a way not described by the access control list during the time range specified. For example, a reservation could reserve 20 processors and 10 GB of memory for users Bob and John from Friday 6:00 AM to Saturday 10:00 PM'. Maui uses advance reservations extensively to manage backfill, guarantee resource availability for active jobs, allow service guarantees, support deadlines, and enable metascheduling. Maui also supports both regularly recurring reservations and the creation of dynamic one time reservations for special needs. Advance reservations are described in detail in the advance reservation overview.

3.2.1.4 Policies

Policies are generally specified via a config file and serve to control how and when jobs start. Policies include job prioritization, fairness policies, fairshare configuration policies, and scheduling policies.

3.2.1.5 Resources

Jobs, nodes, and reservations all deal with the abstract concept of a resource. A resource in the Maui world is one of the following:

processors

Processors are specified with a simple count value.

memory

Real memory or 'RAM' is specified in megabytes (MB).

swap

Virtual memory or 'swap' is specified in megabytes (MB).

disk

Local disk is specified in megabytes (MB).

In addition to these elementary resource types, there are two higher level resource concepts used within Maui. These are the task and the processor equivalent, or PE.

3.2.1.6 Task

A task is a collection of elementary resources which must be allocated together within a single node. For example, a task may consist of one processor, 512MB or memory, and 2 GB of local disk. A key aspect of a task is that the resources associated with the task must be allocated as an atomic unit, without spanning node boundaries. A task requesting 2 processors cannot be satisfied by allocating 2 uniprocessor nodes, nor can a task requesting 1 processor and 1 GB of memory be satisfied by allocating 1 processor on one node and memory on another.

In Maui, when jobs or reservations request resources, they do so in terms of tasks typically using a task count and a task definition. By default, a task maps directly to a single processor within a job and maps to a full node within reservations. In all cases, this default definition can be overridden by specifying a new task definition.

Within both jobs and reservations, depending on task definition, it is possible to have multiple tasks from the same job mapped to the same node. For example, a job requesting 4 tasks using the default task definition of 1 processor per task, can be satisfied by two dual processor nodes.

3.2.1.7 PE

The concept of the processor equivalent, or PE, arose out of the need to translate multi-resource consumption requests into a scalar value. It is not an elementary resource, but rather, a derived resource metric. It is a measure of the actual impact of a set of requested resources by a job on the total resources available system wide. It is calculated as:

PE = MAX(ProcsRequestedByJob / TotalConfiguredProcs,
MemoryRequestedByJob / TotalConfiguredMemory,
DiskRequestedByJob / TotalConfiguredDisk,
SwapRequestedByJob / TotalConfiguredSwap) * TotalConfiguredProcs

For example, say a job requested 20% of the total processors and 50% of the total memory of a 128 processor MPP system. Only two such jobs could be supported by this system. The job is essentially using 50% of all available resources since the system can only be scheduled to its most constrained resource, in this case memory. The processor equivalents for this job should be 50% of the processors, or PE = 64.

Let's make the calculation concrete with one further example. Assume a homogeneous 100 node system with 4 processors and 1 GB of memory per node. A job is submitted requesting 2 processors and 768 MB of memory. The PE for this job would be calculated as:

PE = MAX(2/(100*4), 768/(100*1024)) * (100*4) = 3.

This result makes sense since the job would be consuming 3/4 of the memory on a 4 processor node.

The calculation works equally well on homogeneous or heterogeneous systems, uniprocessor or large way SMP systems.

3.2.1.8 Class (or Queue)

A class (or queue) is a logical container object which can be used to implicitly or explicitly apply policies to jobs. In most cases, a class is defined and configured within the resource manager and associated with one or more of the following attributes or constraints:
Attribute Description
Default Job Attributes A queue may be associated with a default job duration, default size, or default resource requirements
Host Constraints A queue may constrain job execution to a particular set of hosts
Job Constraints A queue may constrain the attributes of jobs which may submitted including setting limits such as max wallclock time, minimum number of processors, etc.
Access List A queue may constrain who may submit jobs into it based on user lists, group lists, etc.
Special Access A queue may associate special privileges with jobs including adjusted job priority.

As stated previously, most resource managers allow full class configuration within the resource manager. Where additional class configuration is required, the CLASSCFG parameter may be used.

Maui tracks class usage as a consumable resource allowing sites to limit the number of jobs using a particular class. This is done by monitoring class initiators which may be considered to be a ticket to run in a particular class. Any compute node may simultaneously support several types of classes and any number of initiators of each type. By default, nodes will have a one-to-one mapping between class initiators and configured processors. For every job task run on the node, one class initiator of the appropriate type is consumed. For example, a 3 processor job submitted to the class batch will consume three batch class initiators on the nodes where it is run.

Using queues as consumable resources allows sites to specify various policies by adjusting the class initiator to node mapping. For example, a site running serial jobs may want to allow a particular 8 processor node to run any combination of batch and special jobs subject to the following constraints:

- only 8 jobs of any type allowed simultaneously
- no more than 4 special jobs allowed simultaneously

To enable this policy, the site may set the node's MAXJOB policy to 8 and configure the node with 4 special class initiators and 8 batch class initiators.

Note that in virtually all cases jobs have a one-to-one correspondence between processors requested and class initiators required. However, this is not a requirement and, with special configuration sites may choose to associate job tasks with arbitrary combinations of class initiator requirements.

In displaying class initiator status, Maui signifies the type and number of class initiators available using the format [<CLASSNAME>:<CLASSCOUNT>]. This is most commonly seen in the output of node status commands indicating the number of configured and available class initiators, or in job status commands when displaying class initiator requirements.

Arbitrary Resource

Node can also be configured to support various 'arbitrary resources'. Information about such resources can be specified using the NODECFG parameter. For example, a node may be configured to have '256 MB RAM, 4 processors, 1 GB Swap, and 2 tape drives'.