(Click to open topic with navigation)
Moab allows jobs to be prioritized based on a range of job related factors. These factors are broken down into a two-tier hierarchy of priority factors and subfactors, each of which can be independently assigned a weight. This approach provides the administrator with detailed yet straightforward control of the job selection process.
Each factor and subfactor can be configured with independent priority weight and priority cap values (described later). In addition, per credential and per QoS priority weight adjustments may be specified for a subset of the priority factors. For example, QoS credentials can adjust the queuetime subfactor weight and group credentials can adjust fairshare subfactor weight.
The following table highlights the factors and subfactors that make up a job's total priority.
Factor | SubFactor | Metric |
---|---|---|
CRED
(job credentials) |
USER | user-specific priority (See USERCFG) |
GROUP | group-specific priority (See GROUPCFG) | |
ACCOUNT | account-specific priority (SEE ACCOUNTCFG) | |
QOS | QoS-specific priority (See QOSCFG) | |
CLASS | class/queue-specific priority (See CLASSCFG) | |
FS
(fairshare usage) |
FSUSER | user-based historical usage (See Fairshare Overview) |
FSGROUP | group-based historical usage (See Fairshare Overview) | |
FSACCOUNT | account-based historical usage (See Fairshare Overview) | |
FSQOS | QoS-based historical usage (See Fairshare Overview) | |
FSCLASS | class/queue-based historical usage (See Fairshare Overview) | |
FSGUSER | imported global user-based historical usage (See ID Manager and Fairshare Overview) | |
FSGGROUP | imported global group-based historical usage (See ID Manager and Fairshare Overview) | |
FSGACCOUNT | imported global account-based historical usage (See ID Manager and Fairshare Overview) | |
FSJPU | current active jobs associated with job user | |
FSPPU | current number of processors allocated to active jobs associated with job user | |
FSPSPU | current number of processor-seconds allocated to active jobs associated with job user | |
WCACCURACY |
user's current historical job wallclock accuracy calculated as total processor-seconds dedicated / total processor-seconds requested Factor values are in the range of 0.0 to 1.0. |
|
RES
(requested job resources) |
NODE | number of nodes requested |
PROC | number of processors requested | |
MEM | total real memory requested (in MB) | |
SWAP | total virtual memory requested (in MB) | |
DISK | total local disk requested (in MB) | |
PS | total processor-seconds requested | |
PE | total processor-equivalent requested | |
WALLTIME | total walltime requested (in seconds) | |
SERV
(current service levels) |
QUEUETIME | time job has been queued (in minutes) |
XFACTOR | minimum job expansion factor | |
BYPASS | number of times job has been bypassed by backfill | |
STARTCOUNT | number of times job has been restarted | |
DEADLINE | proximity to job deadline | |
SPVIOLATION | Boolean indicating whether the active job violates a soft usage limit | |
USERPRIO | user-specified job priority | |
TARGET
(target service levels) |
TARGETQUEUETIME | time until queuetime target is reached (exponential) |
TARGETXFACTOR | distance to target expansion factor (exponential) | |
USAGE
(consumed resources -- active jobs only) |
CONSUMED | processor-seconds dedicated to date |
REMAINING | processor-seconds outstanding | |
PERCENT | percent of required walltime consumed | |
EXECUTIONTIME | seconds since job started | |
ATTR
(job attribute-based prioritization) |
ATTRATTR | Attribute priority if specified job attribute is set (attributes may be user-defined or one of preemptor, or preemptee). Default is 0. |
ATTRSTATE | Attribute priority if job is in specified state (see Job States). Default is 0. | |
ATTRGRES | Attribute priority if a generic resource is requested. Default is 0. |
*CAP parameters (FSCAP, for example) are available to limit the maximum absolute value of each priority component and subcomponent. If set to a positive value, a priority cap will bound priority component values in both the positive and negative directions.
All *CAP and *WEIGHT parameters are specified as positive or negative integers. Non-integer values are not supported.
The credential component allows a site to prioritize jobs based on political issues such as the relative importance of certain groups or accounts. This allows direct political priorities to be applied to jobs.
The priority calculation for the credential component is as follows:
Priority += CREDWEIGHT * (
USERWEIGHT * Job.User.Priority +
GROUPWEIGHT * Job.Group.Priority +
ACCOUNTWEIGHT * Job.Account.Priority +
QOSWEIGHT * Job.Qos.Priority +
CLASSWEIGHT * Job.Class.Priority)
All user, group, account, QoS, and class weights are specified by setting the PRIORITY attribute of using the respective *CFG parameter (namely, USERCFG, GROUPCFG, ACCOUNTCFG, QOSCFG, and CLASSCFG).
For example, to set user and group priorities, you might use the following:
CREDWEIGHT 1 USERWEIGHT 1 GROUPWEIGHT 1 USERCFG[john] PRIORITY=2000 USERCFG[paul] PRIORITY=-1000 GROUPCFG[staff] PRIORITY=10000
Class (or queue) priority may also be specified via the resource manager where supported (as in PBS queue priorities). However, if Moab class priority values are also specified, the resource manager priority values will be overwritten.
All priorities may be positive or negative.
Fairshare components allow a site to favor jobs based on short-term historical usage. The Fairshare Overview describes the configuration and use of fairshare in detail.
The fairshare factor is used to adjust a job's priority based on current and historical percentage system utilization of the job's user, group, account, class, or QoS. This allows sites to steer workload toward a particular usage mix across user, group, account, class, and QoS dimensions.
The fairshare priority factor calculation is as follows:
Priority += FSWEIGHT * MIN(FSCAP, (
FSUSERWEIGHT * DeltaUserFSUsage +
FSGROUPWEIGHT * DeltaGroupFSUsage +
FSACCOUNTWEIGHT * DeltaAccountFSUsage +
FSQOSWEIGHT * DeltaQOSFSUsage +
FSCLASSWEIGHT * DeltaClassFSUsage +
FSJPUWEIGHT * ActiveUserJobs +
FSPPUWEIGHT * ActiceUserProcs +
FSPSPUWEIGHT * ActiveUserPS +
WCACCURACYWEIGHT * UserWCAccuracy ))
All *WEIGHT parameters just listed are specified on a per partition basis in the moab.cfg file. The Delta*Usage components represent the difference in actual fairshare usage from the corresponding fairshare usage target. Actual fairshare usage is determined based on historical usage over the time frame specified in the fairshare configuration. The target usage can be a target, floor, or ceiling value as specified in the fairshare configuration file. See the Fairshare Overview for further information on configuring and tuning fairshare. Additional insight may be available in the fairshare usage example. The ActiveUser* components represent current usage by the job's user credential.
How violated ceilings and floors affect fairshare-based priority
Moab determines FSUsageWeight in the previous section. In order to account for violated ceilings and floors, Moab multiplies that number by the FSUsagePriority as demonstrated in the following formula:
FSPriority = FSUsagePriority * FSUsageWeight
When a ceiling or floor is violated, FSUsagePriority = 0, so FSPriority = 0. This means the job will gain no priority because of fairshare. If fairshare is the only component of priority, then violation takes the priority to 0. For more information, see Priority-Based Fairshare and Fairshare Targets.
Weighting jobs by the amount of resources requested allows a site to favor particular types of jobs. Such prioritization may allow a site to better meet site mission objectives, improve fairness, or even improve overall system utilization.
Resource based prioritization is valuable when you want to favor jobs based on the resources requested. This is good in three main scenarios: (1) when you need to favor large resource jobs because it's part of your site's mission statement, (2) when you want to level the response time distribution across large and small jobs (small jobs are more easily backfilled and thus generally have better turnaround time), and (3) when you want to improve system utilization. While this may be surprising, system utilization actually increases as large resource jobs are pushed to the front of the queue. This keeps the smaller jobs in the back where they can be selected for backfill and thus increase overall system utilization. The situation is like the story about filling a cup with golf balls and sand. If you put the sand in first, it gets in the way and you are unable to put in as many golf balls. However, if you put in the golf balls first, the sand can easily be poured in around them completely filling the cup.
The calculation for determining the total resource priority factor is as follows:
Priority += RESWEIGHT* MIN(RESCAP, (
NODEWEIGHT * TotalNodesRequested +
PROCWEIGHT * TotalProcessorsRequested +
MEMWEIGHT * TotalMemoryRequested +
SWAPWEIGHT * TotalSwapRequested +
DISKWEIGHT * TotalDiskRequested +
WALLTIMEWEIGHT* TotalWalltimeRequested +
PEWEIGHT * TotalPERequested))
The sum of all weighted resources components is then multiplied by the RESWEIGHT parameter and capped by the RESCAP parameter. Memory, Swap, and Disk are all measured in megabytes (MB). The final resource component, PE, represents Processor Equivalents. This component can be viewed as a processor-weighted maximum percentage of total resources factor.
For example, if a job requested 25% of the processors and 50% of the total memory on a 128-processor system, it would have a PE value of MAX(25,50) * 128, or 64. The concept of PEs is a highly effective metric in shared resource systems.
Ideal values for requested job processor count and walltime can be specified using PRIORITYTARGETPROCCOUNT and PRIORITYTARGETDURATION.
The Service component specifies which service metrics are of greatest value to the site. Favoring one service subcomponent over another generally improves that service metric.
The priority calculation for the service priority factor is as follows:
Priority += SERVICEWEIGHT * (
QUEUETIMEWEIGHT * <QUEUETIME> +
XFACTORWEIGHT * <XFACTOR> +
BYPASSWEIGHT * <BYPASSCOUNT> +
STARTCOUNTWEIGHT * <STARTCOUNT> +
DEADLINEWEIGHT * <DEADLINE> +
SPVIOLATIONWEIGHT * <SPBOOLEAN> +
USERPRIOWEIGHT * <USERPRIO> )
QueueTime (QUEUETIME) Subcomponent
In the priority calculation, a job's queue time is a duration measured in minutes. Using this subcomponent tends to prioritize jobs in a FIFO order. Favoring queue time improves queue time based fairness metrics and is probably the most widely used single job priority metric. In fact, under the initial default configuration, this is the only priority subcomponent enabled within Moab. It is important to note that within Moab, a job's queue time is not necessarily the amount of time since the job was submitted. The parameter JOBPRIOACCRUALPOLICY allows a site to select how a job will accrue queue time based on meeting various throttling policies. Regardless of the policy used to determine a job's queue time, this effective queue time is used in the calculation of the QUEUETIME, XFACTOR, TARGETQUEUETIME, and TARGETXFACTOR priority subcomponent values.
The need for a distinct effective queue time is necessitated by the fact that many sites have users who like to work the system, whatever system it happens to be. A common practice at some long existent sites is for some users to submit a large number of jobs and then place them on hold. These jobs remain with a hold in place for an extended period of time and when the user is ready to run a job, the needed executable and data files are linked into place and the hold released on one of these pre-submitted jobs. The extended hold time guarantees that this job is now the highest priority job and will be the next to run. The use of the JOBPRIOACCRUALPOLICY parameter can prevent this practice and prevent "queue stuffers" from doing similar things on a shorter time scale. These "queue stuffer" users submit hundreds of jobs at once to swamp the machine and consume use of the available compute resources. This parameter prevents the user from gaining any advantage from stuffing the queue by not allowing these jobs to accumulate any queue time based priority until they meet certain idle and active Moab fairness policies (such as max job per user and max idle job per user).
As a final note, you can adjust the QUEUETIMEWEIGHT parameter on a per QoS basis using the QOSCFG parameter and the QTWEIGHT attribute. For example, the line QOSCFG[special] QTWEIGHT=5000 causes jobs using the QoS special to have their queue time subcomponent weight increased by 5000.
Expansion Factor (XFACTOR) Subcomponent
The expansion factor subcomponent has an effect similar to the queue time factor but favors shorter jobs based on their requested wallclock run time. In its traditional form, the expansion factor (XFactor) metric is calculated as follows:
XFACTOR = 1 + <QUEUETIME> / <EXECUTIONTIME>
However, a couple of aspects of this calculation make its use more difficult. First, the length of time the job will actually run—<EXECUTIONTIME>—is not actually known until the job completes. All that is known is how much time the job requests. Secondly, as described in the Queue Time Subcomponent section, Moab does not necessarily use the raw time since job submission to determine <QUEUETIME> to prevent various scheduler abuses. Consequently, Moab uses the following modified equation:
XFACTOR = 1 + <EFFQUEUETIME> / <WALLCLOCKLIMIT>
In the equation Moab uses, <EFFQUEUETIME> is the effective queue time subject to the JOBPRIOACCRUALPOLICY parameter and <WALLCLOCKLIMIT> is the user—or system—specified job wallclock limit.
Using this equation, it can be seen that short running jobs will have an XFactor that will grow much faster over time than the xfactor associated with long running jobs. The following table demonstrates this favoring of short running jobs:
Job Queue Time | 1 hour | 2 hours | 4 hours | 8 hours | 16 hours |
---|---|---|---|---|---|
XFactor for 1 hour job | 1 + (1 / 1) = 2.00 | 1 + (2 / 1) = 3.00 | 1 + (4 / 1) = 5.00 | 1 + (8 / 1) = 9.00 | 1 + (16 / 1) = 17.0 |
XFactor for 4 hour job | 1 + (1 / 4) = 1.25 | 1 + (2 / 4) = 1.50 | 1 + (4 / 4) = 2.00 | 1 + (8 / 4) = 3.00 | 1 + (16 / 4) = 5.0 |
Since XFactor is calculated as a ratio of two values, it is possible for this subcomponent to be almost arbitrarily large, potentially swamping the value of other priority subcomponents. This can be addressed either by using the subcomponent cap XFACTORCAP, or by using the XFMINWCLIMIT parameter. If the latter is used, the calculation for the XFactor subcomponent value becomes:
XFACTOR = 1 + <EFFQUEUETIME> / MAX(<XFMINWCLIMIT>,<WALLCLOCKLIMIT>)
Using the XFMINWCLIMIT parameter allows a site to prevent very short jobs from causing the XFactor subcomponent to grow inordinately.
Some sites consider XFactor to be a more fair scheduling performance metric than queue time. At these sites, job XFactor is given far more weight than job queue time when calculating job priority and job XFactor distribution consequently tends to be fairly level across a wide range of job durations. (That is, a flat XFactor distribution of 1.0 would result in a one-minute job being queued on average one minute, while a 24-hour job would be queued an average of 24 hours.)
Like queue time, the effective XFactor subcomponent weight is the sum of two weights, the XFACTORWEIGHT parameter and the QoS-specific XFWEIGHT setting. For example, the line QOSCFG[special] XFWEIGHT=5000 causes jobs using the QoS special to increase their expansion factor subcomponent weight by 5000.
The bypass factor is based on the bypass count of a job where the bypass count is increased by one every time the job is bypassed by a lower priority job via backfill. Backfill starvation has never been reported, but if encountered, use the BYPASS subcomponent.
StartCount (STARTCOUNT) Subcomponent
Apply the startcount factor to sites with trouble starting or completing due to policies or failures. The primary causes of an idle job having a startcount greater than zero are resource manager level job start failure, administrator based requeue, or requeue based preemption.
Deadline (DEADLINE) Subcomponent
The deadline factor allows sites to take into consideration the proximity of a job to its DEADLINE. As a jobs moves closer to its deadline its priority increases linearly. This is an alternative to the strict deadline discussed in QOS SERVICE.
Soft Policy Violation (SPVIOLATION) Subcomponent
The soft policy violation factor allows sites to favor jobs which do not violate their associated soft resource limit policies.
User Priority (USERPRIO) Subcomponent
The user priority subcomponent allows sites to consider end-user specified job priority in making the overall job priority calculation. Under Moab, end-user specified priorities may only be negative and are bounded in the range 0 to -1024. See Manual Priority Usage and Enabling End-user Priorities for more information.
User priorities can be positive, ranging from -1024 to 1023, if ENABLEPOSUSERPRIORITY TRUE is specified in moab.cfg.
The target factor component of priority takes into account job scheduling performance targets. Currently, this is limited to target expansion factor and target queue time. Unlike the expansion factor and queue time factors described earlier which increase gradually over time, the target factor component is designed to grow exponentially as the target metric is approached. This behavior causes the scheduler to do essentially all in its power to make certain the scheduling targets are met.
The priority calculation for the target factor is as follows:
Priority += TARGETWEIGHT* (
TARGETQUEUETIMEWEIGHT * QueueTimeComponent +
TARGETXFACTORWEIGHT * XFactorComponent)
The queue time and expansion factor target are specified on a per QoS basis using the XFTARGET and QTTARGET attributes with the QOSCFG parameter. The QueueTime and XFactor component calculations are designed to produce small values until the target value begins to approach, at which point these components grow very rapidly. If the target is missed, this component remains high and continues to grow, but it does not grow exponentially.
The Usage component applies to active jobs only. The priority calculation for the usage priority factor is as follows:
Priority += USAGEWEIGHT * (
USAGECONSUMEDWEIGHT * ProcSecondsConsumed +
USAGEHUNGERWEIGHT * ProcNeededToBalanceDynamicJob +
USAGEREMAININGWEIGHT * ProcSecRemaining +
USAGEEXECUTIONTIMEWEIGHT * SecondsSinceStart +
USAGEPERCENTWEIGHT * WalltimePercent )
The Attribute component allows the incorporation of job attributes into a job's priority. The most common usage for this capability is to do one of the following:
To use job attribute based prioritization, the JOBPRIOF parameter must be specified to set corresponding attribute priorities. To favor jobs based on node feature requirements, the parameter NODETOJOBATTRMAP must be set to map node feature requests to job attributes.
The priority calculation for the attribute priority factor is as follows:
Priority += ATTRWEIGHT * (
ATTRATTRWEIGHT * <ATTRPRIORITY> +
ATTRSTATEWEIGHT * <STATEPRIORITY> +
ATTRGRESWEIGHT * <GRESPRIORITY>
JOBIDWEIGHT * <JOBID> +
JOBNAMEWEIGHT * <JOBNAME_INTEGER> )
Example 5-1:
ATTRWEIGHT 100 ATTRATTRWEIGHT 1 ATTRSTATEWEIGHT 1 ATTRGRESWEIGHT 5 # favor suspended jobs # disfavor preemptible jobs # favor jobs requesting 'matlab' JOBPRIOF STATE[Running]=100 STATE[Suspended]=1000 ATTR[PREEMPTEE]=-200 ATTR[gpfs]=30 GRES[matlab]=400 # map node features to job features NODETOJOBATTRMAP gpfs,pvfs ...
Related topics