Conventions

5.1.2 Job Priority Factors

Moab allows jobs to be prioritized based on a range of job related factors. These factors are broken down into a two-tier hierarchy of priority factors and subfactors, each of which can be independently assigned a weight. This approach provides the administrator with detailed yet straightforward control of the job selection process.

Each factor and subfactor can be configured with independent priority weight and priority cap values (described later). In addition, per credential and per QoS priority weight adjustments may be specified for a subset of the priority factors. For example, QoS credentials can adjust the queuetime subfactor weight and group credentials can adjust fairshare subfactor weight.

The following table highlights the factors and subfactors that make up a job's total priority.

Factor SubFactor Metric
CRED
(job credentials)
USER user-specific priority (See USERCFG)
GROUP group-specific priority (See GROUPCFG)
ACCOUNT account-specific priority (SEE ACCOUNTCFG)
QOS QoS-specific priority (See QOSCFG)
CLASS class/queue-specific priority (See CLASSCFG)
FS
(fairshare usage)
FSUSER user-based historical usage (See Fairshare Overview)
FSGROUP group-based historical usage (See Fairshare Overview)
FSACCOUNT account-based historical usage (See Fairshare Overview)
FSQOS QoS-based historical usage (See Fairshare Overview)
FSCLASS class/queue-based historical usage (See Fairshare Overview)
FSGUSER imported global user-based historical usage (See ID Manager and Fairshare Overview)
FSGGROUP imported global group-based historical usage (See ID Manager and Fairshare Overview)
FSGACCOUNT imported global account-based historical usage (See ID Manager and Fairshare Overview)
FSJPU current active jobs associated with job user
FSPPU current number of processors allocated to active jobs associated with job user
FSPSPU current number of processor-seconds allocated to active jobs associated with job user
WCACCURACY

user's current historical job wallclock accuracy calculated as total processor-seconds dedicated / total processor-seconds requested

Factor values are in the range of 0.0 to 1.0.

RES
(requested job resources)
NODE number of nodes requested
PROC number of processors requested
MEM total real memory requested (in MB)
SWAP total virtual memory requested (in MB)
DISK total local disk requested (in MB)
PS total processor-seconds requested
PE total processor-equivalent requested
WALLTIME total walltime requested (in seconds)
SERV
(current service levels)
QUEUETIME time job has been queued (in minutes)
XFACTOR minimum job expansion factor
BYPASS number of times job has been bypassed by backfill
STARTCOUNT number of times job has been restarted
DEADLINE proximity to job deadline
SPVIOLATION Boolean indicating whether the active job violates a soft usage limit
USERPRIO user-specified job priority
TARGET
(target service levels)
TARGETQUEUETIME time until queuetime target is reached (exponential)
TARGETXFACTOR distance to target expansion factor (exponential)
USAGE
(consumed resources -- active jobs only)
CONSUMED processor-seconds dedicated to date
REMAINING processor-seconds outstanding
PERCENT percent of required walltime consumed
EXECUTIONTIME seconds since job started
ATTR
(job attribute-based prioritization)
ATTRATTR Attribute priority if specified job attribute is set (attributes may be user-defined or one of preemptor, or preemptee). Default is 0.
ATTRSTATE Attribute priority if job is in specified state (see Job States). Default is 0.
ATTRGRES Attribute priority if a generic resource is requested. Default is 0.

*CAP parameters (FSCAP, for example) are available to limit the maximum absolute value of each priority component and subcomponent. If set to a positive value, a priority cap will bound priority component values in both the positive and negative directions.

All *CAP and *WEIGHT parameters are specified as positive or negative integers. Non-integer values are not supported.

5.1.2-A Credential (CRED) Component

The credential component allows a site to prioritize jobs based on political issues such as the relative importance of certain groups or accounts. This allows direct political priorities to be applied to jobs.

The priority calculation for the credential component is as follows:

Priority += CREDWEIGHT * (
USERWEIGHT * Job.User.Priority +
GROUPWEIGHT * Job.Group.Priority +
ACCOUNTWEIGHT * Job.Account.Priority +
QOSWEIGHT * Job.Qos.Priority +
   CLASSWEIGHT * Job.Class.Priority)

All user, group, account, QoS, and class weights are specified by setting the PRIORITY attribute of using the respective *CFG parameter (namely, USERCFG, GROUPCFG, ACCOUNTCFG, QOSCFG, and CLASSCFG).

For example, to set user and group priorities, you might use the following:

CREDWEIGHT      1
USERWEIGHT      1
GROUPWEIGHT     1
USERCFG[john]   PRIORITY=2000
USERCFG[paul]   PRIORITY=-1000
GROUPCFG[staff] PRIORITY=10000

Class (or queue) priority may also be specified via the resource manager where supported (as in PBS queue priorities). However, if Moab class priority values are also specified, the resource manager priority values will be overwritten.

All priorities may be positive or negative.

5.1.2-B Fairshare (FS) Component

Fairshare components allow a site to favor jobs based on short-term historical usage. The Fairshare Overview describes the configuration and use of fairshare in detail.

The fairshare factor is used to adjust a job's priority based on current and historical percentage system utilization of the job's user, group, account, class, or QoS. This allows sites to steer workload toward a particular usage mix across user, group, account, class, and QoS dimensions.

The fairshare priority factor calculation is as follows:

Priority += FSWEIGHT * MIN(FSCAP, (
   FSUSERWEIGHT * DeltaUserFSUsage +
   FSGROUPWEIGHT * DeltaGroupFSUsage +
   FSACCOUNTWEIGHT * DeltaAccountFSUsage +
   FSQOSWEIGHT * DeltaQOSFSUsage +
   FSCLASSWEIGHT * DeltaClassFSUsage +
   FSJPUWEIGHT * ActiveUserJobs +
   FSPPUWEIGHT * ActiceUserProcs +
   FSPSPUWEIGHT * ActiveUserPS +
   WCACCURACYWEIGHT * UserWCAccuracy ))

All *WEIGHT parameters just listed are specified on a per partition basis in the moab.cfg file. The Delta*Usage components represent the difference in actual fairshare usage from the corresponding fairshare usage target. Actual fairshare usage is determined based on historical usage over the time frame specified in the fairshare configuration. The target usage can be a target, floor, or ceiling value as specified in the fairshare configuration file. See the Fairshare Overview for further information on configuring and tuning fairshare. Additional insight may be available in the fairshare usage example. The ActiveUser* components represent current usage by the job's user credential.

How violated ceilings and floors affect fairshare-based priority

Moab determines FSUsageWeight in the previous section. In order to account for violated ceilings and floors, Moab multiplies that number by the FSUsagePriority as demonstrated in the following formula:

FSPriority = FSUsagePriority * FSUsageWeight

When a ceiling or floor is violated, FSUsagePriority = 0, so FSPriority = 0. This means the job will gain no priority because of fairshare. If fairshare is the only component of priority, then violation takes the priority to 0. For more information, see Priority-Based Fairshare and Fairshare Targets.

5.1.2-C Resource (RES) Component

Weighting jobs by the amount of resources requested allows a site to favor particular types of jobs. Such prioritization may allow a site to better meet site mission objectives, improve fairness, or even improve overall system utilization.

Resource based prioritization is valuable when you want to favor jobs based on the resources requested. This is good in three main scenarios: (1) when you need to favor large resource jobs because it's part of your site's mission statement, (2) when you want to level the response time distribution across large and small jobs (small jobs are more easily backfilled and thus generally have better turnaround time), and (3) when you want to improve system utilization. While this may be surprising, system utilization actually increases as large resource jobs are pushed to the front of the queue. This keeps the smaller jobs in the back where they can be selected for backfill and thus increase overall system utilization. The situation is like the story about filling a cup with golf balls and sand. If you put the sand in first, it gets in the way and you are unable to put in as many golf balls. However, if you put in the golf balls first, the sand can easily be poured in around them completely filling the cup.

The calculation for determining the total resource priority factor is as follows:

Priority += RESWEIGHT* MIN(RESCAP, (
   NODEWEIGHT * TotalNodesRequested +
   PROCWEIGHT * TotalProcessorsRequested +
   MEMWEIGHT * TotalMemoryRequested +
   SWAPWEIGHT * TotalSwapRequested +
   DISKWEIGHT * TotalDiskRequested +
   WALLTIMEWEIGHT* TotalWalltimeRequested +
   PEWEIGHT * TotalPERequested))

The sum of all weighted resources components is then multiplied by the RESWEIGHT parameter and capped by the RESCAP parameter. Memory, Swap, and Disk are all measured in megabytes (MB). The final resource component, PE, represents Processor Equivalents. This component can be viewed as a processor-weighted maximum percentage of total resources factor.

For example, if a job requested 25% of the processors and 50% of the total memory on a 128-processor system, it would have a PE value of MAX(25,50) * 128, or 64. The concept of PEs is a highly effective metric in shared resource systems.

Ideal values for requested job processor count and walltime can be specified using PRIORITYTARGETPROCCOUNT and PRIORITYTARGETDURATION.

5.1.2-D Service (SERVICE) Component

The Service component specifies which service metrics are of greatest value to the site. Favoring one service subcomponent over another generally improves that service metric.

The priority calculation for the service priority factor is as follows:

Priority += SERVICEWEIGHT * (
   QUEUETIMEWEIGHT * <QUEUETIME> +
   XFACTORWEIGHT * <XFACTOR> +
   BYPASSWEIGHT * <BYPASSCOUNT> +
   STARTCOUNTWEIGHT * <STARTCOUNT> +
   DEADLINEWEIGHT * <DEADLINE> +
   SPVIOLATIONWEIGHT * <SPBOOLEAN> +
   USERPRIOWEIGHT * <USERPRIO> )

QueueTime (QUEUETIME) Subcomponent

In the priority calculation, a job's queue time is a duration measured in minutes. Using this subcomponent tends to prioritize jobs in a FIFO order. Favoring queue time improves queue time based fairness metrics and is probably the most widely used single job priority metric. In fact, under the initial default configuration, this is the only priority subcomponent enabled within Moab. It is important to note that within Moab, a job's queue time is not necessarily the amount of time since the job was submitted. The parameter JOBPRIOACCRUALPOLICY allows a site to select how a job will accrue queue time based on meeting various throttling policies. Regardless of the policy used to determine a job's queue time, this effective queue time is used in the calculation of the QUEUETIME, XFACTOR, TARGETQUEUETIME, and TARGETXFACTOR priority subcomponent values.

The need for a distinct effective queue time is necessitated by the fact that many sites have users who like to work the system, whatever system it happens to be. A common practice at some long existent sites is for some users to submit a large number of jobs and then place them on hold. These jobs remain with a hold in place for an extended period of time and when the user is ready to run a job, the needed executable and data files are linked into place and the hold released on one of these pre-submitted jobs. The extended hold time guarantees that this job is now the highest priority job and will be the next to run. The use of the JOBPRIOACCRUALPOLICY parameter can prevent this practice and prevent "queue stuffers" from doing similar things on a shorter time scale. These "queue stuffer" users submit hundreds of jobs at once to swamp the machine and consume use of the available compute resources. This parameter prevents the user from gaining any advantage from stuffing the queue by not allowing these jobs to accumulate any queue time based priority until they meet certain idle and active Moab fairness policies (such as max job per user and max idle job per user).

As a final note, you can adjust the QUEUETIMEWEIGHT parameter on a per QoS basis using the QOSCFG parameter and the QTWEIGHT attribute. For example, the line QOSCFG[special] QTWEIGHT=5000 causes jobs using the QoS special to have their queue time subcomponent weight increased by 5000.

Expansion Factor (XFACTOR) Subcomponent

The expansion factor subcomponent has an effect similar to the queue time factor but favors shorter jobs based on their requested wallclock run time. In its traditional form, the expansion factor (XFactor) metric is calculated as follows:

XFACTOR = 1 + <QUEUETIME> / <EXECUTIONTIME>

However, a couple of aspects of this calculation make its use more difficult. First, the length of time the job will actually run—<EXECUTIONTIME>—is not actually known until the job completes. All that is known is how much time the job requests. Secondly, as described in the Queue Time Subcomponent section, Moab does not necessarily use the raw time since job submission to determine <QUEUETIME> to prevent various scheduler abuses. Consequently, Moab uses the following modified equation:

XFACTOR = 1 + <EFFQUEUETIME> / <WALLCLOCKLIMIT>

In the equation Moab uses, <EFFQUEUETIME> is the effective queue time subject to the JOBPRIOACCRUALPOLICY parameter and <WALLCLOCKLIMIT> is the user—or system—specified job wallclock limit.

Using this equation, it can be seen that short running jobs will have an XFactor that will grow much faster over time than the xfactor associated with long running jobs. The following table demonstrates this favoring of short running jobs:

Job Queue Time 1 hour 2 hours 4 hours 8 hours 16 hours
XFactor for 1 hour job 1 + (1 / 1) = 2.00 1 + (2 / 1) = 3.00 1 + (4 / 1) = 5.00 1 + (8 / 1) = 9.00 1 + (16 / 1) = 17.0
XFactor for 4 hour job 1 + (1 / 4) = 1.25 1 + (2 / 4) = 1.50 1 + (4 / 4) = 2.00 1 + (8 / 4) = 3.00 1 + (16 / 4) = 5.0

Since XFactor is calculated as a ratio of two values, it is possible for this subcomponent to be almost arbitrarily large, potentially swamping the value of other priority subcomponents. This can be addressed either by using the subcomponent cap XFACTORCAP, or by using the XFMINWCLIMIT parameter. If the latter is used, the calculation for the XFactor subcomponent value becomes:

XFACTOR = 1 + <EFFQUEUETIME> / MAX(<XFMINWCLIMIT>,<WALLCLOCKLIMIT>)

Using the XFMINWCLIMIT parameter allows a site to prevent very short jobs from causing the XFactor subcomponent to grow inordinately.

Some sites consider XFactor to be a more fair scheduling performance metric than queue time. At these sites, job XFactor is given far more weight than job queue time when calculating job priority and job XFactor distribution consequently tends to be fairly level across a wide range of job durations. (That is, a flat XFactor distribution of 1.0 would result in a one-minute job being queued on average one minute, while a 24-hour job would be queued an average of 24 hours.)

Like queue time, the effective XFactor subcomponent weight is the sum of two weights, the XFACTORWEIGHT parameter and the QoS-specific XFWEIGHT setting. For example, the line QOSCFG[special] XFWEIGHT=5000 causes jobs using the QoS special to increase their expansion factor subcomponent weight by 5000.

Bypass (BYPASS) Subcomponent

The bypass factor is based on the bypass count of a job where the bypass count is increased by one every time the job is bypassed by a lower priority job via backfill. Backfill starvation has never been reported, but if encountered, use the BYPASS subcomponent.

StartCount (STARTCOUNT) Subcomponent

Apply the startcount factor to sites with trouble starting or completing due to policies or failures. The primary causes of an idle job having a startcount greater than zero are resource manager level job start failure, administrator based requeue, or requeue based preemption.

Deadline (DEADLINE) Subcomponent

The deadline factor allows sites to take into consideration the proximity of a job to its DEADLINE. As a jobs moves closer to its deadline its priority increases linearly. This is an alternative to the strict deadline discussed in QOS SERVICE.

Soft Policy Violation (SPVIOLATION) Subcomponent

The soft policy violation factor allows sites to favor jobs which do not violate their associated soft resource limit policies.

User Priority (USERPRIO) Subcomponent

The user priority subcomponent allows sites to consider end-user specified job priority in making the overall job priority calculation. Under Moab, end-user specified priorities may only be negative and are bounded in the range 0 to -1024. See Manual Priority Usage and Enabling End-user Priorities for more information.

User priorities can be positive, ranging from -1024 to 1023, if ENABLEPOSUSERPRIORITY TRUE is specified in moab.cfg.

5.1.2-E Target Service (TARG) Component

The target factor component of priority takes into account job scheduling performance targets. Currently, this is limited to target expansion factor and target queue time. Unlike the expansion factor and queue time factors described earlier which increase gradually over time, the target factor component is designed to grow exponentially as the target metric is approached. This behavior causes the scheduler to do essentially all in its power to make certain the scheduling targets are met.

The priority calculation for the target factor is as follows:

Priority += TARGETWEIGHT* (
   TARGETQUEUETIMEWEIGHT  * QueueTimeComponent +
   TARGETXFACTORWEIGHT    * XFactorComponent)

The queue time and expansion factor target are specified on a per QoS basis using the XFTARGET and QTTARGET attributes with the QOSCFG parameter. The QueueTime and XFactor component calculations are designed to produce small values until the target value begins to approach, at which point these components grow very rapidly. If the target is missed, this component remains high and continues to grow, but it does not grow exponentially.

5.1.2-F Usage (USAGE) Component

The Usage component applies to active jobs only. The priority calculation for the usage priority factor is as follows:

Priority += USAGEWEIGHT * (
   USAGECONSUMEDWEIGHT       * ProcSecondsConsumed +
   USAGEHUNGERWEIGHT       * ProcNeededToBalanceDynamicJob +
   USAGEREMAININGWEIGHT      * ProcSecRemaining +
   USAGEEXECUTIONTIMEWEIGHT  * SecondsSinceStart +
   USAGEPERCENTWEIGHT        * WalltimePercent )

5.1.2-G Job Attribute (ATTR) Component

The Attribute component allows the incorporation of job attributes into a job's priority. The most common usage for this capability is to do one of the following:

To use job attribute based prioritization, the JOBPRIOF parameter must be specified to set corresponding attribute priorities. To favor jobs based on node feature requirements, the parameter NODETOJOBATTRMAP must be set to map node feature requests to job attributes.

The priority calculation for the attribute priority factor is as follows:

Priority += ATTRWEIGHT * (
   ATTRATTRWEIGHT * <ATTRPRIORITY> +
   ATTRSTATEWEIGHT * <STATEPRIORITY> +
   ATTRGRESWEIGHT * <GRESPRIORITY>
   JOBIDWEIGHT * <JOBID> +
   JOBNAMEWEIGHT * <JOBNAME_INTEGER> )

Example 5-1:  

ATTRWEIGHT      100
ATTRATTRWEIGHT    1
ATTRSTATEWEIGHT   1
ATTRGRESWEIGHT    5
# favor suspended jobs
# disfavor preemptible jobs
# favor jobs requesting 'matlab'

JOBPRIOF STATE[Running]=100  STATE[Suspended]=1000  ATTR[PREEMPTEE]=-200  ATTR[gpfs]=30  GRES[matlab]=400
# map node features to job features

NODETOJOBATTRMAP  gpfs,pvfs
...

Related topics