Preemption Management

Enabling Preemption
Types of Preemption
Testing and Troubleshooting Preemption

Many sites possess workloads of varying importance. While it may be critical that some jobs obtain resources immediately, other jobs are less turnaround time sensitive but have an insatiable hunger for compute cycles, consuming every available cycle. These latter jobs often have turnaround times on the order of weeks or months. The concept of cycle stealing handles such situations well and enables systems to run low priority, preemptible jobs whenever something more pressing is not running. These other systems are often employed on compute farms of desktops where the jobs must vacate anytime interactive system use is detected.

Preemption (requeueing) does not work with dynamic provisioning.

8.4.1 Enabling Preemption

Preemption can be enabled in one of three ways. These include manual intervention, QoS based configuration, and use of the preemption based backfill algorithm. Note that for all of these cases, a single preemptor is limited to 32 preemptees.

Before enabling preemption, verify that BACKFILLPOLICY is set to FIRSTFIT and that JOBNODEMATCHPOLICY is not set to EXACTNODE.

8.4.1.1 Admin Preemption Commands

The mjobctl command can be used to preempt jobs. Specifically, the command can be used to modify a job's execution state in the following ways:

Action	Flag	Details
Cancel	-c	Terminate and remove job from queue.
Checkpoint	-C	Terminate and checkpoint job leaving job in queue.
Requeue	-R	Terminate job leaving job in queue.
Resume	-r	Resume suspended job.
Start (execute)	-x	Start idle job.
Suspend	-s	Suspend active job.

In general, users are allowed to suspend or terminate jobs they own. Administrators are allowed to suspend, terminate, resume, and execute any queued jobs.

8.4.1.2 QoS-Based Preemption

Moab's QoS-based preemption system allows a site the ability to specify preemption rules and control access to preemption privileges. These abilities can be used to increase system throughput, improve job response time for specific classes of jobs, or enable various political policies. All policies are enabled by specifying some QoS's with the flag PREEMPTOR, and others with the flag PREEMPTEE. For example, to enable a cycle stealing high throughput cluster, a QoS can be created for high priority jobs and marked with the flag PREEMPTOR; another QoS can be created for low priority jobs and marked with the flag PREEMPTEE.

If desired, the RESERVATIONPOLICY parameter can be set to NEVER. With this configuration, low priority, preemptee jobs can be started whenever idle resources are available. These jobs are allowed to run until a high priority job arrives, at which point the necessary low priority jobs are preempted and the needed resources freed. This allows near immediate resource access for the high priority jobs. Using this approach, a cluster can maintain near 100% system utilization while still delivering excellent turnaround time to the jobs of greatest value.

To specify the desired type of preemption, use the PREEMPTPOLICY parameter.

It is important to note the rules of QoS based preemption. Preemption only occurs when the following 3 conditions are satisfied:

The preemptor job has the PREEMPTOR attribute set.
The preemptee job has the PREEMPTEE attribute set.
The preemptor job has a higher priority than the preemptee job.

Use of the preemption system need not be limited to controlling low priority jobs. Other uses include optimistic scheduling and development job support.

Example:

In the below example, high priority jobs are configured to always be able to preempt low priority jobs but not med or other high priority jobs.

PREEMPTPOLICY REQUEUE
# enable qos priority to make preemptors higher priority than preemptees
QOSWEIGHT 1   
QOSCFG[high] QFLAGS=PREEMPTOR  PRIORITY=1000
QOSCFG[med]
QOSCFG[low]  QFLAGS=PREEMPTEE
# associate class 'special' with QOS high
CLASSCFG[special] QDEF=high&

As in the previous example, any class can be bound to a particular QoS using the QDEF attribute of the CLASSCFG parameter with the & marker.

Preventing Thrashing

In environments where job checkpointing or job suspension incur significant overhead, it may be desirable to constrain the rate at which job preemption is allowed. The parameter JOBPREEMPTMINACTIVETIME can be used to throttle job preemption. In essence, this parameter prevents a newly started or newly resumed job from being eligible for preemption until it has executed for the specified time frame. Conversely, jobs can be excluded from preemption after running for a certain amount of time using the JOBPREEMPTMAXACTIVETIME parameter.

8.4.1.3 Preemption Based Backfill

The PREEMPT backfill policy allows a site to take advantage of optimistic scheduling. By default, backfill only allows jobs to run if they are guaranteed to have adequate time to run to completion. However, statistically, most jobs do not use their full requested wallclock limit. The PREEMPT backfill policy allows the scheduler to start backfill jobs even if required walltime is not available. If the job runs too long and interferes with another job that was guaranteed a particular timeslot, the backfill job is preempted and the priority job is allowed to run. When another potential timeslot becomes available, the preempted backfill job will again be optimistically executed. In environments with checkpointing or with poor wallclock accuracies, this algorithm has potential for significant savings. See the backfill section for more information.

8.4.1.4 Trigger and Context Based Preemption Policies

Rules regarding which jobs can be preemptors and which are preemptees can be configured to take into account aspects of the compute environment. Some of these context sensitive rules are listed here:

Mark a job a preemptor if its delivered or expected response time exceeds a specified threshold.
Mark a job preemptible if it violates soft policy usage limits or fairshare targets.
Mark a job a preemptor if it is running in a reservation it owns.
Preempt a job as the result of a specific user, node, job, reservation, or other object event using object triggers.
Preempt a job as the result of an external generic event or generic metric.

8.4.2 Types of Preemption

How the scheduler preempts a job is controlled by the PREEMPTPOLICY parameter. This parameter allows preemption to be enforced using one of the following methods: suspend, checkpoint, requeue, or cancel.

Moab uses preemption escalation to free up resources. This means if the PREEMPTPOLICY is set to suspend, then Moab will use this method if available but will escalate to something potentially more disruptive if necessary to preempt and free up resources. The precendence of preemption methods from least to most distruptive is suspend, checkpoint, requeue, and cancel.

8.4.2.1 Job Requeue

Under this policy, active jobs are terminated and returned to the job queue in an idle state.

For a job to be requeued, it must be marked as restartable. If not, it will be canceled. If supported by the resource manager, the job restartable flag can be set when the job is submitted by using the msub -r option.. Otherwise, this can be accomplished using the FLAGS attribute of the associated class or QoS credential.

CLASSCFG[low] JOBFLAGS=RESTARTABLE

8.4.2.2 Job Suspend

Suspend causes active jobs to stop executing but to remain in memory on the allocated compute nodes. While a suspended job frees up processor resources, it may continue to consume swap and other resources. For more information on suspending jobs, see Suspend/Resume Handling.

If suspend based preemption is selected, then the signal used to initiate the job suspend may be specified by setting the SUSPENDSIG attribute of the RMCFG parameter.

By default, an active job takes priority over a suspended job regardless of the suspended job's priority. Set the CHECKSUSPENDEDJOBPRIORITY parameter to TRUE to prevent jobs from starting on nodes which contain suspended jobs of higher priority.

For a job to be suspended, it must be marked as suspendable. If not, it will be requeued or canceled. The job suspendable flag can be set when the job is submitted. Otherwise, this can be accomplished using the JOBFLAGS attribute of the associated class credential as in the following example:

CLASSCFG[low] JOBFLAGS=SUSPENDABLE

8.4.2.3 Job Checkpoint

Systems that support job checkpointing allow a job to save off its current state and either terminate or continue running. A checkpointed job may be restarted at any time and resume execution from its most recent checkpoint.

Checkpointing behavior can be tuned on a per resource manager basis by setting the CHECKPOINTSIG and CHECKPOINTTIMEOUT attributes of the RMCFG parameter.

See Checkpoint/Restart Facilities for more information.

8.4.2.4 Job Cancel

Under this policy, active jobs are canceled.

8.4.2.5 RM Preemption Constraints

Moab is only able to use preemption if the underlying resource manager/OS combination supports this capability. The following table displays current preemption limitations:

Table 8.4.2.5 Resource Manager Preemption Constraints

Resource Manager	TORQUE 1.2+/OpenPBS 2.3+	PBSPro (5.2)	Loadleveler (3.1)	LSF (5.2)	SGE (5.3)
Cancel	yes	yes	yes	yes	???
Requeue	yes	yes	yes	yes	???
Suspend	yes	yes	yes	yes	???
Checkpoint	(yes on IRIX)	(yes on IRIX)	yes	(OS dependent)	???

8.4.3 Testing and TroubleShooting Preemption

There are multiple steps associated with setting up a working preemption policy. With preemption, issues arise because it appears that Moab is not allowing preemptors to preempt preemptees in the right way. To diagnose this, use the following checklist:

Are preemptor jobs marked with the PREEMPTOR flag (verify with checkjob <JOBID> | grep Flags)?
Are preemptee jobs marked with the PREEMPTEE flag (verify with checkjob <JOBID> | grep Flags)?
Is the start priority of the preemptor higher than the priority of the preemptee (verify with checkjob <JOBID> | grep Priority)?
Do the resources allocated to the preemptee match those requested by the preemptor?
Is the preemptor within the 32-preemptee limit?
Are any policies preventing preemption from occurring (verify with checkjob -v -n <NODEID> <JOBID>)?
Is the PREEMPTPOLICY parameter properly set?
Is the preemptee properly marked as restartable, suspendable, or checkpointable (verify with checkjob <JOBID> | grep Flags)?
Is the resource manager properly responding to preemption requests (use mdiag -R)?
If there is a resource manager level race condition, is Moab properly holding target resources (verify with mdiag -S and set RESERVATIONRETRYTIME if needed)?

8.4 Preemption Management