(Click to open topic with navigation)
SUSPEND is one of the PREEMPTPOLICY types (for more information, see PREEMPTPOLICY types). The SUSPEND attribute causes active jobs to stop executing, but to remain in memory on the allocated compute nodes.
For information about PREEPMPTEE and PREEMPTOR flags, see Preemption flags
The following outlines some benefits of using SUSPEND, and also lists some things you should be aware of if you choose to use it.
Advantages:
Cautions:
You must mark a job as SUSPENDABLE if you want it to suspend. If you do not, the job will be requeued or canceled when it is preempted.
If supported by the resource manager, you can set the job SUSPENDABLE flag when submitting the job by using the msub -r option. Otherwise, use the JOBFLAGS attribute of the associated class or QoS credential, as in this example:
CLASSCFG[low] JOBFLAGS=SUSPENDABLE
For more information, see Job Flags.
To preempt jobs using SUSPEND
When you use SUSPEND, you must increase your JOBRETRYTIME. By default, JOBRETRYTIME is set to 60 seconds, but when you use SUSPEND, it is recommended that you increase the time to 300 seconds (5 minutes).
For example:
GUARANTEEDPREEMPTION TRUE PREEMPTPOLICY SUSPEND QOSCFG[test1] QFLAGS=PREEMPTEE JOBFLAGS=RESTARTABLE,SUSPENDABLE MEMBERULIST=john PRIORITY=100 QOSCFG[test2] QFLAGS=PREEMPTOR MEMBERULIST=john PRIORITY=10000
[john@g06]$ echo sleep 120 | msub -l procs=128,walltime=120 -l qos=test1
(Optional) Examine the output for showq:
Moab.7 [john@g06]# showq active jobs------------------------ JOBID USERNAME STATE PROCS REMAINING STARTTIME Moab.7 john Running 128 00:01:59 Thu Nov 10 12:28:44 1 active job 128 of 128 processors in use by local jobs (100.00%) 2 of 2 nodes active (100.00%) eligible jobs---------------------- JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME 0 eligible jobs blocked jobs----------------------- JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME 0 blocked jobs Total job: 1
[john@g06]$ echo sleep 120 | msub -l procs=128,walltime=120 -l qos=test2
(Optional) Examine the output for showq:
Moab.8 [john@g06]# showq active jobs------------------------ JOBID USERNAME STATE PROCS REMAINING STARTTIME Moab.7 john Suspended 128 00:01:56 Thu Nov 10 12:28:44 Moab.8 john Running 128 00:02:00 Thu Nov 10 12:28:48 2 active jobs 128 of 128 processors in use by local jobs (100.00%) 2 of 2 nodes active (100.00%) eligible jobs---------------------- JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME 0 eligible jobs blocked jobs----------------------- JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME 0 blocked jobs Total jobs: 2
Note that when a job is suspended, it stays in the output of showq. This is normal behavior for a suspended job. Moab should only suspend a job once.
[john@g06]$ checkjob Moab.9 job Moab.9 State: Suspended Creds: user:john group:john qos:test1 WallTime: 00:00:02 of 00:02:00 SubmitTime: Thu Nov 10 12:36:29 (Time Queued Total: 00:00:07 Eligible: 00:00:00) Total Requested Tasks: 128 Req[0] TaskCount: 128 Partition: licenses NodeCount: 2 Allocated Nodes: node[01-02]*64 IWD: /opt/native SubmitDir: /opt/native Executable: /opt/native/spool/moab.job.UFe8sQ StartCount: 1 Flags: RESTARTABLE,SUSPENDABLE,PREEMPTEE,GLOBALQUEUE,PROCSPECIFIED Attr: PREEMPTEE StartPriority: 100 job cannot be resumed: preemption required but job is conditional preemptor with no targets BLOCK MSG: non-idle state 'Running' (recorded at last scheduling iteration)
[john@g06]$ checkjob Moab.10 job Moab.10 State: Running Creds: user:john group:john qos:test2 WallTime: 00:00:00 of 00:02:00 SubmitTime: Thu Nov 10 12:36:31 (Time Queued Total: 00:00:00 Eligible: 00:00:00) StartTime: Thu Nov 10 12:36:31 Total Requested Tasks: 128 Req[0] TaskCount: 128 Partition: licenses Allocated Nodes: node[01-02]*64 IWD: /opt/native SubmitDir: /opt/native Executable: /opt/native/spool/moab.job.CZavjU StartCount: 1 Flags: HASPREEMPTED,PREEMPTOR,GLOBALQUEUE,PROCSPECIFIED StartPriority: 10000 Reservation 'Moab.10' (-00:00:07 -> 00:01:53 Duration: 00:02:00)
Occasionally, Moab will keep a job from restarting, holding it in a suspended state for a long period of time, if it thinks the job cannot restart. For example, if a job could write to I/O before it was suspended, and now it cannot, Moab would realize the job is unable to start and would leave it in a suspended state.
Related topics