8.4.3.1 Using SUSPEND

The SUSPEND attribute for PREEMPTPOLICY causes active jobs to stop executing but to remain in memory on the allocated compute nodes.

Note

You must mark a job as SUSPENDABLE if you want it to suspend. If not, the job will be requeued or canceled.

If supported by the resource manager, you can set the job SUSPENDABLE flag when submitting the job by using the msub -r option. Otherwise, use the JOBFLAGS attribute of the associated class or QoS credential, as in this example:

CLASSCFG[low] JOBFLAGS=SUSPENDABLE

The following outlines some benefits of using SUSPEND and also lists some things you should be aware of if you choose to use it.

Advantages:

Cautions:

Note

When using SUSPEND, you must increase your JOBRETRYTIME. By default, JOBRETRYTIME is set to 60 seconds, but when you use SUSPEND, it is recommended that you increase the time to 300 seconds (5 minutes).

To use SUSPEND

The following steps explain and illustrate how to set up preemption with SUSPEND.

  1. Make the following configurations to the moab.cfg file:
  2. GUARANTEEDPREEMPTION TRUE
    PREEMPTPOLICY SUSPEND
    
    QOSCFG[test1] QFLAGS=PREEMPTEE JOBFLAGS=RESTARTABLE,SUSPENDABLE MEMBERULIST=john PRIORITY=100 
    QOSCFG[test2] QFLAGS=PREEMPTOR MEMBERULIST=john PRIORITY=10000
  3. Submit a job to the PREEMPTEE QoS (test1). For example:
  4. [john@g06]$ echo sleep 120 | msub -l procs=128,walltime=120 -l qos=test1

    Examine the following output for showq:

    Moab.7
    [john@g06]# showq
     
    active jobs------------------------ 
    JOBID     USERNAME    STATE      PROCS    REMAINING    STARTTIME 
    Moab.7    john        Running    128      00:01:59     Thu Nov 10 12:28:44
     
    1 active job     128 of 128 processors in use by local jobs (100.00%) 
    2 of 2 nodes active (100.00%)
     
    eligible jobs---------------------- 
    JOBID     USERNAME    STATE      PROCS     WCLIMIT      QUEUETIME
     
    0 eligible jobs
     
    blocked jobs----------------------- 
    JOBID     USERNAME    STATE      PROCS     WCLIMIT      QUEUETIME
     
    0 blocked jobs
     
    Total job: 1
  5. Now submit a job to the PREEMPTOR QoS (test2). For example:
  6. [john@g06]$ echo sleep 120 | msub -l procs=128,walltime=120 -l qos=test2

    Examine the following output for showq:

    Moab.8
    [john@g06]# showq
     
    active jobs------------------------
    JOBID     USERNAME    STATE      PROCS    REMAINING    STARTTIME
    Moab.7    john        Suspended  128      00:01:56     Thu Nov 10 12:28:44
    Moab.8    john        Running    128      00:02:00     Thu Nov 10 12:28:48
     
    2 active jobs 128 of 128 processors in use by local jobs (100.00%)
    2 of 2 nodes active (100.00%)
     
    eligible jobs---------------------- 
    JOBID     USERNAME    STATE     PROCS      WCLIMIT       QUEUETIME
     
    0 eligible jobs
     
    blocked jobs----------------------- 
    JOBID     USERNAME    STATE     PROCS      WCLIMIT       QUEUETIME
     
    0 blocked jobs
     
    Total jobs: 2

    Note that when a job is suspended, it stays in the output of showq (see the example above). This is normal behavior for a suspended job. Moab should only suspend a job once.

  7. Examine the checkjob outputs for these two jobs.
  8. checkjob test1:

    [john@g06]$ checkjob Moab.9 
    job Moab.9
     
    State: Suspended 
    Creds: user:john group:john qos:test1 
    WallTime: 00:00:02 of 00:02:00 
    SubmitTime: Thu Nov 10 12:36:29 
    (Time Queued Total: 00:00:07 Eligible: 00:00:00)
     
    Total Requested Tasks: 128
     
    Req[0] TaskCount: 128 Partition: licenses 
    NodeCount: 2
     
    Allocated Nodes: 
    node[01-02]*64
     
     
    IWD: /opt/native 
    SubmitDir: /opt/native 
    Executable: /opt/native/spool/moab.job.UFe8sQ
     
    StartCount: 1 
    Flags: RESTARTABLE,SUSPENDABLE,PREEMPTEE,GLOBALQUEUE,PROCSPECIFIED 
    Attr: PREEMPTEE 
    StartPriority: 100 
    job cannot be resumed: preemption required but job is conditional preemptor with no targets 
    BLOCK MSG: non-idle state 'Running' (recorded at last scheduling iteration)

    checkjob test2:

    [john@g06]$ checkjob Moab.10 
    job Moab.10
     
    State: Running 
    Creds: user:john group:john qos:test2 
    WallTime: 00:00:00 of 00:02:00 
    SubmitTime: Thu Nov 10 12:36:31 
    (Time Queued Total: 00:00:00 Eligible: 00:00:00)
     
    StartTime: Thu Nov 10 12:36:31 
    Total Requested Tasks: 128
     
    Req[0] TaskCount: 128 Partition: licenses
     
    Allocated Nodes: 
    node[01-02]*64
     
     
    IWD: /opt/native 
    SubmitDir: /opt/native 
    Executable: /opt/native/spool/moab.job.CZavjU
     
    StartCount: 1 
    Flags: HASPREEMPTED,PREEMPTOR,GLOBALQUEUE,PROCSPECIFIED 
    StartPriority: 10000 
    Reservation 'Moab.10' (-00:00:07 -> 00:01:53 Duration: 00:02:00)
Note Very rarely, Moab will keep a job from restarting, holding it in a suspended state for a long period of time, if it thinks the job cannot restart. For example, if a job could write to I/O before it was suspended, and now it cannot, Moab would realize the job is unable to start and would leave it in a suspended state.

See Also

Copyright © 2012 Adaptive Computing Enterprises, Inc.®