Moab Workload Manager

13.3 Resource Manager Extensions

All resource managers are not created equal. There is a wide range in what capabilities are available from system to system. Additionally, there is a large body of functionality that many, if not all, resource managers have no concept of. A good example of this is job QoS. Since most resource managers do not have a concept of quality of service, they do not provide a mechanism for users to specify this information. In many cases, Moab is able to add capabilities at a global level. However, a number of features require a per job specification. Resource manager extensions allow this information to be associated with the job.

13.3.1 Resource Manager Extension Specification

Specifying resource manager extensions varies by resource manager. TORQUE, OpenPBS, PBSPro, Loadleveler, LSF, S3, and Wiki each allow the specification of an extension field as described in the following table:

Resource Manager Specification Method

-l

> qsub -l nodes=3,qos=high sleepy.cmd

-W x=

> qsub -l nodes=3 -W x=qos:high sleepy.cmd
Note OpenPBS does not support this ability by default but can be patched as described in the PBS Resource Manager Extension Overview.

#@comment

#@nodes = 3
#@comment = qos:high

-ext

> bsub -ext advres:system.2

-l

> qsub -l advres=system.2
Note Use of PBSPro resources requires configuring the server_priv/resourcedef file to define the needed extensions as in the following example:
advres type=string
qos    type=string
sid    type=string
sjid   type=string

comment

comment=qos:high

13.3.2 Resource Manager Extension Values

Using the resource manager specific method, the following job extensions are currently available:

ADVRES
[<RSVID>]
---

Specifies that reserved resources are required to run the job. If <RSVID> is specified, then only resources within the specified reservation may be allocated (see Job to Reservation Binding).

You can request to not use a specific reservation by using !advres.

> qsub -l advres=grid.3
Resources for the job must come from grid.3
> qsub -l !advres=grid.5
Resources for the job must not come from grid.5
   
BANDWIDTH
<DOUBLE> (in MB/s)
---
Minimum available network bandwidth across allocated resources. (See Network Management.)
> bsub -ext bandwidth=120 chemjob.txt
   
DDISK
<INTEGER>
0
Dedicated disk per task in MB.
qsub -l ddisk=2000
   
DEADLINE
[[[DD:]HH:]MM:]SS
---
Relative completion deadline of job (from job submission time).
> qsub -l deadline=2:00:00,nodes=4 /tmp/bio3.cmd
   
DEPEND
[<DEPENDTYPE>:][{jobname|jobid}.]<ID>[:[{jobname|jobid}.]<ID>]...
---
Allows specification of job dependencies for compute or system jobs. If no ID prefix (jobname or jobid) is specified, the ID value is interpreted as a job ID.
# submit job which will run after job 1301 and 1304 complete
> msub -l depend=orion.1301:orion.1304 test.cmd

orion.1322

# submit jobname-based dependency job
> msub -l depend=jobname.data1005 dataetl.cmd

orion.1428
   
DMEM
<INTEGER>
0
Dedicated memory per task in bytes.
msub -l DMEM=20480
Moab will dedicate 20 MB of memory to the task.
   
EPILOGUE
<STRING>
---
Specifies a user owned epilogue script which is run before the system epilogue and epilogue.user scripts at the completion of a job. The syntax is epilogue=<file>. The file can be designated with an absolute or relative path.
Note This parameter works only with TORQUE.
msub -l epilogue=epilogue_script.sh job.sh  
   
EXCLUDENODES
{<nodeid>|<node_range>}[:...]
---
Specifies nodes that should not be considered for the given job.
msub -l excludenodes=k1:k2:k[5-8]
  
# Comma separated ranges work only with SLURM
msub -l excludenodes=k[1-2,5-8]
   
FEATURE
<FEATURE>[{:|}<FEATURE>]...
---
Required list of node attribute/node features.
Note If the pipe (|) character is used as a delimiter, the features are logically OR'd together and the associated job may use resources that match any of the specified features.
> qsub -l feature='fastos:bigio' testjob.cmd
   
GATTR
<STRING>
---
Generic job attribute associated with job.
> qsub -l gattr=bigjob
   
GEOMETRY
{(<TASKID>[,<TASKID>[,...]])[(<TASKID>[,...])...]}
---
Explicitly specified task geometry.
> qsub -l nodes=2:ppn=4 -W x=geometry:'{(0,1,4,5)(2,3,6,7)}' quanta2.cmd
The job quanta2.cmd runs tasks 0, 1, 4, and 5 on one node, while tasks 2, 3, 6, and 7 run on another node.
   
GMETRIC
generic metric requirement for allocated nodes where the requirement is specified using the format <GMNAME>[:{lt:,le:,eq:,ge:,gt:,ne:}<VALUE>]
---
Indicates generic constraints that must be found on all allocated nodes. If a <VALUE> is not specified, the node must simply possess the generic metric. (See Generic Metrics for more information.)
> qsub -l gmetric=bioversion:ge:133244 testj.txt
   
GPUs
msub -l nodes=<VALUE>:ppn=<VALUE>:gpus=<VALUE>
Note Moab does not support requesting GPUs as a GRES. Submitting msub -l gres=gpus:x does not work.
---
Moab schedules GPUs as a special type of node-locked generic resources. When TORQUE reports GPUs to Moab, Moab can schedule jobs and correctly assign GPUs to ensure that jobs are scheduled efficiently. To have Moab schedule GPUs, configure them in TORQUE then submit jobs using the "gpu" attribute. Moab automatically parses the "gpu" attribute and assigns them in the correct manner.
> msub -l nodes=2:ppn=2:gpus=1
Submits a job that requests 2 tasks, 2 processors and 1 gpu per task (2 gpus total).
> msub -l nodes=2:ppn=2:gpus=1
Submits a job that requests 4 tasks, 2 tasks per node and 1 gpu per task (4 gpus total).
> msub -l nodes=4:gpus=1
Submits a job that requests 4 tasks, 1 processor and 1 gpu per task (4 gpus total).
> msub -l nodes=4:gpus=2+1:ppn=2,walltime=600
Submits a job that requests 2 different types of tasks, the first is 1 task with 2 processors, the second is 4 tasks, each with 1 processor and 2 gpus.
   
GRES and SOFTWARE
Percent sign (%) delimited list of generic resources where each resource is specified using the format <RESTYPE>[{+|:}<COUNT>]
---
Indicates generic resources required by the job. If the generic resource is node-locked, it is a per-task count. If a <COUNT> is not specified, the resource count defaults to 1.
> qsub -W x=GRES:tape+2%matlab+3 testj.txt
Note When specifying more than one generic resource with -l, use the percent (%) character to delimit them.
> qsub -l gres=tape+2%matlab+3 testj.txt

> qsub -l software=matlab:2 testj.txt
   
HOSTLIST
'+' delimited list of hostnames; also, ranges and regular expressions
---
Indicates an exact set, superset, or subset of nodes on which the job must run.
Note Use the caret (^) or asterisk (*) characters to specify a host list as superset or subset respectively.
> msub -l hostlist=nodeA+nodeB+nodeE

hostlist=foo[1-5]
(foo1,foo2,...,foo5)
hostlist=foo1+foo[3-9]
(foo1,foo3,foo4,...,foo9)
hostlist=foo[1,3-9]
(same as previous example)
hostlist=foo[1-3]+bar[72-79]
(foo1,foo2,foo3,bar72,bar73,...,bar79)
   
JGROUP
<JOBGROUPID>
---
ID of job group to which this job belongs (different from the GID of the user running the job).
> msub -l JGROUP=bluegroup
   
JOBFLAGS (aka FLAGS)
one or more of the following colon delimited job flags including ADVRES[:RSVID], NOQUEUE, NORMSTART, PREEMPTEE, PREEMPTOR, RESTARTABLE, or SUSPENDABLE (see job flag overview for a complete listing)
---
Associates various flags with the job.
> qsub -l nodes=1,walltime=3600,jobflags=advres myjob.py
   
   
LOGLEVEL
<INTEGER>
---
Per job log verbosity.
> qsub -l -W x=loglevel:5 bw.cmd
Job events and analysis will be logged with level 5 verbosity.
   
MAXMEM
<INTEGER> (in megabytes)
---
Maximum amount of memory the job may consume across all tasks before the JOBMEM action is taken.
> qsub -W x=MAXMEM:1000mb bw.cmd
If a RESOURCELIMITPOLICY is set for per-job memory utilization, its action will be taken when this value is reached.
   
MAXPROC
<INTEGER>
---
Maximum CPU load the job may consume across all tasks before the JOBPROC action is taken.
> qsub -W x=MAXPROC:4 bw.cmd
If a RESOURCELIMITPOLICY is set for per-job processor utilization, its action will be taken when this value is reached.
   
MINPREEMPTTIME
[[DD:]HH:]MM:]SS
---
Minimum time job must run before being eligible for preemption.
Note Can only be specified if associated QoS allows per-job preemption configuration by setting the preemptconfig flag.
> qsub -l minpreempttime=900 bw.cmd
Job cannot be preempted until it has run for 15 minutes.
   
MINPROCSPEED
<INTEGER>
0
Minimum processor speed (in MHz) for every node that this job will run on.
> qsub -W x=MINPROCSPEED:2000 bw.cmd
Every node that runs this job must have a processor speed of at least 2000 MHz.
   
MINWCLIMIT
[[DD:]HH:]MM:]SS
1:00:00
Minimum wallclock limit job must run before being eligible for extension. (See JOBEXTENDDURATION or JOBEXTENDSTARTWALLTIME.)
> qsub -l minwclimit=300,walltime=16000 bw.cmd
Job will run for at least 300 seconds but up to 16,000 seconds if possible (without interfering with other jobs).
   
MSTAGEIN
[<SRCURL>[|<SRCRUL>...]%]<DSTURL>
---
Indicates a job has data staging requirements. The source URL(s) listed will be transfered to the execution system for use by the job. If more than one source URL is specified, the destination URL must be a directory.

The format of <SRCURL> is: [PROTO://][HOST][:PORT]][/PATH] where the path is local.

The format of <DSTURL> is: [PROTO://][HOST][:PORT]][/PATH] where the path is remote.

PROTO can be any of the following protocols: ssh, file, or gsiftp.
HOST is the name of the host where the file resides.
PATH is the path of the source or destination file. The destination path may be a directory when sending a single file and must be a directory when sending multiple files. If a directory is specified, it must end with a forward slash (/).

Valid variables include:
$JOBID
$HOME - Path the script was run from
$RHOME - Home dir of the use on the remote system
$SUBMITHOST
$DEST - This is the Moab where the job will run
$LOCALDATASTAGEHEAD
Note If no destination is given, the protocol and file name will be set to the same as the source.
Note The $RHOME (remote home directory) variable is for when a user's home directory on the compute node is different than on the submission host.
> msub -W x='mstagein=file://$HOME/helperscript.sh|file:///home/dev/datafile.txt%ssh://host/home/dev/' script.sh
Copy datafile.txt and helperscript.sh from the local machine to /home/dev/ on host for use in execution of script.sh. $HOME is a path containing a preceding / (i.e. /home/adaptive).
   
MSTAGEOUT
[<SRCURL>[|<SRCRUL>...]%]<DSTURL>
---
Indicates a job has data staging requirements. The source URL(s) listed will be transferred from the execution system after the completion of the job. If more than one source URL is specified, the destination URL must be a directory.

The format of <SRCURL> is: [PROTO://][HOST][:PORT]][/PATH] where the path is remote.

The format of <DSTURL> is: [PROTO://][HOST][:PORT]][/PATH] where the path is local.

PROTO can be any of the following protocols: ssh, file, or gsiftp.
HOST is the name of the host where the file resides.
PATH is the path of the source or destination file. The destination path may be a directory when sending a single file and must be a directory when sending multiple files. If a directory is specified, it must end with a forward slash (/).

Valid variables include:
$JOBID
$HOME - Path the script was run from
$RHOME - Home dir of the user on the remote system
$SUBMITHOST
$DEST - This is the Moab where the job will run
$LOCALDATASTAGEHEAD
Note If no destination is given, the protocol and file name will be set to the same as the source.
Note The $RHOME (remote home directory) variable is for when a user's home directory on the compute node is different than on the submission host.
> msub -W x='mstageout=ssh://$DEST/$HOME/resultfile1.txt|ssh://host/home/dev/resultscript.sh%file:///home/dev/' script.sh
Copy resultfile1.txt and resultscript.sh from the execution system to /home/dev/ after the execution of script.sh is complete. $HOME is a path containing a preceding / (i.e. /home/adaptive).
   
NACCESSPOLICY
one of SHARED, SINGLEJOB, SINGLETASK, SINGLEUSER, or UNIQUEUSER
---
Specifies how node resources should be accessed. (See Node Access Policies for more information).
Note The naccesspolicy option can only be used to make node access more constraining than is specified by the system, partition, or node policies. If the effective node access policy is shared, naccesspolicy can be set to singleuser, if the effective node access policy is singlejob, naccesspolicy can be set to singletask.
> qsub -l naccesspolicy=singleuser bw.cmd
> bsub -ext naccesspolicy=singleuser lancer.cmd
Job can only allocate free nodes or nodes running jobs by same user.
   
NALLOCPOLICY
one of the valid settings for the parameter NODEALLOCATIONPOLICY
---
Specifies how node resources should be selected and allocated to the job. (See Node Allocation Policies for more information.)
> qsub -l nallocpolicy=minresource bw.cmd
Job should use the minresource node allocation policy.
   
NCPUS
<INTEGER>
---
The number of processors in one task where a task cannot span nodes. If NCPUS is used, then the resource manager's SUBMITPOLICY should be set to NODECENTRIC to get correct behavior. -l ncpus=<#> is equivalent to -l nodes=1:ppn=<#> when JOBNODEMATCHPOLICY is set to EXACTNODE. NCPUS is used when submitting jobs to an SMP.
   
NMATCHPOLICY
one of the valid settings for the parameter JOBNODEMATCHPOLICY
---
Specifies how node resources should be selected and allocated to the job.
> qsub -l nodes=2 -W x=nmatchpolicy:exactnode bw.cmd
Job should use the EXACTNODE JOBNODEMATCHPOLICY.
   
NODESET
<SETTYPE>:<SETATTR>[:<SETLIST>]
---
Specifies nodeset constraints for job resource allocation. (See the NodeSet Overview for more information.)
> qsub -l nodeset=ONEOF:PROCSPEED:350:400:450 bw.cmd
   
NODESETCOUNT
<INTEGER>
---
Specifies how many node sets a job uses. See the Node Set Overview for more information.
> msub -l nodesetcount=2
   
NODESETDELAY
[[DD:]HH:]MM:]SS
---
The maximum delay allowed when scheduling a job constrained by NODESETS until Moab discards the NODESET request and schedules the job normally.
> qsub -l nodesetdelay=300,walltime=16000 bw.cmd
   
NODESETISOPTIONAL
<BOOLEAN>
---
Specifies whether the nodeset constraint is optional. (See the NodeSet Overview for more information.)
Note Requires SCHEDCFG[] FLAGS=allowperjobnodesetisoptional.
> msub -l nodesetisoptional=true bw.cmd
   
OPSYS
<OperatingSystem>
---
Specifies the job's required operating system.
> qsub -l nodes=1,opsys=rh73 chem92.cmd
   
PARTITION
<STRING>[{,|:}<STRING>]...
---
Specifies the partition (or partitions) in which the job must run.
Note The job must have access to this partition based on system wide or credential based partition access lists.
> qsub -l nodes=1,partition=math:geology
The job must only run in the math partition or the geology partition.
   
PREF
[{feature|variable}:]<STRING>[:<STRING>]...
Note If feature or variable are not specified, then feature is assumed.
---
Specifies which node features are preferred by the job and should be allocated if available. If preferred node criteria are specified, Moab favors the allocation of matching resources but is not bound to only consider these resources.
Note Preferences are not honored unless the node allocation policy is set to PRIORITY and the PREF priority component is set within the node's PRIORITYF attribute.
> qsub -l nodes=1,pref=bigmem

The job may run on any nodes but prefers to allocate nodes with the bigmem feature.
   
PROCS
<INTEGER>
---

Requests a specific amount of processors for the job. Instead of users trying to determine the amount of nodes they need, they can instead decide how many processors they need and Moab will automatically request the appropriate amount of nodes from the RM. This also works with feature requests, such as procs=12[:feature1[:feature2[…]]].

Note Using this resource request overrides any other processor or node related request, such as nodes=4.
msub -l procs=32 myjob.pl

Moab will request as many nodes as is necessary to meet the 32-processor requirement for the job.
   
PROLOGUE
<STRING>
---
Specifies a user owned prologue script which will be run after the system prologue and prologue.user scripts at the beginning of a job. The syntax is prologue=<file>. The file can be designated with an absolute or relative path.
Note This parameter works only with TORQUE.
msub -l prologue=prologue_script.sh job.s

   
QoS
<STRING>
---
Requests the specified QoS for the job.
> qsub -l walltime=1000,qos=highprio biojob.cmd
   
QUEUEJOB

<BOOLEAN>

TRUE
 Indicates whether or not the scheduler should queue the job if resources are not available to run the job immediately
msub -l nodes=1,queuejob=false test.cmd
   
REQATTR
Required node attributes with version number support: <ATTRIBUTE>[{>=|>|<=|<|=}<VERSION>]
---
Indicates required node attributes.
> qsub -l reqattr=matlab=7.1 testj.txt
   
RESFAILPOLICY
one of CANCEL, HOLD, IGNORE, NOTIFY, or REQUEUE
---
Specifies the action to take on an executing job if one or more allocated nodes fail. This setting overrides the global value specified with the NODEALLOCRESFAILUREPOLICY parameter.
msub -l resfailpolicy=ignore
For this particular job, ignore node failures.
   
RMTYPE
<STRING>
---
One of the resource manager types currently available within the cluster or grid. Typically, this is one of PBS, LSF, LL, SGE, SLURM, BProc, and so forth.
msub -l rmtype=ll
Only run job on a Loadleveler destination resource manager.
   
SIGNAL
<INTEGER>[@<OFFSET>]
---
Specifies the pre-termination signal to be sent to a job prior to it reaching its walltime limit or being terminated by Moab. The optional offset value specifies how long before job termination the signal should be sent. By default, the pre-termination signal is sent one minute before a job is terminated
> msub -l signal=32@120 bio45.cmd
   
SPRIORITY
<INTEGER>
0
Allows Moab administrators to set a system priority on a job (similar to setspri). This only works if the job submitter is an administrator.
> qsub -l nodes=16,spriority=100 job.cmd
   
TASKDISTPOLICY
RR or PACK
---
Allows users to specify task distribution policies on a per job basis. (See Task Distribution Overview)
> qsub -l nodes=16,taskdistpolicy=rr job.cmd
   
TEMPLATE
<STRING>
---
Specifies a job template to be used as a set template. The requested template must have SELECT=TRUE (See Job Templates.)
> msub -l walltime=1000,nodes=16,template=biojob job.cmd
   
TERMTIME
<TIMESPEC>
0
Specifies the time at which Moab should cancel a queued or active job. (See Job Deadline Support.)
> msub -l nodes=10,walltime=600,termtime=12:00_Jun/14 job.cmd
   
TPN
<INTEGER>[+]
0
Tasks per node allowed on allocated hosts. If the plus (+) character is specified, the tasks per node value is interpreted as a minimum tasks per node constraint; otherwise it is interpreted as an exact tasks per node constraint.

Note on Differences between TPN and PPN:

There are two key differences between the following: (A) qsub -l nodes=12:ppn=3 and (B) qsub -l nodes=12,tpn=3

The first difference is that ppn is interpreted as the minimum required tasks per node while tpn defaults to exact tasks per node; case (B) executes the job with exactly 3 tasks on each allocated node while case (A) executes the job with at least 3 tasks on each allocated node—nodeA:4,nodeB:3,nodeC:5

The second major difference is that the line, nodes=X:ppn=Y actually requests X*Y tasks, whereas nodes=X,tpn=Y requests only X tasks.

> msub -l nodes=10,walltime=600,tpn=4 job.cmd
   
TRIG
<TRIGSPEC>
---
Adds trigger(s) to the job. (See the Trigger Specification Page for specific syntax.)
Note Job triggers can only be specified if allowed by the QoS flag trigger.
> qsub -l trig=start:exec@/tmp/email.sh job.cmd
   
TRL (Format 1)
<INTEGER>[@<INTEGER>][:<INTEGER>[@<INTEGER>]]...
0
Specifies alternate task requests with their optional walltimes.
> msub -l trl=2@500:4@250:8@125:16@62 job.cmd

or
> qsub -l trl=2:3:4
   
TRL (Format 2)
<INTEGER>-<INTEGER>
0
Specifies a range of task requests that require the same walltime.
> msub -l trl=32-64 job.cmd
Note For optimization purposes Moab does not perform an exhaustive search of all possible values but will at least do the beginning, the end, and 4 equally distributed choices in between.
   
TTC
<INTEGER>
0
Total tasks allowed across the number of hosts requested. TTC is supported in the Wiki resource manager for SLURM. Compressed output must be enabled in the moab.cfg file. (See SLURMFLAGS for more information). NODEACCESSPOLICY should be set to SINGLEJOB and JOBNODEMATCHPOLICY should be set to EXACTNODE in the moab.cfg file.
> msub -l nodes=10,walltime=600,ttc=20 job.cmd
Note In this example, assuming all the nodes are 8 processor nodes, the first allocated node will have 10 tasks, the next node will have 2 tasks, and the remaining 8 nodes will have 1 task each for a total task count of 20 tasks.
   
VAR
<ATTR>:<VALUE>
---
Adds a generic variable or variables to the job.
VAR=testvar1:testvalue1

Single variable

VAR=testvar1:testvalue1+testvar2:testvalue2+testvar3:testvalue3

Multiple variables

13.3.3 Resource Manager Extension Examples

If more than one extension is required in a given job, extensions can be concatenated with a semicolon separator using the format <ATTR>:<VALUE>[;<ATTR>:<VALUE>]...

Example 1

#@comment="HOSTLIST:node1,node2;QOS:special;SID:silverA"

Job must run on nodes node1 and node2 using the QoS special. The job is also associated with the system ID silverA allowing the silver daemon to monitor and control the job.

Example 2

# PBS -W x=\"NODESET:ONEOF:NETWORK;DMEM:64\"

Job will have resources allocated subject to network based nodeset constraints. Further, each task will dedicate 64 MB of memory.

Example 3

>  qsub -l nodes=4,walltime=1:00:00 -W x="FLAGS:ADVRES:john.1"

Job will be forced to run within the john.1 reservation.

See Also