Maui Scheduler

Wiki Interface Specification, version 1.1


COMMANDS:

All commands are requested via a socket interface, one command per socket connection. All fields and values are specified in ASCII text. Maui is configured to communicate via the wiki interface by specifying the following parameters in the maui.cfg file:

RMTYPE[X] WIKI
RMSERVER[X] <HOSTNAME>
RMPORT[X] <PORTNUMBER>

Field values must backslash escape the following characters if specified:

'#' ';' ':' (ie '\#')

Supported Commands are:

GETNODES, GETJOBS, STARTJOB, CANCELJOB, SUSPENDJOB, RESUMEJOB, JOBADDTASK, JOBRELEASETASK



GetNodes

send

CMD=GETNODES ARG={<UPDATETIME>:<NODEID>[:<NODEID>]... | <UPDATETIME>:ALL}

Only nodes updated more recently than <UPDATETIME> will be returned where <UPDATETIME> is specified as the epoch time of interest. Setting <UPDATETIME> to '0' will return information for all nodes. Specify a colon delimited list of NODEID's if specific nodes are desired or use the keyword 'ALL' to receive information for all nodes.

receive

SC=<STATUSCODE> ARG=<NODECOUNT>#<NODEID>:<FIELD>=<VALUE>;[<FIELD>=<VALUE>;]...[#<NODEID>:<FIELD>=<VALUE>;[<FIELD>=<VALUE>;]...]...

or

SC=<STATUSCODE> RESPONSE=<RESPONSE>

STATUSCODE Values:

0 SUCCESS
-1 INTERNAL ERROR

FIELD is either the text name listed below or 'A<FIELDNUM>' (ie, 'UPDATETIME' or 'A2')

RESPONSE is a statuscode sensitive message describing error or state details

EXAMPLE:

send 'CMD=GETNODES ARG=0:node001:node002:node003'

receive 'SC=0 ARG=4#node001:UPDATETIME=963004212;STATE=Busy;OS=AIX43;ARCH=RS6000...'

Field Values
INDEX NAME FORMAT DEFAULT DESCRIPTION
1 UPDATETIME* <EPOCHTIME> 0 time node information was last updated
2 STATE* one of the following: Idle, Running, Busy, Unknown,Draining, or Down Down state of the node
3 OS <STRING> [NONE] operating system running on node
4 ARCH <STRING> [NONE] compute architecture of node
5 CMEMORY <INTEGER> 0 configured RAM on node (in MB)
6 AMEMORY <INTEGER> 0 available/free RAM on node (in MB)
7 CSWAP <INTEGER> 0 configured swap on node (in MB)
8 ASWAP <INTEGER> 0 available swap on node (in MB)
9 CDISK <INTEGER> 0 configured local disk on node (in MB)
10 ADISK <INTEGER> 0 available local disk on node (in MB)
11 CPROC <INTEGER> 1 configured processors on node
12 APROC <INTEGER> 1 available processors on node
13 CNET one or more colon delimited <STRING>'s (ie, ETHER:FDDI:ATM) [NONE] configured network interfaces on node
14 ANET one or more colon delimited <STRING>'s (ie, ETHER:ATM) [NONE] Available network interfaces on node. Available interfaces are those which are 'up' and not already dedicated to a job.
15 CPULOAD <DOUBLE> 0.0 one minute BSD load average
16 CCLASS one or more bracket enclosed <NAME>:<COUNT> pairs (ie, [batch:5][sge:3]) [NONE] Run classes supported by node. Typically, one class is 'consumed' per task. Thus, an 8 processor node may have 8 instances of each class it supports present, ie [batch:8][interactive:8]
17 ACLASS one or more bracket enclosed <NAME>:<COUNT> pairs (ie, [batch:5][sge:3]) [NONE] run classes currently available on node. If not specified, scheduler will attempt to determine actual ACLASS value.
18 FEATURE one or more colon delimited <STRING>'s (ie, WIDE:HSM) [NONE] generic attributes, often describing hardware or software features, associated with the node.
19 PARTITION <STRING> DEFAULT partition to which node belongs
20 EVENT <STRING> [NONE] Event or exception which occurred on the node
21 CURRENTTASK <INTEGER> 0 Number of tasks currently active on the node
22 MAXTASK <INTEGER> <CPROC> Maximum number of tasks allowed on the node at any given time
23 SPEED <DOUBLE> 1.0 Relative processor speed of the node
24 FRAME <INTEGER> 0 Frame location of the node
25 SLOT <INTEGER> 0 Slot location of the node
26 CRES one or more colon delimited <NAME>,<VALUE> pairs (ie, MATLAB,6:COMPILER,100) [NONE] Arbitrary consumable resources supported and tracked on the node, ie software licenses or tape drives.
27 ARES one or more colon delimited <NAME>,<VALUE> pairs (ie, MATLAB,6:COMPILER,100) [NONE] Arbitrary consumable resources currently available on the node

* indicates required field

NOTE 1: node states have the following definitions:
Idle: Node is ready to run jobs but currently is not running any.
Running: Node is running some jobs and will accept additional jobs
Busy: Node is running some jobs and will not accept additional jobs
Unknown: Node is capable of running jobs but the scheduler will need to determine if the node state is actually Idle, Running, or Busy.
Draining: Node is responding but will not accept new jobs
Down: Resource Manager problems have been detected. Node is incapable of running jobs.



GetJobs

send

CMD=GETJOBS ARG={<UPDATETIME>:<JOBID>[:<JOBID>]... | <UPDATETIME>:ALL }

Only jobs updated more recently than <UPDATETIME> will be returned where <UPDATETIME> is specified as the epoch time of interest. Setting <UPDATETIME> to '0' will return information for all jobs. Specify a colon delimited list of JOBID's if information for specific jobs is desired or use the keyword 'ALL' to receive information about all jobs

receive

SC=<STATUSCODE> ARG=<JOBCOUNT>#<JOBID>:<FIELD>=<VALUE>;[<FIELD>=<VALUE>;]...[#<JOBID>:<FIELD>=<VALUE>;[<FIELD>=<VALUE>;]...]...

or

SC=<STATUSCODE> RESPONSE=<RESPONSE>

FIELD is either the text name listed below or 'A<FIELDNUM>'
(ie, 'UPDATETIME' or 'A2')

STATUSCODE values:

0 SUCCESS
-1 INTERNAL ERROR

RESPONSE is a statuscode sensitive message describing error or state details

EXAMPLE:

send 'CMD=GETJOBS ARG=0:LL'

receive 'ARG=2#nebo3001.0:UPDATETIME=9780000320;STATE=Idle;WCLIMIT=3600;...'

Table of Job Field Values
INDEX NAME FORMAT DEFAULT DESCRIPTION
1 UPDATETIME* <EPOCHTIME> 0 Time job was last updated
2 STATE* one of Idle, Running, Hold, Suspended, Completed, or Cancelled Idle State of job
3 WCLIMIT* <INTEGER> 864000 Seconds of wall time required by job
4 TASKS* <INTEGER> 1 Number of tasks required by job
5 NODES <INTEGER> 1 Number of nodes required by job
6 GEOMETRY <STRING> [NONE] String describing task geometry required by job
7 QUEUETIME* <EPOCHTIME> 0 time job was submitted to resource manager
8 STARTDATE <EPOCHTIME> 0 earliest time job should be allowed to start
9 STARTTIME* <EPOCHTIME> 0 time job was started by the resource manager
10 COMPLETIONTIME* <EPOCHTIME> 0 time job completed execution
11 UNAME* <STRING> [NONE] UserID under which job will run
12 GNAME* <STRING> [NONE] GroupID under which job will run
13 ACCOUNT <STRING> [NONE] AccountID associated with job
14 RFEATURES colon delimited list <STRING>'s [NONE] List of features required on nodes
15 RNETWORK <STRING> [NONE] network adapter required by job
16 DNETWORK <STRING> [NONE] network adapter which must be dedicated to job
17 RCLASS list of bracket enclosed <STRING>:<INTEGER> pairs [NONE] list of <CLASSNAME>:<COUNT> pairs indicating type and number of class instances required per task. (ie, '[batch:1]' or '[batch:2][tape:1]')
18 ROPSYS <STRING> [NONE] operating system required by job
19 RARCH <STRING> [NONE] architecture required by job
20 RMEM <INTEGER> 0 real memory (RAM, in MB) required to be configured on nodes allocated to the job
21 RMEMCMP one of '>=', '>', '==', '<', or '<=' >= real memory comparison (ie, node must have >= 512MB RAM)
22 DMEM <INTEGER> 0 quantity of real memory (RAM, in MB) which must be dedicated to each task of the job
23 RDISK <INTEGER> 0 local disk space (in MB) required to be configured on nodes allocated to the job
24 RDISKCMP one of '>=', '>', '==', '<', or '<=' >= local disk comparison (ie, node must have > 2048 MB local disk)
25 DDISK <INTEGER> 0 quantity of local disk space (in MB) which must be dedicated to each task of the job
26 RSWAP <INTEGER> 0 virtual memory (swap, in MB) required to be configured on nodes allocated to the job
27 RSWAPCMP one of '>=', '>', '==', '<', or '<=' >= virtual memory comparison (ie, node must have ==4096 MB virtual memory)
28 DSWAP <INTEGER> 0 quantity of virtual memory (swap, in MB) which must be dedicated to each task of the job
29 PARTITIONMASK one or more colon delimited <STRING>s [ANY] list of partitions in which job can run
30 EXEC <STRING> [NONE] job executable command
31 IWD <STRING> [NONE] job's initial working directory
32 COMMENT <STRING> 0 general job attributes not described by other field
33 REJCOUNT <INTEGER> 0 number of times job was rejected
34 REJMESSAGE <STRING> [NONE] text description of reason job was rejected
35 REJCODE <INTEGER> 0 reason job was rejected
36 EVENT <EVENT> [NONE] event or exception experienced by job
37 TASKLIST one or more colon delimited <STRING>s [NONE] nodeid associated with each active task of job (ie, cl01, cl02, cl01, cl02, cl03)
38 TASKPERNODE <INTEGER> 0 exact number of tasks required per node
39 QOS <INTEGER> 0 quality of service requested
40 ENDDATE <EPOCHTIME> [ANY] time by which job must complete
41 CBSERVER <STRING>[:<INTEGER> [NONE] location of server which will handle callback requests in <HOSTNAME>:<PORT> format
42 CBTYPE one or more of the following delimited by colons: CANCEL and START START:CANCEL list of callback types requested by job
43 DPROCS <INTEGER> 1 number of processors dedicated per task
44 SUSPENDTIME <INTEGER> 0 Number of seconds job has been suspended
45 RESERVATION <STRING> [NONE] Name of reservation in which job must run

* indicates required field

NOTE 1: job states have the following definitions:
Idle: job is ready to run
Running: job is currently executing
Hold: job is in the queue but is not allowed to run
Suspended: job has started but execution has temporarily been suspended
Completed: job has completed
Cancelled: job has been cancelled

NOTE 2: completed and cancelled jobs should be maintained by the resource manager for a brief time, perhaps 1 to 5 minutes, before being purged. This provides the scheduler time to obtain all final job state information for scheduler statistics.


StartJob

The 'StartJob' command may only be applied to jobs in the 'Idle' state. It causes the job to begin running using the resources listed in the NodeID list.

send CMD=STARTJOB ARG=<JOBID> TASKLIST=<NODEID>[:<NODEID>]...

receive SC=<STATUSCODE> RESPONSE=<RESPONSE>

STATUSCODE >= 0 indicates SUCCESS
STATUSCODE < 0 indicates FAILURE
RESPONSE is a text message possibly further describing an error or state

EXAMPLE:

Start job nebo.1 on nodes cluster001 and cluster002

send 'CMD=STARTJOB ARG=nebo.1 TASKLIST=cluster001:cluster002'

receive 'SC=0;RESPONSE=job nebo.1 started with 2 tasks'



CancelJob

The 'CancelJob' command, if applied to an active job, with terminate its execution. If applied to an idle or active job, the CancelJob command will change the job's state to 'Cancelled'.

send CMD=CANCELJOB ARG=<JOBID> TYPE=<CANCELTYPE>

<CANCELTYPE> is one of the following:

ADMIN (command initiated by scheduler administrator)
WALLCLOCK (command initiated by scheduler because job exceeded its specified wallclock limit)

receive SC=<STATUSCODE> RESPONSE=<RESPONSE>

STATUSCODE >= 0 indicates SUCCESS
STATUSCODE < 0 indicates FAILURE
RESPONSE is a text message further describing an error or state

EXAMPLE:

Cancel job nebo.2

send 'CMD=CANCELJOB ARG=nebo.2 TYPE=ADMIN'

receive 'SC=0 RESPONSE=job nebo.2 cancelled'



SuspendJob

The 'SuspendJob' command can only be issued against a job in the state 'Running'. This command suspends job execution and results in the job changing to the 'Suspended' state.

send CMD=SUSPENDJOB ARG=<JOBID>

receive SC=<STATUSCODE> RESPONSE=<RESPONSE>

STATUSCODE >= 0 indicates SUCCESS
STATUSCODE < 0 indicates FAILURE
RESPONSE is a text message possibly further describing an error or state

EXAMPLE:

Resume job nebo.3

send 'CMD=RESUMEJOB ARG=nebo.3'

receive 'SC=0 RESPONSE=job nebo.3 resumed'



ResumeJob

The 'ResumeJob' command can only be issued against a job in the state 'Suspended'. This command resumes a suspended job returning it to the 'Running' state.

send CMD=RESUMEJOB ARG=<JOBID>

receive SC=<STATUSCODE> RESPONSE=<RESPONSE>

STATUSCODE >= 0 indicates SUCCESS
STATUSCODE < 0 indicates FAILURE
RESPONSE is a text message further describing an error or state

EXAMPLE:

Resume job nebo.3

send 'CMD=RESUMEJOB ARG=nebo.3'

receive 'SC=0 RESPONSE=job nebo.3 resumed'



JobAddTask

The 'JobAddTask' command allocates additional tasks to an active job.

send

CMD=JOBADDTASK ARG=<JOBID> <NODEID> [<NODEID>]...

receive

SC=<STATUSCODE> RESPONSE=<RESPONSE>

STATUSCODE >= 0 indicates SUCCESS



STATUSCODE < 0 indicates FAILURE
RESPONSE is a text message possibly further describing an error or state

EXAMPLE:

Add 3 default tasks to job nebo30023.0 using resources located on nodes cluster002, cluster016, and cluster112.

send 'CMD=JOBADDTASK ARG=nebo30023.0 DEFAULT cluster002 cluster016 cluster112'

receive 'SC=0 RESPONSE=3 tasks added'



JobReleaseTask

The 'JobReleaseTask' command removes tasks from an active job.

send

CMD=JOBREMOVETASK ARG=<JOBID> <TASKID> [<TASKID>]...

receive

SC=<STATUSCODE> RESPONSE=<RESPONSE>

STATUSCODE >= 0 indicates SUCCESS
STATUSCODE < 0 indicates FAILURE
RESPONSE is a text message further describing an error or state

EXAMPLE:

Free resources allocated to tasks 14, 15, and 16 of job nebo30023.0

send 'CMD=JOBREMOVETASK ARG=nebo30023.0 14 15 16'

receive 'SC=0 RESPONSE=3 tasks removed'