Appendix W: Wiki Interface Specification, version 1.2

W.1.1 COMMANDS

All commands are requested via a socket interface, one command per socket connection. All fields and values are specified in ASCII text. Moab is configured to communicate via the wiki interface by specifying the following parameters in the moab.cfg file:
moab.cfg
RMCFG[base] TYPE=WIKI SERVER=<HOSTNAME>[:<PORT>]
...

Field values must backslash escape the following characters if specified:

'#' ';' ':' (i.e. '\#')

Supported Commands are:


W.1.1.1 Wiki Query Resources
W.1.1.1.1 Wiki Query Resources Request Format
CMD=GETNODES ARG={<UPDATETIME>:<NODEID>[:<NODEID>]... | <UPDATETIME>:ALL}

Only nodes updated more recently than <UPDATETIME> will be returned where <UPDATETIME> is specified as the epoch time of interest. Setting <UPDATETIME> to '0' will return information for all nodes. Specify a colon delimited list of NODEID's if specific nodes are desired or use the keyword 'ALL' to receive information for all nodes.

W.1.1.1.2 Query Resources Response Format
The query resources response format is one or more line of the following format (separated with a newline, " "):

<NODEID> <ATTR>=<VALUE>[;<ATTR>=<VALUE>]...

<ATTR> is one of the names in the table below and the format of <VALUE> is dependent on <ATTR>.

W.1.1.1.3 Wiki Query Resources Example
request:
wiki resource query
CMD=GETNODES ARG=0:node001:node002:node003

response:

wiki resource query response
node001 UPDATETIME=963004212;STATE=Busy;OS=AIX43;ARCH=RS6000...
node002 UPDATETIME=963004213;STATE=Busy;OS=AIX43;ARCH=RS6000...
...
W.1.1.1.4 Wiki Query Resources Data Format
NAMEFORMATDEFAULTDESCRIPTION
one or more bracket enclosed <NAME>:<COUNT> pairs (ie, [batch:5][sge:3])---Run classes currently available on node. If not specified, scheduler will attempt to determine actual ACLASS value.
<INTEGER>0Available local disk on node (in MB)
<fs id="X" size="X" io="Y" rcount="X" wcount="X" ocount="X"></fs>[...]0Available filesystem state
<INTEGER>0Available/free RAM on node (in MB)
<INTEGER>1Available processors on node
<STRING>---Compute architecture of node
one or more comma delimited <NAME>:<VALUE> pairs (ie, MATLAB:6,COMPILER:100)---Arbitrary consumable resources currently available on the node
<INTEGER>0Available swap on node (in MB)
one or more bracket enclosed <NAME>:<COUNT> pairs (ie, [batch:5][sge:3])---Run classes supported by node. Typically, one class is 'consumed' per task. Thus, an 8 processor node may have 8 instances of each class it supports present, ie [batch:8][interactive:8]
<INTEGER>0Configured local disk on node (in MB)
<STRING>0Configured filesystem state
<INTEGER>0Configured RAM on node (in MB)
<INTEGER>1Configured processors on node
<DOUBLE>0.0One minute BSD load average
one or more comma delimited <NAME>:<VALUE> pairs (ie, MATLAB:6,COMPILER:100)---Arbitrary consumable resources supported and tracked on the node, ie software licenses or tape drives.
<INTEGER>0Configured swap on node (in MB)
<INTEGER>0Number of tasks currently active on the node
<STRING>---Event or exception which occurred on the node
one or more colon delimited <STRING>'s (ie, WIDE:HSM)---Generic attributes, often describing hardware or software features, associated with the node.
<INTEGER>---Current total number of gevent event occurrences since epoch. This value should be monotonically increasing.
GEVENT[<EVENTNAME>]=<STRING>---Generic event occurrence and context data.
GMETRIC[<METRICNAME>]=<DOUBLE>---Current value of generic metric, i.e., 'GMETRIC[temp]=103.5'.
<INTEGER>---Number of seconds since last detected keyboard or mouse activity (often used with desktop harvesting)
<INTEGER><CPROC>Maximum number of tasks allowed on the node at any given time
<STRING>---Operating system running on node
One or more comma delimited <STRING>'s with quotes if the strIng has spaces (ie. "SAS7 AS3 Core Baseline Build v0.1.0","RedHat AS3-U5Development Build v0.2").---Operating systems accepted by node
<ATTR>=<VALUE>[,<ATTR>=<VALUE>]...---Opaque node attributes assigned to node
<STRING>DEFAULTPartition to which node belongs
<INTEGER>0Rack location of the node
<INTEGER>0Slot location of the node
<DOUBLE>1.0Relative processor speed of the node. For more information, see Node Attributes Speed
one of the following: Idle, Running, Busy, Unknown, Drained, Draining, or DownDownState of the node
<EPOCHTIME>0Time node information was last updated
<ATTR>=<VAL>---Generic variables to be associated with node
one or more comma delimited <NAME>:<VALUE> pairs (ie, MATLAB:6,COMPILER:100)---Amount of external usage of a particular generic resource

* indicates required field

Note: node states have the following definitions:

Node is running some jobs and will not accept additional jobs
Resource Manager problems have been detected. Node is incapable of running jobs.
Node is responding but will not accept new jobs
Node is ready to run jobs but currently is not running any.
Node is running some jobs and will accept additional jobs
Node is capable of running jobs but the scheduler will need to determine if the node state is actually Idle, Running, or Busy.

W.1.1.2 Wiki Query Workload
W.1.1.2.1 Wiki Query Workload Request Format
CMD=GETJOBS ARG={<UPDATETIME>:<JOBID>[:<JOBID>]... | <UPDATETIME>:ALL }

Only jobs updated more recently than <UPDATETIME> will be returned where <UPDATETIME> is specified as the epoch time of interest. Setting <UPDATETIME> to '0' will return information for all jobs. Specify a colon delimited list of JOBID's if information for specific jobs is desired or use the keyword 'ALL' to receive information about all jobs.

W.1.1.2.2 Wiki Query Workload Response Format

SC=<STATUSCODE> ARG=<JOBCOUNT>#<JOBID>:<FIELD>=<VALUE>;[<FIELD>=<VALUE>;]...[#<JOBID>:<FIELD>=<VALUE>;[<FIELD>=<VALUE>;]...]...

SC=<STATUSCODE> ARG=<JOBCOUNT>#<JOBID>:<FIELD>=<VALUE>;[<FIELD>=<VALUE>;]...
[#<JOBID>:<FIELD>=<VALUE>;[<FIELD>=<VALUE>;]...]...

or

SC=<STATUSCODE> RESPONSE=<RESPONSE>

FIELD is either the text name listed below or 'A<FIELDNUM>'
(ie, 'UPDATETIME' or 'A2')

STATUSCODE values:

0 SUCCESS
-1 INTERNAL ERROR

RESPONSE is a statuscode sensitive message describing error or state details

W.1.1.2.3 Wiki Query Workload Example

request syntax
CMD=GETJOBS ARG=0:ALL
response syntax
ARG=2#nebo3001.0:UPDATETIME=9780000320;STATE=Idle;WCLIMIT=3600;...
W.1.1.2.4 Wiki Query Workload Data Format
NAMEFORMATDEFAULTDESCRIPTION
<STRING>---AccountID associated with job
<INTEGER>---number of application tasks to allocate at each allocation adjustment.
<DOUBLE>---backlogged quantity of workload for associated application (units are opaque), value may be compared against TARGETBACKLOG
<DOUBLE>---load of workload for associated application (units are opaque), value may be compared against TARGETLOAD
<DOUBLE>---response time of workload for associated application (units are opaque), value may be compared against TARGETRESPONSETIME
<DOUBLE>---throughput of workload for associated application (units are opaque), value may be compared against TARGETTHROUGHPUT
<STRING>---job command-line arguments
<STRING>0job resource manager extension arguments including qos, dependencies, reservation constraints, etc
<EPOCHTIME>0time job completed execution
<INTEGER>0quantity of local disk space (in MB) which must be dedicated to each task of the job
name:value[,name:value]---Dedicated generic resources per task.
<INTEGER>1number of processors dedicated per task
<STRING>---network adapter which must be dedicated to job
<INTEGER>0quantity of virtual memory (swap, in MB) which must be dedicated to each task of the job
<EPOCHTIME>[ANY]time by which job must complete
<STRING>---job environment variables
<EVENT>---event or exception experienced by job
<STRING>---file to contain STDERR
<STRING>---job executable command
<INTEGER>---job exit code
<STRING>---job flags
<STRING>---String describing task geometry required by job
<STRING>---GroupID under which job will run
comma or colon delimited list of hostnames -
suffix the hostlist with a carat (^) to mean superset; suffix with an asterisk (*) to mean subset; otherwise, the hostlist is interpreted as an exact set
[ANY]

list of required hosts on which job must run. (see TASKLIST)

A subset means the specified hostlist is used first to select hosts for the job. If the job requires more hosts than are in the hostlist, they will be obtained from elsewhere if possible. If the job does not require all of the jobs in the hostlist, it will use only the ones it needs.

A superset means the hostlist is the only source of hosts that should be considered for running the job. If the job can't find the necessary resources in the hosts in this list it should not run. No other hosts should be considered in allocating the job.

<STRING>---file containing STDIN
<STRING>---job's initial working directory
<STRING>---User specified name of job
<INTEGER>[,<INTEGER>]---Minimum and maximum nodes allowed to be allocated to job.
<INTEGER>1Number of nodes required by job (See Node Definition for more info)
<STRING>---file to contain STDOUT
one or more colon delimited <STRING>s[ANY]list of partitions in which job can run
colon delimited list of <STRING>s---List of preferred node features or variables. (See PREF for more information.)
<INTEGER>---system priority (absolute or relative - use '+' and '-' to specify relative)
<INTEGER>0quality of service requested
<EPOCHTIME>0time job was submitted to resource manager
<STRING>---architecture required by job
list of bracket enclosed <STRING>:<INTEGER> pairs---list of <CLASSNAME>:<COUNT> pairs indicating type and number of class instances required per task. (ie, '[batch:1]' or '[batch:2][tape:1]')
<INTEGER>0local disk space (in MB) required to be configured on nodes allocated to the job
one of '>=', '>', '==', '<', or '<='>=local disk comparison (ie, node must have > 2048 MB local disk)
<INTEGER>0reason job was rejected
<INTEGER>0number of times job was rejected
<STRING>---text description of reason job was rejected
<STRING>---Name of reservation in which job must run
<STRING>---List of reservations in which job can run
colon delimited list <STRING>'s---List of features required on nodes
<INTEGER>0real memory (RAM, in MB) required to be configured on nodes allocated to the job
one of '>=', '>', '==', '<', or '<='>=real memory comparison (ie, node must have >= 512MB RAM)
<STRING>---network adapter required by job
<STRING>---operating system required by job
<RESTYPE>[{+|:}<COUNT>]
[@<TIMEFRAME>]
---software required by job
<INTEGER>0virtual memory (swap, in MB) required to be configured on nodes allocated to the job
one of '>=', '>', '==', '<', or '<='>=virtual memory comparison (ie, node must have ==4096 MB virtual memory)
<STRING>---system id (global job system owner)
<STRING>---system job id (global job id)
<EPOCHTIME>0earliest time job should be allowed to start
<EPOCHTIME>0time job was started by the resource manager
one of Idle, Running, Hold, Suspended, Completed, or RemovedIdleState of job
<INTEGER>0Number of seconds job has been suspended
<DOUBLE>[,<DOUBLE>]---Minimum and maximum backlog for application within job.
<DOUBLE>[,<DOUBLE>]---Minimum and maximum load for application within job.
<DOUBLE>[,<DOUBLE>]---Minimum and maximum response time for application within job.
<DOUBLE>[,<DOUBLE>]---Minimum and maximum throughput for application within job.
<ALLOCATIONTIME>
[,<DEALLOCATIONTIME>] where values are specified using the format [[[DD:]HH:]MM:]SS
---Amount of time an application performance target must be exceeded before Moab adjusts the resource allocation. By default, Moab allocates/deallocates resources as soon as a performance target violation is detected.
one or more comma-delimited <STRING>'s---list of allocated tasks, or in other words, comma-delimited list of node ID's associated with each active task of job (i.e., cl01, cl02, cl01, cl02, cl03) The tasklist is initially selected by the scheduler at the time the StartJob command is issued. The resource manager is then responsible for starting the job on these nodes and maintaining this task distribution information throughout the life of the job. (see HOSTLIST)
<INTEGER>1Number of tasks required by job (See Task Definition for more info)
<INTEGER>0exact number of tasks required per node
<STRING>---UserID under which job will run
<EPOCHTIME>0Time job was last updated
[[HH:]MM:]SS864000walltime required by job

* indicates required field

Note: Job states have the following definitions:

Job has completed
Job is in the queue but is not allowed to run
Job is ready to run
Job has been canceled or otherwise terminated externally
Job is currently executing
job has started but execution has temporarily been suspended

Note: Completed and canceled jobs should be maintained by the resource manager for a brief time, perhaps 1 to 5 minutes, before being purged. This provides the scheduler time to obtain all final job state information for scheduler statistics.


StartJob
The 'StartJob' command may only be applied to jobs in the 'Idle' state. It causes the job to begin running using the resources listed in the NodeID list.

send CMD=STARTJOB ARG=<JOBID> TASKLIST=<NODEID>[:<NODEID>]...

receive SC=<STATUSCODE> RESPONSE=<RESPONSE>

STATUSCODE >= 0 indicates SUCCESS
STATUSCODE < 0 indicates FAILURE
RESPONSE is a text message possibly further describing an error or state

job start example

# Start job nebo.1 on nodes cluster001 and cluster002
send 'CMD=STARTJOB ARG=nebo.1 TASKLIST=cluster001:cluster002'
receive 'SC=0;RESPONSE=job nebo.1 started with 2 tasks'

CancelJob
The 'CancelJob' command, if applied to an active job, will terminate its execution. If applied to an idle or active job, the CancelJob command will change the job's state to 'Canceled'.

send CMD=CANCELJOB ARG=<JOBID> TYPE=<CANCELTYPE>

<CANCELTYPE> is one of the following:

ADMIN (command initiated by scheduler administrator)
WALLCLOCK (command initiated by scheduler because job exceeded its specified wallclock limit)

receive SC=<STATUSCODE> RESPONSE=<RESPONSE>

STATUSCODE >= 0 indicates SUCCESS
STATUSCODE < 0 indicates FAILURE
RESPONSE is a text message further describing an error or state

job cancel example
# Cancel job nebo.2

send 'CMD=CANCELJOB ARG=nebo.2 TYPE=ADMIN'
receive 'SC=0 RESPONSE=job nebo.2 canceled'

SuspendJob
The 'SuspendJob' command can only be issued against a job in the state 'Running'. This command suspends job execution and results in the job changing to the 'Suspended' state.

send CMD=SUSPENDJOB ARG=<JOBID>

receive SC=<STATUSCODE> RESPONSE=<RESPONSE>

STATUSCODE >= 0 indicates SUCCESS
STATUSCODE < 0 indicates FAILURE
RESPONSE is a text message possibly further describing an error or state

job suspend example

# Suspend job nebo.3
send 'CMD=SUSPENDJOB ARG=nebo.3'
receive 'SC=0 RESPONSE=job nebo.3 suspended'

ResumeJob
The 'ResumeJob' command can only be issued against a job in the state 'Suspended'. This command resumes a suspended job returning it to the 'Running' state.

send CMD=RESUMEJOB ARG=<JOBID>

receive SC=<STATUSCODE> RESPONSE=<RESPONSE>

STATUSCODE >= 0 indicates SUCCESS
STATUSCODE < 0 indicates FAILURE
RESPONSE is a text message further describing an error or state

job resume example

# Resume job nebo.3
send 'CMD=RESUMEJOB ARG=nebo.3'
receive 'SC=0 RESPONSE=job nebo.3 resumed'

RequeueJob
The 'RequeueJob' command can only be issued against an active job in the state 'Starting' or 'Running'. This command requeues the job, stopping execution and returning the job to an idle state in the queue. The requeued job will be eligible for execution the next time resources are available.

send CMD=REQUEUEJOB ARG=<JOBID>

receive SC=<STATUSCODE> RESPONSE=<RESPONSE>

STATUSCODE >= 0 indicates SUCCESS
STATUSCODE < 0 indicates FAILURE
RESPONSE is a text message further describing an error or state

job requeue example

# Requeue job nebo.3
send 'CMD=REQUEUEJOB ARG=nebo.3'
receive 'SC=0 RESPONSE=job nebo.3 requeued'

SignalJob
The 'SignalJob' command can only be issued against an active job in the state 'Starting' or 'Running'. This command signals the job, sending the specified signal to the master process. The signalled job will be remain in the same state it was before the signal was issued.

send CMD=SIGNALJOB ARG=<JOBID> ACTION=signal VALUE=<SIGNAL>

receive SC=<STATUSCODE> RESPONSE=<RESPONSE>

STATUSCODE >= 0 indicates SUCCESS
STATUSCODE < 0 indicates FAILURE
RESPONSE is a text message further describing an error or state

job signal example

# Signal job nebo.3
send 'CMD=SIGNALJOB ARG=nebo.3 ACTION=signal VALUE=13'
receive 'SC=0 RESPONSE=job nebo.3 signalled'

ModifyJob
The 'ModifyJob' command can be issued against any active or queued job. This command modifies specified attributes of the job.

send CMD=MODIFYJOB ARG=<JOBID> [BANK=name] [NODES=num] [PARTITION=name] [TIMELIMIT=minutes]

receive SC=<STATUSCODE> RESPONSE=<RESPONSE>

STATUSCODE >= 0 indicates SUCCESS
STATUSCODE < 0 indicates FAILURE
RESPONSE is a text message further describing an error or state

job modify example

# Signal job nebo.3
send 'CMD=MODIFYJOB ARG=nebo.3 TIMELIMIT=9600'
receive 'SC=0 RESPONSE=job nebo.3 modified'

JobAddTask
The 'JobAddTask' command allocates additional tasks to an active job.

send

CMD=JOBADDTASK ARG=<JOBID> <NODEID> [<NODEID>]...

receive

SC=<STATUSCODE> RESPONSE=<RESPONSE>

STATUSCODE >= 0 indicates SUCCESS
STATUSCODE < 0 indicates FAILURE
RESPONSE is a text message possibly further describing an error or state

job addtask example

# Add 3 default tasks to job nebo30023.0 using resources located on nodes cluster002, cluster016, and cluster112.
send 'CMD=JOBADDTASK ARG=nebo30023.0 DEFAULT cluster002 cluster016 cluster112'
receive 'SC=0 RESPONSE=3 tasks added'

1.1.11 JobRemoveTask
The 'JobRemoveTask' command removes tasks from an active job.

send

CMD=JOBREMOVETASK ARG=<JOBID> <TASKID> [<TASKID>]...

receive

SC=<STATUSCODE> RESPONSE=<RESPONSE>

STATUSCODE >= 0 indicates SUCCESS
STATUSCODE < 0 indicates FAILURE
RESPONSE is a text message further describing an error or state

job removetask example

# Free resources allocated to tasks 14, 15, and 16 of job nebo30023.0
send 'CMD=JOBREMOVETASK ARG=nebo30023.0 14 15 16'
receive 'SC=0 RESPONSE=3 tasks removed'

1.2 Rejection Codes

Copyright © 2012 Adaptive Computing Enterprises, Inc.®