Moab Workload Manager

Appendix W: Wiki Interface Specification, version 1.2

W.1.1   COMMANDS

   All commands are requested via a socket interface, one command per socket connection. All fields and values are specified in ASCII text.  Moab is configured to communicate via the wiki interface by specifying the following parameters in the moab.cfg file:

moab.cfg

RMCFG[base] TYPE=WIKI SERVER=<HOSTNAME>[:<PORT>]
...

   Field values must backslash escape the following characters if specified:

        '#'  ';'  ':'      (i.e.  '\#')

    Supported Commands are:


W.1.1.1   Wiki Query Resources

W.1.1.1.1   Wiki Query Resources Request Format
   CMD=GETNODES ARG={<UPDATETIME>:<NODEID>[:<NODEID>]... | <UPDATETIME>:ALL}

   Only nodes updated more recently than <UPDATETIME> will be returned where <UPDATETIME> is specified as the epoch time of interest.  Setting <UPDATETIME> to '0' will return information for all nodes.  Specify a colon delimited list of NODEID's if specific nodes are desired or use the keyword 'ALL' to receive information for all nodes.

W.1.1.1.2   Query Resources Response Format
The query resources response format is one or more line of the following format (separated with a newline, " "):

<NODEID> <ATTR>=<VALUE>[;<ATTR>=<VALUE>]...

<ATTR> is one of the names in the table below and the format of <VALUE> is dependent on <ATTR>.

W.1.1.1.3   Wiki Query Resources Example
request:

wiki resource query

CMD=GETNODES ARG=0:node001:node002:node003

response:

wiki resource query response

node001 UPDATETIME=963004212;STATE=Busy;OS=AIX43;ARCH=RS6000...
node002 UPDATETIME=963004213;STATE=Busy;OS=AIX43;ARCH=RS6000...
...
W.1.1.1.4   Wiki Query Resources Data Format

* indicates required field

Note:  node states have the following definitions:
Busy: Node is running some jobs and will not accept additional jobs
Down: Resource Manager problems have been detected.  Node is incapable of running jobs.
Draining: Node is responding but will not accept new jobs
Idle: Node is ready to run jobs but currently is not running any.
Running: Node is running some jobs and will accept additional jobs
Unknown: Node is capable of running jobs but the scheduler will need to determine if the node state is actually Idle, Running, or Busy.


W.1.1.2   Wiki Query Workload

W.1.1.2.1   Wiki Query Workload Request Format
   CMD=GETJOBS ARG={<UPDATETIME>:<JOBID>[:<JOBID>]... | <UPDATETIME>:ALL }

   Only jobs updated more recently than <UPDATETIME> will be returned where <UPDATETIME> is specified as the epoch time of interest.  Setting <UPDATETIME> to '0' will return information for all jobs.  Specify a colon delimited list of JOBID's if information for specific jobs is desired or use the keyword 'ALL' to receive information about all jobs.

W.1.1.2.2   Wiki Query Workload Response Format
   SC=<STATUSCODE> ARG=<JOBCOUNT>#<JOBID>:<FIELD>=<VALUE>;[<FIELD>=<VALUE>;]...[#<JOBID>:<FIELD>=<VALUE>;[<FIELD>=<VALUE>;]...]...

        or

        SC=<STATUSCODE> RESPONSE=<RESPONSE>

        FIELD      is either the text name listed below or 'A<FIELDNUM>'
                      (ie, 'UPDATETIME' or 'A2')

        STATUSCODE values:

             0   SUCCESS
            -1   INTERNAL ERROR

        RESPONSE   is a statuscode sensitive message describing error or state details

W.1.1.2.3   Wiki Query Workload Example
request syntax
CMD=GETJOBS ARG=0:ALL

response syntax

ARG=2#nebo3001.0:UPDATETIME=9780000320;STATE=Idle;WCLIMIT=3600;...
W.1.1.2.4   Wiki Query Workload Data Format
NAME FORMAT DEFAULT DESCRIPTION
ACCOUNT <STRING> --- AccountID associated with job
ALLOCSIZE <INTEGER> --- number of application tasks to allocate at each allocation adjustment.
APPBACKLOG <DOUBLE> --- backlogged quantity of workload for associated application (units are opaque), value may be compared against TARGETBACKLOG
APPLOAD <DOUBLE> --- load of workload for associated application (units are opaque), value may be compared against TARGETLOAD
APPRESPONSETIME <DOUBLE> --- response time of workload for associated application (units are opaque), value may be compared against TARGETRESPONSETIME
APPTHROUGHPUT <DOUBLE> --- throughput of workload for associated application (units are opaque), value may be compared against TARGETTHROUGHPUT
ARGS <STRING> --- job command-line arguments
COMMENT <STRING> 0 job resource manager extension arguments including qos, dependencies, reservation constraints, etc
COMPLETETIME* <EPOCHTIME> 0 time job completed execution
DDISK <INTEGER> 0 quantity of local disk space (in MB) which must be dedicated to each task of the job
DGRES name:value[,name:value] --- Dedicated generic resources per task.
DPROCS <INTEGER> 1 number of processors dedicated per task
DNETWORK <STRING> --- network adapter which must be dedicated to job
DSWAP <INTEGER> 0 quantity of virtual memory (swap, in MB) which must be dedicated to each task of the job
ENDDATE <EPOCHTIME> [ANY] time by which job must complete
ENV <STRING> --- job environment variables
EVENT <EVENT> --- event or exception experienced by job
ERROR <STRING> --- file to contain STDERR
EXEC <STRING> --- job executable command
EXITCODE <INTEGER> --- job exit code
FLAGS <STRING> --- job flags
GEOMETRY <STRING> --- String describing task geometry required by job
GNAME* <STRING> --- GroupID under which job will run
HOSTLIST comma or colon delimited list of hostnames -
suffix the hostlist with a carat (^) to mean superset; suffix with an asterisk (*) to mean subset; otherwise, the hostlist is interpreted as an exact set
[ANY] list of required hosts on which job must run.  (see TASKLIST)
INPUT <STRING> --- file containing STDIN
IWD <STRING> --- job's initial working directory
NAME <STRING> --- User specified name of job
NODERANGE <INTEGER>[,<INTEGER>] --- Minimum and maximum nodes allowed to be allocated to job.
NODES <INTEGER> 1 Number of nodes required by job (See Node Definition for more info)
OUTPUT <STRING> --- file to contain STDOUT
PARTITIONMASK one or more colon delimited <STRING>s [ANY] list of partitions in which job can run
PREF colon delimited list of <STRING>s --- List of preferred node features or variables. (See PREF for more information.)
PRIORITY <INTEGER> --- system priority (absolute or relative - use '+' and '-' to specify relative)
QOS <INTEGER> 0 quality of service requested
QUEUETIME* <EPOCHTIME> 0 time job was submitted to resource manager
RARCH <STRING> --- architecture required by job
RCLASS list of bracket enclosed <STRING>:<INTEGER> pairs --- list of <CLASSNAME>:<COUNT> pairs indicating type and number of class instances required per task.  (ie, '[batch:1]' or '[batch:2][tape:1]')
RDISK <INTEGER> 0 local disk space (in MB) required to be configured on nodes allocated to the job 
RDISKCMP one of '>=', '>', '==', '<', or '<=' >= local disk comparison (ie, node must have > 2048 MB local disk)
REJCODE <INTEGER> 0 reason job was rejected
REJCOUNT <INTEGER> 0 number of times job was rejected
REJMESSAGE <STRING> --- text description of reason job was rejected
REQRSV <STRING> --- Name of reservation in which job must run
RESACCESS <STRING> --- List of reservations in which job can run
RFEATURES colon delimited list <STRING>'s --- List of features required on nodes
RMEM <INTEGER> 0 real memory (RAM, in MB) required to be configured on nodes allocated to the job
RMEMCMP one of '>=', '>', '==', '<', or '<=' >= real memory comparison (ie, node must have >= 512MB RAM)
RNETWORK <STRING> --- network adapter required by job
ROPSYS <STRING> --- operating system required by job
RSOFTWARE <RESTYPE>[{+|:}<COUNT>][@<TIMEFRAME>] --- software required by job
RSWAP <INTEGER> 0 virtual memory (swap, in MB) required to be configured on nodes allocated to the job
RSWAPCMP one of '>=', '>', '==', '<', or '<=' >= virtual memory comparison (ie, node must have ==4096 MB virtual memory) 
SID <STRING> --- system id (global job system owner)
SJID <STRING> --- system job id (global job id)
STARTDATE <EPOCHTIME> 0 earliest time job should be allowed to start
STARTTIME* <EPOCHTIME> 0 time job was started by the resource manager
STATE* one of Idle, Running, Hold, Suspended, Completed, or Removed Idle State of job
SUSPENDTIME <INTEGER> 0 Number of seconds job has been suspended
TARGETBACKLOG <DOUBLE>[,<DOUBLE>] --- Minimum and maximum backlog for application within job.
TARGETLOAD <DOUBLE>[,<DOUBLE>] --- Minimum and maximum load for application within job.
TARGETRESPONSETIME <DOUBLE>[,<DOUBLE>] --- Minimum and maximum response time for application within job.
TARGETTHROUGHPUT <DOUBLE>[,<DOUBLE>] --- Minimum and maximum throughput for application within job.
TARGETVIOLATIONTIME <ALLOCATIONTIME>[,<DEALLOCATIONTIME>] where values are specified using the format [[[DD:]HH:]MM:]SS --- By default, Moab allocates/deallocates resources as soon as a performance target violation is detected.
TASKLIST one or more comma-delimited <STRING>'s --- list of allocated tasks, or in other words, comma-delimited list of node ID's associated with each active task of job (i.e., cl01, cl02, cl01, cl02, cl03)  The tasklist is initially selected by the scheduler at the time the StartJob command is issued.  The resource manager is then responsible for starting the job on these nodes and maintaining this task distribution information throughout the life of the job.  (see HOSTLIST)
TASKS* <INTEGER> 1 Number of tasks required by job (See Task Definition for more info)
TASKPERNODE <INTEGER> 0 exact number of tasks required per node
UNAME* <STRING> --- UserID under which job will run
UPDATETIME* <EPOCHTIME> 0 Time job was last updated
WCLIMIT* [[HH:]MM:]SS 864000 walltime required by job

* indicates required field

Note:  Job states have the following definitions:
Completed: Job has completed
Hold: Job is in the queue but is not allowed to run
Idle: Job is ready to run
Removed: Job has been canceled or otherwise terminated externally
Running: Job is currently executing
Suspended: job has started but execution has temporarily been suspended

Note:  Completed and canceled jobs should be maintained by the resource manager for a brief time, perhaps 1 to 5 minutes, before being purged.  This provides the scheduler time to obtain all final job state information for scheduler statistics.


1.1.3 StartJob

   The 'StartJob' command may only be applied to jobs in the 'Idle' state.  It causes the job to begin running using the resources listed in the NodeID list.

    send     CMD=STARTJOB ARG=<JOBID> TASKLIST=<NODEID>[:<NODEID>]...

    receive  SC=<STATUSCODE> RESPONSE=<RESPONSE>

           STATUSCODE >= 0 indicates SUCCESS
           STATUSCODE <  0 indicates FAILURE
           RESPONSE   is a text message possibly further describing an error or state

job start example

# Start job nebo.1 on nodes cluster001 and cluster002
send 'CMD=STARTJOB ARG=nebo.1 TASKLIST=cluster001:cluster002'
receive 'SC=0;RESPONSE=job nebo.1 started with 2 tasks'

1.1.4 CancelJob

   The 'CancelJob' command, if applied to an active job, will terminate its execution.  If applied to an idle or active job, the CancelJob command will change the job's state to 'Canceled'.

    send     CMD=CANCELJOB ARG=<JOBID> TYPE=<CANCELTYPE>

    <CANCELTYPE> is one of the following:

    ADMIN               (command initiated by scheduler administrator)
    WALLCLOCK (command initiated by scheduler because job exceeded its specified wallclock limit)

    receive  SC=<STATUSCODE> RESPONSE=<RESPONSE>

           STATUSCODE >= 0 indicates SUCCESS
           STATUSCODE <  0 indicates FAILURE
           RESPONSE   is a text message further describing an error or state

job cancel example

# Cancel job nebo.2
send 'CMD=CANCELJOB ARG=nebo.2 TYPE=ADMIN'
receive 'SC=0 RESPONSE=job nebo.2 canceled'

1.1.5 SuspendJob

   The 'SuspendJob' command can only be issued against a job in the state 'Running'.  This command suspends job execution and results in the job changing to the 'Suspended' state.

    send     CMD=SUSPENDJOB ARG=<JOBID>

    receive  SC=<STATUSCODE> RESPONSE=<RESPONSE>

           STATUSCODE >= 0 indicates SUCCESS
           STATUSCODE <  0 indicates FAILURE
           RESPONSE   is a text message possibly further describing an error or state

job suspend example

# Suspend job nebo.3
send 'CMD=SUSPENDJOB ARG=nebo.3'
receive 'SC=0 RESPONSE=job nebo.3 suspended'

1.1.6 ResumeJob

   The 'ResumeJob' command can only be issued against a job in the state 'Suspended'.  This command resumes a suspended job returning it to the 'Running' state.

  send     CMD=RESUMEJOB ARG=<JOBID>

  receive  SC=<STATUSCODE> RESPONSE=<RESPONSE>

           STATUSCODE >= 0 indicates SUCCESS
           STATUSCODE <  0 indicates FAILURE
           RESPONSE   is a text message further describing an error or state

job resume example

# Resume job nebo.3
send 'CMD=RESUMEJOB ARG=nebo.3'
receive 'SC=0 RESPONSE=job nebo.3 resumed'

1.1.7 RequeueJob

   The 'RequeueJob' command can only be issued against an active job in the state 'Starting' or 'Running'.  This command requeues the job, stopping execution and returning the job to an idle state in the queue.  The requeued job will be eligible for execution the next time resources are available.

  send     CMD=REQUEUEJOB ARG=<JOBID>

  receive  SC=<STATUSCODE> RESPONSE=<RESPONSE>

           STATUSCODE >= 0 indicates SUCCESS
           STATUSCODE <  0 indicates FAILURE
           RESPONSE   is a text message further describing an error or state

job requeue example

# Requeue job nebo.3
send 'CMD=REQUEUEJOB ARG=nebo.3'
receive 'SC=0 RESPONSE=job nebo.3 requeued'

1.1.8 SignalJob

   The 'SignalJob' command can only be issued against an active job in the state 'Starting' or 'Running'.  This command signals the job, sending the specified signal to the master process.  The signalled job will be remain in the same state it was before the signal was issued.

  send     CMD=SIGNALJOB ARG=<JOBID> ACTION=signal VALUE=<SIGNAL>

  receive  SC=<STATUSCODE> RESPONSE=<RESPONSE>

           STATUSCODE >= 0 indicates SUCCESS
           STATUSCODE <  0 indicates FAILURE
           RESPONSE   is a text message further describing an error or state

job signal example

# Signal job nebo.3
send 'CMD=SIGNALJOB ARG=nebo.3 ACTION=signal VALUE=13'
receive 'SC=0 RESPONSE=job nebo.3 signalled'

1.1.9 ModifyJob

   The 'ModifyJob' command can be issued against any active or queued job.  This command modifies specified attributes of the job.

  send     CMD=MODIFYJOB ARG=<JOBID> [BANK=name] [NODES=num] [PARTITION=name] [TIMELIMIT=minutes]

  receive  SC=<STATUSCODE> RESPONSE=<RESPONSE>

           STATUSCODE >= 0 indicates SUCCESS
           STATUSCODE <  0 indicates FAILURE
           RESPONSE   is a text message further describing an error or state

job modify example

# Signal job nebo.3
send 'CMD=MODIFYJOB ARG=nebo.3 TIMELIMIT=9600'
receive 'SC=0 RESPONSE=job nebo.3 modified'

1.1.10 JobAddTask

   The 'JobAddTask' command allocates additional tasks to an active job.

    send

        CMD=JOBADDTASK ARG=<JOBID> <NODEID> [<NODEID>]...

    receive

        SC=<STATUSCODE> RESPONSE=<RESPONSE>

           STATUSCODE >= 0 indicates SUCCESS
           STATUSCODE <  0 indicates FAILURE
           RESPONSE   is a text message possibly further describing an error or state

job addtask example

# Add 3 default tasks to job nebo30023.0 using resources located on nodes cluster002, cluster016, and cluster112.
send 'CMD=JOBADDTASK ARG=nebo30023.0 DEFAULT cluster002 cluster016 cluster112'
receive 'SC=0 RESPONSE=3 tasks added'

1.1.11 JobRemoveTask

   The 'JobRemoveTask' command removes tasks from an active job.

    send

        CMD=JOBREMOVETASK ARG=<JOBID> <TASKID> [<TASKID>]...

    receive

        SC=<STATUSCODE> RESPONSE=<RESPONSE>

           STATUSCODE >= 0 indicates SUCCESS
           STATUSCODE <  0 indicates FAILURE
           RESPONSE   is a text message further describing an error or state

job removetask example

# Free resources allocated to tasks 14, 15, and 16 of job nebo30023.0
send 'CMD=JOBREMOVETASK ARG=nebo30023.0 14 15 16'
receive 'SC=0 RESPONSE=3 tasks removed'

1.2 Rejection Codes

  • 0xx - success - no error
    • 00x - success
      • 000 - success
    • 01x - usage/help reply
      • 010 - usage/help reply
    • 02x - status reply
      • 020 - general status reply
  • 1xx - warning
    • 10x - general warning
      • 100 - general warning
    • 11x - no content
      • 110 - general wire protocol or network warning
      • 112 - redirect
      • 114 - protocol warning
    • 12x - no matching results
      • 120 - general message format warning
      • 122 - incomplete specification (best guess action/response applied)
    • 13x - security warning
      • 130 - general security warning
      • 132 - insecure request
      • 134 - insufficient privileges (response was censored/action reduced in scope)
    • 14x - content or action warning
      • 140 - general content/action warning
      • 142 - no content (server has processed the request but there is no data to be returned)
      • 144 - no action (no object to act upon)
      • 146 - partial content
      • 148 - partial action
    • 15x - component defined
    • 18x - application defined
  • 2xx - wire protocol/network failure
    • 20x - protocol failure
      • 200 - general protocol/network failure
    • 21x - network failure
      • 210 - general network failure
      • 212 - cannot resolve host
      • 214 - cannot resolve port
      • 216 - cannot create socket
      • 218 - cannot bind socket
    • 22x - connection failure
      • 220 - general connection failure
      • 222 - cannot connect to service
      • 224 - cannot send data
      • 226 - cannot receive data
    • 23x - connection rejected
      • 230 - general connection failure
      • 232 - connection timed-out
      • 234 - connection rejected - too busy
      • 236 - connection rejected - message too big
    • 24x - malformed framing
      • 240 - general framing failure
      • 242 - malformed framing protocol
      • 244 - invalid message size
      • 246 - unexpected end of file
    • 25x - component defined
    • 28x - application defined
  • 3xx - messaging format error
    • 30x - general messaging format error
      • 300 - general messaging format error
    • 31x - malformed XML document
      • 310 - general malformed XML error
    • 32x - XML schema validation error
      • 320 - general XML schema validation
    • 33x - general syntax error in request
      • 330 - general syntax error in response
      • 332 - object incorrectly specified
      • 334 - action incorrectly specified
      • 336 - option/parameter incorrectly specified
    • 34x - general syntax error in response
      • 340 - general response syntax error
      • 342 - object incorrectly specified
      • 344 - action incorrectly specified
      • 346 - option/parameter incorrectly specified
    • 35x - synchronization failure
      • 350 - general synchronization failure
      • 352 - request identifier is not unique
      • 354 - request id values do not match
      • 356 - request id count does not match
  • 4xx - security error occurred
    • 40x - authentication failure - client signature
      • 400 - general client signature failure
      • 402 - invalid authentication type
      • 404 - cannot generate security token key - inadequate information
      • 406 - cannot canonicalize request
      • 408 - cannot sign request
    • 41x - negotiation failure
      • 410 - general negotiation failure
      • 412 - negotiation request malformed
      • 414 - negotiation request not understood
      • 416 - negotiation request not supported
    • 42x - authentication failure
      • 420 - general authentication failure
      • 422 - client signature failure
      • 424 - server authentication failure
      • 426 - server signature failure
      • 428 - client authentication failure
    • 43x - encryption failure
      • 430 - general encryption failure
      • 432 - client encryption failure
      • 434 - server decryption failure
      • 436 - server encryption failure
      • 438 - client decryption failure
    • 44x - authorization failure
      • 440 - general authorization failure
      • 442 - client authorization failure
      • 444 - server authorization failure
    • 45x - component defined failure
    • 48x - application defined failure
  • 5xx - event management request failure
    • 50x - reserved
      • 500 - reserved
  • 6xx - reserved for future use
    • 60x - reserved
      • 600 - reserved
  • 7xx - server side error occurred
    • 70x - server side error
      • 700 - general server side error
    • 71x - server does not support requested function
      • 710 - server does not support requested function
    • 72x - internal server error
      • 720 - general internal server error
    • 73x - resource unavailable
      • 730 - general resource unavailable error
      • 732 - software resource unavailable error
      • 734 - hardware resource unavailable error
    • 74x - request violates policy
      • 740 - general policy violation
    • 75x - component-defined failure
    • 78x - application-defined failure
  • 8xx - client side error occurred
    • 80x - general client side error
      • 800 - general client side error
    • 81x - request not supported
      • 810 - request not supported
    • 82x - application specific failure
      • 820 - general application specific failure
  • 9xx - miscellaneous
    • 90x - general miscellaneous error
      • 900 - general miscellaneous error
    • 91x - general insufficient resources error
      • 910 - general insufficient resources error
    • 99x - general unknown error
      • 999 - unknown error