TORQUE Resource Manager > Submitting and Managing Jobs > Job Exit Status

Job Exit Status

Once a job under TORQUE has completed, the exit_status attribute will contain the result code returned by the job script. This attribute can be seen by submitting a qstat -f command to show the entire set of information associated with a job. The exit_status field is found near the bottom of the set of output lines.

Example 4-23: qstat -f (job failure)

Job Id: 179.host

    Job_Name = STDIN

    Job_Owner = user@host

    job_state = C

    queue = batchq server = host

    Checkpoint = u ctime = Fri Aug 29 14:55:55 2008

    Error_Path = host:/opt/moab/STDIN.e179

    exec_host = node1/0

    Hold_Types = n

    Join_Path = n

    Keep_Files = n

    Mail_Points = a

    mtime = Fri Aug 29 14:55:55 2008

    Output_Path = host:/opt/moab/STDIN.o179

    Priority = 0

    qtime = Fri Aug 29 14:55:55 2008

    Rerunable = True Resource_List.ncpus = 2

    Resource_List.nodect = 1

    Resource_List.nodes = node1

    Variable_List = PBS_O_HOME=/home/user,PBS_O_LOGNAME=user,

 PBS_O_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:,PBS_O_SHELL=/bin/bash,PBS_O_HOST=host,

 PBS_O_WORKDIR=/opt/moab,PBS_O_QUEUE=batchq

    sched_hint = Post job file processing error; job 179.host on host node1/0Ba

 d UID for job execution REJHOST=pala.cridomain MSG=cannot find user 'user' in password file

    etime = Fri Aug 29 14:55:55 2008

    exit_status = -1

The value of Resource_List.* is the amount of resources requested.

This code can be useful in diagnosing problems with jobs that may have unexpectedly terminated.

If TORQUE was unable to start the job, this field will contain a negative number produced by the pbs_mom. Otherwise, if the job script was successfully started, the value in this field will be the return value of the script.

Example 4-24: TORQUE supplied exit codes

Name Value Description
JOB_EXEC_OK 0 Job execution successful
JOB_EXEC_FAIL1 -1 Job execution failed, before files, no retry
JOB_EXEC_FAIL2 -2 Job execution failed, after files, no retry
JOB_EXEC_RETRY -3 Job execution failed, do retry
JOB_EXEC_INITABT -4 Job aborted on MOM initialization
JOB_EXEC_INITRST -5 Job aborted on MOM init, chkpt, no migrate
JOB_EXEC_INITRMG -6 Job aborted on MOM init, chkpt, ok migrate
JOB_EXEC_BADRESRT -7 Job restart failed
JOB_EXEC_CMDFAIL -8 Exec() of user command failed
JOB_EXEC_STDOUTFAIL -9 Could not create/open stdout stderr files
JOB_EXEC_OVERLIMIT_MEM -10 Job exceeded a memory limit
JOB_EXEC_OVERLIMIT_WT -11 Job exceeded a walltime limit
JOB_EXEC_OVERLIMIT_CPUT -12 Job exceeded a CPU time limit

Example 4-25: Exit code from C program

$ cat error.c

 

#include

#include

 

 

int

main(int argc, char *argv)

{

   exit(256+11);

}

 

 

$ gcc -o error error.c

 

 

$ echo ./error | qsub

180.xxx.yyy

 

 

$ qstat -f

Job Id: 180.xxx.yyy

    Job_Name = STDIN

    Job_Owner = test.xxx.yyy

    resources_used.cput = 00:00:00

    resources_used.mem = 0kb

    resources_used.vmem = 0kb

    resources_used.walltime = 00:00:00

    job_state = C

    queue = batch

    server = xxx.yyy

    Checkpoint = u

    ctime = Wed Apr 30 11:29:37 2008

    Error_Path = xxx.yyy:/home/test/STDIN.e180

    exec_host = node01/0

    Hold_Types = n

    Join_Path = n

    Keep_Files = n

    Mail_Points = a

    mtime = Wed Apr 30 11:29:37 2008

    Output_Path = xxx.yyy:/home/test/STDIN.o180     

    Priority = 0

    qtime = Wed Apr 30 11:29:37 2008

    Rerunable = True

        Resource_List.neednodes = 1

    Resource_List.nodect = 1

    Resource_List.nodes = 1

    Resource_List.walltime = 01:00:00

    session_id = 14107

    substate = 59

    Variable_List = PBS_O_HOME=/home/test,PBS_O_LANG=en_US.UTF-8,

        PBS_O_LOGNAME=test,

        PBS_O_PATH=/usr/local/perltests/bin:/home/test/bin:/usr/local/s

        bin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games,

        PBS_O_SHELL=/bin/bash,PBS_SERVER=xxx.yyy,

        PBS_O_HOST=xxx.yyy,PBS_O_WORKDIR=/home/test,

        PBS_O_QUEUE=batch

    euser = test

    egroup = test

    hashname = 180.xxx.yyy

    queue_rank = 8

    queue_type = E

    comment = Job started on Wed Apr 30 at 11:29

        etime = Wed Apr 30 11:29:37 2008

    exit_status = 11

    start_time = Wed Apr 30 11:29:37 2008

    start_count = 1

Notice that the C routine exit passes only the low order byte of its argument. In this case, 256+11 is really 267 but the resulting exit code is only 11 as seen in the output.

Related topics 

© 2014 Adaptive Computing