Once a job under TORQUE has completed, the exit_status attribute will contain the result code returned by the job script. This attribute can be seen by submitting a qstat -f command to show the entire set of information associated with a job. The exit_status field is found near the bottom of the set of output lines.
Job Id: 179.host Job_Name = STDIN Job_Owner = user@host job_state = C queue = batchq server = host Checkpoint = u ctime = Fri Aug 29 14:55:55 2008 Error_Path = host:/opt/moab/STDIN.e179 exec_host = node1/0 Hold_Types = n Join_Path = n Keep_Files = n Mail_Points = a mtime = Fri Aug 29 14:55:55 2008 Output_Path = host:/opt/moab/STDIN.o179 Priority = 0 qtime = Fri Aug 29 14:55:55 2008 Rerunable = True Resource_List.ncpus = 2 Resource_List.nodect = 1 Resource_List.nodes = node1 Variable_List = PBS_O_HOME=/home/user,PBS_O_LOGNAME=user, PBS_O_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:,PBS_O_SHELL=/bin/bash,PBS_O_HOST=host, PBS_O_WORKDIR=/opt/moab,PBS_O_QUEUE=batchq sched_hint = Post job file processing error; job 179.host on host node1/0Ba d UID for job execution REJHOST=pala.cridomain MSG=cannot find user 'user' in password file etime = Fri Aug 29 14:55:55 2008 exit_status = -1
This code can be useful in diagnosing problems with jobs that may have unexpectedly terminated.
If TORQUE was unable to start the job, this field will contain a negative number produced by the pbs_mom.
Otherwise, if the job script was successfully started, the value in this field will be the return value of the script.
TORQUE Supplied Exit CodesName | Value | Description |
---|---|---|
JOB_EXEC_OK | 0 | job exec successful |
JOB_EXEC_FAIL1 | -1 | job exec failed, before files, no retry |
JOB_EXEC_FAIL2 | -2 | job exec failed, after files, no retry |
JOB_EXEC_RETRY | -3 | job execution failed, do retry |
JOB_EXEC_INITABT | -4 | job aborted on MOM initialization |
JOB_EXEC_INITRST | -5 | job aborted on MOM init, chkpt, no migrate |
JOB_EXEC_INITRMG | -6 | job aborted on MOM init, chkpt, ok migrate |
JOB_EXEC_BADRESRT | -7 | job restart failed |
JOB_EXEC_CMDFAIL | -8 | exec() of user command failed |
$ cat error.c #include #include int main(int argc, char *argv) { exit(256+11); } $ gcc -o error error.c $ echo ./error | qsub 180.xxx.yyy $ qstat -f Job Id: 180.xxx.yyy Job_Name = STDIN Job_Owner = test.xxx.yyy resources_used.cput = 00:00:00 resources_used.mem = 0kb resources_used.vmem = 0kb resources_used.walltime = 00:00:00 job_state = C queue = batch server = xxx.yyy Checkpoint = u ctime = Wed Apr 30 11:29:37 2008 Error_Path = xxx.yyy:/home/test/STDIN.e180 exec_host = node01/0 Hold_Types = n Join_Path = n Keep_Files = n Mail_Points = a mtime = Wed Apr 30 11:29:37 2008 Output_Path = xxx.yyy:/home/test/STDIN.o180 Priority = 0 qtime = Wed Apr 30 11:29:37 2008 Rerunable = True Resource_List.neednodes = 1 Resource_List.nodect = 1 Resource_List.nodes = 1 Resource_List.walltime = 01:00:00 session_id = 14107 substate = 59 Variable_List = PBS_O_HOME=/home/test,PBS_O_LANG=en_US.UTF-8, PBS_O_LOGNAME=test, PBS_O_PATH=/usr/local/perltests/bin:/home/test/bin:/usr/local/s bin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games, PBS_O_SHELL=/bin/bash,PBS_SERVER=xxx.yyy, PBS_O_HOST=xxx.yyy,PBS_O_WORKDIR=/home/test, PBS_O_QUEUE=batch euser = test egroup = test hashname = 180.xxx.yyy queue_rank = 8 queue_type = E comment = Job started on Wed Apr 30 at 11:29 etime = Wed Apr 30 11:29:37 2008 exit_status = 11 start_time = Wed Apr 30 11:29:37 2008 start_count = 1
Notice that the C routine exit passes only the low order byte of its argument. In this case, 256+11 is really 267 but the resulting exit code is only 11 as seen in the output.