Once a job under TORQUE has completed, the exit_status attribute will contain the result code returned by the job script. This attribute can be seen by submitting a qstat -f command to show the entire set of information associated with a job. The exit_status field is found near the bottom of the set of output lines.
Example 2-1: qstat -f (job failure)
Job Id: 179.host
Job_Name = STDIN
Job_Owner = user@host
job_state = C
queue = batchq server = host
Checkpoint = u ctime = Fri Aug 29 14:55:55 2008
Error_Path = host:/opt/moab/STDIN.e179
exec_host = node1/0
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Fri Aug 29 14:55:55 2008
Output_Path = host:/opt/moab/STDIN.o179
Priority = 0
qtime = Fri Aug 29 14:55:55 2008
Rerunable = True Resource_List.ncpus = 2
Resource_List.nodect = 1
Resource_List.nodes = node1
Variable_List = PBS_O_HOME=/home/user,PBS_O_LOGNAME=user,
PBS_O_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:,PBS_O_SHELL=/bin/bash,PBS_O_HOST=host,
PBS_O_WORKDIR=/opt/moab,PBS_O_QUEUE=batchq
sched_hint = Post job file processing error; job 179.host on host node1/0Ba
d UID for job execution REJHOST=pala.cridomain MSG=cannot find user 'user' in password file
etime = Fri Aug 29 14:55:55 2008
exit_status = -1
This code can be useful in diagnosing problems with jobs that may have unexpectedly terminated.
If TORQUE was unable to start the job, this field will contain a negative number produced by the pbs_mom. Otherwise, if the job script was successfully started, the value in this field will be the return value of the script.
Example 2-2: TORQUE supplied exit codes
Name | Value | Description |
---|---|---|
JOB_EXEC_OK | 0 | Job execution successful |
JOB_EXEC_FAIL1 | -1 | Job execution failed, before files, no retry |
JOB_EXEC_FAIL2 | -2 | Job execution failed, after files, no retry |
JOB_EXEC_RETRY | -3 | Job execution failed, do retry |
JOB_EXEC_INITABT | -4 | Job aborted on MOM initialization |
JOB_EXEC_INITRST | -5 | Job aborted on MOM init, chkpt, no migrate |
JOB_EXEC_INITRMG | -6 | Job aborted on MOM init, chkpt, ok migrate |
JOB_EXEC_BADRESRT | -7 | Job restart failed |
JOB_EXEC_CMDFAIL | -8 | Exec() of user command failed |
JOB_EXEC_STDOUTFAIL | -9 | |
JOB_EXEC_OVERLIMIT | -10 |
Example 2-3: Exit code from C program
$ cat error.c
#include
#include
int
main(int argc, char *argv)
{
exit(256+11);
}
$ gcc -o error error.c
$ echo ./error | qsub
180.xxx.yyy
$ qstat -f
Job Id: 180.xxx.yyy
Job_Name = STDIN
Job_Owner = test.xxx.yyy
resources_used.cput = 00:00:00
resources_used.mem = 0kb
resources_used.vmem = 0kb
resources_used.walltime = 00:00:00
job_state = C
queue = batch
server = xxx.yyy
Checkpoint = u
ctime = Wed Apr 30 11:29:37 2008
Error_Path = xxx.yyy:/home/test/STDIN.e180
exec_host = node01/0
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Wed Apr 30 11:29:37 2008
Output_Path = xxx.yyy:/home/test/STDIN.o180
Priority = 0
qtime = Wed Apr 30 11:29:37 2008
Rerunable = True
Resource_List.neednodes = 1
Resource_List.nodect = 1
Resource_List.nodes = 1
Resource_List.walltime = 01:00:00
session_id = 14107
substate = 59
Variable_List = PBS_O_HOME=/home/test,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=test,
PBS_O_PATH=/usr/local/perltests/bin:/home/test/bin:/usr/local/s
bin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games,
PBS_O_SHELL=/bin/bash,PBS_SERVER=xxx.yyy,
PBS_O_HOST=xxx.yyy,PBS_O_WORKDIR=/home/test,
PBS_O_QUEUE=batch
euser = test
egroup = test
hashname = 180.xxx.yyy
queue_rank = 8
queue_type = E
comment = Job started on Wed Apr 30 at 11:29
etime = Wed Apr 30 11:29:37 2008
exit_status = 11
start_time = Wed Apr 30 11:29:37 2008
start_count = 1
Notice that the C routine exit passes only the low order byte of its argument. In this case, 256+11 is really 267 but the resulting exit code is only 11 as seen in the output.
Related topics