(Click to open topic with navigation)
Once a job under TORQUE has completed, the exit_status attribute will contain the result code returned by the job script. This attribute can be seen by submitting a qstat -f command to show the entire set of information associated with a job. The exit_status field is found near the bottom of the set of output lines.
Example 2-14: qstat -f (job failure)
Job Id: 179.host
Job_Name = STDIN
Job_Owner = user@host
job_state = C
queue = batchq server = host
Checkpoint = u ctime = Fri Aug 29 14:55:55 2008
Error_Path = host:/opt/moab/STDIN.e179
exec_host = node1/0
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Fri Aug 29 14:55:55 2008
Output_Path = host:/opt/moab/STDIN.o179
Priority = 0
qtime = Fri Aug 29 14:55:55 2008
Rerunable = True Resource_List.ncpus = 2
Resource_List.nodect = 1
Resource_List.nodes = node1
Variable_List = PBS_O_HOME=/home/user,PBS_O_LOGNAME=user,
PBS_O_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:,PBS_O_SHELL=/bin/bash,PBS_O_HOST=host,
PBS_O_WORKDIR=/opt/moab,PBS_O_QUEUE=batchq
sched_hint = Post job file processing error; job 179.host on host node1/0Ba
d UID for job execution REJHOST=pala.cridomain MSG=cannot find user 'user' in password file
etime = Fri Aug 29 14:55:55 2008
exit_status = -1
The value of Resource_List.* is the amount of resources requested.
This code can be useful in diagnosing problems with jobs that may have unexpectedly terminated.
If TORQUE was unable to start the job, this field will contain a negative number produced by the pbs_mom. Otherwise, if the job script was successfully started, the value in this field will be the return value of the script.
Example 2-15: TORQUE supplied exit codes
Name | Value | Description |
---|---|---|
JOB_EXEC_OK | 0 | Job execution successful |
JOB_EXEC_FAIL1 | -1 | Job execution failed, before files, no retry |
JOB_EXEC_FAIL2 | -2 | Job execution failed, after files, no retry |
JOB_EXEC_RETRY | -3 | Job execution failed, do retry |
JOB_EXEC_INITABT | -4 | Job aborted on MOM initialization |
JOB_EXEC_INITRST | -5 | Job aborted on MOM init, chkpt, no migrate |
JOB_EXEC_INITRMG | -6 | Job aborted on MOM init, chkpt, ok migrate |
JOB_EXEC_BADRESRT | -7 | Job restart failed |
JOB_EXEC_CMDFAIL | -8 | Exec() of user command failed |
JOB_EXEC_STDOUTFAIL | -9 | Could not create/open stdout stderr files |
JOB_EXEC_OVERLIMIT_MEM | -10 | Job exceeded a memory limit |
JOB_EXEC_OVERLIMIT_WT | -11 | Job exceeded a walltime limit |
JOB_EXEC_OVERLIMIT_CPUT | -12 | Job exceeded a CPU time limit |
Example 2-16: Exit code from C program
$ cat error.c
#include
#include
int
main(int argc, char *argv)
{
exit(256+11);
}
$ gcc -o error error.c
$ echo ./error | qsub
180.xxx.yyy
$ qstat -f
Job Id: 180.xxx.yyy
Job_Name = STDIN
Job_Owner = test.xxx.yyy
resources_used.cput = 00:00:00
resources_used.mem = 0kb
resources_used.vmem = 0kb
resources_used.walltime = 00:00:00
job_state = C
queue = batch
server = xxx.yyy
Checkpoint = u
ctime = Wed Apr 30 11:29:37 2008
Error_Path = xxx.yyy:/home/test/STDIN.e180
exec_host = node01/0
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Wed Apr 30 11:29:37 2008
Output_Path = xxx.yyy:/home/test/STDIN.o180
Priority = 0
qtime = Wed Apr 30 11:29:37 2008
Rerunable = True
Resource_List.neednodes = 1
Resource_List.nodect = 1
Resource_List.nodes = 1
Resource_List.walltime = 01:00:00
session_id = 14107
substate = 59
Variable_List = PBS_O_HOME=/home/test,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=test,
PBS_O_PATH=/usr/local/perltests/bin:/home/test/bin:/usr/local/s
bin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games,
PBS_O_SHELL=/bin/bash,PBS_SERVER=xxx.yyy,
PBS_O_HOST=xxx.yyy,PBS_O_WORKDIR=/home/test,
PBS_O_QUEUE=batch
euser = test
egroup = test
hashname = 180.xxx.yyy
queue_rank = 8
queue_type = E
comment = Job started on Wed Apr 30 at 11:29
etime = Wed Apr 30 11:29:37 2008
exit_status = 11
start_time = Wed Apr 30 11:29:37 2008
start_count = 1
Notice that the C routine exit passes only the low order byte of its argument. In this case, 256+11 is really 267 but the resulting exit code is only 11 as seen in the output.
Related topics