2.0 Submitting and managing jobs > 2.7 Job exit status

2.7 Job exit status

Once a job under TORQUE has completed, the exit_status attribute will contain the result code returned by the job script. This attribute can be seen by submitting a qstat -f command to show the entire set of information associated with a job. The exit_status field is found near the bottom of the set of output lines.

Example 2-1: qstat -f (job failure)

Job Id: 179.host

    Job_Name = STDIN

    Job_Owner = user@host

    job_state = C

    queue = batchq server = host

    Checkpoint = u ctime = Fri Aug 29 14:55:55 2008

    Error_Path = host:/opt/moab/STDIN.e179

    exec_host = node1/0

    Hold_Types = n

    Join_Path = n

    Keep_Files = n

    Mail_Points = a

    mtime = Fri Aug 29 14:55:55 2008

    Output_Path = host:/opt/moab/STDIN.o179

    Priority = 0

    qtime = Fri Aug 29 14:55:55 2008

    Rerunable = True Resource_List.ncpus = 2

    Resource_List.nodect = 1

    Resource_List.nodes = node1

    Variable_List = PBS_O_HOME=/home/user,PBS_O_LOGNAME=user,

 PBS_O_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:,PBS_O_SHELL=/bin/bash,PBS_O_HOST=host,

 PBS_O_WORKDIR=/opt/moab,PBS_O_QUEUE=batchq

    sched_hint = Post job file processing error; job 179.host on host node1/0Ba

 d UID for job execution REJHOST=pala.cridomain MSG=cannot find user 'user' in password file

    etime = Fri Aug 29 14:55:55 2008

    exit_status = -1

This code can be useful in diagnosing problems with jobs that may have unexpectedly terminated.

If TORQUE was unable to start the job, this field will contain a negative number produced by the pbs_mom. Otherwise, if the job script was successfully started, the value in this field will be the return value of the script.

Example 2-2: TORQUE supplied exit codes

Name Value Description
JOB_EXEC_OK 0 Job execution successful
JOB_EXEC_FAIL1 -1 Job execution failed, before files, no retry
JOB_EXEC_FAIL2 -2 Job execution failed, after files, no retry
JOB_EXEC_RETRY -3 Job execution failed, do retry
JOB_EXEC_INITABT -4 Job aborted on MOM initialization
JOB_EXEC_INITRST -5 Job aborted on MOM init, chkpt, no migrate
JOB_EXEC_INITRMG -6 Job aborted on MOM init, chkpt, ok migrate
JOB_EXEC_BADRESRT -7 Job restart failed
JOB_EXEC_CMDFAIL -8 Exec() of user command failed
JOB_EXEC_STDOUTFAIL -9  
JOB_EXEC_OVERLIMIT -10  

Example 2-3: Exit code from C program

$ cat error.c

 

#include

#include

 

 

int

main(int argc, char *argv)

{

   exit(256+11);

}

 

 

$ gcc -o error error.c

 

 

$ echo ./error | qsub

180.xxx.yyy

 

 

$ qstat -f

Job Id: 180.xxx.yyy

    Job_Name = STDIN

    Job_Owner = test.xxx.yyy

    resources_used.cput = 00:00:00

    resources_used.mem = 0kb

    resources_used.vmem = 0kb

    resources_used.walltime = 00:00:00

    job_state = C

    queue = batch

    server = xxx.yyy

    Checkpoint = u

    ctime = Wed Apr 30 11:29:37 2008

    Error_Path = xxx.yyy:/home/test/STDIN.e180

    exec_host = node01/0

    Hold_Types = n

    Join_Path = n

    Keep_Files = n

    Mail_Points = a

    mtime = Wed Apr 30 11:29:37 2008

    Output_Path = xxx.yyy:/home/test/STDIN.o180     

    Priority = 0

    qtime = Wed Apr 30 11:29:37 2008

    Rerunable = True

        Resource_List.neednodes = 1

    Resource_List.nodect = 1

    Resource_List.nodes = 1

    Resource_List.walltime = 01:00:00

    session_id = 14107

    substate = 59

    Variable_List = PBS_O_HOME=/home/test,PBS_O_LANG=en_US.UTF-8,

        PBS_O_LOGNAME=test,

        PBS_O_PATH=/usr/local/perltests/bin:/home/test/bin:/usr/local/s

        bin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games,

        PBS_O_SHELL=/bin/bash,PBS_SERVER=xxx.yyy,

        PBS_O_HOST=xxx.yyy,PBS_O_WORKDIR=/home/test,

        PBS_O_QUEUE=batch

    euser = test

    egroup = test

    hashname = 180.xxx.yyy

    queue_rank = 8

    queue_type = E

    comment = Job started on Wed Apr 30 at 11:29

        etime = Wed Apr 30 11:29:37 2008

    exit_status = 11

    start_time = Wed Apr 30 11:29:37 2008

    start_count = 1

Notice that the C routine exit passes only the low order byte of its argument. In this case, 256+11 is really 267 but the resulting exit code is only 11 as seen in the output.

Related topics