5.494 Using "tracejob" to Locate Job Failures

5.494.1 Overview

The tracejob utility extracts job status and job events from accounting records, MOM log files, server log files, and scheduler log files. Using it can help identify where, how, a why a job failed. This tool takes a job id as a parameter as well as arguments to specify which logs to search, how far into the past to search, and other conditions.

5.494.2 Syntax

tracejob [-a|s|l|m|q|v|z] [-c count] [-w size] [-p path] [ -n <DAYS>] [-f filter_type] <JOBID>
 
-p  :  path to PBS_SERVER_HOME
-w  :  number of columns of your terminal
-n  :  number of days in the past to look for job(s) [default 1]
-f  :  filter out types of log entries, multiple -f's can be specified
       error, system, admin, job, job_usage, security, sched, debug,
       debug2, or absolute numeric hex equivalent
-z  :  toggle filtering excessive messages
-c  :  what message count is considered excessive
-a  :  don't use accounting log files
-s  :  don't use server log files
-l  :  don't use scheduler log files
-m  :  don't use MOM log files
-q  :  quiet mode - hide all error messages
-v  :  verbose mode - show more error messages

5.494.3 Example

> tracejob -n 10 1131

 

Job: 1131.icluster.org

 

03/02/2005 17:58:28  S   enqueuing into batch, state 1 hop 1

03/02/2005 17:58:28  S   Job Queued at request of [email protected], owner =

                         [email protected], job name = STDIN, queue = batch

03/02/2005 17:58:28  A   queue=batch

03/02/2005 17:58:41  S   Job Run at request of [email protected]

03/02/2005 17:58:41  M   evaluating limits for job

03/02/2005 17:58:41  M   phase 2 of job launch successfully completed

03/02/2005 17:58:41  M   saving task (TMomFinalizeJob3)

03/02/2005 17:58:41  M   job successfully started

03/02/2005 17:58:41  M   job 1131.koa.icluster.org reported successful start on 1 node(s)

03/02/2005 17:58:41  A   user=dev group=dev jobname=STDIN queue=batch ctime=1109811508

                         qtime=1109811508 etime=1109811508 start=1109811521

                         exec_host=icluster.org/0 Resource_List.neednodes=1 Resource_List.nodect=1

                         Resource_List.nodes=1 Resource_List.walltime=00:01:40

03/02/2005 18:02:11  M   walltime 210 exceeded limit 100

03/02/2005 18:02:11  M   kill_job

03/02/2005 18:02:11  M   kill_job found a task to kill

03/02/2005 18:02:11  M   sending signal 15 to task

03/02/2005 18:02:11  M   kill_task: killing pid 14060 task 1 with sig 15

03/02/2005 18:02:11  M   kill_task: killing pid 14061 task 1 with sig 15

03/02/2005 18:02:11  M   kill_task: killing pid 14063 task 1 with sig 15

03/02/2005 18:02:11  M   kill_job done

03/02/2005 18:04:11  M   kill_job

03/02/2005 18:04:11  M   kill_job found a task to kill

03/02/2005 18:04:11  M   sending signal 15 to task

03/02/2005 18:06:27  M   kill_job

03/02/2005 18:06:27  M   kill_job done

03/02/2005 18:06:27  M   performing job clean-up

03/02/2005 18:06:27  A   user=dev group=dev jobname=STDIN queue=batch ctime=1109811508

                         qtime=1109811508 etime=1109811508 start=1109811521

                         exec_host=icluster.org/0 Resource_List.neednodes=1 Resource_List.nodect=1

                         Resource_List.nodes=1 Resource_List.walltime=00:01:40 session=14060

                         end=1109811987 Exit_status=265 resources_used.cput=00:00:00

                         resources_used.mem=3544kb resources_used.vmem=10632kb

                         resources_used.walltime=00:07:46

 

...

The tracejob command operates by searching the pbs_server accounting records and the pbs_server, MOM, and scheduler logs. To function properly, it must be run on a node and as a user which can access these files. By default, these files are all accessible by the user root and only available on the cluster management node. In particular, the files required by tracejob are located in the following directories:

TORQUE_HOME/server_priv/accounting

TORQUE_HOME/server_logs

TORQUE_HOME/mom_logs

TORQUE_HOME/sched_logs

tracejob may only be used on systems where these files are made available. Non-root users may be able to use this command if the permissions on these directories or files are changed appropriately.

The value of Resource_List.* is the amount of resources requested, and the value of resources_used.* is the amount of resources actually used.

Related Topics 

© 2017 Adaptive Computing