(Click to open topic with navigation)
The tracejob utility extracts job status and job events from accounting records, MOM log files, server log files, and scheduler log files. Using it can help identify where, how, a why a job failed. This tool takes a job id as a parameter as well as arguments to specify which logs to search, how far into the past to search, and other conditions.
tracejob [-a|s|l|m|q|v|z] [-c count] [-w size] [-p path] [ -n <DAYS>] [-f filter_type] <JOBID>
-p : path to PBS_SERVER_HOME
-w : number of columns of your terminal
-n : number of days in the past to look for job(s) [default 1]
-f : filter out types of log entries, multiple -f's can be specified
error, system, admin, job, job_usage, security, sched, debug,
debug2, or absolute numeric hex equivalent
-z : toggle filtering excessive messages
-c : what message count is considered excessive
-a : don't use accounting log files
-s : don't use server log files
-l : don't use scheduler log files
-m : don't use MOM log files
-q : quiet mode - hide all error messages
-v : verbose mode - show more error messages
> tracejob -n 10 1131
Job: 1131.icluster.org
03/02/2005 17:58:28 S enqueuing into batch, state 1 hop 1
03/02/2005 17:58:28 S Job Queued at request of [email protected], owner =
[email protected], job name = STDIN, queue = batch
03/02/2005 17:58:28 A queue=batch
03/02/2005 17:58:41 S Job Run at request of [email protected]
03/02/2005 17:58:41 M evaluating limits for job
03/02/2005 17:58:41 M phase 2 of job launch successfully completed
03/02/2005 17:58:41 M saving task (TMomFinalizeJob3)
03/02/2005 17:58:41 M job successfully started
03/02/2005 17:58:41 M job 1131.koa.icluster.org reported successful start on 1 node(s)
03/02/2005 17:58:41 A user=dev group=dev jobname=STDIN queue=batch ctime=1109811508
qtime=1109811508 etime=1109811508 start=1109811521
exec_host=icluster.org/0 Resource_List.neednodes=1 Resource_List.nodect=1
Resource_List.nodes=1 Resource_List.walltime=00:01:40
03/02/2005 18:02:11 M walltime 210 exceeded limit 100
03/02/2005 18:02:11 M kill_job
03/02/2005 18:02:11 M kill_job found a task to kill
03/02/2005 18:02:11 M sending signal 15 to task
03/02/2005 18:02:11 M kill_task: killing pid 14060 task 1 with sig 15
03/02/2005 18:02:11 M kill_task: killing pid 14061 task 1 with sig 15
03/02/2005 18:02:11 M kill_task: killing pid 14063 task 1 with sig 15
03/02/2005 18:02:11 M kill_job done
03/02/2005 18:04:11 M kill_job
03/02/2005 18:04:11 M kill_job found a task to kill
03/02/2005 18:04:11 M sending signal 15 to task
03/02/2005 18:06:27 M kill_job
03/02/2005 18:06:27 M kill_job done
03/02/2005 18:06:27 M performing job clean-up
03/02/2005 18:06:27 A user=dev group=dev jobname=STDIN queue=batch ctime=1109811508
qtime=1109811508 etime=1109811508 start=1109811521
exec_host=icluster.org/0 Resource_List.neednodes=1 Resource_List.nodect=1
Resource_List.nodes=1 Resource_List.walltime=00:01:40 session=14060
end=1109811987 Exit_status=265 resources_used.cput=00:00:00
resources_used.mem=3544kb resources_used.vmem=10632kb
resources_used.walltime=00:07:46
...
The tracejob command operates by searching the pbs_server accounting records and the pbs_server, MOM, and scheduler logs. To function properly, it must be run on a node and as a user which can access these files. By default, these files are all accessible by the user root and only available on the cluster management node. In particular, the files required by tracejob are located in the following directories:
TORQUE_HOME/server_priv/accounting
TORQUE_HOME/server_logs
TORQUE_HOME/mom_logs
TORQUE_HOME/sched_logs
tracejob may only be used on systems where these files are made available. Non-root users may be able to use this command if the permissions on these directories or files are changed appropriately.
The value of Resource_List.* is the amount of resources requested, and the value of resources_used.* is the amount of resources actually used.
Related Topics