TORQUE Resource Manager
11.1 Troubleshooting

11.1 Troubleshooting

There are a few general strategies that can be followed to determine the cause of unexpected behavior. These are a few of the tools available to help determine where problems occur.

11.1.1 Host Resolution

The TORQUE server host must be able to perform both forward and reverse name lookup on itself and on all compute nodes. Likewise, each compute node must be able to perform forward and reverse name lookup on itself, the TORQUE server host, and all other compute nodes. In many cases, name resolution is handled by configuring the node's /etc/hosts file although DNS and NIS services may also be used. Commands such as nslookup or dig can be used to verify proper host resolution.

Note Invalid host resolution may exhibit itself with compute nodes reporting as down within the output of pbsnodes -a and with failure of the momctl -d 3 command.

11.1.2 Firewall Configuration

Be sure that, if you have firewalls running on the server or node machines, you allow connections on the appropriate ports for each machine. TORQUE pbs_mom daemons use UDP ports 1023 and below if privileged ports are configured (privileged ports is the default). The pbs_server and pbs_mom daemons use TCP and UDP ports 15001-15004 by default.

Firewall based issues are often associated with server to MOM communication failures and messages such as 'premature end of message' in the log files.

Also, the tcpdump program can be used to verify the correct network packets are being sent.

11.1.3 TORQUE Log Files

The pbs_server keeps a daily log of all activity in the TORQUE_HOME/server_logs directory. The pbs_mom also keeps a daily log of all activity in the TORQUE_HOME/mom_logs/ directory. These logs contain information on communication between server and MOM as well as information on jobs as they enter the queue and as they are dispatched, run, and terminated. These logs can be very helpful in determining general job failures. For MOM logs, the verbosity of the logging can be adjusted by setting the loglevel parameter in the mom_priv/config file. For server logs, the verbosity of the logging can be adjusted by setting the server log_level attribute in qmgr.

For both pbs_mom and pbs_server daemons, the log verbosity level can also be adjusted by setting the environment variable PBSLOGLEVEL to a value between 0 and 7. Further, to dynamically change the log level of a running daemon, use the SIGUSR1 and SIGUSR2 signals to increase and decrease the active loglevel by one. Signals are sent to a process using the kill command. For example, kill -USR1 `pgrep pbs_mom` would raise the log level up by one. The current loglevel for pbs_mom can be displayed with the command momctl -d3.

11.1.4 Using tracejob to Locate Job Failures

Overview

The tracejob utility extracts job status and job events from accounting records, MOM log files, server log files, and scheduler log files. Using it can help identify where, how, a why a job failed. This tool takes a job id as a parameter as well as arguments to specify which logs to search, how far into the past to search, and other conditions.

Syntax

tracejob [-a|s|l|m|q|v|z] [-c count] [-w size] [-p path] [ -n <DAYS>] [-f filter_type] <JOBID>

  -p : path to PBS_SERVER_HOME
  -w : number of columns of your terminal
  -n : number of days in the past to look for job(s) [default 1]
  -f : filter out types of log entries, multiple -f's can be specified
       error, system, admin, job, job_usage, security, sched, debug, 
       debug2, or absolute numeric hex equivalent
  -z : toggle filtering excessive messages
  -c : what message count is considered excessive
  -a : don't use accounting log files
  -s : don't use server log files
  -l : don't use scheduler log files
  -m : don't use MOM log files
  -q : quiet mode - hide all error messages
  -v : verbose mode - show more error messages

Example

> tracejob -n 10 1131

Job: 1131.icluster.org

03/02/2005 17:58:28  S    enqueuing into batch, state 1 hop 1
03/02/2005 17:58:28  S    Job Queued at request of dev@icluster.org, owner =
                          dev@icluster.org, job name = STDIN, queue = batch
03/02/2005 17:58:28  A    queue=batch
03/02/2005 17:58:41  S    Job Run at request of dev@icluster.org
03/02/2005 17:58:41  M    evaluating limits for job
03/02/2005 17:58:41  M    phase 2 of job launch successfully completed
03/02/2005 17:58:41  M    saving task (TMomFinalizeJob3)
03/02/2005 17:58:41  M    job successfully started
03/02/2005 17:58:41  M    job 1131.koa.icluster.org reported successful start on 1 node(s)
03/02/2005 17:58:41  A    user=dev group=dev jobname=STDIN queue=batch ctime=1109811508
                          qtime=1109811508 etime=1109811508 start=1109811521
                          exec_host=icluster.org/0 Resource_List.neednodes=1 Resource_List.nodect=1
                          Resource_List.nodes=1 Resource_List.walltime=00:01:40
03/02/2005 18:02:11  M    walltime 210 exceeded limit 100
03/02/2005 18:02:11  M    kill_job
03/02/2005 18:02:11  M    kill_job found a task to kill
03/02/2005 18:02:11  M    sending signal 15 to task
03/02/2005 18:02:11  M    kill_task: killing pid 14060 task 1 with sig 15
03/02/2005 18:02:11  M    kill_task: killing pid 14061 task 1 with sig 15
03/02/2005 18:02:11  M    kill_task: killing pid 14063 task 1 with sig 15
03/02/2005 18:02:11  M    kill_job done
03/02/2005 18:04:11  M    kill_job
03/02/2005 18:04:11  M    kill_job found a task to kill
03/02/2005 18:04:11  M    sending signal 15 to task
03/02/2005 18:06:27  M    kill_job
03/02/2005 18:06:27  M    kill_job done
03/02/2005 18:06:27  M    performing job clean-up
03/02/2005 18:06:27  A    user=dev group=dev jobname=STDIN queue=batch ctime=1109811508
                          qtime=1109811508 etime=1109811508 start=1109811521
                          exec_host=icluster.org/0 Resource_List.neednodes=1 Resource_List.nodect=1
                          Resource_List.nodes=1 Resource_List.walltime=00:01:40 session=14060
                          end=1109811987 Exit_status=265 resources_used.cput=00:00:00
                          resources_used.mem=3544kb resources_used.vmem=10632kb
                          resources_used.walltime=00:07:46

...

Note The tracejob command operates by searching the pbs_server accounting records and the pbs_server, mom, and scheduler logs. To function properly, it must be run on a node and as a user which can access these files. By default, these files are all accessible by the user root and only available on the cluster management node. In particular, the files required by tracejob are located in the following directories:

  • TORQUE_HOME/server_priv/accounting
  • TORQUE_HOME/server_logs
  • TORQUE_HOME/mom_logs
  • TORQUE_HOME/sched_logs

tracejob may only be used on systems where these files are made available. Non-root users may be able to use this command if the permissions on these directories or files is changed appropriately.

11.1.5 Using GDB to Locate Failures

If either the pbs_mom or pbs_server fail unexpectedly (and the log files contain no information on the failure) gdb can be used to determine whether or not the program is crashing. To start pbs_mom or pbs_server under GDB export the environment variable PBSDEBUG=yes and start the program (i.e., gdb pbs_mom and then issue the run subcommand at the gdb prompt). GDB may run for some time until a failure occurs and at which point, a message will be printed to the screen and a gdb prompt again made available. If this occurs, use the gdb where subcommand to determine the exact location in the code. The information provided may be adequate to allow local diagnosis and correction. If not, this output may be sent to the mailing list or to help for further assistance. (for more information on submitting bugs or requests for help please see the Mailing List Instructions)

Note See the PBSCOREDUMP parameter for enabling creation of core files.

11.1.6 Other Diagnostic Options

When PBSDEBUG is set, some client commands will print additional diagnostic information.

$ export PBSDEBUG=yes
$ cmd

To debug different kinds of problems, it can be useful to see where in the code time is being spent. This is called profiling and there is a Linux utility gprof that will output a listing of routines and the amount of time spent in these routines. This does require that the code be compiled with special options to instrument the code and to produce a file, gmon.out, that will be written at the end of program execution.

The following listing shows how to build TORQUE with profiling enabled. Notice that the output file for pbs_mom will end up in the mom_priv directory because its startup code changes the default directory to this location.

# ./configure "CFLAGS=-pg -lgcov -fPIC"
# make -j5
# make install
# pbs_mom
... do some stuff for a while ...
# momctl -s
# cd /var/spool/torque/mom_priv
# gprof -b `which pbs_mom` gmon.out |less
#

Another way to see areas where a program is spending most of its time is with the valgrind program. The advantage of using valgrind is that the programs do not have to be specially compiled.

# valgrind --tool=callgrind pbs_mom

11.1.7 Stuck Jobs

If a job gets stuck in TORQUE, try these suggestions to resolve the issue.

  • Use the qdel command to cancel the job.
  • Force the MOM to send an obituary of the job ID to the server.
    > qsig -s 0 <JOBID>
  • You can try clearing the stale jobs by using the momctl command on the compute nodes where the jobs are still listed.
    > momctl -c 58925 -h compute-5-20
  • Setting the qmgr server setting mom_job_sync to True might help prevent jobs from hanging.
    > qmgr -c "set server mom_job_sync = True"

    To check and see if this is already set, use:

    > qmgr -c "p s"
  • If the suggestions above cannot remove the stuck job, you can try qdel -p. However, since the -p option purges all information generated by the job, this is not a recommended option unless the above suggestions fail to remove the stuck job.
    > qdel -p <JOBID>
  • The last suggestion for removing stuck jobs from compute nodes is to restart the pbs_mom.
For additional troubleshooting, run a tracejob on one of the stuck jobs. You can then create an online support ticket with the full server log for the time period displayed in the trace job.

11.1.8 Frequently Asked Questions (FAQ)


Cannot connect to server: error=15034

This error occurs in TORQUE clients (or their APIs) because TORQUE cannot find the server_name file and/or the PBS_DEFAULT environment variable is not set. The server_name file or PBS_DEFAULT variable indicate the pbs_server's hostname that the client tools should communicate with. The server_name file is usually located in TORQUE's local state directory. Make sure the file exists, has proper permissions, and that the version of TORQUE you are running was built with the proper directory settings. Alternatively you can set the PBS_DEFAULT environment variable. Restart TORQUE daemons if you make changes to these settings.


Deleting 'Stuck' Jobs

To manually delete a stale job which has no process, and for which the mother superior is still alive, sending a sig 0 with qsig will often cause MOM to realize the job is stale and issue the proper JobObit notice. Failing that, use momctl -c to forcefully cause MOM to purge the job. The following process should never be necessary:
  • shut down the MOM on the mother superior node
  • delete all files and directories related to the job from TORQUE_HOME/mom_priv/jobs
  • restart the MOM on the mother superior node.

If the mother superior MOM has been lost and cannot be recovered (i.e, hardware or disk failure), a job running on that node can be purged from the output of qstat using the qdel -p command or can be removed manually using the following steps:

To remove job X:

  1. Shutdown pbs_server.
  2. > qterm
  3. Remove job spool files.
  4. > rm TORQUE_HOME/server_priv/jobs/X.SC TORQUE_HOME/server_priv/jobs/X.JB
  5. Restart pbs_server.
  6. > pbs_server

Which user must run TORQUE?

TORQUE (pbs_server & pbs_mom) must be started by a user with root privileges.


Scheduler cannot run jobs - rc: 15003

For a scheduler, such as Moab or Maui, to control jobs with TORQUE, the scheduler needs to be run be a user in the server operators / managers list (see qmgr (set server operators / managers)). The default for the server operators / managers list is root@localhost. For TORQUE to be used in a grid setting with Silver, the scheduler needs to be run as root.


PBS_Server: pbsd_init, Unable to read server database

If this message is displayed upon starting pbs_server it means that the local database cannot be read. This can be for several reasons. The most likely is a version mismatch. Most versions of TORQUE can read each others' databases. However, there are a few incompatibilities between OpenPBS and TORQUE. Because of enhancements to TORQUE, it cannot read the job database of an OpenPBS server (job structure sizes have been altered to increase functionality). Also, a compiled in 32 bit mode cannot read a database generated by a 64 bit pbs_server and vice versa.

To reconstruct a database (excluding the job database), first print out the old data with this command:

%> qmgr -c "p s"
#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch acl_host_enable = False
set queue batch resources_max.nodect = 6
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch resources_available.nodect = 18
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server managers = griduser@oahu.icluster.org
set server managers += scott@*.icluster.org
set server managers += wightman@*.icluster.org
set server operators = griduser@oahu.icluster.org
set server operators += scott@*.icluster.org
set server operators += wightman@*.icluster.org
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server resources_available.nodect = 80
set server scheduler_iteration = 600
set server node_ping_rate = 300
set server node_check_rate = 600
set server tcp_timeout = 6

Copy this information somewhere. Restart pbs_server with the following command:

> pbs_server -t create

When it to prompts to overwrite the previous database enter 'y' then enter the data exported by the qmgr command with a command similar to the following:

> cat data | qmgr

Restart pbs_server without the flags:

> qterm
> pbs_server

This will reinitialize the database to the current version. Note that reinitializing the server database will reset the next jobid to 1.


qsub will not allow the submission of jobs requesting many processors

TORQUE's definition of a node is context sensitive and can appear inconsistent. The qsub '-l nodes=<X>' expression can at times indicate a request for X processors and other time be interpreted as a request for X nodes. While qsub allows multiple interpretations of the keyword nodes, aspects of the TORQUE server's logic are not so flexible. Consequently, if a job is using '-l nodes' to specify processor count and the requested number of processors exceeds the available number of physical nodes, the server daemon will reject the job.

To get around this issue, the server can be told it has an inflated number of nodes using the resources_available attribute. To take affect, this attribute should be set on both the server and the associated queue as in the example below. See resources_available for more information.

> qmgr
Qmgr: set server resources_available.nodect=2048
Qmgr: set queue batch resources_available.nodect=2048

Note The pbs_server daemon will need to be restarted before these changes will take affect.


qsub reports 'Bad UID for job execution'

[guest@login2]$ qsub test.job
qsub: Bad UID for job execution

Job submission hosts must be explicitly specified within TORQUE or enabled via RCmd security mechanisms in order to be trusted. In the example above, the host 'login2' is not configured to be trusted. This process is documented in Configuring Job Submission Hosts.


Why does my job keep bouncing from running to queued?

There are several reasons why a job will fail to start. Do you see any errors in the MOM logs? Be sure to increase the loglevel on MOM if you don't see anything. Also be sure TORQUE is configured with --enable-syslog and look in /var/log/messages (or wherever your syslog writes).

Also verify the following on all machines:

  • DNS resolution works correctly with matching forward and reverse
  • time is synchronized across the head and compute nodes
  • user accounts exist on all compute nodes
  • user home directories can be mounted on all compute nodes
  • prologue scripts (if specified) exit with 0

If using a scheduler such as Moab or Maui, use a scheduler tool such as checkjob to identify job start issues.


How do I use PVM with TORQUE?

  • Start the master pvmd on a compute node and then add the slaves
  • mpiexec can be used to launch slaves using rsh or ssh (use export PVM_RSH=/usr/bin/ssh to use ssh)

Note Access can be managed by rsh/ssh without passwords between the batch nodes, but denying it from anywhere else, including the interactive nodes. This can be done with xinetd and sshd configuration (root is allowed to ssh everywhere). This way, the pvm daemons can be started and killed from the job script.

The problem is that this setup allows the users to bypass the batch system by writing a job script that uses rsh/ssh to launch processes on the batch nodes. If there are relatively few users and they can more or less be trusted, this setup can work.


My build fails attempting to use the TCL library

TORQUE builds can fail on TCL dependencies even if a version of TCL is available on the system. TCL is only utilized to support the xpbsmon client. If your site does not use this tool (most sites do not use xpbsmon), you can work around this failure by rerunning configure with the --disable-gui argument.

My job will not start, failing with the message 'cannot send job to mom, state=PRERUN'

If a node crashes or other major system failures occur, it is possible that a job may be stuck in a corrupt state on a compute node. TORQUE 2.2.0 and higher automatically handle this when the mom_job_sync parameter is set via qmgr (the default). For earlier versions of TORQUE, set this parameter and restart the pbs_mom daemon.

This error can also occur if not enough free space is available on the partition that holds TORQUE.


I want to submit and run jobs as root

While this can be a very bad idea from a security point of view, in some restricted environments this can be quite useful and can be enabled by setting the acl_roots parameter via qmgr command as in the following example:

qmgr
> qmgr -c 's s acl_roots+=root@*'

How do I determine what version of Torque I am using?

There are times when you want to find out what version of Torque you are using. An easy way to do this is to run the following command:

qmgr
> qmgr -c "p s" | grep pbs_ver

See Also