There are a few general strategies that can be followed to determine the cause of unexpected behavior. These are a few of the tools available to help determine where problems occur.
The TORQUE server host must be able to perform both forward and reverse name lookup on itself and on all compute nodes. Likewise, each compute node must be able to perform forward and reverse name lookup on itself, the TORQUE server host, and all other compute nodes. In many cases, name resolution is handled by configuring the node's /etc/hosts file although DNS and NIS services may also be used. Commands such as nslookup or dig can be used to verify proper host resolution.
Invalid host resolution may exhibit itself with compute nodes reporting as down within the output of pbsnodes -a and with failure of the momctl -d 3 command. |
Be sure that, if you have firewalls running on the server or node machines, you allow connections on the appropriate ports for each machine. TORQUE pbs_mom daemons use UDP ports 1023 and below if privileged ports are configured (privileged ports is the default). The pbs_server and pbs_mom daemons use TCP and UDP ports 15001-15004 by default.
Firewall based issues are often associated with server to MOM communication failures and messages such as 'premature end of message' in the log files.
Also, the tcpdump program can be used to verify the correct network packets are being sent.
The pbs_server keeps a daily log of all activity in the TORQUE_HOME/server_logs directory. The pbs_mom also keeps a daily log of all activity in the TORQUE_HOME/mom_logs/ directory. These logs contain information on communication between server and MOM as well as information on jobs as they enter the queue and as they are dispatched, run, and terminated. These logs can be very helpful in determining general job failures. For MOM logs, the verbosity of the logging can be adjusted by setting the loglevel parameter in the mom_priv/config file. For server logs, the verbosity of the logging can be adjusted by setting the server log_level attribute in qmgr.
For both pbs_mom and pbs_server daemons, the log verbosity level can also be adjusted by setting the environment variable PBSLOGLEVEL to a value between 0 and 7. Further, to dynamically change the log level of a running daemon, use the SIGUSR1 and SIGUSR2 signals to increase and decrease the active loglevel by one. Signals are sent to a process using the kill command. For example, kill -USR1 `pgrep pbs_mom` would raise the log level up by one. The current loglevel for pbs_mom can be displayed with the command momctl -d3.
tracejob [-a|s|l|m|q|v|z] [-c count] [-w size] [-p path] [ -n <DAYS>] [-f filter_type] <JOBID> -p : path to PBS_SERVER_HOME -w : number of columns of your terminal -n : number of days in the past to look for job(s) [default 1] -f : filter out types of log entries, multiple -f's can be specified error, system, admin, job, job_usage, security, sched, debug, debug2, or absolute numeric hex equivalent -z : toggle filtering excessive messages -c : what message count is considered excessive -a : don't use accounting log files -s : don't use server log files -l : don't use scheduler log files -m : don't use MOM log files -q : quiet mode - hide all error messages -v : verbose mode - show more error messages
> tracejob -n 10 1131 Job: 1131.icluster.org 03/02/2005 17:58:28 S enqueuing into batch, state 1 hop 1 03/02/2005 17:58:28 S Job Queued at request of dev@icluster.org, owner = dev@icluster.org, job name = STDIN, queue = batch 03/02/2005 17:58:28 A queue=batch 03/02/2005 17:58:41 S Job Run at request of dev@icluster.org 03/02/2005 17:58:41 M evaluating limits for job 03/02/2005 17:58:41 M phase 2 of job launch successfully completed 03/02/2005 17:58:41 M saving task (TMomFinalizeJob3) 03/02/2005 17:58:41 M job successfully started 03/02/2005 17:58:41 M job 1131.koa.icluster.org reported successful start on 1 node(s) 03/02/2005 17:58:41 A user=dev group=dev jobname=STDIN queue=batch ctime=1109811508 qtime=1109811508 etime=1109811508 start=1109811521 exec_host=icluster.org/0 Resource_List.neednodes=1 Resource_List.nodect=1 Resource_List.nodes=1 Resource_List.walltime=00:01:40 03/02/2005 18:02:11 M walltime 210 exceeded limit 100 03/02/2005 18:02:11 M kill_job 03/02/2005 18:02:11 M kill_job found a task to kill 03/02/2005 18:02:11 M sending signal 15 to task 03/02/2005 18:02:11 M kill_task: killing pid 14060 task 1 with sig 15 03/02/2005 18:02:11 M kill_task: killing pid 14061 task 1 with sig 15 03/02/2005 18:02:11 M kill_task: killing pid 14063 task 1 with sig 15 03/02/2005 18:02:11 M kill_job done 03/02/2005 18:04:11 M kill_job 03/02/2005 18:04:11 M kill_job found a task to kill 03/02/2005 18:04:11 M sending signal 15 to task 03/02/2005 18:06:27 M kill_job 03/02/2005 18:06:27 M kill_job done 03/02/2005 18:06:27 M performing job clean-up 03/02/2005 18:06:27 A user=dev group=dev jobname=STDIN queue=batch ctime=1109811508 qtime=1109811508 etime=1109811508 start=1109811521 exec_host=icluster.org/0 Resource_List.neednodes=1 Resource_List.nodect=1 Resource_List.nodes=1 Resource_List.walltime=00:01:40 session=14060 end=1109811987 Exit_status=265 resources_used.cput=00:00:00 resources_used.mem=3544kb resources_used.vmem=10632kb resources_used.walltime=00:07:46 ...
The tracejob command operates by searching the pbs_server accounting records and the pbs_server, mom, and scheduler logs. To function properly, it must be run on a node and as a user which can access these files. By default, these files are all accessible by the user root and only available on the cluster management node. In particular, the files required by tracejob are located in the following directories:
|
tracejob may only be used on systems where these files are made available. Non-root users may be able to use this command if the permissions on these directories or files is changed appropriately.
See the PBSCOREDUMP parameter for enabling creation of core files. |
When PBSDEBUG is set, some client commands will print additional diagnostic information.
$ export PBSDEBUG=yes $ cmd
To debug different kinds of problems, it can be useful to see where in the code time is being spent. This is called profiling and there is a Linux utility gprof that will output a listing of routines and the amount of time spent in these routines. This does require that the code be compiled with special options to instrument the code and to produce a file, gmon.out, that will be written at the end of program execution.
The following listing shows how to build TORQUE with profiling enabled. Notice that the output file for pbs_mom will end up in the mom_priv directory because its startup code changes the default directory to this location.
# ./configure "CFLAGS=-pg -lgcov -fPIC" # make -j5 # make install # pbs_mom ... do some stuff for a while ... # momctl -s # cd /var/spool/torque/mom_priv # gprof -b `which pbs_mom` gmon.out |less #
Another way to see areas where a program is spending most of its time is with the valgrind program. The advantage of using valgrind is that the programs do not have to be specially compiled.
# valgrind --tool=callgrind pbs_mom
> qsig -s 0 <JOBID>
> momctl -c 58925 -h compute-5-20
> qmgr -c "set server mom_job_sync = True"
To check and see if this is already set, use:
> qmgr -c "p s"
> qdel -p <JOBID>
If the mother superior MOM has been lost and cannot be recovered (i.e, hardware or disk failure), a job running on that node can be purged from the output of qstat using the qdel -p command or can be removed manually using the following steps:
To remove job X:
> qterm
> rm TORQUE_HOME/server_priv/jobs/X.SC TORQUE_HOME/server_priv/jobs/X.JB
> pbs_server
If this message is displayed upon starting pbs_server it means that the local database cannot be read. This can be for several reasons. The most likely is a version mismatch. Most versions of TORQUE can read each others' databases. However, there are a few incompatibilities between OpenPBS and TORQUE. Because of enhancements to TORQUE, it cannot read the job database of an OpenPBS server (job structure sizes have been altered to increase functionality). Also, a compiled in 32 bit mode cannot read a database generated by a 64 bit pbs_server and vice versa.
To reconstruct a database (excluding the job database), first print out the old data with this command:
%> qmgr -c "p s" # # Create queues and set their attributes. # # # Create and define queue batch # create queue batch set queue batch queue_type = Execution set queue batch acl_host_enable = False set queue batch resources_max.nodect = 6 set queue batch resources_default.nodes = 1 set queue batch resources_default.walltime = 01:00:00 set queue batch resources_available.nodect = 18 set queue batch enabled = True set queue batch started = True # # Set server attributes. # set server scheduling = True set server managers = griduser@oahu.icluster.org set server managers += scott@*.icluster.org set server managers += wightman@*.icluster.org set server operators = griduser@oahu.icluster.org set server operators += scott@*.icluster.org set server operators += wightman@*.icluster.org set server default_queue = batch set server log_events = 511 set server mail_from = adm set server resources_available.nodect = 80 set server scheduler_iteration = 600 set server node_ping_rate = 300 set server node_check_rate = 600 set server tcp_timeout = 6
Copy this information somewhere. Restart pbs_server with the following command:
> pbs_server -t create
When it to prompts to overwrite the previous database enter 'y' then enter the data exported by the qmgr command with a command similar to the following:
> cat data | qmgr
Restart pbs_server without the flags:
> qterm > pbs_server
This will reinitialize the database to the current version. Note that reinitializing the server database will reset the next jobid to 1.
To get around this issue, the server can be told it has an inflated number of nodes using the resources_available attribute. To take affect, this attribute should be set on both the server and the associated queue as in the example below. See resources_available for more information.
> qmgr Qmgr: set server resources_available.nodect=2048 Qmgr: set queue batch resources_available.nodect=2048
The pbs_server daemon will need to be restarted before these changes will take affect. |
[guest@login2]$ qsub test.job qsub: Bad UID for job execution
Job submission hosts must be explicitly specified within TORQUE or enabled via RCmd security mechanisms in order to be trusted. In the example above, the host 'login2' is not configured to be trusted. This process is documented in Configuring Job Submission Hosts.
Also verify the following on all machines:
If using a scheduler such as Moab or Maui, use a scheduler tool such as checkjob to identify job start issues.
Access can be managed by rsh/ssh without passwords between the batch nodes, but denying it from anywhere else, including the interactive nodes. This can be done with xinetd and sshd configuration (root is allowed to ssh everywhere). This way, the pvm daemons can be started and killed from the job script. |
The problem is that this setup allows the users to bypass the batch system by writing a job script that uses rsh/ssh to launch processes on the batch nodes. If there are relatively few users and they can more or less be trusted, this setup can work.
If a node crashes or other major system failures occur, it is possible that a job may be stuck in a corrupt state on a compute node. TORQUE 2.2.0 and higher automatically handle this when the mom_job_sync parameter is set via qmgr (the default). For earlier versions of TORQUE, set this parameter and restart the pbs_mom daemon.
This error can also occur if not enough free space is available on the partition that holds TORQUE.
> qmgr -c 's s acl_roots+=root@*'
> qmgr -c "p s" | grep pbs_ver