TORQUE Resource Manager > Troubleshooting > Frequently Asked Questions (FAQ)

Frequently Asked Questions (FAQ)

Cannot connect to server: error=15034

This error occurs in TORQUE clients (or their APIs) because TORQUE cannot find the server_name file and/or the PBS_DEFAULT environment variable is not set. The server_name file or PBS_DEFAULT variable indicate the pbs_server's hostname that the client tools should communicate with. The server_name file is usually located in TORQUE's local state directory. Make sure the file exists, has proper permissions, and that the version of TORQUE you are running was built with the proper directory settings. Alternatively you can set the PBS_DEFAULT environment variable. Restart TORQUE daemons if you make changes to these settings.

Deleting 'stuck' jobs

To manually delete a "stale" job which has no process, and for which the mother superior is still alive, sending a sig 0 with qsig will often cause MOM to realize the job is stale and issue the proper JobObit notice. Failing that, use momctl -c to forcefully cause MOM to purge the job. The following process should never be necessary:

If the mother superior MOM has been lost and cannot be recovered (i.e. hardware or disk failure), a job running on that node can be purged from the output of qstat using the qdel -p command or can be removed manually using the following steps:

To remove job X

  1. Shut down pbs_server.
  2. > qterm

  3. Remove job spool files.
  4. > rm TORQUE_HOME/server_priv/jobs/X.SC TORQUE_HOME/server_priv/jobs/X.JB

  5. Restart pbs_server
  6. > pbs_server

Which user must run TORQUE?

TORQUE (pbs_server & pbs_mom) must be started by a user with root privileges.

Scheduler cannot run jobs - rc: 15003

For a scheduler, such as Moab or Maui, to control jobs with TORQUE, the scheduler needs to be run be a user in the server operators / managers list (see qmgr). The default for the server operators / managers list is root@localhost. For TORQUE to be used in a grid setting with Silver, the scheduler needs to be run as root.

PBS_Server: pbsd_init, Unable to read server database

If this message is displayed upon starting pbs_server it means that the local database cannot be read. This can be for several reasons. The most likely is a version mismatch. Most versions of TORQUE can read each other's databases. However, there are a few incompatibilities between OpenPBS and TORQUE. Because of enhancements to TORQUE, it cannot read the job database of an OpenPBS server (job structure sizes have been altered to increase functionality). Also, a compiled in 32-bit mode cannot read a database generated by a 64-bit pbs_server and vice versa.

To reconstruct a database (excluding the job database)

  1. First, print out the old data with this command:
  2. %> qmgr -c "p s"

    #

    # Create queues and set their attributes.

    #

    #

    # Create and define queue batch

    # create queue batch

    set queue batch queue_type = Execution

    set queue batch acl_host_enable = False

    set queue batch resources_max.nodect = 6

    set queue batch resources_default.nodes = 1

    set queue batch resources_default.walltime = 01:00:00

    set queue batch resources_available.nodect = 18

    set queue batch enabled = True

    set queue batch started = True

    #

    # Set server attributes.

    #

    set server scheduling = True

    set server managers = [email protected]

    set server managers += scott@*.icluster.org

    set server managers += wightman@*.icluster.org

    set server operators = [email protected]

    set server operators += scott@*.icluster.org

    set server operators += wightman@*.icluster.org

    set server default_queue = batch

    set server log_events = 511

    set server mail_from = adm

    set server resources_available.nodect = 80

    set server node_ping_rate = 300

    set server node_check_rate = 600

    set server tcp_timeout = 6

  3. Copy this information somewhere.
  4. Restart pbs_server with the following command:
  5. > pbs_server -t create

  6. When you are prompted to overwrite the previous database, enter y, then enter the data exported by the qmgr command as in this example:
  7. > cat data | qmgr

  8. Restart pbs_server without the flags:
  9. > qterm

    > pbs_server

    This will reinitialize the database to the current version.

    Reinitializing the server database will reset the next jobid to 1

qsub will not allow the submission of jobs requesting many processors

TORQUE's definition of a node is context sensitive and can appear inconsistent. The qsub -l nodes=<X> expression can at times indicate a request for X processors and other time be interpreted as a request for X nodes. While qsub allows multiple interpretations of the keyword nodes, aspects of the TORQUE server's logic are not so flexible. Consequently, if a job is using -l nodes to specify processor count and the requested number of processors exceeds the available number of physical nodes, the server daemon will reject the job.

To get around this issue, the server can be told it has an inflated number of nodes using the resources_available attribute. To take effect, this attribute should be set on both the server and the associated queue as in the example below. (See resources_available for more information.)

> qmgr

Qmgr: set server resources_available.nodect=2048

Qmgr: set queue batch resources_available.nodect=2048

The pbs_server daemon will need to be restarted before these changes will take effect.

qsub reports 'Bad UID for job execution'

[guest@login2]$ qsub test.job

qsub: Bad UID for job execution

Job submission hosts must be explicitly specified within TORQUE or enabled via RCmd security mechanisms in order to be trusted. In the example above, the host 'login2' is not configured to be trusted. This process is documented in Configuring Job Submission Hosts.

Why does my job keep bouncing from running to queued?

There are several reasons why a job will fail to start. Do you see any errors in the MOM logs? Be sure to increase the loglevel on MOM if you don't see anything. Also be sure TORQUE is configured with --enable-syslog and look in /var/log/messages (or wherever your syslog writes).

Also verify the following on all machines:

If using a scheduler such as Moab or Maui, use a scheduler tool such as checkjob to identify job start issues.

How do I use PVM with TORQUE?

Access can be managed by rsh/ssh without passwords between the batch nodes, but denying it from anywhere else, including the interactive nodes. This can be done with xinetd and sshd configuration (root is allowed to ssh everywhere). This way, the pvm daemons can be started and killed from the job script.

The problem is that this setup allows the users to bypass the batch system by writing a job script that uses rsh/ssh to launch processes on the batch nodes. If there are relatively few users and they can more or less be trusted, this setup can work.

My build fails attempting to use the TCL library

TORQUE builds can fail on TCL dependencies even if a version of TCL is available on the system. TCL is only utilized to support the xpbsmon client. If your site does not use this tool (most sites do not use xpbsmon), you can work around this failure by rerunning configure with the --disable-gui argument.

My job will not start, failing with the message 'cannot send job to mom, state=PRERUN'

If a node crashes or other major system failures occur, it is possible that a job may be stuck in a corrupt state on a compute node. TORQUE 2.2.0 and higher automatically handle this when the mom_job_sync parameter is set via qmgr (the default). For earlier versions of TORQUE, set this parameter and restart the pbs_mom daemon.

This error can also occur if not enough free space is available on the partition that holds TORQUE.

How do I determine what version of TORQUE I am using?

There are times when you want to find out what version of TORQUE you are using. An easy way to do this is to run the following command:

qmgr

 

> qmgr -c "p s" | grep pbs_ver

How do I resolve autogen.sh errors that contain "error: possibly undefined macro: AC_MSG_ERROR"?

Verify the pkg-config package is installed.

How do I resolve compile errors with libssl or libcrypto for TORQUE 4.0 on Ubuntu 10.04?

When compiling TORQUE 4.0 on Ubuntu 10.04 the following errors might occur:

libtool: link: gcc -Wall -pthread -g -D_LARGEFILE64_SOURCE -o .libs/trqauthd trq_auth_daemon.o trq_main.o -ldl -lssl -lcrypto -L/home/adaptive/torques/torque-4.0.0/src/lib/Libpbs/.libs /home/adaptive/torques/torque-4.0.0/src/lib/Libpbs/.libs/libtorque.so -lpthread -lrt -pthread
/usr/bin/ld: cannot find -lssl
collect2: ld returned 1 exit status
make[3]: *** [trqauthd] Error 1

libtool: link: gcc -Wall -pthread -g -D_LARGEFILE64_SOURCE -o .libs/trqauthd trq_auth_daemon.o trq_main.o -ldl -lssl -lcrypto -L/home/adaptive/torques/torque-4.0.0/src/lib/Libpbs/.libs /home/adaptive/torques/torque-4.0.0/src/lib/Libpbs/.libs/libtorque.so -lpthread -lrt -pthread
/usr/bin/ld: cannot find -lcrypto
collect2: ld returned 1 exit status
make[3]: *** [trqauthd] Error 1

To resolve the compile issue, use these commands:

> cd /usr/lib
> ln -s /lib/libcrypto.so.0.9. libcrypto.so
> ln -s /lib/libssl.so.0.9.8 libssl.so

Why are there so many error messages in the client logs (trqauthd logs) when I don't notice client commands failing?

If a client makes a connection to the server and the trqauthd connection for that client command is authorized before the client's connection, the trqauthd connection is rejected. The connection is retried, but if all retry attempts are rejected, trqauthd logs a message indicating a failure. Some client commands then open a new connection to the server and try again. The client command fails only if all its retries fail.

Related Topics 

© 2015 Adaptive Computing