TORQUE Resource Manager
11.2 Compute Node Health Check

11.2 Compute Node Health Check

11.2.1 Compute Node Health Check Overview

TORQUE provides the ability to perform health checks on each compute node. If these checks fail, a failure message can be associated with the node and routed to the scheduler. Schedulers (such as Moab) can forward this information to administrators by way of scheduler triggers, make it available through scheduler diagnostic commands, and automatically mark the node down until the issue is resolved. (See the RMMSGIGNORE parameter in Appendix F of the Moab Workload Manager Administrator's Guide for more information.)

11.2.2 Configuring MOM's to Launch a Health Check

The health check feature is configured via the pbs_mom config file using the parameters described below:

Parameter Format Default Description
<STRING> N/A (required) specifies the fully qualified pathname of the health check script to run
<INTEGER> 1 (optional) specifies the number of MOM intervals between health checks (by default, each MOM interval is 45 seconds long - this is controlled via the DEFAULT_SERVER_STAT_UPDATES #define located in TORQUE_HOME/src/resmom/mom_main.c). The integer may be followed by a list of event names (currently supported are jobstart and jobend. See the pbs_mom command page for more information).

11.2.3 Creating the Health Check Script

The health check script is executed directly by the pbs_mom daemon under the root user id. It must be accessible from the compute node and may be a script or compile executable program. It may make any needed system calls and execute any combination of system utilities but should not execute resource manager client commands. Also, as of TORQUE 1.0.1, the pbs_mom daemon blocks until the health check is completed and does not possess a built-in timeout. Consequently, it is advisable to keep the launch script execution time short and verify that the script will not block even under failure conditions.

If the script detects a failure, it should return the keyword 'ERROR' to stdout followed by an error message. When a failure is detected, the ERROR keyword should be printed to stdout before any other data. The message (up to 1024 characters) immediately following the ERROR keyword must all be contained on the same line. The message is assigned to the node attribute 'message' of the associated node.

11.2.4 Adjusting Node State Based on the Health Check Output

If the health check reports an error, the node attribute 'message' is set to the error string returned. Cluster schedulers can be configured to adjust a given node's state based on this information. For example, by default, Moab sets a node's state to down if a node error message is detected and restores the state as soon as the failure disappears.

11.2.5 Example Health Check Script

As mentioned, the health check can be a shell script, PERL, Python, C-executable, or anything which can be executed from the command line capable of setting STDOUT. The example below demonstrates a very simple health check:

#!/bin/sh

/bin/mount | grep global

if [ $? != "0" ]
  then
    echo "ERROR cannot locate filesystem global"
  fi