Z.55 Creating the Health Check Script

The health check script is executed directly by the pbs_mom daemon under the root user id. It must be accessible from the compute node and may be a script or compile executable program. It may make any needed system calls and execute any combination of system utilities but should not execute resource manager client commands. Also, as of Torque 1.0.1, the pbs_mom daemon blocks until the health check is completed and does not possess a built-in timeout. Consequently, it is advisable to keep the launch script execution time short and verify that the script will not block even under failure conditions.

By default, the script looks for the EVENT: keyword to indicate successes. If the script detects a failure, it should return the keyword ERROR to stdout followed by an error message. When a failure is detected, the ERROR keyword should be printed to stdout before any other data. The message immediately following the ERROR keyword must all be contained on the same line. The message is assigned to the node attribute 'message' of the associated node.

In order for the node health check script to log a positive run, it is necessary to include the keyword EVENT: at the beginning of the message your script returns. Failure to do so may result in unexpected outcomes.

Both the ERROR and EVENT: keywords are case insensitive.

Related Topics 

© 2016 Adaptive Computing