13.0 Troubleshooting and System Maintenance > Notifying Administrators of Failures

Conventions

13.4 Notifying Administrators of Failures

13.4-A Enabling Administrator Email

In the case of certain events, Moab can automatically send email to administrators. To enable mail notification, the MAILPROGRAM parameter must be set to DEFAULT or point to the locally available mail client. With this set, policies such as JOBREJECTPOLICY will send email to administrators if set to a value of MAIL.

13.4-B Handling Events with the Notification Routine

Moab possesses a primitive event management system through the use of the notify program. The program is called each time an event of interest occurs. Currently, most events are associated with failures of some sort but use of this facility need not be limited in this way. The NOTIFICATIONPROGRAM parameter allows a site to specify the name of the program to run. This program is most often locally developed and designed to take action based on the event that has occurred. The location of the notification program may be specified as a relative or absolute path. If a relative path is specified, Moab looks for the notification relative to the $(INSTDIR)/tools directory. In all cases, Moab verifies the existence of the notification program at start up and disables it if it cannot be found or is not executable.

The notification program's action may include steps such as reporting the event via email, adjusting scheduling parameters, rebooting a node, or even recycling the scheduler.

For most events, the notification program is called with command line arguments in a simple <EVENTTYPE>: <MESSAGE> format. The following event types are currently enabled:

Event Type Format Description
JOBCORRUPTION <MESSAGE> An active job is in an unexpected state or has one or more allocated nodes that are in unexpected states.
JOBHOLD <MESSAGE> A job hold has been placed on a job.
JOBWCVIOLATION <MESSAGE> A job has exceeded its wallclock limit.
RESERVATIONCORRUPTION <MESSAGE> Reservation corruption has been detected.
RESERVATIONCREATED <RSVNAME> <RSVTYPE> <NAME> <PRESENTTIME> <STARTTIME> <ENDTIME> <NODECOUNT> A new reservation has been created.
RESERVATIONDESTROYED <RSVNAME> <RSVTYPE> <PRESENTTIME> <STARTTIME> <ENDTIME> <NODECOUNT> A reservation has been destroyed.
RMFAILURE <MESSAGE> The interface to the resource manager has failed.

Perhaps the most valuable use of the notify program stems from the fact that additional notifications can be easily inserted into Moab to handle site specific issues. To do this, locate the proper block routine, specify the correct conditional statement, and add a call to the routine notify(<MESSAGE>);.

Related topics