Moab Workload Manager

12.8 Enabling Generic Events

Generic events are used to identify failures and other occurrences that Moab or other systems must be made aware. This information may result in automated resource recovery, notifications, adjustments to statistics, or changes in policy. Generic events also have the ability to carry an arbitrary human readable message that may be attached to associated objects or passed to administrators or external systems. Generic events typically signify the occurrence of a specific event as opposed to generic metrics which indicate a change in a measured value.

Using generic events, Moab can be configured to automatically address many failures and environmental changes improving the overall performance. Some sample events that sites may be interested in monitoring, recording, and taking action on include:

  • Machine Room Status
    • Excessive Room Temperature
    • Power Failure or Power Fluctuation
    • Chiller Health
  • Network File Server Status
    • Failed Network Connectivity
    • Server Hardware Failure
    • Full Network File System
  • Compute Node Status
    • Machine Check Event (MCE)
    • Network Card (NIC) Failure
    • Excessive Motherboard/CPU Temperature
    • Hard Drive Failures

12.8.1 Configuring Generic Events

Generic events are defined in the moab.cfg file and have several different configuration options. The only required option is action.

The full list of configurable options for generic events are listed in the following table:

Attribute Description
Comma-delimited list of actions to be processed when a new event is received.
Number of events that must occur before launching action.

Note Action will be launched each <ECOUNT> event if rearm is set.
Minimum time between events specified in [[[DD:]HH:]MM:]SS format.
An arbitrary severity level from 1 through 4, inclusive. SEVERITY appears in the output of mdiag -n -v -v --xml.
Note The severity level will not be used for any other purpose.

12.8.1.1 Action Types

The impact of the event is controlled using the ACTION attribute of the GEVENTCFG parameter. The ACTION attribute is comma-delimited and may include any combination of the actions in the following table:

Value Description
Marks event object (or specified object) down until event report is cleared.
Executes a script at the provided path. Arguments are allowed at the end of the path and are separated by question marks (?). Trigger variables (such as $OID) are allowed.
Notifies admininstrators of the event occurrence.
Powers off node or resource.
Powers on node or resource.
Preempts workload associated with object (valid for node, job, reservation, partition, resource manager, user, group, account, class, QoS, and cluster objects).
Records events to the event log. The record action causes a a line to be added to the event log regardless of whether or not RECORDEVENTLIST includes GEVENT.
Reserves node for specified duration (default: 24 hours).
Resets object (valid for nodes - causes reboot).
Sends signal to associated jobs or services (valid for node, job, reservation, partition, resource manager, user, group, account, class, QoS, and cluster objects).

12.8.1.2 Named Events

In general, generic events are named, with the exception of those based on generic metrics. Names are used primarily to differentiate between different events and do not have any intrinsic meaning to Moab. It is suggested that the administrator choose names that denote specific meanings within the organization.

Example

# Note: cpu failures require admin attention, create maintenance reservation
GEVENTCFG[cpufail] action=notify,record,disable,reserve rearm=01:00:00

# Note: power failures are transient, minimize future use
GEVENTCFG[powerfail] action=notify,record, rearm=00:05:00

# Note: fs full can be automatically fixed
GEVENTCFG[fsfull] action=notify,execute:/home/jason/MyPython/cleartmp.py?$OID?nodefix

# Note: memory errors can cause invalid job results, clear node immediately 
GEVENTCFG[badmem] action=notify,record,preempt,disable,reserve

12.8.1.3 Generic Metric (GMetric) Events

GMetric events are generic events based on generic metrics. They are used for executing an action when a generic metric passes a defined threshold. Unlike named events, GMetric events are not named and use the following format:

GEVENTCFG[GMETRIC<COMPARISON>VALUE] ACTION=...

Example

GEVENTCFG[cputemp>150] action=off

This form of generic events uses the GMetric name, as returned by a GMETRIC attribute in a native Resource Manager interface.

Note Only one generic event may be specified for any given generic metric.

Valid comparative operators are shows in the following table:

Type Comparison Notes
greater than Numeric values only
greater than or equal to Numeric values only
equal to Numeric values only
less than Numeric values only
less than or equal to Numeric values only
not equal Numeric values only

12.8.2 Reporting Generic Events

Unlike generic metrics, generic events can be optionally configured at the global level to adjust rearm policies, and other behaviors. In all cases, this is accomplished using the GEVENTCFG parameter.

To report an event associated with a job or node, use the native Resource Manager interface or the mjobctl or mnodectl commands.

If using the native Resource Manager interface, use the GEVENT attribute as in the following example:

node001 GEVENT[hitemp]='temperature exceeds 150 degrees'
node017 GEVENT[fullfs]='/var/tmp is full'

Note The time at which the event occurred can be passed to Moab to prevent multiple processing of the same event. This is accomplished by specifying the event type in the format <GEVENTID>[:<EVENTTIME>] as in what follows:

node001 GEVENT[hitemp:1130325993]='temperature exceeds 150 degrees'
node017 GEVENT[fullfs:1130325142]='/var/tmp is full'

The messages specified after GEVENT are routed to Moab Cluster Manager for graphical display and can be used to dynamically adjust scheduling behavior.

12.8.3 Generic Events Attributes

Each node will record the following about reported generic events:

  • status - is event active
  • message - human readable message associated with event
  • count - number of event incidences reported since statistics were cleared
  • time - time of most recent event

Each event can be individually cleared, annotated, or deleted by cluster administrators using a mnodectl command.

Note Generic events are only available in Moab 4.5.0 and later.

12.8.4 Manually Creating Generic Events

Generic events may be manually created on a physical node or VM.

To add GEVENT "event" with message "hello" to node02, do the following:

> mnodectl -m gevent=event:"hello" node02

To add GEVENT "event" with message "hello" to myvm, do the following:

> mvmctl -m gevent=event:"hello" myvm

See Also