Generic events are used to identify failures and other occurrences that Moab or other systems must be made aware. This information may result in automated resource recovery, notifications, adjustments to statistics, or changes in policy. Generic events also have the ability to carry an arbitrary human readable message that may be attached to associated objects or passed to administrators or external systems. Generic events typically signify the occurrence of a specific event as opposed to generic metrics which indicate a change in a measured value.
Using generic events, Moab can be configured to automatically address many failures and environmental changes improving the overall performance. Some sample events that sites may be interested in monitoring, recording, and taking action on include:
Generic events are defined in the moab.cfg file and have several different configuration options. The only required option is action.
The full list of configurable options for generic events are listed in the following table:
Attribute | Description | ||
---|---|---|---|
ACTION | Comma-delimited list of actions to be processed when a new event is received. | ||
ECOUNT | Number of events that must occur before launching action.
| ||
REARM | Minimum time between events specified in [[[DD:]HH:]MM:]SS format. | ||
SEVERITY | An arbitrary severity level from 1 through 4, inclusive. SEVERITY appears in the output of mdiag -n -v -v --xml.
|
The impact of the event is controlled using the ACTION attribute of the GEVENTCFG parameter. The ACTION attribute is comma-delimited and may include any combination of the actions in the following table:
Value | Description |
---|---|
DISABLE[:<OTYPE>:<OID>] | Marks event object (or specified object) down until event report is cleared. |
EXECUTE | Executes a script at the provided path. Arguments are allowed at the end of the path and are separated by question marks (?). Trigger variables (such as $OID) are allowed. |
NOTIFY | Notifies admininstrators of the event occurrence. |
OFF | Powers off node or resource. |
ON | Powers on node or resource. |
PREEMPT[:<POLICY>] | Preempts workload associated with object (valid for node, job, reservation, partition, resource manager, user, group, account, class, QoS, and cluster objects). |
RECORD | Records events to the event log. The record action causes a a line to be added to the event log regardless of whether or not RECORDEVENTLIST includes GEVENT. |
RESERVE[:<DURATION>] | Reserves node for specified duration (default: 24 hours). |
RESET | Resets object (valid for nodes - causes reboot). |
SIGNAL[:<SIGNO>] | Sends signal to associated jobs or services (valid for node, job, reservation, partition, resource manager, user, group, account, class, QoS, and cluster objects). |
In general, generic events are named, with the exception of those based on generic metrics. Names are used primarily to differentiate between different events and do not have any intrinsic meaning to Moab. It is suggested that the administrator choose names that denote specific meanings within the organization.
Example
# Note: cpu failures require admin attention, create maintenance reservation GEVENTCFG[cpufail] action=notify,record,disable,reserve rearm=01:00:00 # Note: power failures are transient, minimize future use GEVENTCFG[powerfail] action=notify,record, rearm=00:05:00 # Note: fs full can be automatically fixed GEVENTCFG[fsfull] action=notify,execute:/home/jason/MyPython/cleartmp.py?$OID?nodefix # Note: memory errors can cause invalid job results, clear node immediately GEVENTCFG[badmem] action=notify,record,preempt,disable,reserve
GMetric events are generic events based on generic metrics. They are used for executing an action when a generic metric passes a defined threshold. Unlike named events, GMetric events are not named and use the following format:
GEVENTCFG[GMETRIC<COMPARISON>VALUE] ACTION=...
Example
GEVENTCFG[cputemp>150] action=off
This form of generic events uses the GMetric name, as returned by a GMETRIC attribute in a native Resource Manager interface.
Only one generic event may be specified for any given generic metric. |
Valid comparative operators are shows in the following table:
Type | Comparison | Notes |
---|---|---|
> | greater than | Numeric values only |
> = | greater than or equal to | Numeric values only |
= = | equal to | Numeric values only |
< | less than | Numeric values only |
< = | less than or equal to | Numeric values only |
< > | not equal | Numeric values only |
Unlike generic metrics, generic events can be optionally configured at the global level to adjust rearm policies, and other behaviors. In all cases, this is accomplished using the GEVENTCFG parameter.
To report an event associated with a job or node, use the native Resource Manager interface or the mjobctl or mnodectl commands.
If using the native Resource Manager interface, use the GEVENT attribute as in the following example:
node001 GEVENT[hitemp]='temperature exceeds 150 degrees' node017 GEVENT[fullfs]='/var/tmp is full'
The time at which the event occurred can be passed to Moab to prevent multiple processing of the same event. This is accomplished by specifying the event type in the format <GEVENTID>[:<EVENTTIME>] as in what follows:
node001 GEVENT[hitemp:1130325993]='temperature exceeds 150 degrees' node017 GEVENT[fullfs:1130325142]='/var/tmp is full' |
The messages specified after GEVENT are routed to Moab Cluster Manager for graphical display and can be used to dynamically adjust scheduling behavior.
Each node will record the following about reported generic events:
Each event can be individually cleared, annotated, or deleted by cluster administrators using a mnodectl command.
Generic events are only available in Moab 4.5.0 and later. |
Generic events may be manually created on a physical node or VM.
To add GEVENT "event" with message "hello" to node02, do the following:
> mnodectl -m gevent=event:"hello" node02
To add GEVENT "event" with message "hello" to myvm, do the following:
> mvmctl -m gevent=event:"hello" myvm