(Click to open topic with navigation)
Generic events are used to identify failures and other occurrences that Moab or other systems must be made aware. This information may result in automated resource recovery, notifications, adjustments to statistics, or changes in policy. Generic events also have the ability to carry an arbitrary human readable message that may be attached to associated objects or passed to administrators or external systems. Generic events typically signify the occurrence of a specific event as opposed to generic metrics which indicate a change in a measured value.
Using generic events, Moab can be configured to automatically address many failures and environmental changes improving the overall performance. Some sample events that sites may be interested in monitoring, recording, and taking action on include:
Generic events are defined in the moab.cfg file and have several different configuration options. The only required option is action.
The full list of configurable options for generic events is in the following table:
Attribute | Description |
---|---|
ACTION | Comma-delimited list of actions to be processed when a new event is received. |
ECOUNT | Number of events that must occur before launching action.
Action will be launched each <ECOUNT> event if rearm is set. |
REARM | Minimum time between events specified in [[[DD:]HH:]MM:]SS format. |
SEVERITY | An arbitrary severity level from 1 through 4, inclusive. SEVERITY appears in the output of mdiag -n -v -v --xml.
The severity level will not be used for any other purpose. |
The impact of the event is controlled using the ACTION attribute of the GEVENTCFG parameter. The ACTION attribute is comma-delimited and may include any combination of the actions in the following table:
Value | Description |
---|---|
DISABLE[:<OTYPE>:<OID>] | Marks event object (or specified object) down until event report is cleared. |
EXECUTE | Executes a script at the provided path. The value of EXECUTE is not contained in quotation marks. Arguments are allowed at the end of the path and are separated by question marks (?). Trigger variables (such as $OID) are allowed. |
NOTIFY | Notifies administrators of the event occurrence. |
OBJECTXMLSTDIN | If the EXECUTE action type is also specified, this flag passes an XML description of the firing gevent to the script. |
OFF | Powers off node or resource. |
ON | Powers on node or resource. |
PREEMPT[:<POLICY>] | Preempts workload associated with object (valid for node, job, reservation, partition, resource manager, user, group, account, class, QoS, and cluster objects). |
RECORD | Records events to the event log. The record action causes a line to be added to the event log regardless of whether or not RECORDEVENTLIST includes GEVENT. |
RESERVE[:<DURATION>] | Reserves node for specified duration (default: 24 hours). |
RESET | Resets object (valid for nodes - causes reboot). |
SIGNAL[:<SIGNO>] | Sends signal to associated jobs or services (valid for node, job, reservation, partition, resource manager, user, group, account, class, QoS, and cluster objects). |
This is an example of using objectxmlstdin with a gevent:
<gevent name="bob" statuscode="0" time="1320334763">Testing</gevent>
In general, generic events are named, with the exception of those based on generic metrics. Names are used primarily to differentiate between different events and do not have any intrinsic meaning to Moab. It is suggested that the administrator choose names that denote specific meanings within the organization.
Example 11-9:
# Note: cpu failures require admin attention, create maintenance reservation GEVENTCFG[cpufail] action=notify,record,disable,reserve rearm=01:00:00# Note: power failures are transient, minimize future use
GEVENTCFG[powerfail] action=notify,record, rearm=00:05:00 # Note: fs full can be automatically fixed GEVENTCFG[fsfull] action=notify,execute:/home/jason/MyPython/cleartmp.py?$OID?nodefix # Note: memory errors can cause invalid job results, clear node immediately GEVENTCFG[badmem] action=notify,record,preempt,disable,reserve
Generic Metric (GMetric) Events
GMetric events are generic events based on generic metrics. They are used for executing an action when a generic metric passes a defined threshold. Unlike named events, GMetric events are not named and use the following format:
GEVENTCFG[GMETRIC<COMPARISON>VALUE] ACTION=...Example 11-10:
GEVENTCFG[cputemp>150] action=off
This form of generic events uses the GMetric name, as returned by a GMETRIC attribute in a native Resource Manager interface.
Only one generic event may be specified for any given generic metric.
Valid comparative operators are shows in the following table:
Type | Comparison | Notes |
---|---|---|
> | greater than | Numeric values only |
> = | greater than or equal to | Numeric values only |
= = | equal to | Numeric values only |
< | less than | Numeric values only |
< = | less than or equal to | Numeric values only |
< > | not equal | Numeric values only |
Unlike generic metrics, generic events can be optionally configured at the global level to adjust rearm policies, and other behaviors. In all cases, this is accomplished using the GEVENTCFG parameter.
To report an event associated with a job or node, use the native Resource Manager interface or the mjobctl or mnodectl commands. You can report generic events on the scheduler with the mschedctl command.
If using the native Resource Manager interface, use the GEVENT attribute as in the following example:
node001 GEVENT[hitemp]='temperature exceeds 150 degrees' node017 GEVENT[fullfs]='/var/tmp is full'
The time at which the event occurred can be passed to Moab to prevent multiple processing of the same event. This is accomplished by specifying the event type in the format <GEVENTID>[:<EVENTTIME>] as in what follows:
node001 GEVENT[hitemp:1130325993]='temperature exceeds 150 degrees'
node017 GEVENT[fullfs:1130325142]='/var/tmp is full'
Using Generic Events for VM Detection
To enable Moab to detect a virtual machine (VM) reported by a generic event, do the following:
GEVENTCFG[NewVM] ACTION=execute:/opt/moab/AddVM.py,OBJECTXMLSTDIN
> mschedctl -c gevent -n NewVM -m "VM=newVMName"
With the ObjectXMLStdin action set, Moab sends an XML description of the generic event to the script, so the message passes through. The script then creates a VMTracking job to attach to the newly discovered VM.
The following sample Perl script submits a VMTracking job for the new VM:
#!/usr/bin/perl # in moab.cfg: GEVENTCFG[NewVM] ACTION=execute:$TOOLSDIR/newvm_event.pl,OBJECTXMLSTDIN # trigger gevent with: mschedctl -c gevent -n NewVM -m "VM=TestVM1" # input to this script: <gevent name="NewVM" statuscode="0" time="1318500261">VM=TestVM1</gevent> use strict; my $vmidVarName = "preVMID"; my $vmTemplate = "existingVM"; my $vmOwner = "operator"; $ENV{MOABHOMEDIR} = '/opt/moab'; my $xml = join "", <STDIN>; my ($vmid) = ($xml =~ m/VM=([^\<]+)\</); if ( defined $vmid ) { my $cmd = qq| $ENV{MOABHOMEDIR}/bin/mvmctl -q $vmid --xml |; my $vmxml = `$cmd`; my ($hv, $os, $proc, $disk, $mem) = (undef, undef, undef, undef, undef); ($hv) = ($vmxml =~ m/CONTAINERNODE="([^"]+)"/); ($os) = ($vmxml =~ m/OS="([^"]+)"/); ($proc) = ($vmxml =~ m/RCPROC="([^"]+)"/); ($mem) = ($vmxml =~ m/RCMEM="([^"]+)"/); ($disk) = ($vmxml =~ m/RCDISK="([^"]+)"/); die "Error parsing VM XML. Invalid VMID $vmid or $hv || $os || $proc || $mem || $disk? " if ( ! defined $hv || !defined $os || !defined $proc || !defined $mem || !defined $disk ); $cmd = qq| $ENV{MOABHOMEDIR}/bin/msub -l hostlist=$hv,os=$os,nodes=1:ppn=$proc,mem=$mem,file=$disk,template=$vmTemplate,VAR=$vmidVarName=$vmid --proxy=$vmOwner /dev/null |; my $msubout = `$cmd`; die "Error executing msub. Output is: $msubout " if ( $? ); } else { die "Error parsing VMID from GEVENT message "; }
Each node will record the following about reported generic events:
Each event can be individually cleared, annotated, or deleted by cluster administrators using a mnodectl command.
Generic events are only available in Moab 4.5.0 and later.
Generic events may be manually created on a physical node or VM.
To add GEVENT event with message "hello" to node02, do the following:
> mnodectl -m gevent=event:"hello" node02
To add GEVENT event with message "hello" to myvm, do the following:
> mvmctl -m gevent=event:"hello" myvm
Related topics