11.0 General Node Administration > Enabling Generic Events

Conventions

11.8 Enabling Generic Events

Generic events are used to identify failures and other occurrences that Moab or other systems must be made aware. This information may result in automated resource recovery, notifications, adjustments to statistics, or changes in policy. Generic events also have the ability to carry an arbitrary human readable message that may be attached to associated objects or passed to administrators or external systems. Generic events typically signify the occurrence of a specific event as opposed to generic metrics which indicate a change in a measured value.

Using generic events, Moab can be configured to automatically address many failures and environmental changes improving the overall performance. Some sample events that sites may be interested in monitoring, recording, and taking action on include:

11.8-A Configuring Generic Events

Generic events are defined in the moab.cfg file and have several different configuration options. The only required option is action.

The full list of configurable options for generic events is in the following table:

 

Attribute Description
ACTION Comma-delimited list of actions to be processed when a new event is received.
ECOUNT Number of events that must occur before launching action.

Action will be launched each <ECOUNT> event if rearm is set.

REARM Minimum time between events specified in [[[DD:]HH:]MM:]SS format.
SEVERITY An arbitrary severity level from 1 through 4, inclusive. SEVERITY appears in the output of mdiag -n -v -v --xml.

The severity level will not be used for any other purpose.

 

Action Types

The impact of the event is controlled using the ACTION attribute of the GEVENTCFG parameter. The ACTION attribute is comma-delimited and may include any combination of the actions in the following table:

 

Value Description
DISABLE[:<OTYPE>:<OID>] Marks event object (or specified object) down until event report is cleared.
EXECUTE Executes a script at the provided path. The value of EXECUTE is not contained in quotation marks. Arguments are allowed at the end of the path and are separated by question marks (?). Trigger variables (such as $OID) are allowed.
NOTIFY Notifies administrators of the event occurrence.
OBJECTXMLSTDIN If the EXECUTE action type is also specified, this flag passes an XML description of the firing gevent to the script.
OFF Powers off node or resource.
ON Powers on node or resource.
PREEMPT[:<POLICY>] Preempts workload associated with object (valid for node, job, reservation, partition, resource manager, user, group, account, class, QoS, and cluster objects).
RECORD Records events to the event log. The record action causes a line to be added to the event log regardless of whether or not RECORDEVENTLIST includes GEVENT.
RESERVE[:<DURATION>] Reserves node for specified duration (default: 24 hours).
RESET Resets object (valid for nodes - causes reboot).
SIGNAL[:<SIGNO>] Sends signal to associated jobs or services (valid for node, job, reservation, partition, resource manager, user, group, account, class, QoS, and cluster objects).

This is an example of using objectxmlstdin with a gevent:

<gevent name="bob" statuscode="0" time="1320334763">Testing</gevent>

Named Events

In general, generic events are named, with the exception of those based on generic metrics. Names are used primarily to differentiate between different events and do not have any intrinsic meaning to Moab. It is suggested that the administrator choose names that denote specific meanings within the organization.

Example 11-9:  

# Note: cpu failures require admin attention, create maintenance reservation
GEVENTCFG[cpufail] action=notify,record,disable,reserve rearm=01:00:00# Note: power failures are transient, minimize future use
GEVENTCFG[powerfail] action=notify,record, rearm=00:05:00
# Note: fs full can be automatically fixed
GEVENTCFG[fsfull] action=notify,execute:/home/jason/MyPython/cleartmp.py?$OID?nodefix
# Note: memory errors can cause invalid job results, clear node immediately 
GEVENTCFG[badmem] action=notify,record,preempt,disable,reserve

Generic Metric (GMetric) Events

GMetric events are generic events based on generic metrics. They are used for executing an action when a generic metric passes a defined threshold. Unlike named events, GMetric events are not named and use the following format:

GEVENTCFG[GMETRIC<COMPARISON>VALUE] ACTION=...

Example 11-10:  

GEVENTCFG[cputemp>150] action=off
			

This form of generic events uses the GMetric name, as returned by a GMETRIC attribute in a native Resource Manager interface.

Only one generic event may be specified for any given generic metric.

Valid comparative operators are shows in the following table:

Type Comparison Notes
> greater than Numeric values only
> = greater than or equal to Numeric values only
= = equal to Numeric values only
< less than Numeric values only
< = less than or equal to Numeric values only
< > not equal Numeric values only

 

11.8-B Reporting Generic Events

Unlike generic metrics, generic events can be optionally configured at the global level to adjust rearm policies, and other behaviors. In all cases, this is accomplished using the GEVENTCFG parameter.

To report an event associated with a job or node, use the native Resource Manager interface or the mjobctl or mnodectl commands. You can report generic events on the scheduler with the mschedctl command.

If using the native Resource Manager interface, use the GEVENT attribute as in the following example:

node001 GEVENT[hitemp]='temperature exceeds 150 degrees'
node017 GEVENT[fullfs]='/var/tmp is full'
			

The time at which the event occurred can be passed to Moab to prevent multiple processing of the same event. This is accomplished by specifying the event type in the format <GEVENTID>[:<EVENTTIME>] as in what follows:

node001 GEVENT[hitemp:1130325993]='temperature exceeds 150 degrees'
node017 GEVENT[fullfs:1130325142]='/var/tmp is full'
					

Using Generic Events for VM Detection

To enable Moab to detect a virtual machine (VM) reported by a generic event, do the following:

  1. Set up your resource manager to detect virtual machine creation and to submit a generic event to Moab.
  2. Configure moab.cfg to recognize a generic event.
    GEVENTCFG[NewVM] ACTION=execute:/opt/moab/AddVM.py,OBJECTXMLSTDIN
  3. Report the event.
    > mschedctl -c gevent -n NewVM -m "VM=newVMName"

    With the ObjectXMLStdin action set, Moab sends an XML description of the generic event to the script, so the message passes through. The script then creates a VMTracking job to attach to the newly discovered VM.

The following sample Perl script submits a VMTracking job for the new VM:

#!/usr/bin/perl

# in moab.cfg: GEVENTCFG[NewVM] ACTION=execute:$TOOLSDIR/newvm_event.pl,OBJECTXMLSTDIN
# trigger gevent with: mschedctl -c gevent -n NewVM -m "VM=TestVM1"
# input to this script: <gevent name="NewVM" statuscode="0" time="1318500261">VM=TestVM1</gevent>

use strict;

my $vmidVarName = "preVMID";
my $vmTemplate = "existingVM";
my $vmOwner = "operator";

$ENV{MOABHOMEDIR} = '/opt/moab';

my $xml = join "", <STDIN>;
my ($vmid) = ($xml =~ m/VM=([^\<]+)\</);
if ( defined $vmid )
{
	my $cmd = qq| $ENV{MOABHOMEDIR}/bin/mvmctl -q $vmid --xml |;
	my $vmxml = `$cmd`;
	my ($hv, $os, $proc, $disk, $mem) = (undef, undef, undef, undef, undef);
	($hv) = ($vmxml =~ m/CONTAINERNODE="([^"]+)"/);
	($os) = ($vmxml =~ m/OS="([^"]+)"/);
	($proc) = ($vmxml =~ m/RCPROC="([^"]+)"/);
	($mem) = ($vmxml =~ m/RCMEM="([^"]+)"/);
	($disk) = ($vmxml =~ m/RCDISK="([^"]+)"/);
	die "Error parsing VM XML. Invalid VMID $vmid or $hv || $os || $proc || $mem || $disk?
"
		if ( ! defined $hv || !defined $os || !defined $proc || !defined $mem || !defined $disk );

	$cmd = qq| $ENV{MOABHOMEDIR}/bin/msub -l hostlist=$hv,os=$os,nodes=1:ppn=$proc,mem=$mem,file=$disk,template=$vmTemplate,VAR=$vmidVarName=$vmid --proxy=$vmOwner /dev/null |;
	my $msubout = `$cmd`;
	die "Error executing msub. Output is:
$msubout
" if ( $? );
} else {
	die "Error parsing VMID from GEVENT message
";
}

11.8-C Generic Events Attributes

Each node will record the following about reported generic events:

Each event can be individually cleared, annotated, or deleted by cluster administrators using a mnodectl command.

Generic events are only available in Moab 4.5.0 and later.

11.8-D Manually Creating Generic Events

Generic events may be manually created on a physical node or VM.

To add GEVENT event with message "hello" to node02, do the following:

> mnodectl -m gevent=event:"hello" node02
			

To add GEVENT event with message "hello" to myvm, do the following:

> mvmctl -m gevent=event:"hello" myvm
			

Related topics