Conventions

13.2 Logging Facilities

The Moab Workload Manager provides the ability to produce detailed logging of all of its activities. This is accomplished using verbose server logging, event logging, and system logging facilities.

13.2-A Log Facility Configuration

The LOGFILE and/or LOGDIR parameters within the moab.cfg file specify the destination of this logging information. Logging information will be written in the file <MOABHOMEDIR>/<LOGDIR><LOGFILE> unless <LOGDIR> or <LOGFILE> is specified using an absolute path. If the log file is not specified or points to an invalid file, all logging information is directed to STDERR. However, because of the sheer volume of information that can be logged, it is not recommended that this be done while in production. By default, LOGDIR and LOGFILE are set to log and moab.log respectively, resulting in scheduler logs being written to <MOABHOMEDIR>/log/moab.log.

The parameter LOGFILEMAXSIZE determines how large the log file is allowed to become before it is rolled and is set to 10 MB by default. When the log file reaches this specified size, the log file is rolled. The parameter LOGFILEROLLDEPTH controls the number of old logs maintained and defaults to 3. Rolled log files have a numeric suffix appended indicating their order.

The parameter LOGLEVEL controls the verbosity of the information. Currently, LOGLEVEL values between 0 and 9 are used to control the amount of information logged, with 0 being the most terse, logging only the most severe problems detected, while 9 is the most verbose, commenting on just about everything. The amount of information provided at each log level is approximately an order of magnitude greater than what is provided at the log level immediately below it. A LOGLEVEL of 2 will record virtually all critical messages, while a log level of 4 will provide general information describing all actions taken by the scheduler. If a problem is detected, you may want to increase the LOGLEVEL value to get more details. However, doing so will cause the logs to roll faster and will also cause a lot of possibly unrelated information to clutter up the logs. Also be aware of the fact that high LOGLEVEL values results in large volumes of possibly unnecessary file I/O to occur on the scheduling machine. Consequently, it is not recommended that high LOGLEVEL values be used unless tracking a problem or similar circumstances warrant the I/O cost.

If high log levels are desired for an extended period of time and your Moab home directory is located on a network file system, performance may be improved by moving your log directory to a local file system using the LOGDIR parameter.

A final log related parameter is LOGFACILITY. This parameter can be used to focus logging on a subset of scheduler activities. This parameter is specified as a list of one or more scheduling facilities as listed in the parameters documentation.

Example 13-1:  

# moab.cfg
# allow up to 30 100MB logfiles
LOGLEVEL         5
LOGDIR           /var/tmp/moab
LOGFILEMAXSIZE   100000000
LOGFILEROLLDEPTH 30

The logging that occurs is of the following major types: subroutine information, status information, scheduler warnings, scheduler alerts, and scheduler errors.

13.2-B Status Information

Critical internal status is indicated at low LOGLEVELs while less critical and more verbose status information is logged at higher LOGLEVELs. For example:

INFO:     job orion.4228 rejected (max user jobs)
INFO:     job fr4n01.923.0 rejected (maxjobperuser policy failure)

13.2-C Scheduler Warnings

Warnings are logged when the scheduler detects an unexpected value or receives an unexpected result from a system call or subroutine. These messages are not necessarily indicative of problems and are not catastrophic to the scheduler. Most warnings are reported at loglevel 0 to loglevel 3. For example:

WARNING:  cannot open fairshare data file '/opt/moab/stats/FS.87000'

13.2-D Scheduler Alerts

Alerts are logged when the scheduler detects events of an unexpected nature that may indicate problems in other systems or in objects. They are typically of a more severe nature than warnings and possibly should be brought to the attention of scheduler administrators. Most alerts are reported at loglevel 0 to loglevel 2. For example:

ALERT:    job orion.72 cannot run.  deferring job for 360 Seconds

13.2-E Scheduler Errors

Errors are logged when the scheduler detects problems of a nature that impacts the scheduler's ability to properly schedule the cluster. Moab will try to remedy or mitigate the problem as best it can, but the problem may be outside of its sphere of control. Errors should definitely be monitored by administrators. Most errors are reported at loglevel 0 to loglevel 1. For example:

ERROR:    cannot connect to Loadleveler API

13.2-F Searching Moab Logs

While major failures are reported via the mdiag -S command, these failures can also be uncovered by searching the logs using the grep command as in the following:

> grep -E "WARNING|ALERT|ERROR" moab.log

On a production system working normally, this list usually includes some ALERT and WARNING messages. The messages are usually self-explanatory, but if not, viewing the log can give context to the message.

If a problem is occurring early when starting the Moab scheduler (before the configuration file is read) Moab can be started up using the -L <LOGLEVEL>flag. If this is the first flag on the command line, then the LOGLEVEL is set to the specified level immediately before any setup processing is done and additional logging is recorded.

If problems are detected in the use of one of the client commands, the client command can be re-issued with the --loglevel=<LOGLEVEL> command line argument specified. This argument causes log information to be written to STDERR as the client command is running. As with the server, <LOGLEVEL> values from 0 to 9 are supported.

The LOGLEVEL can be changed dynamically by use of the mschedctl -m command, or by modifying the moab.cfg file and restarting the scheduler. Also, if the scheduler appears to be hung or is not properly responding, the log level can be incremented by one by sending a SIGUSR1 signal to the scheduler process. Repeated SIGUSR1signals continue to increase the log level. The SIGUSR2 signal can be used to decrease the log level by one.

If an unexpected problem does occur, save the log file as it is often very helpful in isolating and correcting the problem.

13.2-G Event Logs

Major events are reported to both the Moab log file as well as the Moab event log. By default, the event log is maintained in the statistics directory and rolls on a daily basis, using the naming convention events.WWW_MMM_DD_YYYY as in events.Tue_Mar_18_2008.

Event Log Format

The event log contains information about major job, reservation, node, and scheduler events and failures and reports this information in the following format:

<EVENTTIME> <EPOCHTIME>:<EID> <OBJECT> <OBJECTID> <EVENT> <DETAILS>

Example 13-2:  

VERSION 500
07:03:21 110244322:0 sched clusterA   start
07:03:26 110244327:1 rsv   system.1   start   1124142432 1324142432 2 2 0.0 2342155.3 node1|node2 NA RSV=%=system.1= 
07:03:54 110244355:2 job   1413       end     8 16 llw mcc 432000 Completed [batch:1] 11 08708752 1108703981 ... 
07:04:59 110244410:3 rm    base       failure cannot connect to RM
07:05:20 110244431:4 sched clusterA   stop    admin
...

The parameter RECORDEVENTLIST can be used to control which events are reported to the event log. See the sections on job and reservation trace format for more information regarding the values reported in the details section for those records.

Record Type Specific Details Format

The format for each record type is unique and is described in the following table:

Record Type Event Types Description
gevent See Enabling Generic Events for gevent information.

Generic events are included within node records. See node detail format that follows.

job JOBCANCEL, JOBCHECKPOINT, JOBEND, JOBHOLD, JOBMIGRATE, JOBMODIFY, JOBPREEMPT, JOBREJECT, JOBRESUME, JOBSTART, JOBSUBMIT See Workload Accounting Records.
node NODEDOWN, NODEFAILURE, NODEUP The following fields are displayed in the event file in a space-delimited line as long as Moab has information pertaining to it: state, partition, disk, memory, maxprocs, swap, os, rm, nodeaccesspolicy, class, and message, where state is the node's current state and message is a human readable message indicating reason for node state change.
rm RMDOWN, RMPOLLEND, RMPOLLSTART, RMUP Human readable message indicating reason for resource manager state change.

For SCHEDCOMMAND, only create/modify commands are recorded. No record is created for general list/query commands. ALLSCHEDCOMMAND does the same thing as SCHEDCOMMAND, but it also logs info query commands.

trigger TRIGEND, TRIGFAILURE, TRIGSTART <ATTR>="<VALUE>"[ <ATTR>="<VALUE>"]...
where <ATTR> is one of the following: actiondata, actiontype, description, ebuf, eventtime, eventtype, flags, name, objectid, objecttype, obuf, offset, period, requires, sets, threshold, timeout, and so forth.
See About object triggers for more information.
vm VMCREATE, VMDESTROY, VMMIGRATE, VMPOWEROFF, VMPOWERON The following fields are displayed in the event file in a space-delimited line as long as Moab has information pertaining to it: name, sovereign, powerstate, parentnode, swap, memory, disk, maxprocs, opsys, class, and variables, where class and variables may have 0 or multiple entries.

Exporting Events in Real-Time

Moab event information can be exported to external systems in real-time using the ACCOUNTINGINTERFACEURL parameter. When set, Moab activates this URL each time one of the default events or one of the events specified by the RECORDEVENTLIST occurs.

While various protocols can be used, the most common protocol is exec, which indicates that Moab should launch the specified tool or script and pass in event information as command line arguments. This tool can then select those events and fields of interest and re-direct them as appropriate providing significant flexibility and control to the organization.

Exec Protocol Format

When a URL with an exec protocol is specified, the target is launched with the event fields passed in as STDIN. These fields appear exactly as they do in the event logs with the same values and order.

The tools/sql directory included with the Moab distribution contains event.create.sql.pl, a sample accounting interface processing script that may be used as a template.

13.2-H Event logging with web services

Administrators can configure Moab to push event data to Moab Web Services or other web services. This allows you to manage and store event logs from a single location. Currently, Moab pushes the following events to web services for storage:

These event logs are separate from the old Moab event logs.

Event type Facility Category
jobcancel job cancel
jobend job end
jobhold job hold
jobmodify job modify
jobreject job reject
jobrelease job release
jobstart job start
jobsubmit job submit
rsvcreate reservation create
rsvend reservation end
rsvstart reservation start
allschedcommand scheduler command
schedcommand scheduler command
schedcycleend scheduler end
schedcyclestart scheduler start
schedpause scheduler pause
schedrecycle scheduler recycle
schedresume scheduler resume
schedstart scheduler start
schedend scheduler end
trigcreate trigger create
trigend trigger end
trigstart trigger start
vmcancel vm cancel
vmdestroy vm destroy
vmend vm end
vmmigrateend vm migrate
vmmigratestart vm migrate
vmready vm ready
vmsubmit vm submit

The SCHEDCOMMAND and ALLSCHEDCOMMAND event type log information is not pushed to web services unless you have specified that you want to include it in the RECORDEVENTLIST parameter.

Event category Description
cancel

Indicates the object was canceled.

command Indicates that Moab received a command.
create Indicates the object was created.
destroy Indicates the object was destroyed.

"end" can occur before "destroy".

end

Indicates the object ended normally (it reached its end of life; completed).

hold Indicates the job had a hold placed on it.
migrate Indicates a VM migration event.
modify Indicates the object was modified.
pause Indicates the scheduler paused.
ready

Indicates the object was ready.

"submit" can occur before "ready".

recycle Indicates the scheduler recycled.
reject Indicates the object was rejected as invalid.
release Indicates all holds have been removed from a job.
resume

Indicates the scheduler resumed.

"pause" can occur before "resume".

start

Indicates the object started.

"submit" can occur before "start".

stop Indicates the scheduler stopped.
submit Indicates the object was submitted to Moab. (Note that this does not indicate that Moab accepted it.)

To enable Moab to push event logging to web services, you will need to set the following parameters in moab.cfg:

And these parameters in the moab-private.cfg file:

For more information about Moab event logging in Moab Web Services, see the "Events" section of the Moab Web Services Reference Guide.

13.2-I Enabling Syslog

In addition to the log file, the Moab scheduler can report events it determines to be critical to the UNIX syslog facility via the daemon facility using priorities ranging from INFO to ERROR. (See USESYSLOG). The verbosity of this logging is not affected by the LOGLEVEL parameter. In addition to errors and critical events, user commands that affect the state of the jobs, nodes, or the scheduler may also be logged to syslog. Moab syslog messages are reported using the INFO, NOTICE, and ERR syslog priorities.

By default, messages are logged to syslog's user facility. However, using the USESYSLOG parameter, Moab can be configured to use any of the following:

13.2-J Managing Verbosity

In very large systems, a highly verbose log may roll too quickly to be of use in tracking specific targeted behaviors. In these cases, one or more of the following approaches may be of use:

Related topics