Logging Facilities

The Moab Workload Manager provides the ability to produce detailed logging of all of its activities. This is accomplished using verbose server logging, event logging, and system logging facilities.

Log Facility Configuration
Status Information
Scheduler Warnings
Scheduler Alerts
Scheduler Errors
Searching Moab Logs
Event Logs

Event Log Format
Exporting Events in Real-Time

Enabling Syslog
Managing Log Verbosity

14.2.1 Log Facility Configuration

The LOGFILE and/or LOGDIR parameters within the moab.cfg file specify the destination of this logging information. Logging information will be written in the file <MOABHOMEDIR>/<LOGDIR><LOGFILE> unless <LOGDIR> or <LOGFILE> is specified using an absolute path. If the log file is not specified or points to an invalid file, all logging information is directed to STDERR. However, because of the sheer volume of information that can be logged, it is not recommended that this be done while in production. By default, LOGDIR and LOGFILE are set to log and moab.log respectively, resulting in scheduler logs being written to <MOABHOMEDIR>/log/moab.log.

The parameter LOGFILEMAXSIZE determines how large the log file is allowed to become before it is rolled and is set to 10 MB by default. When the log file reaches this specified size, the log file is rolled. The parameter LOGFILEROLLDEPTH controls the number of old logs maintained and defaults to 3. Rolled log files have a numeric suffix appended indicating their order.

The parameter LOGLEVEL controls the verbosity of the information. Currently, LOGLEVEL values between 0 and 9 are used to control the amount of information logged, with 0 being the most terse, logging only the most severe problems detected, while 9 is the most verbose, commenting on just about everything. The amount of information provided at each log level is approximately an order of magnitude greater than what is provided at the log level immediately below it. A LOGLEVEL of 2 will record virtually all critical messages, while a log level of 4 will provide general information describing all actions taken by the scheduler. If a problem is detected, you may want to increase the LOGLEVEL value to get more details. However, doing so will cause the logs to roll faster and will also cause a lot of possibly unrelated information to clutter up the logs. Also be aware of the fact that high LOGLEVEL values results in large volumes of possibly unnecessary file I/O to occur on the scheduling machine. Consequently, it is not recommended that high LOGLEVEL values be used unless tracking a problem or similar circumstances warrant the I/O cost.

If high log levels are desired for an extended period of time and your Moab home directory is located on a network file system, performance may be improved by moving your log directory to a local file system using the LOGDIR parameter.

A final log related parameter is LOGFACILITY. This parameter can be used to focus logging on a subset of scheduler activities. This parameter is specified as a list of one or more scheduling facilities as listed in the parameters documentation.

Example

# moab.cfg
# allow up to 30 100MB logfiles
LOGLEVEL         5
LOGDIR           /var/tmp/moab
LOGFILEMAXSIZE   100000000
LOGFILEROLLDEPTH 30

The logging that occurs is of the following major types: subroutine information, status information, scheduler warnings, scheduler alerts, and scheduler errors.

14.2.2 Status Information

Critical internal status is indicated at low LOGLEVELs while less critical and more verbose status information is logged at higher LOGLEVELs. For example:

INFO:     job orion.4228 rejected (max user jobs)
INFO:     job fr4n01.923.0 rejected (maxjobperuser policy failure)

14.2.3 Scheduler Warnings

Warnings are logged when the scheduler detects an unexpected value or receives an unexpected result from a system call or subroutine. These messages are not necessarily indicative of problems and are not catastrophic to the scheduler. Most warnings are reported at loglevel 0 to loglevel 3. For example:

WARNING:  cannot open fairshare data file '/opt/moab/stats/FS.87000'

14.2.4 Scheduler Alerts

Alerts are logged when the scheduler detects events of an unexpected nature that may indicate problems in other systems or in objects. They are typically of a more severe nature than warnings and possibly should be brought to the attention of scheduler administrators. Most alerts are reported at loglevel 0 to loglevel 2. For example:

ALERT:    job orion.72 cannot run.  deferring job for 360 Seconds

14.2.5 Scheduler Errors

Errors are logged when the scheduler detects problems of a nature that impacts the scheduler's ability to properly schedule the cluster. Moab will try to remedy or mitigate the problem as best it can, but the problem may be outside of its sphere of control. Errors should definitely be monitored by administrators. Most errors are reported at loglevel 0 to loglevel 1. For example:

ERROR:    cannot connect to Loadleveler API

14.2.6 Searching Moab Logs

While major failures are reported via the mdiag -S command, these failures can also be uncovered by searching the logs using the grep command as in the following:

> grep -E "WARNING|ALERT|ERROR" moab.log

On a production system working normally, this list should usually turn up empty. The messages are usually self-explanatory, but if not, viewing the log can give context to the message.

If a problem is occurring early when starting the Moab scheduler (before the configuration file is read) Moab can be started up using the -L <LOGLEVEL>flag. If this is the first flag on the command line, then the LOGLEVEL is set to the specified level immediately before any setup processing is done and additional logging is recorded.

If problems are detected in the use of one of the client commands, the client command can be re-issued with the --loglevel=<LOGLEVEL> command line argument specified. This argument causes log information to be written to STDERR as the client command is running. As with the server, <LOGLEVEL> values from 0 to 9 are supported.

The LOGLEVEL can be changed dynamically by use of the mschedctl -m command, or by modifying the moab.cfg file and restarting the scheduler. Also, if the scheduler appears to be hung or is not properly responding, the log level can be incremented by one by sending a SIGUSR1 signal to the scheduler process. Repeated SIGUSR1signals continue to increase the log level. The SIGUSR2 signal can be used to decrease the log level by one.

If an unexpected problem does occur, save the log file as it is often very helpful in isolating and correcting the problem.

14.2.7 Event Logs

Major events are reported to both the Moab log file as well as the Moab event log. By default, the event log is maintained in the statistics directory and rolls on a daily basis, using the naming convention events.WWW_MMM_DD_YYYY as in events.Tue_Mar_18_2008.

14.2.7.1 Event Log Format

The event log contains information about major job, reservation, node, and scheduler events and failures and reports this information in the following format:

<EVENTTIME> <EPOCHTIME>:<EID> <OBJECT> <OBJECTID> <EVENT> <DETAILS>

Example

VERSION 500
07:03:21 110244322:0 sched clusterA   start
07:03:26 110244327:1 rsv   system.1   start   1124142432 1324142432 2 2 0.0 2342155.3 node1|node2 NA RSV=%=system.1= 
07:03:54 110244355:2 job   1413       end     8 16 llw mcc 432000 Completed [batch:1] 11 08708752 1108703981 ... 
07:04:59 110244410:3 rm    base       failure cannot connect to RM
07:05:20 110244431:4 sched clusterA   stop    admin
...

The parameter RECORDEVENTLIST can be used to control which events are reported to the event log. See the sections on job and reservation trace format for more information regarding the values reported in the details section for those records.

Record Type Specific Details Format

The format for each record type is unique and is described in the following table:

Record Type

Event Types

Description

gevent

See Enabling Generic Events for gevent information.

Generic events are included within noderecords. See node detail format that follows.

job

JOBCANCEL, JOBCHECKPOINT, JOBEND, JOBHOLD, JOBMIGRATE, JOBMODIFY, JOBPREEMPT, JOBREJECT, JOBRESUME, JOBSTART, JOBSUBMIT

See Workload Accounting Records.

node

NODEDOWN, NODEFAILURE, NODEUP

The following fields are displayed in the event file in a space-delimited line as long as Moab has information pertaining to it: state, partition, disk, memory, maxprocs, swap, os, rm, nodeaccesspolicy, class, and message, where state is the node's current state and message is a human readable message indicating reason for node state change.

RMDOWN, RMPOLLEND, RMPOLLSTART, RMUP

Human readable message indicating reason for resource manager state change.

RMUP and RMDOWN are only logged for PBS resource managers.

rsv

RSVCANCEL, RSVCREATE, RSVEND, RSVMODIFY, RSVSTART

creationtime - epoch
starttime - epoch
endtime - epoch
alloc taskcount - integer
alloc nodecount - integer
total active proc-seconds - integer
total proc-seconds - integer
hostlist - comma-delimited list
owner - reservation owner
ACL - semicolon-delimited access control list
category - reservation usage category
comment - human-readable description
command - <command> <argument(s)>
human readable message - MSG='<message>'
(See Reservation Accounting Records.)

sched

ALLSCHEDCOMMAND, SCHEDCOMMAND, SCHEDCYCLEEND, SCHEDCYCLESTART, SCHEDFAILURE, SCHEDMODIFY, SCHEDPAUSE, SCHEDRECYCLE, SCHEDRESUME, SCHEDSTART, SCHEDSTOP

Human readable message indicating reason for scheduler action.

For SCHEDCOMMAND, only create/modify commands are recorded. No record is created for general list/query commands. ALLSCHEDCOMMAND does the same thing as SCHEDCOMMAND, but it also logs info query commands.

trigger

TRIGEND, TRIGFAILURE, TRIGSTART

<ATTR>="<VALUE>"[ <ATTR>="<VALUE>"]...
where <ATTR> is at least the following: actiondata, actiontype, and eventtype. It also includes all other set optional trigger attributes.
See Object Trigger Overview for more information.

VMCREATE, VMDESTROY, VMMIGRATE, VMPOWEROFF, VMPOWERON

The following fields are displayed in the event file in a space-delimited line as long as Moab has information pertaining to it: name, sovereign, powerstate, parentnode, swap, memory, disk, maxprocs, opsys, class, and variables, where class and variables may have 0 or multiple entries.

14.2.7.2 Exporting Events in Real-Time

Moab event information can be exported to external systems in real-time using the ACCOUNTINGINTERFACEURL parameter. When set, Moab activates this URL each time one of the default events or one of the events specified by the RECORDEVENTLIST occurs.

While various protocols can be used, the most common protocol is exec, which indicates that Moab should launch the specified tool or script and pass in event information as command line arguments. This tool can then select those events and fields of interest and re-direct them as appropriate providing significant flexibility and control to the organization.

Exec Protocol Format

When a URL with an exec protocol is specified, the target is launched with the event fields passed in as STDIN. These fields appear exactly as they do in the event logs with the same values and order.

The tools directory included with the Moab distribution contains event.create.sql.pl, a sample accounting interface processing script that may be used as a template.

14.2.8 Enabling Syslog

In addition to the log file, the Moab scheduler can report events it determines to be critical to the UNIX syslog facility via the daemon facility using priorities ranging from INFO to ERROR. (See USESYSLOG). The verbosity of this logging is not affected by the LOGLEVEL parameter. In addition to errors and critical events, user commands that affect the state of the jobs, nodes, or the scheduler may also be logged to syslog. Moab syslog messages are reported using the INFO, NOTICE, and ERR syslog priorities.

By default, messages are logged to syslog's user facility. However, using the USESYSLOG parameter, Moab can be configured to use any of the following:

user
daemon
local0
local1
local2
local3
local4
local5
local6
local7

14.2.9 Managing Verbosity

In very large systems, a highly verbose log may roll too quickly to be of use in tracking specific targeted behaviors. In these cases, one or more of the following approaches may be of use:

Use the LOGFACILITY parameter to log only functions and services of interest.
Use syslog to maintain a permanent record of critical events and failures.
Specify higher object loglevels on jobs, nodes, and reservations of interest (such as NODECFG[orion13] LOGLEVEL=6).
Increase the range of events reported to the event log using the RECORDEVENTLIST parameter.
Review object messages for required details.
Run Moab in monitor mode using IGNOREUSERS, IGNOREJOBS, IGNORECLASSES, or IGNORENODES.

14.2 Logging Facilities