The Moab Workload Manager provides the ability to produce detailed logging of all of its activities. This is accomplished using verbose server logging, event logging, and system logging facilities.
The LOGFILE and/or LOGDIR parameters within the moab.cfg file specify the destination of this logging information. Logging information will be written in the file <MOABHOMEDIR>/<LOGDIR><LOGFILE> unless <LOGDIR> or <LOGFILE> is specified using an absolute path. If the log file is not specified or points to an invalid file, all logging information is directed to STDERR. However, because of the sheer volume of information that can be logged, it is not recommended that this be done while in production. By default, LOGDIR and LOGFILE are set to log and moab.log respectively, resulting in scheduler logs being written to <MOABHOMEDIR>/log/moab.log.
The parameter LOGFILEMAXSIZE determines how large the log file is allowed to become before it is rolled and is set to 10 MB by default. When the log file reaches this specified size, the log file is rolled. The parameter LOGFILEROLLDEPTH controls the number of old logs maintained and defaults to 3. Rolled log files have a numeric suffix appended indicating their order.
The parameter LOGLEVEL controls the verbosity of the information. Currently, LOGLEVEL values between 0 and 9 are used to control the amount of information logged, with 0 being the most terse, logging only the most severe problems detected, while 9 is the most verbose, commenting on just about everything. The amount of information provided at each log level is approximately an order of magnitude greater than what is provided at the log level immediately below it. A LOGLEVEL of 2 will record virtually all critical messages, while a log level of 4 will provide general information describing all actions taken by the scheduler. If a problem is detected, you may want to increase the LOGLEVEL value to get more details. However, doing so will cause the logs to roll faster and will also cause a lot of possibly unrelated information to clutter up the logs. Also be aware of the fact that high LOGLEVEL values results in large volumes of possibly unnecessary file I/O to occur on the scheduling machine. Consequently, it is not recommended that high LOGLEVEL values be used unless tracking a problem or similar circumstances warrant the I/O cost.
If high log levels are desired for an extended period of time and your Moab home directory is located on a network file system, performance may be improved by moving your log directory to a local file system using the LOGDIR parameter. |
A final log related parameter is LOGFACILITY. This parameter can be used to focus logging on a subset of scheduler activities. This parameter is specified as a list of one or more scheduling facilities as listed in the parameters documentation.
Example
# moab.cfg # allow up to 30 100MB logfiles LOGLEVEL 5 LOGDIR /var/tmp/moab LOGFILEMAXSIZE 100000000 LOGFILEROLLDEPTH 30
The logging that occurs is of the following major types: subroutine information, status information, scheduler warnings, scheduler alerts, and scheduler errors.
Critical internal status is indicated at low LOGLEVELs while less critical and more verbose status information is logged at higher LOGLEVELs. For example:
INFO: job orion.4228 rejected (max user jobs) INFO: job fr4n01.923.0 rejected (maxjobperuser policy failure)
Warnings are logged when the scheduler detects an unexpected value or receives an unexpected result from a system call or subroutine. These messages are not necessarily indicative of problems and are not catastrophic to the scheduler. Most warnings are reported at loglevel 0 to loglevel 3. For example:
WARNING: cannot open fairshare data file '/opt/moab/stats/FS.87000'
Alerts are logged when the scheduler detects events of an unexpected nature that may indicate problems in other systems or in objects. They are typically of a more severe nature than warnings and possibly should be brought to the attention of scheduler administrators. Most alerts are reported at loglevel 0 to loglevel 2. For example:
ALERT: job orion.72 cannot run. deferring job for 360 Seconds
Errors are logged when the scheduler detects problems of a nature that impacts the scheduler's ability to properly schedule the cluster. Moab will try to remedy or mitigate the problem as best it can, but the problem may be outside of its sphere of control. Errors should definitely be be monitored by administrators. Most errors are reported at loglevel 0 to loglevel 1. For example:
ERROR: cannot connect to Loadleveler API
While major failures are reported via the mdiag -S command, these failures can also be uncovered by searching the logs using the grep command as in the following:
> grep -E "WARNING|ALERT|ERROR" moab.log
On a production system working normally, this list should usually turn up empty. The messages are usually self-explanatory, but if not, viewing the log can give context to the message.
If a problem is occurring early when starting the Moab scheduler (before the configuration file is read) Moab can be started up using the -L <LOGLEVEL> flag. If this is the first flag on the command line, then the LOGLEVEL is set to the specified level immediately before any setup processing is done and additional logging is recorded.
If problems are detected in the use of one of the client commands, the client command can be re-issued with the --loglevel=<LOGLEVEL> command line argument specified. This argument causes log information to be written to STDERR as the client command is running. As with the server, <LOGLEVEL> values from 0 to 9 are supported.
The LOGLEVEL can be changed dynamically by use of the mschedctl -m command, or by modifying the moab.cfg file and restarting the scheduler. Also, if the scheduler appears to be hung or is not properly responding, the log level can be incremented by one by sending a SIGUSR1 signal to the scheduler process. Repeated SIGUSR1 signals continue to increase the log level. The SIGUSR2 signal can be used to decrease the log level by one.
If an unexpected problem does occur, save the log file as it is often very helpful in isolating and correcting the problem.
Major events are reported to both the Moab log file as well as the Moab event log. By default, the event log is maintained in the statistics directory and rolls on a daily basis, using the naming convention events.WWW_MMM_DD_YYYY as in events.Tue_Mar_18_2008.
The event log contains information about major job, reservation, node, and scheduler events and failures and reports this information in the following format:
<EVENTTIME> <EPOCHTIME>:<EID> <OBJECT> <OBJECTID> <EVENT> <DETAILS>
Example
VERSION 500 07:03:21 110244322:0 sched clusterA start 07:03:26 110244327:1 rsv system.1 start 1124142432 1324142432 2 2 0.0 2342155.3 node1|node2 NA RSV=%=system.1= 07:03:54 110244355:2 job 1413 end 8 16 llw mcc 432000 Completed [batch:1] 11 08708752 1108703981 ... 07:04:59 110244410:3 rm base failure cannot connect to RM 07:05:20 110244431:4 sched clusterA stop admin ...
The parameter RECORDEVENTLIST can be used to control which events are reported to the event log. See the sections on job and reservation trace format for more information regarding the values reported in the details section for those records.
Record Type Specific Details Format
The format for each record type is unique and is described in the following table:
Record Type | Event Types | Description | ||
---|---|---|---|---|
gevent | See Enabling Generic Events for gevent information. |
| ||
job | JOBCANCEL, JOBCHECKPOINT, JOBEND, JOBHOLD, JOBMIGRATE, JOBMODIFY, JOBPREEMPT, JOBREJECT, JOBRESUME, JOBSTART, JOBSUBMIT | See Workload Accounting Records. | ||
node | NODEDOWN, NODEFAILURE, NODEUP | <eventid> <message> where <message> is a human readable message indicating reason for resource manager state change. | ||
rm | RMDOWN, RMPOLLEND, RMPOLLSTART, RMUP | Human readable message indicating reason for resource manager state change.
| ||
rsv | RSVCANCEL, RSVCREATE, RSVEND, RSVMODIFY, RSVSTART | creationtime - epoch starttime - epoch endtime - epoch alloc taskcount - integer alloc nodecount - integer total active proc-seconds - integer total proc-seconds - integer hostlist - comma-delimited list owner - reservation owner ACL - semicolon-delimited access control list category - reservation usage category comment - human-readable description command - <command> <argument(s)> human readable message - MSG='<message>' (See Reservation Accounting Records.) | ||
sched | ALLSCHEDCOMMAND, SCHEDCOMMAND, SCHEDCYCLEEND, SCHEDCYCLESTART, SCHEDFAILURE, SCHEDMODIFY, SCHEDPAUSE, SCHEDRECYCLE, SCHEDRESUME, SCHEDSTART, SCHEDSTOP | Human readable message indicating reason for scheduler action.
| ||
trigger | TRIGEND, TRIGFAILURE, TRIGSTART | <ATTR>="<VALUE>"[ <ATTR>="<VALUE>"]... where <ATTR> is one of the following: actiondata, actiontype, description, ebuf, eventtime, eventtype, flags, name, objectid, objecttype, obuf, offset, period, requires, sets, threshold, timeout, and so forth. See Object Trigger Overview for more information. |
Moab event information can be exported to external systems in real-time using the ACCOUNTINGINTERFACEURL parameter. When set, Moab activates this URL each time one of the default events or one of the events specified by the RECORDEVENTLIST occurs.
While various protocols can be used, the most common protocol is exec, which indicates that Moab should launch the specified tool or script and pass in event information as command line arguments. This tool can then select those events and fields of interest and re-direct them as appropriate providing significant flexibility and control to the organization.
Exec Protocol Format
When a URL with an exec protocol is specified, the target is launched with the event fields passed in as STDIN. These fields appear exactly as they do in the event logs with the same values and order.
The tools directory included with the Moab distribution contains event.create.sql.pl, a sample accounting interface processing script that may be used as a template. |
In addition to the log file, the Moab scheduler can report events it determines to be critical to the Unix syslog facility via the daemon facility using priorities ranging from INFO to ERROR. (See USESYSLOG). The verbosity of this logging is not affected by the LOGLEVEL parameter. In addition to errors and critical events, user commands that affect the state of the jobs, nodes, or the scheduler may also be logged to syslog. Moab syslog messages are reported using the INFO, NOTICE, and ERR syslog priorities.
By default, messages are logged to syslog's user facility. However, using the USESYSLOG parameter, Moab can be configured to use any of the following:
In very large systems, a highly verbose log may roll too quickly to be of use in tracking specific targeted behaviors. In these cases, one or more of the following approaches may be of use: