(Click to open topic with navigation)
The Moab Workload Manager provides the ability to produce detailed logging of all of its activities. This is accomplished using verbose server logging, event logging, and system logging facilities.
The LOGFILE and/or LOGDIR parameters within the moab.cfg file specify the destination of this logging information. Logging information will be written in the file <MOABHOMEDIR>/<LOGDIR><LOGFILE> unless <LOGDIR> or <LOGFILE> is specified using an absolute path. If the log file is not specified or points to an invalid file, all logging information is directed to STDERR. However, because of the sheer volume of information that can be logged, it is not recommended that this be done while in production. By default, LOGDIR and LOGFILE are set to log and moab.log respectively, resulting in scheduler logs being written to <MOABHOMEDIR>/log/moab.log.
The parameter LOGFILEMAXSIZE determines how large the log file is allowed to become before it is rolled and is set to 10 MB by default. When the log file reaches this specified size, the log file is rolled. The parameter LOGFILEROLLDEPTH controls the number of old logs maintained and defaults to 3. Rolled log files have a numeric suffix appended indicating their order.
The parameter LOGLEVEL controls the verbosity of the information. LOGLEVEL values between 1 and 6 are used to control the amount of information logged with 1 being the least verbose (recording only the worst events that occur) while 6 is the most verbose. The amount of information provided at each log level is approximately an order of magnitude greater than what is provided at the log level immediately below it. The first three log levels (1-3) measure the severity of an event and the rest of the levels (4-6) measure verbosity and how much detail is logged.
If a problem is detected, you may want to increase the LOGLEVEL value to get more details. However, doing so will cause the logs to roll faster and will also cause a lot of possibly unrelated information to clutter up the logs. Also be aware of the fact that high LOGLEVEL values results in large volumes of possibly unnecessary file I/O to occur on the scheduling machine. Consequently, it is not recommended that high LOGLEVEL values be used unless tracking a problem or similar circumstances warrant the I/O cost.
If high log levels are desired for an extended period of time and your Moab home directory is located on a network file system, performance may be improved by moving your log directory to a local file system using the LOGDIR parameter.
Visibility | LOGLEVEL value | Description |
---|---|---|
FATAL | n/a | FATAL events are errors that render part of the system unusable. An example would be failing to create a connection to a database. FATAL event logging cannot be suppressed. |
ERROR | 1 | This is the minimum level of logging that Moab accepts. ERROR events are problems that occur in circumstances where a user's goal has failed. For example, when a user submits a job but the job fails to start, the cause of the failure will be an error. Not all failures are ERROR events, such as failing to open a file because it does not exist. Like FATAL events, ERROR events cannot be suppressed. |
WARNING | 2 | WARNING events are problems that have user consequences that Moab cannot easily evaluate. Their impact has to be judged by users. An example would be if a user job asked Moab to copy a folder and Moab was unable to copy one file in the folder because the file was a temp file and was opened exclusively by another process. The user might consider that failure irrelevant. WARNING event logging can be suppressed at user discretion. |
INFO | 3 | INFO events are occurrences that might be interesting but do not represent problems. An example would be the transition to a "terminated phase" when a service is successfully terminated. This event is potentially interesting to both human and automated observers but is not a problem in any sense. |
TRACE1 | 4 |
These log levels are generally not used in production environments. They are used mainly by Adaptive Computing developers to analyze various issues. Setting LOGLEVEL to one of these levels could seriously impact performance due to Moab attempting to write to the log potentially hundreds of times per second. |
TRACE2 | 5 | |
TRACE3 | 6 |
A final log related parameter is LOGFACILITY. This parameter can be used to focus logging on a subset of scheduler activities. This parameter is specified as a list of one or more scheduling facilities as listed in the parameters documentation.
Example 3-150:
# moab.cfg # allow up to 30 100MB logfiles LOGLEVEL 3 LOGDIR /var/tmp/moab LOGFILEMAXSIZE 100000000 LOGFILEROLLDEPTH 30
Each log event line follows a standard, tab-delimited log format:
timestamp <tab> thread ID <tab> visibility <tab> origin <tab> event code <tab> scope IDs <tab> message
Example 3-151:
2014-08-15T05:26:18.108-0600 846 TRACE1 MQueue.c:MQueueCheckStatus:3081 0 MQueueCheckStatus() 2014-08-15T05:26:18.108-0600 846 TRACE1 MNode.c:MNodeCheckStatus:949 0 MNodeCheckStatus() 2014-08-15T05:26:18.108-0600 846 TRACE1 MVC.c:MVCHarvestVCs:2911 0 Checking for VCs to harvest 2014-08-15T05:26:18.108-0600 846 TRACE1 MSU.c:MUClearChild:5301 0 MUClearChild(PID) 2014-08-15T05:26:18.108-0600 846 INFO MSysMainLoop.c:MSysMainLoop:1059 0x1002a14 Scheduling complete. Sleeping for 60 seconds. 2014-08-15T05:26:18.108-0600 846 TRACE1 MSchedStats.c:MSchedUpdateStats:36 0 MSchedUpdateStats() 2014-08-15T05:26:18.108-0600 846 INFO MSchedStats.c:MSchedUpdateStats:45 0x100a9da Iteration: 23; scheduling time: 0.00 seconds. 2014-08-15T05:26:18.108-0600 846 TRACE1 MRsv.c:MRsvUpdateStats:605 0 MRsvUpdateStats() 2014-08-15T05:26:18.108-0600 846 TRACE1 MSchedStats.c:MSchedUpdateStats:164 0 current util[23]: 0/1d (0.002f%) PH: 0.072f% active jobs: 0 of 0 (completed: 6217) 2014-08-15T05:26:18.109-0600 846 INFO MSysMainLoop.c:MSysMainLoop:1138 0x1000193 scheduler:Moab A scheduler iteration is ending.
While major failures are reported via the mdiag -S command, these failures can also be uncovered by searching the logs using the grep command as in the following:
> grep -E "WARNING|ALERT|ERROR" moab.log
On a production system working normally, this list usually includes some ALERT and WARNING messages. The messages are usually self-explanatory, but if not, viewing the log can give context to the message.
If a problem is occurring early when starting the Moab scheduler (before the configuration file is read) Moab can be started up using the -L <LOGLEVEL>flag. If this is the first flag on the command line, then the LOGLEVEL is set to the specified level immediately before any setup processing is done and additional logging is recorded.
If problems are detected in the use of one of the client commands, the client command can be re-issued with the --loglevel=<LOGLEVEL> command line argument specified. This argument causes log information to be written to STDERR as the client command is running. As with the server, <LOGLEVEL> values from 0 to 9 are supported.
The LOGLEVEL can be changed dynamically by use of the mschedctl -m command, or by modifying the moab.cfg file and restarting the scheduler. Also, if the scheduler appears to be hung or is not properly responding, the log level can be incremented by one by sending a SIGUSR1 signal to the scheduler process. Repeated SIGUSR1signals continue to increase the log level. The SIGUSR2 signal can be used to decrease the log level by one.
If an unexpected problem does occur, save the log file as it is often very helpful in isolating and correcting the problem.
Major events are reported to both the Moab log file as well as the Moab event log. By default, the event log is maintained in the statistics directory and rolls on a daily basis, using the naming convention events.WWW_MMM_DD_YYYY as in events.Tue_Mar_18_2008.
The event log contains information about major job, reservation, node, and scheduler events and failures and reports this information in the following format:
<EVENTTIME> <EPOCHTIME>:<EID> <OBJECT> <OBJECTID> <EVENT> <DETAILS>
Example 3-152:
VERSION 500 07:03:21 110244322:0 sched clusterA start 07:03:26 110244327:1 rsv system.1 start 1124142432 1324142432 2 2 0.0 2342155.3 node1|node2 NA RSV=%=system.1= 07:03:54 110244355:2 job 1413 end 8 16 llw mcc 432000 Completed [batch:1] 11 08708752 1108703981 ... 07:04:59 110244410:3 rm base failure cannot connect to RM 07:05:20 110244431:4 sched clusterA stop admin ...
The parameter RECORDEVENTLIST can be used to control which events are reported to the event log. See the sections on job and reservation trace format for more information regarding the values reported in the details section for those records.
Record Type Specific Details Format
The format for each record type is unique and is described in the following table:
Record Type | Event Types | Description |
---|---|---|
gevent | See Enabling Generic Events for gevent information. |
Generic events are included within node records. See node detail format that follows. |
job | JOBCANCEL, JOBCHECKPOINT, JOBEND, JOBHOLD, JOBMIGRATE, JOBMODIFY, JOBPREEMPT, JOBREJECT, JOBRESUME, JOBSTART, JOBSUBMIT |
See Workload Accounting Records. |
node | NODEDOWN, NODEFAILURE, NODEUP | The following fields are displayed in the event file in a space-delimited line as long as Moab has information pertaining to it: state, partition, disk, memory, maxprocs, swap, os, rm, nodeaccesspolicy, class, and message, where state is the node's current state and message is a human readable message indicating reason for node state change. |
rm | RMDOWN, RMPOLLEND, RMPOLLSTART, RMUP | Human readable message indicating reason for resource manager state change.
For SCHEDCOMMAND, only create/modify commands are recorded. No record is created for general list/query commands. ALLSCHEDCOMMAND does the same thing as SCHEDCOMMAND, but it also logs info query commands. |
trigger | TRIGEND, TRIGFAILURE, TRIGSTART | <ATTR>="<VALUE>"[
<ATTR>="<VALUE>"]...
where <ATTR> is one of the following: actiondata, actiontype, description, ebuf, eventtime, eventtype, flags, name, objectid, objecttype, obuf, offset, period, requires, sets, threshold, timeout, and so forth. See About Object Triggers for more information. |
vm | VMCREATE, VMDESTROY, VMMIGRATE, VMPOWEROFF, VMPOWERON | The following fields are displayed in the event file in a space-delimited line as long as Moab has information pertaining to it: name, sovereign, powerstate, parentnode, swap, memory, disk, maxprocs, opsys, class, and variables, where class and variables may have 0 or multiple entries. |
Moab event information can be exported to external systems in real-time using the ACCOUNTINGINTERFACEURL parameter. When set, Moab activates this URL each time one of the default events or one of the events specified by the RECORDEVENTLIST occurs.
While various protocols can be used, the most common protocol is exec, which indicates that Moab should launch the specified tool or script and pass in event information as command line arguments. This tool can then select those events and fields of interest and re-direct them as appropriate providing significant flexibility and control to the organization.
Exec Protocol Format
When a URL with an exec protocol is specified, the target is launched with the event fields passed in as STDIN. These fields appear exactly as they do in the event logs with the same values and order.
The tools/sql directory included with the Moab distribution contains event.create.sql.pl, a sample accounting interface processing script that may be used as a template.
In addition to the log file, the Moab scheduler can report events it determines to be critical to the UNIX syslog facility via the daemon facility using priorities ranging from INFO to ERROR. (See USESYSLOG). The verbosity of this logging is not affected by the LOGLEVEL parameter. In addition to errors and critical events, user commands that affect the state of the jobs, nodes, or the scheduler may also be logged to syslog. Moab syslog messages are reported using the INFO, NOTICE, and ERR syslog priorities.
By default, messages are logged to syslog's user facility. However, using the USESYSLOG parameter, Moab can be configured to use any of the following:
In very large systems, a highly verbose log may roll too quickly to be of use in tracking specific targeted behaviors. In these cases, one or more of the following approaches may be of use:
Related Topics