The parameter LOGFILEMAXSIZE determines how large the log file is allowed to become before it is rolled and is set to 10 MB by default. When the log file reaches this specified size, the log file is rolled. The parameter LOGFILEROLLDEPTH will control the number of old logs maintained and defaults to 1. Rolled log files will have a numeric suffix appended indicating their order.
The parameter LOGLEVEL controls the verbosity of the information. Currently, LOGLEVEL values between 0 and 9 are used to control the amount of information logged, with 0 being the most terse, logging only the most server problems detected, while 9 is the most verbose, commenting on just about everything. The amount of information provided at each log level is approximately an order of magnitude greater than what is provided at the log level immediately below it. A LOGLEVEL of 2 will record virtually all critical messages, while a log level of 4 will provide general information describing all actions taken by the scheduler. If a problem is detected, you may wish to increase the LOGLEVEL value to get more details. However, doing so will cause the logs to roll faster and will also cause a lot of possibly unrelated information to clutter up the logs. Also be aware of the fact that high LOGLEVEL values will result in large volumes of possibly unnecessary file I/O to occur on the scheduling machine. Consequently, it is not recommended that high LOGLEVEL values be used unless tracking a problem or similar circumstances warrant the I/O cost. NOTE: If high log levels are desired for an extended period of time and your Maui home directory is located on a network filesystem, performance may be improved by moving your log directory to a local file system using the 'LOGDIR' parameter.
A final log related parameter is LOGFACILITY. This parameter can be used to focus logging on a subset of scheduler activities. This parameter is specified as a list of one or more scheduling facilities as listed in the parameters documentation.
The logging that occurs is of five major types, subroutine information, status information, scheduler warnings, scheduler alerts, and scheduler errors. These are described in detail below:
1.Subroutine Information. Each subroutine is logged, along with all printable parameters. Major subroutines are logged at lower LOGLEVELs while all subroutines are logged at higher LOGLEVELs. Example:
CheckPolicies(fr4n01.923.0,2,Reason)
2.Status Information. Information about internal status is logged at all LOGLEVELs. Critical internal status is indicated at low LOGLEVELs while less critical and voluminous status information is logged at higher LOGLEVELs. Example:
INFO: Job fr4n01.923.0 Rejected (Max User Jobs)
INFO: Job[25] 'fr4n01.923.0' Rejected (MaxJobPerUser
Policy Failure)
3.Scheduler Warnings. Warnings are logged when the scheduler detects an unexpected value or receives an unexpected result from a system call or subroutine. They are not necessarily indicative of problems and are not catastrophic to the scheduler. Example:
WARNING: cannot open fairshare data file '/home/loadl/maui/stats/FS.87000'
4.Scheduler Alerts. Alerts are logged when the scheduler detects events of an unexpected nature which may indicate problems in other systems or in objects. They are typically of a more severe nature than are warnings and possibly should be brought to the attention of scheduler administrators. Example:
ALERT: job 'fr5n02.202.0' cannot run. deferring job for 360 Seconds
5.Schedulers Errors. Errors are logged when the scheduler detects problems of a nature of which it is not prepared to deal. It will try to back out and recover as best it can, but will not necessarily succeed. Errors should definitely be be monitored by administrators. Example:
ERROR: cannot connect to Loadleveler API
On a regular basis, use the command grep -E "WARNING|ALERT|ERROR" maui.log to get a listing of the problems the scheduler is detecting. On a production system working normally, this list should usually turn up empty. The messages are usually self-explanatory but if not, viewing the log can give context to the message.
If a problem is occurring early when starting the Maui Scheduler (before the configuration file is read) Maui can be started up using the -L LOGLEVEL flag. If this is the first flag on the command line, then the LOGLEVEL is set to the specified level immediately before any setup processing is done and additional logging will be recorded.
If problems are detected in the use of one of the
client commands, the client command can be re-issued with the -L <LOGLEVEL>
command line arg specified. This argument causes debug information
to be logged to STDERR as the client command is running. Again, <LOGLEVEL>
values from 0 to 9 are
supported.
In addition to the log file, the Maui Scheduler reports all events it determines to be critical to the Unix® syslog facility via the 'daemon' facility using priorities ranging from 'INFO' to 'ERROR'. This logging is not affected by LOGLEVEL. In addition to errors and critical events, all user commands that affect the state of the jobs, nodes, or the scheduler are also logged via syslog.
The logging information is extremely helpful in diagnosing problems, but it can also be useful if you are simply trying to become familiar with the "flow" of the scheduler. The scheduler can be run with a low LOGLEVEL value at first to show the highest level functions. This shows high-level data and control flow. Increasing the LOGLEVEL increases the number of functions displayed as familiarity with the scheduler flow grows.
The LOGLEVEL can be changed "on-the-fly" by use of the changeparam command, or by modifying the maui.cfg file and sending the scheduler process a SIGHUP. Also, if the scheduler appears to be "hung" or is not properly responding, the LOGLEVEL can be incremented by one by sending a SIGUSR1 signal to the scheduler process. Repeated SIGUSR1 signals will continue to increase the LOGLEVEL. The SIGUSR2 signal can be used to decrement the LOGLEVEL by one.
If an unexpected problem does occur, save the log file as it is often very helpful in isolating and correcting the problem.