Moab Workload Manager

14.1 Internal Diagnostics/Diagnosing System Behavior and Problems

Moab provides a number of commands for diagnosing system behavior. These diagnostic commands present detailed state information about various aspects of the scheduling problem, summarize performance, and evaluate current operation reporting on any unexpected or potentially erroneous conditions found. Where possible, Moab's diagnostic commands even correct detected problems if desired.

At a high level, the diagnostic commands are organized along functionality and object based delineations. Diagnostic commands exist to help prioritize workload, evaluate fairness, and determine effectiveness of scheduling optimizations. Commands are also available to evaluate reservations reporting state information, potential reservation conflicts, and possible corruption issues. Scheduling is a complicated task. Failures and unexpected conditions can occur as a result of resource failures, job failures, or conflicting policies.

Moab's diagnostics can intelligently organize information to help isolate these failures and allow them to be resolved quickly. Another powerful use of the diagnostic commands is to address the situation in which there are no hard failures. In these cases, the jobs, compute nodes, and scheduler are all functioning properly, but the cluster is not behaving exactly as desired. Moab diagnostics can help a site determine how the current configuration is performing and how it can be changed to obtain the desired behavior.

14.1.1 The mdiag Command

The cornerstone of Moab's diagnostics is the mdiag command. This command provides detailed information about scheduler state and also performs a large number of internal sanity checks presenting problems it finds as warning messages.

Currently, the mdiag command provides in-depth analysis of the following objects and subsystems:

Object/Subsystem mdiag Flag Use
-a Shows detailed account configuration information.
-b Indicates why blocked (ineligible) jobs are not allowed to run.
-c Shows detailed class configuration information.
-C Shows configuration lines from moab.cfg and whether or not they are valid.
-f Shows detailed fairshare configuration information as well as current fairshare usage.
-g Shows detailed group information.
-j Shows detailed job information. Reports corrupt job attributes, unexpected states, and excessive job failures.
-m Shows detailed frame/rack information.
-n Shows detailed node information. Reports unexpected node states and resource allocation conditions.
-p Shows detailed job priority information including priority factor contributions to all idle jobs.
-q Shows detailed QoS information.
-r Shows detailed reservation information. Reports reservation corruption and unexpected reservation conditions.
-R Shows detailed resource manager information. Reports configured and detected state, configuration, performance, and failures of all configured resource manager interfaces.
-s Shows detailed standing reservation information. Reports reservation corruption and unexpected reservation conditions.
-S Shows detailed scheduler state information. Indicates if scheduler is stopped, reports status of grid interface, identifies and reports high-level scheduler failures.
-t Shows detailed partition information.
-u Shows detailed user information.

14.1.2 Other Diagnostic Commands

Beyond mdiag, the checkjob and checknode commands also provide detailed information and sanity checking on individual jobs and nodes respectively. These commands can indicate why a job cannot start, which nodes can be available, and information regarding the recent events impacting current job or nodes state.

14.1.3 Using Moab Logs for Troubleshooting

Moab logging is extremely useful in determining the cause of a problem. Where other systems may be cursed for not providing adequate logging to diagnose a problem, Moab may be cursed for the opposite reason. If the logging level is configured too high, huge volumes of log output may be recorded, potentially obscuring the problems in a flood of data. Intelligent searching, combined with the use of the LOGLEVEL and LOGFACILITY parameters can mine out the needed information. Key information associated with various problems is generally marked with the keywords WARNING, ALERT, or ERROR. See the Logging Overview for further information.

14.1.4 Automating Recovery Actions After a Failure

The RECOVERYACTION parameter of SCHEDCFG can be used to control scheduler action in the case of a catastrophic internal failure. Valid actions include die, ignore, restart, and trap.

Recovery Mode Description
Moab will exit, and if core files are externally enabled, will create a core file for analysis. (This is the default behavior.)
Moab will ignore the signal and continue processing. This may cause Moab to continue running with corrupt data which may be dangerous. Use this setting with caution.
When a SIGSEGV is received, Moab will relaunch using the current checkpoint file, the original launch environment, and the original command line flags. The receipt of the signal will be logged but Moab will continue scheduling. Because the scheduler is restarted with a new memory image, no corrupt scheduler data should exist. One caution with this mode is that it may mask underlying system failures by allowing Moab to overcome them. If used, the event log should be checked occasionally to determine if failures are being detected.
When a SIGSEGV is received, Moab stays alive but enters diagnostic mode. In this mode, Moab stops scheduling but responds to client requests allowing analysis of the failure to occur using internal diagnostics available via the mdiag command.

14.1.5 Recovering from Server and VM Failures

When Moab runs a job with a compute resource manager, such as TORQUE, and a node fails, Moab continues running the job by default. Typically the job aborts or runs past its walltime and fails. The JOBACTIONONNODEFAILURE and JOBACTIONONNODEFAILUREDURATION parameters change this default behavior.

JOBACTIONONNODEFAILURE REQUEUE
JOBACTIONONNODEFAILUREDURATION 120

With these settings, Moab waits 120 seconds after detecting a node is down before checking its status again. If the node does not recover, Moab requeues the workload. In the case of on-demand VM jobs, Moab attempts to destroy the underlying VMs and create new ones on either the same hypervisor or a new one, depending on whether the hypervisor or VM failed. As a result, Moab makes recovery from VM and server failures transparent and restarts the workload quickly.

To ensure a quick recovery, shorten some of the poll intervals, such as node_check_rate for pbs_server and RMPOLLINTERVAL and JOBACTIONONNODEFAILUREDURATION for Moab. For example, setting node_check_rate to 15, RMPOLLINTERVAL to 10, and JOBACTIONONNODEFAILUREDURATION to 30 causes Moab to recognize a node failure and a need to requeue in about 70 seconds.

See Also