The scheduler has a number of dependencies that may cause failures if not satisfied. These dependencies are in the areas of disk space, network access, memory, and processor utilization.
The scheduler uses a number of files. If the file system is full or otherwise inaccessible, the following behaviors might be noted:
Unavailable File | Behavior |
---|---|
moab.pid | Scheduler cannot perform single instance check. |
moab.ck* | Scheduler cannot store persistent record of reservations, jobs, policies, summary statistics, and so forth. |
moab.cfg/moab.dat | Scheduler cannot load local configuration. |
log/* | Scheduler cannot log activities. |
stats/* | Scheduler cannot write job records. |
When possible, configure Moab to use local disk space for configuration files, statistics files, and logs files. If any of these files are located in a networked file system (such as NFS, DFS, or AFS) and the network or file server experience heavy loads or failures, Moab server may appear sluggish or unresponsive and client command may fail. Use of local disk space eliminates susceptibility to this potential issue. |
The scheduler uses a number of socket connections to perform basic functions. Network failures may affect the following facilities.
Network Connection | Behavior |
---|---|
scheduler client | Scheduler client commands fail. |
resource manager | Scheduler is unable to load/update information regarding nodes and jobs. |
allocation manager | Scheduler is unable to validate account access or reserve/debit account balances. |
Depending on cluster size and configuration, the scheduler may require up to 120 MB of memory on the server host. If inadequate memory is available, multiple aspects of scheduling may be negatively affected. The scheduler log files should indicate if memory failures are detected and mark any such messages with the ERROR or ALERT keywords.
On a heavily loaded system, the scheduler may appear sluggish and unresponsive. However, no direct failures should result from this slowdown. Indirect failures may include timeouts of peer services (such as the resource manager or allocation manager) or timeouts of client commands. All timeouts should be recorded in the scheduler log files.
The Moab scheduling system contains features to assist in diagnosing internal failures. If the scheduler exits unexpectedly, the scheduler logs may provide information regarding the cause. If no reason can be determined, use of a debugger may be required.
The first step in any exit failure is to check the last few lines of the scheduler log. In many cases, the scheduler may have exited due to misconfiguration or detected system failures. The last few lines of the log should indicate why the scheduler exited and what changes would be required to correct the situation. If the scheduler did not intentionally exit, increasing the LOGLEVEL parameter to 7, or higher, may help isolate the problem.
If an internal failure is detected on your system, the information of greatest value to developers in isolating the problem will be the output of the gdb where subcommand and a printout of all variables associated with the failure. In addition, a level 7 log covering the failure can also help in determining the environment that caused the failure. If you encounter such and require assistance, please submit a ticket at the following address:
http://support.adaptivecomputing.com/
If you do not already have a support username and password, please send an e-mail message to info@adaptivecomputing.com to request one. |