Moab has many features to improve the availability of a cluster beyond the ability to automatically relocate to another execution server. The following table describes some of these features.
Feature | Description |
---|---|
JOBACTIONONNODEFAILURE | If a node allocated to an active job fails, it is possible for the job to continue running indefinitely even though the output it produces is of no value. Setting this parameter allows the scheduler to automatically preempt these jobs when a node failure is detected, possibly allowing the job to run elsewhere and also allowing other allocated nodes to be used by other jobs. |
SCHEDCFG[] FBSERVER | Specifies the fallback or secondary server in an HA setup. |
SCHEDCFG[] RECOVERYACTION |
If a catastrophic failure event occurs (SIGSEGV or SIGILL signal is triggered), Moab can be configured to automatically restart, trap the failure, ignore the failure, or behave in the default manner for the specified signal. These actions are specified using the values RESTART, TRAP, IGNORE, or DIE, as in the following example: SCHEDCFG[bas] MODE=NORMAL RECOVERYACTION=RESTART |
SCHEDCFG[] SERVER | Specifies the primary server in an HA setup. |