Moab Workload Manager

Appendix I: Considerations for Large Clusters

There are several key considerations in getting a batch system to scale.

I.1 Resource Manager Scaling

Proper Resource Manager Configuration

TORQUE
- General Scaling Overview
OpenPBS/PBSPro
- Manage Direct Node Communication with NODEPOLLFREQUENCY

I.2 Handling Large Numbers of Jobs

Aggregating Scheduling Cycles - JOBAGGREGATIONTIME

With event driven resource manager interfaces (such as TORQUE, PBS, and SGE), each time a job is submitted, the resource manager notifies the scheduler of this fact. In an attempt to minimize response time, the scheduler will start a new scheduling cycle to determine if the newly submitted job can run. In systems with large numbers of jobs submitted at once, this may not result in the desired behavior for two reasons. First, by scheduling at every job submission, Moab will schedule newly submitted jobs onto available resources in a first come, first served basis rather than evaluating the entire group of new jobs at once and optimizing the placement accordingly. Second, by launching a scheduling iteration for every job submitted, Moab may place a heavy load on the resource manager. For example, if a user were to submit 1000 new jobs simultaneously, for each job submitted, the resource manager would contact the scheduler, the scheduler would start a new iteration, and in this iteration, the scheduler would contact the resource manager requesting updated information on all jobs and resources available.

The JOBAGGREGATIONTIME parameter works by informing the scheduler to not process jobs as quickly as they are submitted, but rather to process these new jobs in groups.

Limited Job Checkpointing - LIMITEDJOBCP

By default, Moab will checkpoint information about every job it reads from its resource managers. When a cluster routinely runs more than 15000 jobs, they may see some speed-ups by limiting which jobs are checkpointed. When LIMITEDJOBCP is set to TRUE, Moab will only checkpoint jobs that have a hold, a system priority, jobs that have had their QoS modified, and a few other limited attributes. Some minimal statistical information is lost for jobs that are not checkpointed.

Reducing Job Start Time - RMCFG ASYNCSTART flag value.

By default, Moab will launch one job at a time and verify that each job successfully started before launching a subsequent job. For organizations with large numbers of very short jobs (less than 2 minutes in duration), the delay associated with confirming successful job start can lead to productivity losses. If tens or hundreds of jobs must be started per minute, and especially if the workload is composed primarily of serial jobs, then the resource manager ASYNCSTART flag may be set. When set, Moab will launch jobs optimistically and confirm success or failure of the job start on the subsequent scheduling iteration.

Reducing Job Reservation Creation Time - RMCFG JOBRSVRECREATE attribute.

By default, Moab destroys and re-creates job reservations each time a resource manager updates any aspect of a job. Historically, this stems from the fact that certain resource managers would inadvertently or intentionally migrate job tasks from originally requested nodes to other nodes. To maintain synchronization, Moab would re-create reservations each iteration thus incorporating these changes. On most modern resource managers, these changes never occur, but the effort required to handle this case grows with the size of the cluster and the size of the queue. Consequently, on very large systems with thousands of nodes and thousands of jobs, a noticeable delay is present. By setting JOBRSVRECREATE to FALSE on resource managers that do not exhibit this behavior, significant time savings per iteration can be obtained.

Minimizing Compute Intensive Operations - ENABLESTARTESTIMATESTATS

Where possible, these parameters should be disabled as they are expensive per job operations.

Buffering Log Output - MOABENABLELOGBUFFERING environment variable

When large or verbose logs are required, setting this environment variable to true will allow Moab to buffer its logs and speed up log writing. This capability is primarily useful when writing to remote file systems and is only of limited value with local file systems.

Constraining Preemption - PREEMPTSEARCHDEPTH parameter

When a large number of active serial jobs are present in a system, Moab may unnecessarily consider additional jobs even after an adequate number of feasible preemptible jobs have been located. Setting this parameter will cause Moab to cease its search after the specified number of target preemptees has been located.

Note: Setting this parameter only impacts searches for serial preemptor jobs.

Handling Transient Resource Manager Failures - MOABMAXRMFAILCOUNT=<INTEGER>

Disable Job Feasibility Analysis - MOABDISABLEFEASIBILITYCHECK

Constrain the number of jobs started per iteration - JOBMAXSTARTPERITERATION parameter. (Some resource managers can take in excess of two seconds per job start.) Because Moab must serialize job launch, a system where many jobs are started each iteration may appear sluggish from the point of view of client commands. Setting this parameter will reduce the maximum duration of a scheduling cycle and thus the maximum duration a client command will wait for processing.

Constrain the number of jobs preempted per iteration - JOBMAXPREEMPTPERITERATION parameter

Note: For very large job count systems, configuration options controlling the maximum supported limits may need to be adjusted including the maximum number of reservations and the maximum number of supported evaluation ranges.

I.3 Handling Large Numbers of Nodes

For very large clusters (>= 10,000 processors) default scheduling behavior may not scale as desired. To address this, the following parameters should be considered:

Parameter	Recommended Settings
RMPOLLINTERVAL	In large node environments with large and long jobs, scheduling overhead can be minimized by increasing RMPOLLINTERVAL above its default setting. If an event-driven resource management interface is available, values of two minutes or higher may be used. Scheduling overhead can be determined by looking at the scheduling load reported by mdiag -S.
LIMITEDNODECP	Startup/shutdown time can be minimized by disabling full node state checkpointing that includes some statistics covering node availability.
--with-maxnodes	(special builds only - contact support) *
--with-maxtasks	(special builds only - contact support) *

* For clusters where the number of nodes or processors exceeds 50,000, the maximum stack size for the shell in which Moab is started may need to be increased (as Moab may crash if the stack size is too small). On most Unix®/Linux® based systems, the command ulimit -s unlimited may be used to increase the stack size limit before starting Moab. This may be placed in your Moab startup script.

Other considerations include using grid based resource reporting when using peers and enabling virtual nodes to collapse scheduling decisions.

Note: See Appendix D for further information on default and supported object limits.

I.4 Handling Large Jobs

For large jobs, additional parameters beyond those specified for large node systems may be required. These include settings for the maximum number of tasks per job, and the maximum number of nodes per job.

I.5 Handling Large SMP Systems

For large-way SMP systems (> 512 processors/node) Moab defaults may need adjustment.

Parameter	Recommended Settings
MAXRSVPERNODE	By default, Moab does not expect more than 24 jobs per node to be running or have future reservations. Increasing this parameter to a value larger than the expected maximum number of jobs per node is advised.
--with-maxrange	(special builds only - contact support)

I.6 Server Sizing

See Hardware and Software Requirements for recommendations.