Appendix I: Considerations for Large Clusters

Appendices > Appendix I: Considerations for Large Clusters

Conventions

Appendix I: Considerations for Large Clusters

I.1 Resource Manager Scaling
I.2 Handling Large Numbers of Jobs
I.3 Handling Large Numbers of Nodes
I.4 Handling Large Jobs
I.5 Handling Large SMP Systems
I.6 Server Sizing

There are several key considerations in getting a batch system to scale.

I.1 Resource Manager Scaling

Proper Resource Manager Configuration

TORQUE
- General Scaling Overview
OpenPBS/PBSPro
- Manage Direct Node Communication with NODEPOLLFREQUENCY

I.2 Handling Large Numbers of Jobs

Reduce command processing time

If your system's scheduling cycle regularly takes longer than the CLIENTTIMEOUT value, you can configure Moab to fork a copy of itself that will respond to certain information-only client commands (checkjob, showbf, showres, and showstart). This enables you to run intense diagnostic commands while Moab is in the middle of its scheduling process.

When you set UIMANAGEMENTPOLICY FORK, Moab forks a copy of itself that will listen for client commands on a separate port, which you must configure with CLIENTUIPORT. This forked process responds to checkjob, showbf, showres, and showstart until the main scheduling cycle has finished. After that, it is killed and the normal process resumes responding to client commands. Moab prints a disclaimer at the top of each command that was populated by the forked process stating that the information may be a few seconds stale.

Example 25-1: Sample configuration

UIMANAGEMENTPOLICY      FORK

CLIENTUIPORT            41560

Moab forks a copy of itself on port 41560, where it will watch for checkjob, showbf, showres, and showstart commands until the main scheduling process completes.

Example 25-2: Sample command output

$ checkjob 34

--------------------------------------------------------------------

NOTE: The following information has been cached by the remote server

and may be slightly out of date.

--------------------------------------------------------------------

job 34

State: Idle

Creds: user:wightman group:company class:batch

WallTime: 00:00:00 of 00:01:00

SubmitTime: Thu May 22 14:17:06

(Time Queued Total: 00:00:18 Eligible: 00:00:18)

TemplateSets: DEFAULT

Total Requested Tasks: 1

Req[0] TaskCount: 1 Partition: ALL

SystemID: scale

SystemJID: 34

IWD: $HOME/test/scale

SubmitDir: $HOME/test/scale

Executable: sleep 60

Limited Job Checkpointing

Use the LIMITEDJOBCP parameter. By default, Moab will checkpoint information about every job it reads from its resource managers. When a cluster routinely runs more than 15000 jobs, they may see some speed-ups by limiting which jobs are checkpointed. When LIMITEDJOBCP is set to TRUE, Moab will only checkpoint jobs that have a hold, a system priority, jobs that have had their QoS modified, and a few other limited attributes. Some minimal statistical information is lost for jobs that are not checkpointed.

Minimize Job Processing Time

Use the ENABLEHIGHTHROUGHPUT parameter. By default, Moab processes all job attributes, filters, remap classes, job arrays, and other information when a job is submitted. This requires full access to the Moab configuration and significantly increases the processing time Moab needs when jobs are submitted. By setting ENABLEHIGHTHROUGHPUT to TRUE, Moab stores the job information in an internal queue and returns the job ID immediately. The internal queue is processed when Moab begins its next scheduling iteration. This enables Moab to process hundreds of jobs per second rather than 20-30 per second. Because the jobs are processed in a separate queue after the job has been returned, it is recommended that MAILPROGRAM be configured. Moab will send an email to the user if a job is rejected.

Because the job is not fully processed, some attributes may change after the job has been submitted. For example, when a job class is remapped, the new class is not reflected until Moab begins its next scheduling iteration. Additionally, job arrays are not instantiated until Moab begins its next scheduling cycle.

If ENABLEHIGHTHROUGHPUT is TRUE, you must set NODEALLOCATIONPOLICY to FIRSTAVAILABLE.

Load all Non-Completed Jobs at Startup

Use the LOADALLJOBCP parameter. By default, Moab loads non-complete jobs for active resource managers only. By setting LOADALLJOBCP to TRUE, Moab will load all non-complete jobs from all checkpoint files at startup, regardless of whether their corresponding resource manager is active.

Reducing Job Start Time

Use the ASYNCSTART parameter. By default, Moab will launch one job at a time and verify that each job successfully started before launching a subsequent job. For organizations with large numbers of very short jobs (less than 2 minutes in duration), the delay associated with confirming successful job start can lead to productivity losses. If tens or hundreds of jobs must be started per minute, and especially if the workload is composed primarily of serial jobs, then the resource manager ASYNCSTART flag may be set. When set, Moab will launch jobs optimistically and confirm success or failure of the job start on the subsequent scheduling iteration. Also consider adding the ASYNCDELETE flag if users frequently cancel jobs.

Increase TORQUE Timeout

The number of jobs in the queue affects the time needed by TORQUE to get updates to Moab. Adjust the Moab timeout to prevent "End of File" errors on scheduling intervals. This is environment specific, but in general if you have more than 50,000 jobs in the queue you should make this adjustment.

RMCFG[torque] TIMEOUT=300

This sets the connection timeout to 300 seconds.

Reducing Job Reservation Creation Time

Use the RMCFG JOBRSVRECREATE attribute. By default, Moab destroys and re-creates job reservations each time a resource manager updates any aspect of a job. Historically, this stems from the fact that certain resource managers would inadvertently or intentionally migrate job tasks from originally requested nodes to other nodes. To maintain synchronization, Moab would re-create reservations each iteration thus incorporating these changes. On most modern resource managers, these changes never occur, but the effort required to handle this case grows with the size of the cluster and the size of the queue. Consequently, on very large systems with thousands of nodes and thousands of jobs, a noticeable delay is present. By setting JOBRSVRECREATE to FALSE on resource managers that do not exhibit this behavior, significant time savings per iteration can be obtained.

Optimizing Backfill Time

Use the OPTIMIZEDBACKFILL flag, which speeds up backfill when a system reservation is in use.

Constraining Moab Logging - LOGLEVEL

Use the LOGLEVEL parameter. When running on large systems, setting LOGLEVEL to 0 or 1 is normal and recommended. Only increase LOGLEVEL above 0 or 1 if you have been instructed to do so by Moab support.

Preemption

When preemption is enabled Moab can take considerably more time scheduling jobs for every scheduling iteration. Preemption increases the number of options available to Moab and therefore takes more time for Moab to optimally place jobs. If you are running a large cluster or have more than the usual amount of jobs (>10000), consider disabling preemption. If disabling preemption is not possible, consider limiting its scope to only a small subset of jobs (as both preemptors and preemptees).

Handling Transient Resource Manager Failures

Use the RMCFG MAXITERATIONFAILURECOUNT attribute.

Constrain the number of jobs preempted per iteration

Use the JOBMAXPREEMPTPERITERATION parameter.

For very large job count systems, configuration options controlling the maximum supported limits may need to be adjusted including the maximum number of reservations and the maximum number of supported evaluation ranges.

I.3 Scheduler settings

If using Moab, there are a number of parameters which can be set on the scheduler which may improve TORQUE performance. In an environment containing a large number of short-running jobs, the JOBAGGREGATIONTIME parameter can be set to reduce the number of workload and resource queries performed by the scheduler when an event based interface is enabled. Setting JOBAGGREGATIONTIME instructs the scheduler to ignore events coming from the resource manager and to scheduling at regular intervals, rather than around resource manager events. If the pbs_server daemon is heavily loaded and PBS API timeout errors (i.e. "Premature end of message") are reported within the scheduler, the TIMEOUT attribute of the RMCFG parameter may be set with a value of between 30 and 90 seconds.

I.3 Handling Large Numbers of Nodes

For very large clusters (>= 10,000 processors) default scheduling behavior may not scale as desired. To address this, the following parameters should be considered:

Parameter	Recommended Settings
RMPOLLINTERVAL	In large node environments with large and long jobs, scheduling overhead can be minimized by increasing RMPOLLINTERVAL above its default setting. If an event-driven resource management interface is available, values of two minutes or higher may be used. Scheduling overhead can be determined by looking at the scheduling load reported by mdiag -S.
LIMITEDNODECP	Startup/shutdown time can be minimized by disabling full node state checkpointing that includes some statistics covering node availability.
SCHEDCFG FLAGS="FASTRSVSTARTUP	When you have reservations on a large number of nodes, it can take Moab a long time to recreate them on startup. Setting the FASTRSVSTARTUP scheduler flag greatly reduces startup time.

* For clusters where the number of nodes or processors exceeds 50,000, the maximum stack size for the shell in which Moab is started may need to be increased (as Moab may crash if the stack size is too small). On most Unix/Linux based systems, the command ulimit -s unlimited may be used to increase the stack size limit before starting Moab. This may be placed in your Moab startup script.

See Appendix D for further information on default and supported object limits.

Avoid adding large numbers of NODECFG lines in the moab.cfg or moab.d/*.cfg files to keep the Moab boot time low.

For example, adding a configuration line to define features for each node in a large cluster (such as NODECFG[x] Features+=green,purple) can greatly increase the Moab boot time. If Moab processes 15 node configuration lines per second for a 50,000-node system, it could add approximately 55 minutes of node configuration processing to the Moab boot time.

In this case, it is better to define the node features in the resource manager configuration.

I.4 Handling Large Jobs

For large jobs, additional parameters beyond those specified for large node systems may be required. These include settings for the maximum number of tasks per job, and the maximum number of nodes per job.

I.5 Handling Large SMP Systems

For large-way SMP systems (> 512 processors/node) Moab defaults may need adjustment.

Parameter	Recommended Settings
MAXRSVPERNODE	By default, Moab does not expect more than 48 jobs per node to be running or have future reservations. Increasing this parameter to a value larger than the expected maximum number of jobs per node is advised.

I.6 Server Sizing

See Hardware and Software Requirements for recommendations.