You are here: Appendices > Appendix I Considerations for Large Clusters

Appendix I Considerations for Large Clusters

There are several key considerations in getting a batch system to scale.

I.0.1 Resource Manager Scaling

I.0.1.A Proper Resource Manager Configuration

I.0.2 Handling Large Numbers of Jobs

I.0.2.A Set a Minimum RMPOLLINTERVAL

With event driven resource managers like Torque, each time a job is submitted the resource manager notifies the scheduler. In an attempt to minimize response time, the scheduler starts a new scheduling cycle to determine if the newly submitted job can run. In systems with large numbers of jobs submitted at once, this might not result in the desired behavior for two reasons. First, by scheduling at every job submission Moab schedules newly submitted jobs onto available resources in a first come, first served basis rather than evaluating the entire group of new jobs at once and optimizing the placement accordingly. Second, by launching a scheduling iteration for every job submitted, Moab places a heavy load on the resource manager. For example, if a user were to submit 1000 new jobs simultaneously, for each job submitted, the resource manager contacts the scheduler, the scheduler starts a new iteration, and in this iteration, the scheduler contacts the resource manager requesting updated information on all jobs and resources available.

Setting a minimum RMPOLLINTERVAL causes the scheduler to not process jobs as quickly as they are submitted, but rather to wait a minimum amount of time to allow more jobs be submitted and to process these new jobs in groups.

RMPOLLINTERVAL 30,60

If the system is busy, schedule every 30 seconds. If it is not busy, schedule every 60 seconds.

I.0.2.B Reduce Command Processing Time

If your system's scheduling cycle regularly takes longer than the CLIENTTIMEOUT value, you can configure Moab to fork a copy of itself that will respond to certain information-only client commands (checkjob, showbf, showres, and showstart). This enables you to run intense diagnostic commands while Moab is in the middle of its scheduling process.

When you set UIMANAGEMENTPOLICY FORK CLIENTUIPORT <port number> on the server side, Moab forks a copy of itself that will listen for client commands on a separate port. For example, systems that run client commands, such as submit hosts, can set CLIENTUIPORT 41560. This will allow the clients to run commands such as checkjob, showbf, showres and showstart and have cached information returned from the previous scheduling iteration. Moab prints a disclaimer at the top of each command that was populated by the forked process stating that the information may be an iteration behind.

See CLIENTTIMEOUT, CLIENTUIPORT, and UIMANAGEMENTPOLICY for parameter information.

Example I-1: Sample configuration

UIMANAGEMENTPOLICY      FORK
CLIENTUIPORT            41560

Moab forks a copy of itself on port 41560, where it will watch for checkjob, showbf, showres, and showstart commands until the main scheduling process completes.

Example I-2: Sample command output

$ checkjob 34
 
--------------------------------------------------------------------

NOTE: The following information has been cached by the remote server

and may be slightly out of date.

--------------------------------------------------------------------

 

job 34

 

State: Idle

Creds: user:wightman group:company class:batch

WallTime: 00:00:00 of 00:01:00

SubmitTime: Thu May 22 14:17:06

(Time Queued Total: 00:00:18 Eligible: 00:00:18)

 

TemplateSets: DEFAULT

Total Requested Tasks: 1

 

Req[0] TaskCount: 1 Partition: ALL

 

SystemID: scale

SystemJID: 34

 

IWD: $HOME/test/scale

SubmitDir: $HOME/test/scale

Executable: sleep 60

I.0.2.C Limited Job Checkpointing

Use the LIMITEDJOBCP parameter. By default, Moab will checkpoint information about every job it reads from its resource managers. When a cluster routinely runs more than 15000 jobs, they may see some speed-ups by limiting which jobs are checkpointed. When LIMITEDJOBCP is set to TRUE, Moab will only checkpoint jobs that have a hold, a system priority, jobs that have had their QoS modified, and a few other limited attributes. Some minimal statistical information is lost for jobs that are not checkpointed.

I.0.2.D Minimize Job Processing Time

Use the ENABLEHIGHTHROUGHPUT parameter. By default, Moab processes all job attributes, filters, remap classes, job arrays, and other information when a job is submitted. This requires full access to the Moab configuration and significantly increases the processing time Moab needs when jobs are submitted. By setting ENABLEHIGHTHROUGHPUT to TRUE, Moab stores the job information in an internal queue and returns the job ID immediately. The internal queue is processed when Moab begins its next scheduling iteration. This enables Moab to process hundreds of jobs per second rather than 20-30 per second. Because the jobs are processed in a separate queue after the job has been returned, it is recommended that MAILPROGRAM be configured. Moab will send an email to the user if a job is rejected.

Because the job is not fully processed, some attributes may change after the job has been submitted. For example, when a job class is remapped, the new class is not reflected until Moab begins its next scheduling iteration. Additionally, job arrays are not instantiated until Moab begins its next scheduling cycle.

If ENABLEHIGHTHROUGHPUT is TRUE, you must set NODEALLOCATIONPOLICY to FIRSTAVAILABLE.

I.0.2.E Load all Non-Completed Jobs at Startup

Use the LOADALLJOBCP parameter. By default, Moab loads non-complete jobs for active resource managers only. By setting LOADALLJOBCP to TRUE, Moab will load all non-complete jobs from all checkpoint files at startup, regardless of whether their corresponding resource manager is active.

I.0.2.F Reducing Job Start Time

Use the ASYNCSTART parameter. By default, Moab will launch one job at a time and verify that each job successfully started before launching a subsequent job. For organizations with large numbers of very short jobs (less than 2 minutes in duration), the delay associated with confirming successful job start can lead to productivity losses. If tens or hundreds of jobs must be started per minute, and especially if the workload is composed primarily of serial jobs, then the resource manager ASYNCSTART flag may be set. When set, Moab will launch jobs optimistically and confirm success or failure of the job start on the subsequent scheduling iteration. Also consider adding the ASYNCDELETE flag if users frequently cancel jobs.

I.0.2.G Reducing Job Reservation Creation Time

Use the RMCFG JOBRSVRECREATE attribute. By default, Moab destroys and re-creates job reservations each time a resource manager updates any aspect of a job. Historically, this stems from the fact that certain resource managers would inadvertently or intentionally migrate job tasks from originally requested nodes to other nodes. To maintain synchronization, Moab would re-create reservations each iteration thus incorporating these changes. On most modern resource managers, these changes never occur, but the effort required to handle this case grows with the size of the cluster and the size of the queue. Consequently, on very large systems with thousands of nodes and thousands of jobs, a noticeable delay is present. By setting JOBRSVRECREATE to FALSE on resource managers that do not exhibit this behavior, significant time savings per iteration can be obtained.

I.0.2.H Optimizing Backfill Time

Use the OPTIMIZEDBACKFILL flag. Speeds up backfill when a system reservation is in use.

I.0.2.I Constraining Moab Logging - LOGLEVEL

Use the LOGLEVEL parameter. When running on large systems, setting LOGLEVEL to 0 or 1 is normal and recommended. Only increase LOGLEVEL above 0 or 1 if you have been instructed to do so by Moab support.

I.0.2.J Preemption

When preemption is enabled Moab can take considerably more time scheduling jobs for every scheduling iteration. Preemption increases the number of options available to Moab and therefore takes more time for Moab to optimally place jobs. If you are running a large cluster or have more than the usual amount of jobs (>10000), consider disabling preemption. If disabling preemption is not possible, consider limiting its scope to only a small subset of jobs (as both preemptors and preemptees).

I.0.2.K Handling Transient Resource Manager Failures

Use the RMCFG MAXITERATIONFAILURECOUNT attribute.

I.0.2.L Constrain the number of jobs preempted per iteration

Use the JOBMAXPREEMPTPERITERATION parameter.

For very large job count systems, configuration options controlling the maximum supported limits may need to be adjusted, including the maximum number of reservations and the maximum number of supported evaluation ranges.

I.0.2.M Scheduler settings

If using Moab, there are a number of parameters which can be set on the scheduler which may improve Torque performance. In an environment containing a large number of short-running jobs, the JOBAGGREGATIONTIME parameter can be set to reduce the number of workload and resource queries performed by the scheduler when an event based interface is enabled. Setting JOBAGGREGATIONTIME instructs the scheduler to ignore events coming from the resource manager and perform scheduling at regular intervals, rather than around resource manager events. If the pbs_server daemon is heavily loaded and PBS API timeout errors (i.e. "Premature end of message") are reported within the scheduler, the TIMEOUT attribute of the RMCFG parameter may be set with a value of between 30 and 90 seconds.

I.0.3 Handling Large Numbers of Nodes

For very large clusters (>= 10,000 processors) default scheduling behavior may not scale as desired. To address this, the following parameters should be considered:

Parameter Recommended Settings
RMPOLLINTERVAL In large node environments with large and long jobs, scheduling overhead can be minimized by increasing RMPOLLINTERVAL above its default setting. If an event-driven resource management interface is available, values of two minutes or higher may be used. Scheduling overhead can be determined by looking at the scheduling load reported by mdiag -S.
LIMITEDNODECP Startup/shutdown time can be minimized by disabling full node state checkpointing that includes some statistics covering node availability.
SCHEDCFG FLAGS="FASTRSVSTARTUP When you have reservations on a large number of nodes, it can take Moab a long time to recreate them on startup. Setting the FASTRSVSTARTUP scheduler flag greatly reduces startup time.

* For clusters where the number of nodes or processors exceeds 50,000, the maximum stack size for the shell in which Moab is started may need to be increased (as Moab may crash if the stack size is too small). On most Unix/Linux based systems, the command ulimit -s unlimited may be used to increase the stack size limit before starting Moab. This may be placed in your Moab startup script.

See Appendix D for further information on default and supported object limits.

Avoid adding large numbers of NODECFG lines in the moab.cfg or moab.d/*.cfg files to keep the Moab boot time low.

For example, adding a configuration line to define features for each node in a large cluster (such as NODECFG[x] Features+=green,purple) can greatly increase the Moab boot time. If Moab processes 15 node configuration lines per second for a 50,000-node system, it could add approximately 55 minutes of node configuration processing to the Moab boot time.

In this case, it is better to define the node features in the resource manager configuration.

I.0.4 Handling Large Jobs

For large jobs, additional parameters beyond those specified for large node systems may be required. These include settings for the maximum number of tasks per job, and the maximum number of nodes per job.

I.0.5 Handling Large SMP Systems

For large-way SMP systems (> 512 processors/node) Moab defaults may need adjustment.

Parameter Recommended Settings
MAXRSVPERNODE By default, Moab does not expect more than 64 jobs per node to be running or have future reservations. Increasing this parameter to a value larger than the expected maximum number of jobs per node is advised.

I.0.6 Server Sizing

See 1.1.1 Environment Requirements for recommendations.

Related Topics 

© 2016 Adaptive Computing