5.636 Moab and Torque Configuration for Large Clusters

There are a few basic configurations for Moab and Torque that can potentially improve performance on large clusters.

Moab configuration

In the moab.cfg file, add:

  1. RMPOLLINTERVAL 30,30 - This sets the minimum and maximum poll interval to 30 seconds.
  2. RMCFG[<name>] FLAGS=ASYNCSTART - This tells Moab not to block until it receives a confirmation that the job starts.
  3. RMCFG[<name>] FLAGS=ASYNCDELETE - This tells Moab not to block until it receives a confirmation that the job was deleted.

Torque configuration

  1. Follow the Starting Torque in large environments recommendations.
  2. Increase job_start_timeout on pbs_server. The default is 300 (5 minutes), but for large clusters the value should be changed to something like 600 (10 minutes). Sites running very large parallel jobs might want to set this value even higher.
  3. Use a node health check script on all MOM nodes. This helps prevent jobs from being scheduled on bad nodes and is especially helpful for large parallel jobs.
  4. Make sure that ulimit -n (maximum file descriptors) is set to unlimited, or a very large number, and not the default.
  5. For clusters with a high job throughput it is recommended that the server parameter max_threads be increased from the default. The default is (2 * number of cores + 1) * 10.
  6. Versions 5.1.3, 6.0.2, and later: if you have the server send emails, set email_batch_seconds appropriately. Setting this parameter will prevent pbs_server from forking too frequently and increase the server's performance. See email_batch_seconds for more information on this server parameter.

Related Topics 

© 2016 Adaptive Computing