Moab and TORQUE Configuration for Large Clusters
There are a few basic configurations for Moab and TORQUE that can potentially improve performance on large clusters.
Moab configuration
In the moab.cfg file, add:
- RMPOLLINTERVAL 30,30 - This sets the minimum and maximum poll interval to 30 seconds.
- RMCFG[<name>] FLAGS=ASYNCSTART - This tells Moab not to block until it receives a confirmation that the job starts.
- RMCFG[<name>] FLAGS=ASYNCDELETE - This tells Moab not to block until it receives a confirmation that the job was deleted.
TORQUE configuration
- Follow the Starting TORQUE in large environments recommendations.
- Increase job_start_timeout on pbs_server. The default is 300 (5 minutes), but for large clusters the value should be changed to something like 600 (10 minutes). Sites running very large parallel jobs might want to set this value even higher.
- Use a node health check script on all MOM nodes. This helps prevent jobs from being scheduled on bad nodes and is especially helpful for large parallel jobs.
- Make sure that ulimit -n (maximum file descriptors) is set to unlimited, or a very large number, and not the default.
- For clusters with a high job throughput it is recommended that the server parameter max_threads be increased from the default. The default is (2 * number of cores + 1) * 10.
Related Topics
Appendix F: Large Cluster Considerations