TORQUE Resource Manager > Appendices > Appendix F: Large Cluster Considerations > Scalability Guidelines

Scalability Guidelines

In very large clusters (in excess of 1,000 nodes), it may be advisable to tune a number of communication layer timeouts. By default, PBS MOM daemons timeout on inter-MOM messages after 60 seconds. In TORQUE 1.1.0p5 and higher, this can be adjusted by setting the timeout parameter in the mom_priv/config file (see, Appendix C: Node Manager (MOM) Configuration). If 15059 errors (cannot receive message from sisters) are seen in the MOM logs, it may be necessary to increase this value.

Client-to-server communication timeouts are specified via the tcp_timeout server option using the qmgr command.

On some systems, ulimit values may prevent large jobs from running. In particular, the open file descriptor limit (i.e., ulimit -n) should be set to at least the maximum job size in procs + 20. Further, there may be value in setting the fs.file-max in sysctl.conf to a high value, such as:

/etc/sysctl.conf:
fs.file-max = 65536

Related Topics 

© 2015 Adaptive Computing