Appendices > Appendix F: Large cluster considerations > Scalability guidelines

F.1 Scalability guidelines

In very large clusters (in excess of 1,000 nodes), it may be advisable to additionally tune a number of communication layer timeouts. By default, PBS MOM daemons will timeout on inter-MOM messages after 60 seconds. In TORQUE 1.1.0p5 and higher, this can be adjusted by setting the timeout parameter in the mom_priv/config file (see, Node manager (MOM) configuration). If 15059 errors (cannot receive message from sisters) are seen in the MOM logs, it may be necessary to increase this value.

Client-to-PBS server and MOM-to-PBS server communication timeouts are specified via the tcp_timeout server option using the qmgr command.

On some systems, ulimit values may prevent large jobs from running. In particular, the open file descriptor limit (i.e., ulimit -n) should be set to at least the maximum job size in procs + 20. Further, there may be value in setting the fs.file-max in sysctl.conf to a high value, such as:

/etc/sysctl.conf:
fs.file-max = 65536

Related topics