In very large clusters (in excess of 1,000 nodes), it may be advisable to additionally tune a number of communication layer timeouts. By default, PBS MOM daemons will timeout on inter-MOM messages after 60 seconds. In TORQUE 1.1.0p5 and higher, this can be adjusted by setting the timeout parameter in the mom_priv/config file (see, Node manager (MOM) configuration). If 15059 errors (cannot receive message from sisters) are seen in the MOM logs, it may be necessary to increase this value.
Client-to-PBS server and MOM-to-PBS server communication timeouts are specified via the tcp_timeout server option using the qmgr command.
On some systems, ulimit values may prevent large jobs from running. In particular, the open file descriptor limit (i.e., ulimit -n) should be set to at least the maximum job size in procs + 20. Further, there may be value in setting the fs.file-max in sysctl.conf to a high value, such as:
/etc/sysctl.conf:
fs.file-max = 65536
Related topics
© 2012 Adaptive Computing