Other considerations

Appendices > Appendix F: Large cluster considerations > Other considerations

Other considerations

job_stat_rate

In a large system, there may be many users, many jobs, and many requests for information. To speed up response time for users and for programs using the API the job_stat_rate can be used to tweak when the pbs_server daemon will query MOMs for job information. By increasing this number, a system will not be constantly querying job information and causing other commands to block.

poll_jobs

The poll_jobs parameter allows a site to configure how the pbs_server daemon will poll for job information. When set to TRUE, the pbs_server will poll job information in the background and not block on user requests. When set to FALSE, the pbs_server may block on user requests when it has stale job information data. Large clusters should set this parameter to TRUE.

Internal settings

On large, slow, and/or heavily loaded systems, it may be desirable to increase the pbs_tcp_timeout setting used by the pbs_mom daemon in MOM-to-MOM communication. This setting defaults to 20 seconds and requires rebuilding code to adjust. For client-server based communication, this attribute can be set using the qmgr command. For MOM-to-MOM communication, a source code modification is required. To make this change, edit the $TORQUEBUILDDIR/src/lib/Libifl/tcp_dis.c file and set pbs_tcp_timeout to the desired maximum number of seconds allowed for a MOM-to-MOM request to be serviced.

A system may be heavily loaded if it reports multiple 'End of File from addr' or 'Premature end of message' failures in the pbs_mom or pbs_server logs.

Scheduler settings

If using Moab, there are a number of parameters which can be set on the scheduler which may improve TORQUE performance. In an environment containing a large number of short-running jobs, the JOBAGGREGATIONTIME parameter (see the "Parameters" section of the Moab Workload Manager Administrator Guide) can be set to reduce the number of workload and resource queries performed by the scheduler when an event based interface is enabled. If the pbs_server daemon is heavily loaded and PBS API timeout errors (i.e. "Premature end of message") are reported within the scheduler, the "TIMEOUT" attribute of the RMCFG parameter may be set with a value of between 30 and 90 seconds.

File system

TORQUE can be configured to disable file system blocking until data is physically written to the disk by using the --disable-filesync argument with configure. While having filesync enabled is more reliable, it may lead to server delays for sites with either a larger number of nodes, or a large number of jobs. Filesync is enabled by default.

Network ARP cache

For networks with more than 512 nodes it is mandatory to increase the kernel's internal ARP cache size. For a network of ~1000 nodes, we use these values in /etc/sysctl.conf on all nodes and servers:

/etc/sysctl.conf

# Don't allow the arp table to become bigger than this

net.ipv4.neigh.default.gc_thresh3 = 4096

# Tell the gc when to become aggressive with arp table cleaning.

# Adjust this based on size of the LAN.

net.ipv4.neigh.default.gc_thresh2 = 2048

# Adjust where the gc will leave arp table alone

net.ipv4.neigh.default.gc_thresh1 = 1024

# Adjust to arp table gc to clean-up more often

net.ipv4.neigh.default.gc_interval = 3600

# ARP cache entry timeout

net.ipv4.neigh.default.gc_stale_time = 3600

Use sysctl -p to reload this file.

The ARP cache size on other Unixes can presumably be modified in a similar way.

An alternative approach is to have a static /etc/ethers file with all hostnames and MAC addresses and load this by arp -f /etc/ethers. However, maintaining this approach is quite cumbersome when nodes get new MAC addresses (due to repairs, for example).