Appendix F: Large Cluster Considerations

F.1 Communication Overview

TORQUE has enhanced much of the communication found in the original OpenPBS project. This has resulted in a number of key advantages:

Support for larger clusters
Support for more jobs
Support for larger jobs
Support for larger messages

In most cases, enhancements made apply to all systems and no tuning is required. However, some changes have been made configurable to allow site specific modification. The configurable communication parameters are:

F.2 Scalability Guidelines

In very large clusters (in excess of 1,000 nodes), it may be advisable to additionally tune a number of communication layer timeouts. By default, PBS MOM daemons will timeout on inter-MOM messages after 60 seconds. In TORQUE 1.1.0p5 and higher, this can be adjusted by setting the timeout parameter in the mom_priv/config file. If 15059 errors (cannot receive message from sisters) are seen in the MOM logs, it may be necessary to increase this value.

Client-to-PBS server and MOM-to-PBS server communication timeouts are specified via the tcp_timeout server option using the qmgr command.

Note
On some systems, ulimit values may prevent large jobs from running. In particular, the open file descriptor limit (i.e., ulimit -n) should be set to at least the maximum job size in procs + 20. Further, there may be value in setting the fs.file-max in sysctl.conf to a high value, such as:
/etc/sysctl.conf:

fs.file-max = 65536

F.3 End User Command Caching

F.3.1 qstat

In a large system, users may tend to place excessive load on the system by manual or automated use of resource manager end user client commands. A simple way of reducing this load is through the use of client command wrappers which cache data. The example script below will cache the output of the command 'qstat -f' for 60 seconds and report this info to end users.

#!/bin/sh

# USAGE: qstat $@

CMDPATH=/usr/local/bin/qstat
CACHETIME=60
TMPFILE=/tmp/qstat.f.tmp

if [ "$1" != "-f" ] ; then
  #echo "direct check (arg1=$1) "
  $CMDPATH $1 $2 $3 $4
  exit $?
fi

if [ -n "$2" ] ; then
   #echo "direct check (arg2=$2)"
   $CMDPATH $1 $2 $3 $4
   exit $?
fi

if [ -f $TMPFILE ] ; then
  TMPFILEMTIME=`stat -c %Z $TMPFILE`
else
  TMPFILEMTIME=0
fi

NOW=`date +%s`

AGE=$(($NOW - $TMPFILEMTIME))

#echo AGE=$AGE

for i in 1 2 3;do
  if [ "$AGE" -gt $CACHETIME ] ; then
    #echo "cache is stale "

    if [ -f $TMPFILE.1 ] ; then
      #echo someone else is updating cache

      sleep 5

      NOW=`date +%s`

      TMPFILEMTIME=`stat -c %Z $TMPFILE`

      AGE=$(($NOW - $TMPFILEMTIME))
    else
      break;
    fi
  fi
done

if [ -f $TMPFILE.1 ] ; then
  #echo someone else is hung

  rm $TMPFILE.1
fi

if [ "$AGE" -gt $CACHETIME ] ; then
  #echo updating cache

  $CMDPATH -f > $TMPFILE.1

  mv $TMPFILE.1 $TMPFILE

fi

#echo "using cache"

cat $TMPFILE

exit 0

The above script can easily be modified to cache any command and any combination of arguments by changing one or more of the following attributes:

script name
value of $CMDPATH
value of $CACHETIME
value of $TMPFILE

For example, to cache the command pbsnodes -a, make the following changes:

move original pbsnodes command to pbsnodes.orig
save the script as 'pbsnodes'
change $CMDPATH to pbsnodes.orig
change $TMPFILE to /tmp/pbsnodes.a.tmp

F.5 Other Considerations

F.5.1 job_stat_rate

In a large system, there may be many users, many jobs, and many requests for information. To speed up response time for users and for programs using the API the job_stat_rate can be used to tweak when the pbs_server daemon will query MOMs for job information. By increasing this number, a system will not be constantly querying job information and causing other commands to block.

F.5.2 poll_jobs

The poll_jobs parameter allows a site to configure how the pbs_server daemon will poll for job information. When set to TRUE, the pbs_server will poll job information in the background and not block on user requests. When set to FALSE, the pbs_server may block on user requests when it has stale job information data. Large clusters should set this parameter to TRUE.

F.5.3 Internal Settings

On large, slow, and/or heavily loaded systems, it may be desirable to increase the pbs_tcp_timeout setting used by the pbs_mom daemon in MOM-to-MOM communication. This setting defaults to 20 seconds and requires rebuilding code to adjust. For client-server based communication, this attribute can be set using the qmgr command. For MOM-to-MOM communication, a source code modification is required. To make this change, edit the $TORQUEBUILDDIR/src/lib/Libifl/tcp_dis.c file and set pbs_tcp_timeout to the desired maximum number of seconds allowed for a MOM-to-MOM request to be serviced.

Note A system may be heavily loaded if it reports multiple 'End of File from addr' or 'Premature end of message' failures in the pbs_mom or pbs_server logs.

F.5.4 Scheduler Settings

If using Moab, there are a number of parameters which can be set on the schedululer which may improve TORQUE performance. In an environment containing a large number of short-running jobs, the JOBAGGREGATIONTIME parameter (see Appendix F of the Moab Workload Manager Administrator's Guide) can be set to reduce the number of workload and resource queries performed by the scheduler when an event based interface is enabled. If the pbs_server daemon is heavily loaded and PBS API timeout errors (ie. 'Premature end of message') are reported within the scheduler, the TIMEOUT attribute of the RMCFG parameter (see Appendix F of the Moab Workload Manager Administrator's Guide) may be set with a value of between 30 and 90 seconds.

F.5.5 File System

TORQUE can be configured to disable file system blocking until data is physically written to the disk by using the --disable-filesync argument with configure. While having filesync enabled is more reliable, it may lead to server delays for sites with either a larger number of nodes, or a large number of jobs. Filesync is enabled by default.

F.5.6 Network ARP cache

For networks with more than 512 nodes it is mandatory to increase the kernel’s internal ARP cache size. For a network of ~1000 nodes, we use these values in /etc/sysctl.conf on all nodes and servers:

/etc/sysctl.conf

# Don't allow the arp table to become bigger than this
net.ipv4.neigh.default.gc_thresh3 = 4096
# Tell the gc when to become aggressive with arp table cleaning.
# Adjust this based on size of the LAN.
net.ipv4.neigh.default.gc_thresh2 = 2048
# Adjust where the gc will leave arp table alone
net.ipv4.neigh.default.gc_thresh1 = 1024
# Adjust to arp table gc to clean-up more often
net.ipv4.neigh.default.gc_interval = 3600
# ARP cache entry timeout
net.ipv4.neigh.default.gc_stale_time = 3600

Use sysctl -p to reload this file.

The ARP cache size on other Unixes can presumably be modified in a similar way.

An alternative approach is to have a static /etc/ethers file with all hostnames and MAC addresses and load this by arp -f /etc/ethers. However, maintaining this approach is quite cumbersome when nodes get new MAC addresses (due to repairs, for example).