Server high availability

4.0 Setting server policies > 4.2 Server high availability

4.2 Server high availability

You can now run TORQUE in a redundant or high availability mode. This means that there can be multiple instances of the server running and waiting to take over processing in the event that the currently running server fails.

The high availability feature is available in the 2.3 and later versions of TORQUE. TORQUE 2.4 includes several enhancements to high availability (see Enhanced high availability).

High availability enables TORQUE to continue running even if pbs_server is brought down. This is done by running multiple copies of pbs_server which have their torque/server_priv directory mounted on a shared file system. The torque/server_name must include the host names of all nodes that run pbs_server. All MOM nodes also must include the host names of all nodes running pbs_server in their torque/server_name file. The syntax of the torque/server_name is a comma delimited list of host names.

host1,host2,host3

When configuring high availability, do not use $pbsserver to specify the host names. You must use the $TORQUEHOMEDIR/server_name file.

All instances of pbs_server need to be started with the --ha command line option that allows the servers to run at the same time. Only the first server to start will complete the full startup. The second server to start will block very early in the startup when it tries to lock the file torque/server_priv/server.lock. When the second server cannot obtain the lock, it will spin in a loop and wait for the lock to clear. The sleep time between checks of the lock file is one second.

Notice that not only can the servers run on independent server hardware, there can also be multiple instances of the pbs_server running on the same machine. This was not possible before as the second one to start would always write an error and quit when it could not obtain the lock.

Because the file server_priv/serverdb is created in a way which is not compatible between hardware architectures, the machines that are running pbs_server in high-availability mode must be of similar architecture. For example, a 32-bit machine is unable to read the server_priv/serverdb file of a 64-bit machine. Therefore, when choosing hardware, verify all servers are of the same architecture.

The default high availability configuration of TORQUE 2.4 is backward compatible with version 2.3, but an enhanced high availability option is available with version 2.4. The enhanced version in 2.4 fixes some shortcomings in the default configuration and is more robust. The lock file mechanism used to trigger a fail-over in TORQUE 2.3 works correctly only if the primary pbs_server is taken down gracefully, and releases the lock on the file being used as the semaphore. If the server crashes, the lock stays in place and the backup server will not start unless the lock is manually removed by the administrator. With 2.4 enhanced high availability the reliance on the file system is bypassed with a much more reliable mechanism.

In order to use enhanced high availability with TORQUE 2.4, TORQUE must be configured using the --enable-high-availability option (in addition to all other configuration options you specify).

> ./configure --prefix=/usr/var/torque --enable-high-availability

This configuration option is not necessary in TORQUE 4.0 because high availability is enhanced high availability in TORQUE 4.0.

In the above example, TORQUE installs to the /usr/var/torque directory and is configured to use the high availability features.

Once TORQUE has been compiled and installed, it is launched the same way as with TORQUE 2.3; start each instance of pbs_server with the --ha option.

The lock_file option allows the administrator to change the location of the lock file. The default location is torque/server_priv. If the lock_file option is used, the new location must be on the shared partition so all servers have access.

The lock_file_update_time and lock_file_check_time parameters are used by the servers to determine if the primary server is active. The primary pbs_server will update the lock file based on the lock_file_update_time (default value of 3 seconds). All backup pbs_servers will check the lock file as indicated by the lock_file_check_time parameter (default value of 9 seconds). The lock_file_update_time must be less than the lock_file_check_time. When a failure occurs, the backup pbs_server takes up to the lock_file_check_time value to take over.

> qmgr -c "set server lock_file_check_time=5"

In the above example, after the primary pbs_server goes down, the backup pbs_server takes up to 5 seconds to take over. It takes additional time for all MOMs to switch over to the new pbs_server.

The clock on the primary and redundant servers must be synchronized in order for high availability to work. Use a utility such as NTP to ensure your servers have a synchronized time.

When TORQUE is run with an external scheduler such as Moab, and the pbs_server is not running on the same host as Moab, pbs_server needs to know where to find the scheduler. To do this, use the following syntax (the port is required and the default is 15004):

> pbs_server --ha -l <moabhost:port>

> pbs_server --ha -l <moabhost1:port> -l <moabhost2:port>

The root user of each Moab host must be added to the operators and managers lists of the server. This enables Moab to execute root level operations in TORQUE.

The various commands that send messages to pbs_server usually have an option of specifying the server name on the command line, or if none is specified will use the default server name. The default server name comes either from the environment variable PBS_DEFAULT or from the file torque/server_name.

When a command is executed and no explicit server is mentioned, an attempt is made to connect to the first server name in the list of hosts from PBS_DEFAULT or torque/server_name. If this fails, the next server name is tried. If all servers in the list are unreachable, an error is returned and the command fails.

Note that there is a period of time after the failure of the current server during which the new server is starting up where it is unable to process commands. The new server must read the existing configuration and job information from the disk, so the length of time that commands cannot be received varies. Commands issued during this period of time might fail due to timeouts expiring.

One aspect of this enhancement is in the construction of job names. Job names normally contain the name of the host machine where pbs_server is running. When job names are constructed, only the first name from the server specification list is used in building the job name.

The system administrator must ensure that pbs_server continues to run on the server nodes. This could be as simple as a cron job that counts the number of pbs_server's in the process table and starts some more if needed.

One consideration of this implementation is that it depends on NFS file system also being redundant. NFS can be set up as a redundant service. See the following.