The following procedure demonstrates how to install Moab and TORQUE in high availability (HA) mode.
To install Moab and TORQUE in HA mode
> service iptables stop
> chkconfig iptables off
If you are unable to stop the firewall due to infrastructure restriction, open the following ports:
TORQUE
Moab
> vi /etc/sysconfig/selinux
SELINUX=disabled
# Moab
export MOABHOMEDIR=/opt/moab
# TORQUE
export TORQUEHOME=/var/spool/torque
# Library Path
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${MOABHOMEDIR}/lib:${TORQUEHOME}/lib
# Update system paths
export PATH=${MOABHOMEDIR}/sbin:${MOABHOMEDIR}/bin:${TORQUEHOME}/bin:${TORQUEHOME}/sbin:${PATH}
fileServer# mkdir -m 0755 /var/spool/torque
fileServer# mkdir -m 0750 /var/spool/torque/server_priv
fileServer# mkdir -m 0755 /opt/moab
/opt/moab 192.168.0.0/255.255.255.0(rw,sync,no_root_squash)
/var/spool/torque/server_priv 192.168.0.0/255.255.255.0(rw,sync,no_root_squash)
fileServer# exportfs -r
If the NFS daemons are not already running on fileServer, start them.
> systemctl restart rpcbind.service
> systemctl start nfs-server.service
> systemctl start nfs-lock.service
> systemctl start nfs-idmap.service
server1# mkdir /opt/moab
server1# mkdir /var/spool/torque/server_priv
Repeat this process for server2.
fileServer:/opt/moab /opt/moab nfs rsize=8192,wsize=8192,timeo=14,intr
fileServer:/var/spool/torque/server_priv /var/spool/torque/server_priv nfs rsize=8192,wsize=8192,timeo=14,intr
Repeat this step for server2.
server1# wget http://github.com/adaptivecomputing/torque/branches/4.1.4/torque-4.1.4.tar.gz
server1# tar -xvzf torque-4.1.4.tar.gz
server1# configure --enable-high-availability --with-tcp-retry-limit=3
server1# make
server1# make install
server1# make packages
server1# make install
If the installation directory is not shared, repeat step 8a-b (downloading and installing TORQUE) on server2.
server1# /etc/init.d/trqauthd start
List the host names of all nodes that run pbs_server in the torque/server_name file. You must also include the host names of all nodes running pbs_server in the torque/server_name file of each MOM node. The syntax of torque/server_name is a comma-delimited list of host names.
server1
server2
server1# pbs_server -t create
server1# qmgr -c “set server scheduling=true”
server1# qmgr -c “create queue batch queue_type=execution”
server1# qmgr -c “set queue batch started=true”
server1# qmgr -c “set queue batch enabled=true”
server1# qmgr -c “set queue batch resources_default.nodes=1”
server1# qmgr -c “set queue batch resources_default.walltime=3600”
server1# qmgr -c “set server default_queue=batch”
Because server_priv/* is a shared drive, you do not need to repeat this step on server2.
server1# qmgr -c “set server managers += root@server1”
server1# qmgr -c “set server managers += root@server2”
server1# qmgr -c “set server operators += root@server1”
server1# qmgr -c “set server operators += root@server2”
Because server_priv/* is a shared drive, you do not need to repeat this step on Server 2.
You must update the lock file mechanism for TORQUE in order to determine which server is the primary. To do so, use the lock_file_update_time and lock_file_check_time parameters. The primary pbs_server will update the lock file based on the specified lock_file_update_time (default value of 3 seconds). All backup pbs_servers will check the lock file as indicated by the lock_file_check_time parameter (default value of 9 seconds). The lock_file_update_time must be less than the lock_file_check_time. When a failure occurs, the backup pbs_server takes up to the lock_file_check_time value to take over.
server1# qmgr -c “set server lock_file_check_time=5”
server1# qmgr -c “set server lock_file_update_time=3”
Because server_priv/* is a shared drive, you do not need to repeat this step on server2.
server1# qmgr -c “set server acl_hosts += server1”
server1# qmgr -c “set server acl_hosts += server2”
Because server_priv/* is a shared drive, you do not need to repeat this step on server2.
server1# qterm
server1# pbs_server --ha -l server2:port
server2# pbs_server --ha -l server1:port
Only specify the Moab hosts and ports if Moab HA is configured on a remote server. Otherwise, run pbs_server --ha. For example, if Moab is running on server1 and you wish to start TORQUE in HA, you only need to make TORQUE aware of server2.
> pbs_server --ha -l server2:<port> h1.
server1# qmgr -c “p s”
server2# qmgr -c “p s”
The commands above returns all settings from the active TORQUE server from either node.
Drop one of the pbs_servers to verify that the secondary server picks up the request.
server1# qterm
server2# qmgr -c “p s”
Stop the pbs_server on server2 and restart pbs_server on server1 to verify that both nodes can handle a request from the other.
node1# torque-package-mom-linux-x86_64.sh --install
node2# torque-package-clients-linux-x86_64.sh --install
Repeat this for each compute node. Verify that the /var/pool/torque/server-name file shows all your compute nodes.
node1 np=2
node2 np=2
Change the np flag to reflect number of available processors on that node.
server1# qterm; pbs_server --ha -l server2:port
server2# qterm; pbs_server --ha -l server1:port
node5# pbs_mom
node6# pbs_mom
server1# tar -xvzf moab-6.1.0-linux-x86_64-torque-odbc.tar.gz
Navigate to the moab directory.
server1# cd moab-7.2.0
Begin the Moab installation.
server1# ./configure
server1# make install
SCHEDCFG[Moab] SERVER=server1:42559
SCHEDCFG[Moab] FBSERVER=server2
SCHEDCFG[Moab] FLAGS=filelockha
SCHEDCFG[Moab] HALOCKFILE=/opt/moab/.moab_lock
ADMINCFG[1] USERS=root
TOOLSDIR /opt/moab/tools
LOGLEVEL 3
...
RMCFG[moabha] TYPE=PBS
RMCFG[moabha] SUBMITCMD=/usr/local/bin/qsub
server1# moab
server2# moab
server1# showq
server2# showq
Query the available MOMs via TORQUE and check their status. If everything is working correctly, the MOMs you configured in step 5.c should be returned as available.
server1# mdiag -n
server2# mdiag -n
server1# echo "sleep 60" | msub
Verify that the job is running.
server1# showq
server2# echo "sleep 60" | msub
Verify that the job is running.
server2# showq