1.0 High availability overview > How-to's > Installing Moab and TORQUE in high availability mode

Installing Moab and TORQUE in high availability mode

The following procedure demonstrates how to install Moab and TORQUE in high availability (HA) mode.

To install Moab and TORQUE in HA mode

  1. Stop all firewalls or update your firewall to allow traffic from Moab and TORQUE services.
  2. > service iptables stop

    > chkconfig iptables off

    If you are unable to stop the firewall due to infrastructure restriction, open the following ports:

    TORQUE

    • 15001[tcp,udp]
    • 15002[tcp,udp]
    • 15003[tcp,udp]

    Moab

    • 42559[tcp]
  3. Disable SELinux

    > vi /etc/sysconfig/selinux

     

    SELINUX=disabled

  4. Update your main ~/.bashrc profile to ensure you are always referencing the applications to be installed on all servers.
  5. # Moab

    export MOABHOMEDIR=/opt/moab

     

    # TORQUE

    export TORQUEHOME=/var/spool/torque

     

    # Library Path

     

    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${MOABHOMEDIR}/lib:${TORQUEHOME}/lib

     

    # Update system paths

    export PATH=${MOABHOMEDIR}/sbin:${MOABHOMEDIR}/bin:${TORQUEHOME}/bin:${TORQUEHOME}/sbin:${PATH}

  6. Verify server1 and server2 are resolvable via either DNS or looking for an entry in the /etc/hosts file.
  7. Configure the NFS Mounts by following these steps:
    1. Create mount point folders on fileServer.
    2. fileServer# mkdir -m 0755 /var/spool/torque

      fileServer# mkdir -m 0750 /var/spool/torque/server_priv

      fileServer# mkdir -m 0755 /opt/moab

    3. Update /etc/exports on fileServer. The IP addresses should be that of server2.
    4. /opt/moab                       192.168.0.0/255.255.255.0(rw,sync,no_root_squash)

      /var/spool/torque/server_priv   192.168.0.0/255.255.255.0(rw,sync,no_root_squash)

    5. Update the list of NFS exported file systems.
    6. fileServer# exportfs -r

  8. If the NFS daemons are not already running on fileServer, start them.

    > systemctl restart rpcbind.service

    > systemctl start nfs-server.service

    > systemctl start nfs-lock.service

    > systemctl start nfs-idmap.service

  9. Mount the exported file systems on server1 by following these steps:
    1. Create the directory reference and mount them.
    2. server1# mkdir /opt/moab

      server1# mkdir /var/spool/torque/server_priv

      Repeat this process for server2.

    3. Update /etc/fstab on server1 to ensure that NFS mount is performed on startup.
    4. fileServer:/opt/moab /opt/moab nfs rsize=8192,wsize=8192,timeo=14,intr

      fileServer:/var/spool/torque/server_priv /var/spool/torque/server_priv nfs rsize=8192,wsize=8192,timeo=14,intr

      Repeat this step for server2.

  10. Install TORQUE by following these steps:
    1. Download and extract TORQUE 4.2 on server1.
    2. server1# wget http://github.com/adaptivecomputing/torque/branches/4.1.4/torque-4.1.4.tar.gz

      server1# tar -xvzf torque-4.1.4.tar.gz

    3. Navigate to the TORQUE directory and compile TORQUE with the HA flags on server1.
    4. server1# configure --enable-high-availability --with-tcp-retry-limit=3

      server1# make

      server1# make install

      server1# make packages

    5. If the installation directory is shared on both head nodes, then run make install on server1.
    6. server1# make install

      If the installation directory is not shared, repeat step 8a-b (downloading and installing TORQUE) on server2.

  11. Start trqauthd.

    server1# /etc/init.d/trqauthd start

  12. Configure TORQUE for HA.
    1. List the host names of all nodes that run pbs_server in the torque/server_name file. You must also include the host names of all nodes running pbs_server in the torque/server_name file of each MOM node. The syntax of torque/server_name is a comma-delimited list of host names.

      server1

      server2

    2. Create a simple queue configuration for TORQUE job queues on server1.
    3. server1# pbs_server -t create

      server1# qmgr -c “set server scheduling=true”

      server1# qmgr -c “create queue batch queue_type=execution”

      server1# qmgr -c “set queue batch started=true”

      server1# qmgr -c “set queue batch enabled=true”

      server1# qmgr -c “set queue batch resources_default.nodes=1”

      server1# qmgr -c “set queue batch resources_default.walltime=3600”

      server1# qmgr -c “set server default_queue=batch”

      Because server_priv/* is a shared drive, you do not need to repeat this step on server2.

    4. Add the root users of Moab and TORQUE to the TORQUE configuration as an operator and manager.
    5. server1# qmgr -c “set server managers += root@server1”

      server1# qmgr -c “set server managers += root@server2”

      server1# qmgr -c “set server operators += root@server1”

      server1# qmgr -c “set server operators += root@server2”

      Because server_priv/* is a shared drive, you do not need to repeat this step on Server 2.

    6. You must update the lock file mechanism for TORQUE in order to determine which server is the primary. To do so, use the lock_file_update_time and lock_file_check_time parameters. The primary pbs_server will update the lock file based on the specified lock_file_update_time (default value of 3 seconds). All backup pbs_servers will check the lock file as indicated by the lock_file_check_time parameter (default value of 9 seconds). The lock_file_update_time must be less than the lock_file_check_time. When a failure occurs, the backup pbs_server takes up to the lock_file_check_time value to take over.

      server1# qmgr -c “set server lock_file_check_time=5”

      server1# qmgr -c “set server lock_file_update_time=3”

      Because server_priv/* is a shared drive, you do not need to repeat this step on server2.

    7. List the servers running pbs_server in the TORQUE acl_hosts file.
    8. server1# qmgr -c “set server acl_hosts += server1”

      server1# qmgr -c “set server acl_hosts += server2”

      Because server_priv/* is a shared drive, you do not need to repeat this step on server2.

    9. Restart the running pbs_server in HA mode.
    10. server1# qterm

    11. Start the pbs_server on the secondary server.
    12. server1# pbs_server --ha -l server2:port

      server2# pbs_server --ha -l server1:port

      Only specify the Moab hosts and ports if Moab HA is configured on a remote server. Otherwise, run pbs_server --ha. For example, if Moab is running on server1 and you wish to start TORQUE in HA, you only need to make TORQUE aware of server2.

      > pbs_server --ha -l server2:<port> h1.
  13. Check the status of TORQUE in HA mode.

    server1# qmgr -c “p s”

    server2# qmgr -c “p s”

    The commands above returns all settings from the active TORQUE server from either node.

    Drop one of the pbs_servers to verify that the secondary server picks up the request.

    server1# qterm

    server2# qmgr -c “p s”

    Stop the pbs_server on server2 and restart pbs_server on server1 to verify that both nodes can handle a request from the other.

  14. Install a pbs_mom on the compute nodes.
    1. Copy the install scripts to the compute nodes and install.
    2. Navigate to the shared source directory of TORQUE and run the following:
    3. node1# torque-package-mom-linux-x86_64.sh --install

      node2# torque-package-clients-linux-x86_64.sh --install

      Repeat this for each compute node. Verify that the /var/pool/torque/server-name file shows all your compute nodes.

    4. On server1 or server2, configure the nodes file to identify all available MOMs. To do so, edit the /var/spool/torque/server_priv/nodes file.

      node1 np=2

      node2 np=2

      Change the np flag to reflect number of available processors on that node.

    5. Recycle the pbs_servers to verify that they pick up the MOM configuration.
    6. server1# qterm; pbs_server --ha -l server2:port

      server2# qterm; pbs_server --ha -l server1:port

    7. Again, if Moab HA is configured on a remote server, run pbs_server --ha -l <moabHost1:port> -l <moabHost2:port>.
    8. Start the pbs_mom on each execution node.
    9. node5# pbs_mom

      node6# pbs_mom

  15. Download Moab 7.2 (ODBC + TORQUE). Extract and install the package.
  16. server1# tar -xvzf moab-6.1.0-linux-x86_64-torque-odbc.tar.gz

    Navigate to the moab directory.

    server1# cd moab-7.2.0

    Begin the Moab installation.

    server1# ./configure

    server1# make install

  17. Configure Moab by editing the /opt/moab/etc/moab.cfg file.
  18. SCHEDCFG[Moab] SERVER=server1:42559

    SCHEDCFG[Moab] FBSERVER=server2

    SCHEDCFG[Moab] FLAGS=filelockha

    SCHEDCFG[Moab] HALOCKFILE=/opt/moab/.moab_lock

    ADMINCFG[1] USERS=root

    TOOLSDIR /opt/moab/tools

    LOGLEVEL 3

    ...

    RMCFG[moabha] TYPE=PBS

    RMCFG[moabha] SUBMITCMD=/usr/local/bin/qsub

  19. Install your Moab license file moab.lic into the directory /opt/moab. You must have a license that permits HA by allowing Moab to run on both server1 and server2.
  20. Since /opt/moab has an NFS share and is mounted, and you have already set the system paths for your bash shell at step 2, you must now start your Moab instance.

    server1# moab

    server2# moab

  21. Run showq to make sure everything is working correctly.

    server1# showq

    server2# showq

    Query the available MOMs via TORQUE and check their status. If everything is working correctly, the MOMs you configured in step 5.c should be returned as available.

    server1# mdiag -n

    server2# mdiag -n

  22. Verify that your setup is working correctly. To do so:
    1. Switch to a non-root user and ensure they have the path defined in step 2, then run the following:

      server1# echo "sleep 60" | msub

      Verify that the job is running.

      server1# showq

    2. Submit jobs from the secondary MOM server to double-check that it is working there.

      server2# echo "sleep 60" | msub

      Verify that the job is running.

      server2# showq

    3. While the jobs are running, kill one of the pbs_servers and Moab on the same node to simulate a disaster. Allow about 10 seconds and you should see all traffic being handled by the active server.

© 2012 Adaptive Computing