Moab Adaptive Computing Suite Administrator's Guide 5.4

1.0 Installation

The installation process for a Moab system is straightforward. However, it is accomplished in two separate steps: Moab and the resource managers.

1.1 Moab

In most cases, Moab is distributed as a binary tarball. Builds are readily available for all major architectures and operating systems. These packages are available to those with a valid full or evaluation license.

> tar xvzf moab-5.1.0-i386-libtorque-p2.tar.gz
> ./configure
> make
> make install

Example 1 shows the commands to do a basic installation from the command line for Moab. In this case, the install package for Moab 5.1.0 (patch 2) needs to be in the current directory.

drwxr-xr-x 2 root root 4096 2008-04-02 12:24 bin
drwxr-xr-x 2 root root 4096 2008-04-02 12:24 include
drwxr-xr-x 2 root root 4096 2008-04-02 12:26 log
-rw-r--r-- 1 root root 1354 2008-04-02 12:26 moab.cfg
drwxrwxrwt 2 root root 4096 2008-04-02 12:42 spool
drwxr-xr-x 2 root root 4096 2008-04-02 12:26 stats
drwxr-xr-x 2 root root 4096 2008-04-02 12:24 traces

By default, Moab installs to the /opt/moab directory. Example 2 shows a sample ls -l output in /opt/moab. In addition, the default locations for the Moab executables are /usr/local/bin (client) and /usr/local/sbin (server).

As can be seen in Example 2, the binary installation creates a default moab.cfg file. This file contains the global configuration for Moab that is loaded each time Moab is started. The definitions for users, groups, nodes, resource manager, quality of services and standing reservations are placed in this file. While there are many settings for Moab, only a few will be discussed here. The default moab.cfg that is provided with a binary installation is very simple. The installation process defines several important default values, but the majority of configuration needs to be done by the administrator, either through directly editing the file or using one of the provided administrative tools such as Moab Cluster Manager (MCM).

SCHEDCFG[Moab]    SERVER=allbe:42559
ADMINCFG[1]       USERS=root,root
RMCFG[base]       TYPE=PBS

Example 3 shows the default moab.cfg file sans-comments. The first line defines a new scheduler named Moab. In this case, it is located on a host named allbe and listening on port 42559 for client commands. These values are added by the installation process, and should be kept in most cases.

The second line, however, requires some editing by the administrator. This line defines what users on the system have Level 1 Administrative rights. These are users who have global access to information and unlimited control over scheduling operations in Moab. There are five default administrative levels defined by Moab, each of which is fully customizable. In Example 3, this line needs to be updated. The second root entry needs to be changed to the username of the administrator(s) of the system. The first root entry needs to remain, as Moab needs to run as root in order to submit jobs to the resource managers as the original owner.

The final line in this example is the configuration for the default resource manager. This particular binary distribution is for the TORQUE resource manager. Because TORQUE follows the PBS style of job handling, the resource manager is given a type of PBS. To differentiate it from other resource managers that may be added in the future, it is also given the name base. Resource managers will be discussed in the next section.

This constitutes the basic installation of Moab. Many additional parameters can be added to the moab.cfg file in order to fully adapt Moab to the needs of your particular data center. A more detailed installation guide is available.

1.2 Resource Managers

The job of Moab is to schedule resources. In fact, Moab views the world as a vast collection of resources that can be scheduled for different purposes. It is not, however, responsible for the direct manipulation of these resources. Instead, Moab relies on resource managers to handle the fine details in this area. Moab makes decisions and sends the necessary commands to the resource managers, which execute the commands and return state information back to Moab. This decoupling of Moab from the actual resources allows Moab to support all possible resource types, as it doesn't need specific knowledge about the resources, only a knowledge of how to communicate with the resource manager.

Moab natively supports a wide range of resource managers. For several of these, Moab interacts directly with the resource manager's API. These include TORQUE, LSF and PBS. For these resource managers, a specific binary build is required to take advantage of the API calls. For other resource managers, Moab supports a generic interface known as the Native Interface. This allows interfaces to be built for any given type of resource manager, including those developed in-house. Cluster Resources supplies a large number of pre-built interfaces for the most common resource managers. They also provide information on building custom interfaces, as well as contract services for specialized development.

The setup of each individual resource manager is beyond the scope of this document. However, most resource managers come with ample instructions and/or wizards to aid in their installation. TORQUE, an open-source resource manager under the auspice of Cluster Resources, has a documentation WIKI. This includes instructions on installing and testing TORQUE. Please note that special SSH or NFS configuration may also be required in order to get data staging to work correctly.

Once the resource manager(s) are installed and configured, the moab.cfg file will need to be updated if new or different resource managers have been added. Valid resource manager types include: LL, LSF, PBS, SGE, SSS and WIKI. General information on resource managers is available.

1.3 Checking the Installation

Once Moab and the resource managers have been installed, there are several steps that should be followed to check the installation.

  1. Start the Resource Manager — See resource manager's documentation
  2. Start Moab — Run Moab from the command line as root
  3. Run Resource Manager Tests — See resource manager's documentation
  4. Check Moab Install — See the following

Checking the Moab installation is a fairly straightforward process. The first test is to run the program showq. This displays the Moab queue information. Example 4 shows a sample output from showq. In this case, the system shown is a single node, which is a 64-processor SMP machine. Also, there are currently no jobs running or queued. So, there are no active processors or nodes.

active jobs------------------------
JOBID              USERNAME      STATE  PROC   REMAINING            STARTTIME


0 active jobs              0 of 64 processors in use by local jobs (0.00%)
                            0 of 1 nodes active      (0.00%)

eligible jobs----------------------
JOBID              USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


0 eligible jobs

blocked jobs-----------------------
JOBID              USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


0 blocked jobs

Total jobs:  0

The important thing to look for at this point is the total number of processors and nodes. If either the total number of processors or nodes is 0, there is a problem. Generally, this is would be caused by a communication problem between Moab and the resource manager, assuming the resource manager is configured correctly and actively communicating with each of the nodes for which it has responsibility.

The current state of the communication links between Moab and the resource managers can be viewed using the mdiag -R -v command. This gives a verbose listing of the resource managers configured in Moab, including current state, statistics, and any error messages.

diagnosing resource managers

RM[base]  State: Active
  Type:               PBS  ResourceType: COMPUTE
  Version:            '2.2.0'
  Objects Reported:   Nodes=1 (64 procs)  Jobs=0
  Flags:              executionServer,noTaskOrdering
  Partition:          base
  Event Management:   EPORT=15004  (last event: 00:01:46)
  Note:  SSS protocol enabled
  Submit Command:     /usr/local/bin/qsub
  DefaultClass:       batch
  Total Jobs Started: 3
  RM Performance:     AvgTime=0.00s  MaxTime=1.45s  (8140 samples)
  RM Languages:       PBS
  RM Sub-Languages:   -

RM[internal]  State: ---
  Type:               SSS
  Max Failure Per Iteration:          0
  JobCounter:                         5
  Version:            'SSS4.0'
  Flags:              localQueue
  Event Management:   (event interface disabled)
  RM Performance:     AvgTime=0.00s  MaxTime=0.00s  (5418 samples)
  RM Languages:       -
  RM Sub-Languages:   -


Note:  use 'mrmctl -f messages ' to clear stats/failures

In Example 5, two different resource managers are listed: base and internal. The base resource manager is the TORQUE resource manager that was defined in Example 3. It is currently showing that it is healthy and there are no communication problems. The internal resource manager is used internally by Moab for a number of procedures. If there were any problems with either resource manager, messages would be displayed here. Where possible, error messages include suggested fixes for the noted problem.

Another command that can be very helpful when testing Moab is mdiag -C, which does a format check on the moab.cfg file to ensure that each line has a recognizable format.

INFO:  line #15 is valid:  'SCHEDCFG[Moab]  SERVER=allbe:42559'
INFO:  line #16 is valid:  'ADMINCFG[1]     USERS=root,guest2'
INFO:  line #23 is valid:  'RMCFG[base]     TYPE=PBS'

The state of individual nodes can be checked using the mdiag -n command. Verbose reporting of the same information is available through mdiag -n -v.

compute node summary
Name                    State   Procs      Memory         Opsys

allbe                    Idle   64:64     1010:1010       linux
-----                     ---   64:64     1010:1010       -----

Total Nodes: 1  (Active: 0  Idle: 1  Down: 0)

In this case (Example 7), there is only a single compute node, allbe. This node has 64 processors and is currently idle, meaning it is ready to run jobs, but is not currently doing anything. If a job or jobs had been running on the node, the node would be noted as active, and the Procs and Memory column would indicate not only the total number configured, but also the number currently available.

The next test is to run a simple job using Moab and the configured resource manager. This can be done either through the command line or an administrative tool like MCM. This document will show how this is done utilizing the command line.

> echo "sleep 60" | msub

The command in Example 8 submits a job that simply sleeps for 60 seconds and returns. While this may appear to have little or no point, it allows for the testing of the job submission procedures. As the root user is not allowed to submit jobs, this command needs to be run as a different user. When this command is run successfully, it will return the Job ID of the new job. The job should also appear in showq as running (assuming the queue was empty), seen in Example 9.

active jobs------------------------
JOBID              USERNAME      STATE  PROC   REMAINING            STARTTIME

40277                 user1    Running     1    00:59:59  Tue Apr  3 11:23:33

1 active job               1 of 64 processors in use by local jobs (1.56%)
                            1 of 1 nodes active      (100.00%)
eligible jobs----------------------
JOBID              USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


0 eligible jobs

blocked jobs-----------------------
JOBID              USERNAME      STATE  PROC     WCLIMIT            QUEUETIME


0 blocked jobs

Total jobs:  1

If showq indicated a problem with the job, such as it being blocked, additional information regarding the job can be gained using checkjob job_id. Example 10 shows some sample output of this command. More verbose information can be gathered using checkjob -v job_id.

job 40277

AName: STDIN
State: Running
Creds:  user:user1  group:user1  class:batch
WallTime:   00:01:32 of 1:00:00
SubmitTime: Tue Apr  3 11:23:32
  (Time Queued  Total: 00:00:01  Eligible: 00:00:01)

StartTime: Tue Apr  3 11:23:33
Total Requested Tasks: 1

Req[0]  TaskCount: 1  Partition: base
Memory >= 0  Disk >= 0  Swap >= 0
Opsys:   ---  Arch: ---  Features: ---
NodesRequested:  1

Allocated Nodes:
[allbe:1]


IWD:            /opt/moab
Executable:     /opt/moab/spool/moab.job.A6wPSf

StartCount:     1
Partition Mask: [base]
Flags:          RESTARTABLE,GLOBALQUEUE
Attr:           checkpoint
StartPriority:  1
Reservation '40277' (-00:01:47 -> 00:58:13  Duration: 1:00:00)

If jobs can be submitted and run properly, the system is configured for basic use. As the transition to a Moab-centric system continues, additional items will be placed in the moab.cfg file. After each change to the moab.cfg file, it is necessary to restart Moab for the changes to take effect. This simple process is shown in Example 11.

> mschedctl -R

Other commands administrators will find useful are shown below.

> mschedctl -k
> mdiag -S
> showq -c