Appendices > Appendix B: Multi-OS Provisioning

Conventions

Appendix B: Multi-OS Provisioning

25.0-A Introduction

Moab can dynamically provision compute machines to requested operating systems and power off compute machines when not in use. Moab can intelligently control xCAT and use its advanced system configuration mechanisms to adapt systems to current workload requirements. Moab communicates with xCAT using the Moab Service Manager (MSM). MSM is a translation utility that resides between Moab and xCAT and acts as aggregator and interpreter. The Moab Workload Manager will query MSM, which in turn queries xCAT, about system resources, configurations, images, and metrics. After learning about these resources from MSM, Moab then makes intelligent decisions about the best way to maximize system utilization.

In this model Moab gathers system information from two resource managers. The first is TORQUE, which handles the workload on the system; the second is MSM, which relays information gathered by xCAT. By leveraging these software packages, Moab intelligently adapts clusters to deliver on-site goals.

This document assumes that xCAT has been installed and configured. It describes the process of getting MSM and xCAT communicating, and it offers troubleshooting guidance for basic integration. This document offers a description for how to get Moab communicating with MSM and the final steps in verifying a complete software stack.

25.0-B xCAT Configuration Requirements

Observe the following xCAT configuration requirements before installing MSM:

You must have a valid Moab license file (moab.lic) with provisioning and green enabled. For information on acquiring an evaluation license, please contact info@adaptivecomputing.com.

25.0-C MSM Installation

> perl -e 'use Storable 2.18'
> perl -MXML::Simple -e 'exit'
> perl -MProc::Daemon -e 'exit'
> perl -MDBD::SQLite -e 'exit'

25.0-D Integrating MSM and xCAT

Copy the x_msm table schema to the xCAT schema directory:

> cp $MSMHOMEDIR/contrib/xcat/MSM.pm $XCATROOT/lib/perl/xCAT_schema

Restart xcatd and check the x_msm table is correctly created:

> service xcatd restart
> tabdump x_msm

Prepare xCAT images and ensure they provision correctly (see xCAT documentation)

Populate the x_msm table with your image definitions:

> tabedit x_msm
  #flavorname,arch,profile,os,nodeset,features,vmoslist,hvtype,hvgroupname,vmgroupname,comments,disable
  "compute","x86_64","compute","centos5.3","netboot","torque",,,,,,
  "science","x86","compute","scientific_linux","netboot","torque",,,,,,

Ensure all xCAT group names in the x_msm table exist in the xCAT nodegroup table

> tabedit nodegroup

Edit as necessary to simulate the following example:

  #groupname,grouptype,members,wherevals,comments,disable
  "compute",,,,,
  "esxi4",,,,,
  "esxhv",,,,,
  "esxvmmgt",,,,,

After making any necessary edits, run the following command:

> nodels compute,esxi4,esxhv,esxvmmgt
  # should complete without error, ok if doesn't return anything

25.0-E MSM Configuration

Edit $MSMHOMEDIR/msm.cfg and configure the xCAT plug-in. Below is a generic example for use with TORQUE without virtualization. See the section on configuration parameters for a complete list of parameters and descriptions.

  # MSM configuration options
  RMCFG[msm]        PORT=24603
  RMCFG[msm]        POLLINTERVAL=45
  RMCFG[msm]        LOGFILE=/opt/moab/log/msm.log
  RMCFG[msm]        LOGLEVEL=8
  RMCFG[msm]        DEFAULTNODEAPP=xcat
  
  # xCAT plugin specific options
  APPCFG[xcat]      DESCRIPTION="xCAT plugin"
  APPCFG[xcat]      MODULE=Moab::MSM::App::xCAT
  APPCFG[xcat]      LOGLEVEL=3
  APPCFG[xcat]      POLLINTERVAL=45
  APPCFG[xcat]      TIMEOUT=3600
  APPCFG[xcat]      _USEOPIDS=0
  APPCFG[xcat]      _NODERANGE=moab,esxcompute
  APPCFG[xcat]      _USESTATES=boot,netboot,install
  APPCFG[xcat]      _LIMITCLUSTERQUERY=1
  APPCFG[xcat]      _RPOWERTIMEOUT=120
  APPCFG[xcat]      _DONODESTAT=1
  APPCFG[xcat]      _REPORTNETADDR=1
  APPCFG[xcat]      _CQXCATSESSIONS=4

25.0-F Configuration Validation

Set up environment to manually call MSM commands:

# substitute appropriate value(s) for path(s)
  export MSMHOMEDIR=/opt/moab/tools/msm
  export MSMLIBDIR=/opt/moab/tools/msm
  export PATH=$PATH:/$MSMLIBDIR/contrib:$MSMLIBDIR/bin

Verify that MSM starts without errors:

> msmd

Verify that the expected nodes are listed, without errors, using the value of _NODERANGE from msm.cfg.

> nodels <_NODERANGE>

Verify that the expected nodes, are listed in the cluster query output from MSM:

> cluster.query.pl

Provision all nodes through MSM for the first time (pick and image name from x_msm):

> for i in `nodels <_NODERANGE>; do node.modify.pl $i --set os=<image_name>;done

Verify the nodes correctly provision and that the correct OS is reported (which may take some time after the provisioning requests are made):

> cluster.query.pl

25.0-G Troubleshooting

25.0-H Deploying Images with TORQUE

When using MSM + xCAT to deploy images with TORQUE, there are some special configuration considerations. Most of these also apply to other workload resource managers.

Note that while the MSM xCAT plugin contains support for manipulating TORQUE directly, this is not an ideal solution. If you are using a version of xCAT that supports prescripts, it is more appropriate to write prescripts that manipulate TORQUE based on the state of the xCAT tables. This approach is also applicable to other workload resource managers, while the xCAT plugin only deals with TORQUE.

Several use cases and configuration choices are discussed in what follows.

Each image should be configured to report its image name through TORQUE. In the TORQUE pbs_mom mom_config file the opsys value should mirror the name of the image. See Node Manager (MOM) Configuration in the TORQUE Administrator's Guide for more information.

25.0-I Installing Moab on the Management Node

Moab is the intelligence engine that coordinates the capabilities of xCAT and TORQUE to dynamically provision compute nodes to the requested operating system. Moab also schedules workload on the system and powers off idle nodes. Download and install Moab.

25.0-J Moab Configuration File Example

Moab stores its configuration in the moab.cfg file: /opt/moab/etc/moab.cfg. A sample configuration file, set up and optimized for adaptive computing follows:

SCHEDCFG[Moab]          SERVER=gpc-sched:42559
ADMINCFG[1]             USERS=root,egan
LOGLEVEL                7

# How often (in seconds) to refresh information from TORQUE and MSM
RMPOLLINTERVAL           60,60
RESERVATIONDEPTH        10
DEFERTIME               0
TOOLSDIR                /opt/moab/tools

###############################################################################
# TORQUE and MSM configuration                                                #
###############################################################################
RMCFG[torque]           TYPE=PBS
RMCFG[msm]        TYPE=NATIVE:msm FLAGS=autosync,NOCREATERESOURCE RESOURCETYPE=PROV
RMCFG[msm]        TIMEOUT=60
RMCFG[msm]        PROVDURATION=10:00
AGGREGATENODEACTIONS    TRUE

###############################################################################
# ON DEMAND PROVISIONING SETUP                                                #
###############################################################################
QOSCFG[od]              QFLAGS=PROVISION
USERCFG[DEFAULT]        QLIST=od
NODEALLOCATIONPOLICY    PRIORITY
NODECFG[DEFAULT]        PRIORITYF=1000*OS+1000*POWER
NODEAVAILABILITYPOLICY  DEDICATED
CLASSCFG[DEFAULT]       DEFAULT.OS=scinetcompute

###############################################################
# GREEN POLICIES                                              #
###############################################################
NODECFG[DEFAULT]        POWERPOLICY=ONDEMAND
PARCFG[ALL]             NODEPOWEROFFDURATION=20:00
NODEIDLEPOWERTHRESHOLD  600
# END Example moab.cfg

25.0-K Verifying the Installation

When Moab starts it immediately communicates with its configured resource managers. In this case Moab communicates with TORQUE to get compute node and job queue information. It then communicates with MSM to determine the state of the nodes according to xCAT. It aggregates this information and processes the jobs discovered from TORQUE.

When a job is submitted, Moab determines whether nodes need to be provisioned to a particular operating system to satisfy the requirements of the job. If any nodes need to be provisioned Moab performs this action by creating a provisioning system job (a job that is internal to Moab). This system job communicates with xCAT to provision the nodes and remain active while the nodes are provisioning. Once the system job has provisioned the nodes it informs the user's job that the nodes are ready at which time the user's job starts running on the newly provisioned nodes.

When a node has been idle for a specified amount of time (see NODEIDLEPOWERTHRESHOLD), Moab creates a power-off system job. This job communicates with xCAT to power off the nodes and remains active in the job queue until the nodes have powered off. Then the system job informs Moab that the nodes are powered off but are still available to run jobs. The power off system job then exits.

To verify correct communication between Moab and MSM run the mdiag -R -v msm command.

$ mdiag -R -v msm
diagnosing resource managers
RM[msm]       State: Active  Type: NATIVE:MSM  ResourceType: PROV
  Timeout:            30000.00 ms
  Cluster Query URL:  $HOME/tools/msm/contrib/cluster.query.xcat.pl
  Workload Query URL: exec://$TOOLSDIR/msm/contrib/workload.query.pl
  Job Start URL:      exec://$TOOLSDIR/msm/contrib/job.start.pl
  Job Cancel URL:     exec://$TOOLSDIR/msm/contrib/job.modify.pl
  Job Migrate URL:    exec://$TOOLSDIR/msm/contrib/job.migrate.pl
  Job Submit URL:     exec://$TOOLSDIR/msm/contrib/job.submit.pl
  Node Modify URL:    exec://$TOOLSDIR/msm/contrib/node.modify.pl
  Node Power URL:     exec://$TOOLSDIR/msm/contrib/node.power.pl
  RM Start URL:       exec://$TOOLSDIR/msm/bin/msmd
  RM Stop URL:        exec://$TOOLSDIR/msm/bin/msmctl?-k
  System Modify URL:  exec://$TOOLSDIR/msm/contrib/node.modify.pl
  Environment:        MSMHOMEDIR=/home/wightman/test/scinet/tools//msm;MSMLIBDIR=/home/wightman/test/scinet/tools//msm
  Objects Reported:   Nodes=10 (0 procs)  Jobs=0
  Flags:              autosync
  Partition:          SHARED
  Event Management:   (event interface disabled)
  RM Performance:     AvgTime=0.10s  MaxTime=0.25s  (38 samples)
  RM Languages:       NATIVE
  RM Sub-Languages:   -

To verify nodes are configured to provision use the checknode -v command. Each node will have a list of available operating systems.

$ checknode n01
node n01
State:      Idle  (in current state for 00:00:00)
Configured Resources: PROCS: 4  MEM: 1024G  SWAP: 4096M  DISK: 1024G
Utilized   Resources: ---
Dedicated  Resources: ---
Generic Metrics:    watts=25.00,temp=40.00
Power Policy:       Green (global policy)   Selected Power State: Off
Power State:   Off
Power:      Off
  MTBF(longterm):   INFINITY  MTBF(24h):   INFINITY
Opsys:      compute   Arch:      ---
  OS Option: compute
  OS Option: computea
  OS Option: gpfscompute
  OS Option: gpfscomputea
Speed:      1.00      CPULoad:   0.000
Flags:      rmdetected
RM[msm]:    TYPE=NATIVE:MSM  ATTRO=POWER
EffNodeAccessPolicy: SINGLEJOB
Total Time: 00:02:30  Up: 00:02:19 (92.67%)  Active: 00:00:11 (7.33%)

To verify nodes are configured for Green power management, run the mdiag -G command. Each node will show its power state.

$ mdiag -G
NOTE:  power management enabled for all nodes
Partition ALL:  power management enabled
  Partition NodeList:
Partition local:  power management enabled
  Partition NodeList:
  node n01 is in state Idle, power state On (green powerpolicy enabled)
  node n02 is in state Idle, power state On (green powerpolicy enabled)
  node n03 is in state Idle, power state On (green powerpolicy enabled)
  node n04 is in state Idle, power state On (green powerpolicy enabled)
  node n05 is in state Idle, power state On (green powerpolicy enabled)
  node n06 is in state Idle, power state On (green powerpolicy enabled)
  node n07 is in state Idle, power state On (green powerpolicy enabled)
  node n08 is in state Idle, power state On (green powerpolicy enabled)
  node n09 is in state Idle, power state On (green powerpolicy enabled)
  node n10 is in state Idle, power state On (green powerpolicy enabled)
Partition SHARED:  power management enabled

To submit a job that dynamically provisions compute nodes, run the msub -l os=<image> command.

$ msub -l os=computea job.sh
yuby.3
$ showq
active jobs------------------------
JOBID              USERNAME      STATE PROCS   REMAINING            STARTTIME
provision-4            root    Running     8    00:01:00  Fri Jun 19 09:12:56
1 active job               8 of 40 processors in use by local jobs (20.00%)
                           2 of 10 nodes active      (20.00%)
eligible jobs----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME
yuby.3             wightman       Idle     8    00:10:00  Fri Jun 19 09:12:55
1 eligible job
blocked jobs-----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME

0 blocked jobs
Total jobs:  2

Notice that Moab created a provisioning system job named provision-4 to provision the nodes. When provision-4 detects that the nodes are correctly provisioned to the requested OS, the submitted job yuby.3 runs:

$ showq
active jobs------------------------
JOBID              USERNAME      STATE PROCS   REMAINING            STARTTIME
yuby.3             wightman    Running     8    00:08:49  Fri Jun 19 09:13:29
1 active job               8 of 40 processors in use by local jobs (20.00%)
                           2 of 10 nodes active      (20.00%)
eligible jobs----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME

0 eligible jobs
blocked jobs-----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME

0 blocked jobs
Total job:  1

The checkjob command shows information about the provisioning job as well as the submitted job. If any errors occur, run the checkjob -v <jobid> command to diagnose failures.

25.0-L xCAT Plug-in Configuration Parameters

Plugin parameters that begin with an underscore character are specific to the xCAT plug-in; others are common to all plug-ins and may either be set in the RMCFG[msm] for all plug-ins, or per plug-in in the APPCFG[<plugin_name>].

Description
Module
LogLevel
PollInterval
TimeOut
_NodeRange
_CQxCATSessions
_DORVitals
_PowerString
_DoNodeStat
_DoxCATStats
_LockDir
_HVxCATPasswdKey
_FeatureGroups
_DefaultVMCProc
_DefaultVMDisk
_DefaultVMCMemory
_KVMStoragePath
_ESXStore
_ESXCFGPath
_VMInterfaces
_XenHostInterfaces
_KVMHostInterfaces
_VMSovereign
_UseStates
_ImagesTabName
_VerifyRPower
_RPowerTimeOut
_QueueRPower
_RPowerQueueAge
_RPowerQueueSize
_MaskOSWhenOff
_ModifyTORQUE
_ReportNETADDR
_UseOpIDs
_VMIPRange
_xCATHost
_NoRollbackOnError
Description
Format Double quoted string containing brief description of plugin.
Default ---
Description This information is not visible in Moab, but shows up in msmctl -a.
Module
Format Moab::MSM::App::xCAT
Default ---
Description Name of the plugin module to load.
LogLevel
Format 1-9
Default 5
Description Used to control the verbosity of logging, 1 being the lowest (least information logged) and 9 being the highest ( most information logged ). For initial setup and testing, 8 is recommended, then lowering to 3 (only errors logged) for normal operation. Use 9 for debugging, or when submitting a log file for support.
PollInterval
Format Integer > 0
Default 60
Description

MSM will query xCAT every POLLINTERVAL seconds to update general node status. This number will likely require tuning for each specific system. In general, to develop this number, you should pick a fraction of the total nodes MSM will be managing ( 1/_CQXCATSESSIONS ), and time how long it takes run nodestat, rpower stat, and optionally rvitals on these nodes, and add ~15%.

Increasing the POLLINTERVAL will lower the overall load on the xCAT headnode, but decrease the responsiveness to provisioning and power operations.

TimeOut
Format Integer value > POLLINTERVAL
Default 300
Description This parameter controls how long MSM will wait for child processed to complete (all xCAT commands are run in child processes). After TIMEOUT seconds, if a child has not returned it will be killed, and an error reported for the operation.
_NodeRange
Format Any valid noderange (see the xCAT noderange man page).
Default All
Description When MSM queries xCAT this is the noderange it will use. At sites where xCAT manages other hardware that Moab is not intended to control, it is important to change this.
_CQxCATSessions
Format Positive integer > 1
Default 10
Description MSM will divide the node list generated by nodels into this many groups and simultaneously query xCAT for each group. The value may need tuning for large installations, higher values will cause the time to complete a single cluster query to go down, but cause a higher load on the xCAT headnode.
_DORVitals
Format 0 or 1
Default 0
Description When set to 1, MSM will poll rvitals power and led status (see the xCAT rvitals man page). This only works with IBM BMCs currently. In order to use this, xCAT should respond without error to the rvitals <noderange> watts and rvitals <noderange> leds commands. Status is reported as GMETRTIC[watts] and GMETRIC[leds]. See also the _PowerString configuration parameter.
_PowerString
Format single quote delimited string
Default 'AC Avg Power'
Description Only meaningful when used with _DORVitals=1. Some BMCs return multiple responses to the rvitals command, or use slightly different text to describe the power metrics. Use this parameter to control what is reported to Moab. You can use '$MSMLIBDIR/contrib/xcat/dump.xcat.cmd.pl rvitals <node_name> power' and examine the output to determine what the appropriate value of this string is.
_DoNodeStat
Format 0 or 1
Default 1
Description If set to 0, MSM will not call nodestat to generated a substate. This can be used to speed up the time it takes to query xCAT, and you do not need the substate visible to Moab.
_DoxCATStats
Format 0 or 1
Default 0
Description If Set to 1, MSM will track performance statistics about calls to xCAT, and the performance of higher level operations. The information is available via the script $MSMHOMEDIR/contrib/xcat/xcatstats.pl. This parameter is useful for tuning the POLLINTERVAL and _CQxCATSessions configuration parameters.
_LockDir
Format Existing path on MSM host
Default $MSMHOMEDIR/lck
Description This is a path to where MSM maintains lock files to control concurrency with some Xen and KVM operations.
_HVxCATPasswdKey
Format key value in the xCAT passwd table
Default vmware
Description This is where MSM gets the user/password to communicate with ESX hypervisors.
_FeatureGroups
Format Comma delimited string of xCAT group names.
Default ---
Description MSM builds the OSLIST for a node as the intersection of _FEATUREGROUPS, features specified in x_msm for that image, and the nodes group membership. The value 'torque' is special, and indicates that the image uses TORQUE, and the node should be added/removed from TORQUE during provisioning when used in conjunction with the _ModifyTORQUE parameter.
_DefaultVMCProc
Format 1-?
Default 1
Description If not explicitly specified in the create request, MSM will create VMs with this many processors.
_DefaultVMDisk
Format Positive integer values, minimum is determined by your vm image needs
Default 4096
Description If not explicitly specified in the create request, MSM will create VMs with this much disk allocated.
_DefaultVMCMemory
Format Positive integer values, minimum is determined by your vm image needs
Default 512
Description If not specified, MSM will create VMs with this much memory allocated.
_KVMStoragePath
Format Existing path on MSM host
Default /vms
Description File backed disk location for stateful KVM VMS will be placed here.
_ESXStore
Format Mountable NFS Path
Default ---
Description Location of ESX stores.
_ESXCFGPath
Format Mountable NFS Path
Default ESXStore
Description Location of ESX VM configuration files.
_VMInterfaces
Format Name of bridge device in your VM image
Default br0
Description Bridge device name passed to libvirt for network configuration of VMs (overrides _XENHOSTINTERFACES and _KVMHOSTINTERFACES if specified).
_XenHostInterfaces
Format Name of bridge device in your VM image
Default xenbr0
Description Bridge device name passed to libvirt for network configuration of Xen VMs.
_KVMHostInterfaces
Format Name of bridge device in your VM image
Default br0
Description Bridge device name passed to libvirt for network configration of KVM VMs.
_VMSovereign
Format 0 or 1
Default 0
Description Setting this attribute will cause Moab to reserve VMs' memory and procs on the hypervisor and treat the VM as the workload — additional workload cannot be scheduled on the VM.
_UseStates
Format Valid xCAT chain.currstate values (see the xCAT chain man page)
Default boot,netboot,install
Description Nodes that do not have one of these values in the xCAT chain.currstate field will reported with STATE=Updating. Use this configuration parameter to prevent Moab from scheduling nodes that are updating firmware, etc.
_ImagesTabName
Format Existing xCAT table that contains your image definitions.
Default x_msm
Description This table specifies the images that may be presented to Moab in a node's OSLIST. The xCAT schema for this table is defined in $MSMHOMEDIR/contrib/xcat/MSM.pm, which needs to be copied to the $XCATROOT/lib/perl/xCAT_schema directory.
_VerifyRPower
Format 0 or 1
Default 0
Description

If set, MSM will attempt to confirm that rpower requests were successful by polling the power state with rpower stat until the node reports the expected state, or _RPowerTimeOut is reached.

NOTE: This can create significant load on the xCAT headnode.

_RPowerTimeOut
Format Positive integer values
Default 60
Description Only meaningful when used with _VerifyRPower. If nodes do not report the expected power state in this amount of time, a GEVENT will be produced on the node (or system job).
_QueueRPower
Format 0 or 1
Default 0
Description

When set, this parameter will cause MSM to aggregate rpower requests to xCAT into batches. The timing and size of these batches is controlled with the _RPowerQueueAge and _RPowerQueueSize parameters.

NOTE: This can significantly reduce load on the xCAT headnode, but will cause the power commands to take longer, and MSM shutdown to take longer.

_RPowerQueueAge
Format Positive integer values
Default 30
Description Only meaningful when used with _QueueRPower. MSM will send any pending rpower requests when the oldest request in the queue exceeds this value (seconds).
_RPowerQueueSize
Format Positive integer values
Default 200
Description Only meaningful when used with _QueueRPower. MSM will send any pending rpower requests when the queue depth exceeds this value.
_MaskOSWhenOff
Format 0 or 1
Default 0
Description When set, this parameter will cause MSM to report OS=None for nodes that are powered off. This may be useful when mixing stateless and stateful images, forcing Moab to request provisioning instead of just powering on a node.
_ModifyTORQUE
Format 0 or 1
Default 0
Description When set, this parameter will cause MSM to add and removes nodes and VMs from TORQUE as required by provisioning. See the _FeatureGroups parameter as well.
_ReportNETADDR
Format 0 or 1
Default 0
Description When set, this parameter will cause MSM to report NETADDR=<hosts.ip from xCAT>.
_UseOpIDs
Format 0 or 1
Default 0
Description When set, this parameter will cause errors to be reported as GEVENTs on the provided system job, instead of a node (Moab 5.4 only, with appropriate Moab CFG)
_VMIPRange
Format Comma separated list of dynamic ranges for VM (ex '10.10.23.100-200,10.10.24.1-255')
Default ---
Description Use this parameter to specify a pool of IPs that MSM should assign to VMs at creation time. IPs are selected sequentially from this list as available. Omit this configuration parameter if an external service is managing IP assignment, or if they are all previously statically assigned.
_xCATHost
Format <xcat_headnode>:<xcatd_port>
Default localhost:3001
Description Use to configure MSM to communicate with xCAT on another host.
_NoRollbackOnError
Format 0 or 1
Default 0
Description When an error occurs and rollback is activated (as it is by default), rollback causes a reversion to the previous successful request. _NoRollbackOnError is useful for debugging to determine the xCAT state if no rollback occurred. If set to 1 and an error occurs between MSM and xCAT when creating a node, assigning a name (DNS) to a node, or assigning an IP address (DHCP) to a node, then no rollback occurs.