(Click to open topic with navigation)
B.0.1 Introduction
Moab can dynamically provision compute machines to requested operating systems and power off compute machines when not in use. Moab can intelligently control xCAT and use its advanced system configuration mechanisms to adapt systems to current workload requirements. Moab communicates with xCAT using the Moab Service Manager (MSM). MSM is a translation utility that resides between Moab and xCAT and acts as aggregator and interpreter. The Moab Workload Manager will query MSM, which in turn queries xCAT, about system resources, configurations, images, and metrics. After learning about these resources from MSM, Moab then makes intelligent decisions about the best way to maximize system utilization.
In this model Moab gathers system information from two resource managers. The first is Torque, which handles the workload on the system; the second is MSM, which relays information gathered by xCAT. By leveraging these software packages, Moab intelligently adapts clusters to deliver on-site goals.
This document assumes that xCAT has been installed and configured. It describes the process of getting MSM and xCAT communicating, and it offers troubleshooting guidance for basic integration. This document offers a description for how to get Moab communicating with MSM and the final steps in verifying a complete software stack.
B.0.2 xCAT Configuration Requirements
Observe the following xCAT configuration requirements before installing MSM:
You must have a valid Moab license file (moab.lic) with provisioning and green enabled. For information on acquiring an evaluation license, please contact [email protected].
> perl -e 'use Storable 2.18' > perl -MXML::Simple -e 'exit' > perl -MProc::Daemon -e 'exit' > perl -MDBD::SQLite -e 'exit'
B.0.4 Integrating MSM and xCAT
Copy the x_msm table schema to the xCAT schema directory:
> cp $MSMHOMEDIR/contrib/xcat/MSM.pm $XCATROOT/lib/perl/xCAT_schema
Restart xcatd and check the x_msm table is correctly created:
> service xcatd restart
> tabdump x_msm
Prepare xCAT images and ensure they provision correctly (see xCAT documentation)
Populate the x_msm table with your image definitions:
> tabedit x_msm #flavorname,arch,profile,os,nodeset,features,vmoslist,hvtype,hvgroupname,vmgroupname,comments,disable "compute","x86_64","compute","centos5.3","netboot","torque",,,,,, "science","x86","compute","scientific_linux","netboot","torque",,,,,,
Ensure all xCAT group names in the x_msm table exist in the xCAT nodegroup table
> tabedit nodegroup
Edit as necessary to simulate the following example:
#groupname,grouptype,members,wherevals,comments,disable "compute",,,,, "esxi4",,,,, "esxhv",,,,, "esxvmmgt",,,,,
After making any necessary edits, run the following command:
> nodels compute,esxi4,esxhv,esxvmmgt # should complete without error, ok if doesn't return anything
Edit $MSMHOMEDIR/msm.cfg and configure the xCAT plug-in. Below is a generic example for use with Torque without virtualization. See the section on configuration parameters for a complete list of parameters and descriptions.
# MSM configuration options RMCFG[msm] PORT=24603 RMCFG[msm] POLLINTERVAL=45 RMCFG[msm] LOGFILE=/opt/moab/log/msm.log RMCFG[msm] LOGLEVEL=8 RMCFG[msm] DEFAULTNODEAPP=xcat # xCAT plugin specific options APPCFG[xcat] DESCRIPTION="xCAT plugin" APPCFG[xcat] MODULE=Moab::MSM::App::xCAT APPCFG[xcat] LOGLEVEL=3 APPCFG[xcat] POLLINTERVAL=45 APPCFG[xcat] TIMEOUT=3600 APPCFG[xcat] _USEOPIDS=0 APPCFG[xcat] _NODERANGE=moab,esxcompute APPCFG[xcat] _USESTATES=boot,netboot,install APPCFG[xcat] _LIMITCLUSTERQUERY=1 APPCFG[xcat] _RPOWERTIMEOUT=120 APPCFG[xcat] _DONODESTAT=1 APPCFG[xcat] _REPORTNETADDR=1 APPCFG[xcat] _CQXCATSESSIONS=4
B.0.6 Configuration Validation
Set up environment to manually call MSM commands:
# substitute appropriate value(s) for path(s) export MSMHOMEDIR=/opt/moab/tools/msm export MSMLIBDIR=/opt/moab/tools/msm export PATH=$PATH:/$MSMLIBDIR/contrib:$MSMLIBDIR/bin
Verify that MSM starts without errors:
> msmd
Verify that the expected nodes are listed, without errors, using the value of _NODERANGE from msm.cfg.
> nodels <_NODERANGE>
Verify that the expected nodes, are listed in the cluster query output from MSM:
> cluster.query.pl
Provision all nodes through MSM for the first time (pick and image name from x_msm):
> for i in `nodels <_NODERANGE>; do node.modify.pl $i --set os=<image_name>;done
Verify the nodes correctly provision and that the correct OS is reported (which may take some time after the provisioning requests are made):
> cluster.query.pl
B.0.8 Deploying Images with Torque
When using MSM + xCAT to deploy images with Torque, there are some special configuration considerations. Most of these also apply to other workload resource managers.
Note that while the MSM xCAT plugin contains support for manipulating Torque directly, this is not an ideal solution. If you are using a version of xCAT that supports prescripts, it is more appropriate to write prescripts that manipulate Torque based on the state of the xCAT tables. This approach is also applicable to other workload resource managers, while the xCAT plugin only deals with Torque.
Several use cases and configuration choices are discussed in what follows.
Each image should be configured to report its image name through Torque. In the Torque pbs_mom mom_config file the opsys value should mirror the name of the image. See Node Manager (MOM) Configuration in the Torque 6.0.1 Administrator Guide for more information.
B.0.9 Installing Moab on the Management Node
Moab is the intelligence engine that coordinates the capabilities of xCAT and Torque to dynamically provision compute nodes to the requested operating system. Moab also schedules workload on the system and powers off idle nodes. Download and install Moab.
B.0.10 Moab Configuration File Example
Moab stores its configuration in the moab.cfg file: /opt/moab/etc/moab.cfg. A sample configuration file, set up and optimized for adaptive computing follows:
SCHEDCFG[Moab] SERVER=gpc-sched:42559 ADMINCFG[1] USERS=root,egan LOGLEVEL 7 # How often (in seconds) to refresh information from Torque and MSM RMPOLLINTERVAL 60,60 RESERVATIONDEPTH 10 DEFERTIME 0 TOOLSDIR /opt/moab/tools ############################################################################### # Torque and MSM configuration # ############################################################################### RMCFG[torque] TYPE=PBS RMCFG[msm] TYPE=NATIVE:msm FLAGS=autosync,NOCREATERESOURCE RESOURCETYPE=PROV RMCFG[msm] TIMEOUT=60 RMCFG[msm] PROVDURATION=10:00 AGGREGATENODEACTIONS TRUE ############################################################################### # ON DEMAND PROVISIONING SETUP # ############################################################################### QOSCFG[od] QFLAGS=PROVISION USERCFG[DEFAULT] QLIST=od NODEALLOCATIONPOLICY PRIORITY NODECFG[DEFAULT] PRIORITYF=1000*OS+1000*POWER NODEAVAILABILITYPOLICY DEDICATED CLASSCFG[DEFAULT] DEFAULT.OS=scinetcompute ############################################################### # GREEN POLICIES # ############################################################### NODECFG[DEFAULT] POWERPOLICY=ONDEMAND PARCFG[ALL] NODEPOWEROFFDURATION=20:00 NODEIDLEPOWERTHRESHOLD 600 # END Example moab.cfg
B.0.11 Verifying the Installation
When Moab starts it immediately communicates with its configured resource managers. In this case Moab communicates with Torque to get compute node and job queue information. It then communicates with MSM to determine the state of the nodes according to xCAT. It aggregates this information and processes the jobs discovered from Torque.
When a job is submitted, Moab determines whether nodes need to be provisioned to a particular operating system to satisfy the requirements of the job. If any nodes need to be provisioned Moab performs this action by creating a provisioning system job (a job that is internal to Moab). This system job communicates with xCAT to provision the nodes and remain active while the nodes are provisioning. Once the system job has provisioned the nodes it informs the user's job that the nodes are ready at which time the user's job starts running on the newly provisioned nodes.
When a node has been idle for a specified amount of time (see NODEIDLEPOWERTHRESHOLD), Moab creates a power-off system job. This job communicates with xCAT to power off the nodes and remains active in the job queue until the nodes have powered off. Then the system job informs Moab that the nodes are powered off but are still available to run jobs. The power off system job then exits.
To verify correct communication between Moab and MSM run the mdiag -R -v msm command.
$ mdiag -R -v msm diagnosing resource managers RM[msm] State: Active Type: NATIVE:MSM ResourceType: PROV Timeout: 30000.00 ms Cluster Query URL: $HOME/tools/msm/contrib/cluster.query.xcat.pl Workload Query URL: exec://$TOOLSDIR/msm/contrib/workload.query.pl Job Start URL: exec://$TOOLSDIR/msm/contrib/job.start.pl Job Cancel URL: exec://$TOOLSDIR/msm/contrib/job.modify.pl Job Migrate URL: exec://$TOOLSDIR/msm/contrib/job.migrate.pl Job Submit URL: exec://$TOOLSDIR/msm/contrib/job.submit.pl Node Modify URL: exec://$TOOLSDIR/msm/contrib/node.modify.pl Node Power URL: exec://$TOOLSDIR/msm/contrib/node.power.pl RM Start URL: exec://$TOOLSDIR/msm/bin/msmd RM Stop URL: exec://$TOOLSDIR/msm/bin/msmctl?-k System Modify URL: exec://$TOOLSDIR/msm/contrib/node.modify.pl Environment: MSMHOMEDIR=/home/wightman/test/scinet/tools//msm;MSMLIBDIR=/home/wightman/test/scinet/tools//msm Objects Reported: Nodes=10 (0 procs) Jobs=0 Flags: autosync Partition: SHARED Event Management: (event interface disabled) RM Performance: AvgTime=0.10s MaxTime=0.25s (38 samples) RM Languages: NATIVE RM Sub-Languages: -
To verify nodes are configured to provision use the checknode -v command. Each node will have a list of available operating systems.
$ checknode n01 node n01 State: Idle (in current state for 00:00:00) Configured Resources: PROCS: 4 MEM: 1024G SWAP: 4096M DISK: 1024G Utilized Resources: --- Dedicated Resources: --- Generic Metrics: watts=25.00,temp=40.00 Power Policy: Green (global policy) Selected Power State: Off Power State: Off Power: Off MTBF(longterm): INFINITY MTBF(24h): INFINITY Opsys: compute Arch: --- OS Option: compute OS Option: computea OS Option: gpfscompute OS Option: gpfscomputea Speed: 1.00 CPULoad: 0.000 Flags: rmdetected RM[msm]: TYPE=NATIVE:MSM ATTRO=POWER EffNodeAccessPolicy: SINGLEJOB Total Time: 00:02:30 Up: 00:02:19 (92.67%) Active: 00:00:11 (7.33%)
To verify nodes are configured for Green power management, run the mdiag -G command. Each node will show its power state.
$ mdiag -G NOTE: power management enabled for all nodes Partition ALL: power management enabled Partition NodeList: Partition local: power management enabled Partition NodeList: node n01 is in state Idle, power state On (green powerpolicy enabled) node n02 is in state Idle, power state On (green powerpolicy enabled) node n03 is in state Idle, power state On (green powerpolicy enabled) node n04 is in state Idle, power state On (green powerpolicy enabled) node n05 is in state Idle, power state On (green powerpolicy enabled) node n06 is in state Idle, power state On (green powerpolicy enabled) node n07 is in state Idle, power state On (green powerpolicy enabled) node n08 is in state Idle, power state On (green powerpolicy enabled) node n09 is in state Idle, power state On (green powerpolicy enabled) node n10 is in state Idle, power state On (green powerpolicy enabled) Partition SHARED: power management enabled
To submit a job that dynamically provisions compute nodes, run the msub -l os=<image> command.
$ msub -l os=computea job.sh yuby.3 $ showq active jobs------------------------ JOBID USERNAME STATE PROCS REMAINING STARTTIME provision-4 root Running 8 00:01:00 Fri Jun 19 09:12:56 1 active job 8 of 40 processors in use by local jobs (20.00%) 2 of 10 nodes active (20.00%) eligible jobs---------------------- JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME yuby.3 wightman Idle 8 00:10:00 Fri Jun 19 09:12:55 1 eligible job blocked jobs----------------------- JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME 0 blocked jobs Total jobs: 2
Notice that Moab created a provisioning system job named provision-4 to provision the nodes. When provision-4 detects that the nodes are correctly provisioned to the requested OS, the submitted job yuby.3 runs:
$ showq active jobs------------------------ JOBID USERNAME STATE PROCS REMAINING STARTTIME yuby.3 wightman Running 8 00:08:49 Fri Jun 19 09:13:29 1 active job 8 of 40 processors in use by local jobs (20.00%) 2 of 10 nodes active (20.00%) eligible jobs---------------------- JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME 0 eligible jobs blocked jobs----------------------- JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME 0 blocked jobs Total job: 1
The checkjob command shows information about the provisioning job as well as the submitted job. If any errors occur, run the checkjob -v <jobid> command to diagnose failures.
B.0.12 xCAT Plug-in Configuration Parameters
Plugin parameters that begin with an underscore character are specific to the xCAT plug-in; others are common to all plug-ins and may either be set in the RMCFG[msm] for all plug-ins, or per plug-in in the APPCFG[<plugin_name>].
Description | |
---|---|
Format | Double quoted string containing brief description of plugin. |
Default | --- |
Description | This information is not visible in Moab, but shows up in msmctl -a. |
Module | |
---|---|
Format | Moab::MSM::App::xCAT |
Default | --- |
Description | Name of the plugin module to load. |
_DORVitals | |
---|---|
Format | 0 or 1 |
Default | 0 |
Description | When set to 1, MSM will poll rvitals power and led status (see the xCAT rvitals man page). This only works with IBM BMCs currently. In order to use this, xCAT should respond without error to the rvitals <noderange> watts and rvitals <noderange> leds commands. Status is reported as GMETRTIC[watts] and GMETRIC[leds]. See also the _PowerString configuration parameter. |
_PowerString | |
---|---|
Format | single quote delimited string |
Default | 'AC Avg Power' |
Description | Only meaningful when used with _DORVitals=1. Some BMCs return multiple responses to the rvitals command, or use slightly different text to describe the power metrics. Use this parameter to control what is reported to Moab. You can use '$MSMLIBDIR/contrib/xcat/dump.xcat.cmd.pl rvitals <node_name> power' and examine the output to determine what the appropriate value of this string is. |
_DoxCATStats | |
---|---|
Format | 0 or 1 |
Default | 0 |
Description | If Set to 1, MSM will track performance statistics about calls to xCAT, and the performance of higher level operations. The information is available via the script $MSMHOMEDIR/contrib/xcat/xcatstats.pl. This parameter is useful for tuning the POLLINTERVAL and _CQxCATSessions configuration parameters. |
_LockDir | |
---|---|
Format | Existing path on MSM host |
Default | $MSMHOMEDIR/lck |
Description | This is a path to where MSM maintains lock files to control concurrency with some Xen and KVM operations. |
_HVxCATPasswdKey | |
---|---|
Format | key value in the xCAT passwd table |
Default | vmware |
Description | This is where MSM gets the user/password to communicate with ESX hypervisors. |
_FeatureGroups | |
---|---|
Format | Comma delimited string of xCAT group names. |
Default | --- |
Description | MSM builds the OSLIST for a node as the intersection of _FEATUREGROUPS, features specified in x_msm for that image, and the nodes group membership. The value 'torque' is special, and indicates that the image uses Torque, and the node should be added/removed from Torque during provisioning when used in conjunction with the _ModifyTorque parameter. |
_DefaultVMCProc | |
---|---|
Format | 1-? |
Default | 1 |
Description | If not explicitly specified in the create request, MSM will create VMs with this many processors. |
_DefaultVMCMemory | |
---|---|
Format | Positive integer values, minimum is determined by your vm image needs |
Default | 512 |
Description | If not specified, MSM will create VMs with this much memory allocated. |
_KVMStoragePath | |
---|---|
Format | Existing path on MSM host |
Default | /vms |
Description | File backed disk location for stateful KVM VMS will be placed here. |
_ESXStore | |
---|---|
Format | Mountable NFS Path |
Default | --- |
Description | Location of ESX stores. |
_ESXCFGPath | |
---|---|
Format | Mountable NFS Path |
Default | ESXStore |
Description | Location of ESX VM configuration files. |
_XenHostInterfaces | |
---|---|
Format | Name of bridge device in your VM image |
Default | xenbr0 |
Description | Bridge device name passed to libvirt for network configuration of Xen VMs. |
_KVMHostInterfaces | |
---|---|
Format | Name of bridge device in your VM image |
Default | br0 |
Description | Bridge device name passed to libvirt for network configration of KVM VMs. |
_VerifyRPower | |
---|---|
Format | 0 or 1 |
Default | 0 |
Description |
If set, MSM will attempt to confirm that rpower requests were successful by polling the power state with rpower stat until the node reports the expected state, or _RPowerTimeOut is reached. NOTE: This can create significant load on the xCAT headnode. |
_RPowerTimeOut | |
---|---|
Format | Positive integer values |
Default | 60 |
Description | Only meaningful when used with _VerifyRPower. If nodes do not report the expected power state in this amount of time, a GEVENT will be produced on the node (or system job). |
_QueueRPower | |
---|---|
Format | 0 or 1 |
Default | 0 |
Description |
When set, this parameter will cause MSM to aggregate rpower requests to xCAT into batches. The timing and size of these batches is controlled with the _RPowerQueueAge and _RPowerQueueSize parameters. NOTE: This can significantly reduce load on the xCAT headnode, but will cause the power commands to take longer, and MSM shutdown to take longer. |
_RPowerQueueAge | |
---|---|
Format | Positive integer values |
Default | 30 |
Description | Only meaningful when used with _QueueRPower. MSM will send any pending rpower requests when the oldest request in the queue exceeds this value (seconds). |
_RPowerQueueSize | |
---|---|
Format | Positive integer values |
Default | 200 |
Description | Only meaningful when used with _QueueRPower. MSM will send any pending rpower requests when the queue depth exceeds this value. |
_ModifyTorque | |
---|---|
Format | 0 or 1 |
Default | 0 |
Description | When set, this parameter will cause MSM to add and removes nodes and VMs from Torque as required by provisioning. See the _FeatureGroups parameter as well. |
_ReportNETADDR | |
---|---|
Format | 0 or 1 |
Default | 0 |
Description | When set, this parameter will cause MSM to report NETADDR=<hosts.ip from xCAT>. |
_xCATHost | |
---|---|
Format | <xcat_headnode>:<xcatd_port> |
Default | localhost:3001 |
Description | Use to configure MSM to communicate with xCAT on another host. |