21.5 Information Services for Enterprises and Grids

Moab can be used to collect information from multiple scattered resources. Beyond information collection, Moab can also be set up to perform automated diagnostics, produce summary reports, and initiate automated resource recovery, event, and threshold based reprovisioning. Managed resources can include compute clusters, network resources, storage resources, license resources, system services, applications, and even databases.

21.5.1 General Collection Infrastructure

While significant flexibility is possible, a simple approach for monitoring and managing resources involves setting up a Moab Information Daemon (minfod) to access each of the resources to be monitored. These minfod daemons collect configuration, state, load, and other usage information and report it back to one or more central moab daemons. The central Moab is responsible for assembling this information, handling conflict resolution, identifying critical events, generating reports, and performing various automated actions.

The minfod daemon can be configured to import information from most existing HPC information sources, including both specialized application APIs and general communication standards. These interfaces include IPMI, Ganglia, SQL, Nagios, HTTP Services, Web/Soap based services, flat files, LSF, TORQUE/PBS, Loadleveler, SLURM, locally developed scripts, network routers, license managers, and so forth.

Information services uses the Moab peer-to-peer communication facility, identity management interface, generic event/metric facilities, generalized resource management infrastructure, and advanced accounting/reporting capabilities to perform resource healing and automated load-balancing.

Moab can be used with various hybrid solutions. Services and resources associated with both open source/open standard protocols and vendor-specific protocols can be integrated and simultaneously managed by Moab. In real-time, the information gathered by Moab can be exported to a database, as HTML, or as a Web service. This flexibility allows the information to be of immediate use via human-readable and machine-readable interfaces.

21.5.2 Sample Uses

Organizations use this capability for multiple purposes including the following:

21.5.3 General Configuration Guidelines

  1. Establish peer relationships between information service daemons (minfod or moab).
  2. (optional) Enable Starttime Estimation Reporting if manual or automated load-balancing is to occur.
    • Set ENABLESTARTESTIMATESTATS to generate local start estimation statistics.
    • Set REPORTPEERSTARTINFO to report start estimate information to peers.
  3. (optional) Enable Generic Event/Generic Metric Triggers if automated resource recovery or alerts are to be used.
  4. (optional) Enable automated periodic reporting.
  5. (optional) Enable automated data/job staging and environmental translation.
  6. (optional) Enable automated load/event based resource provisioning.

21.5.4 Examples

21.5.4.1 .1 Grid Resource Availability Information Service

The objective of this project is to create a centralized service that can assist users in better utilizing geographically distributed resources within a loosely coupled-grid. In this grid, many independent clusters exist, but many jobs may only be able to use a portion of the available resources due to architectural and environmental differences from cluster to cluster. The information service must provide information to both users and services to allow improved decisions regarding job to resource mapping.

To address this, a centralized Moab information service is created that collects information from each of the participating grids. On each cluster where Moab is already managing the local workload, the existing cluster-level Moab is configured to report the needed information to the central Moab daemon. On each cluster where another system is managing local cluster workload, a Moab Information Service Daemon (minfod) is started.

Because load-balancing information is required, the Moab daemon running on each cluster is configured to report backlog and start estimate information using the REPORTPEERSTARTINFO parameter.

To make information available via a Web service, on the master Moab node, the cluster.mon.ws.pl service is started, allowing Moab to receive Web service based requests and report responses in XML over SOAP. To allow human-readable browser access to the same information and services, the local Web service is configured to use the moab.is.cgiscript to drive the Web service interface and report results via a standard Web page.

Due to the broad array of users within the grid, many types of information are provided. This information includes the following:

With these queries, users/services can obtain and process raw resource information or can ask a question as simple as What is the best cluster for this request?.

ENABLESTARTESTIMATESTATS TRUE
REPORTPEERSTARTINFO      TRUE
...
RMCFG[clusterA] SERVER=moab://clusterA.bnl.gov
RMCFG[clusterB] SERVER=moab://clusterB.qrnl.gov
RMCFG[clusterC] SERVER=moab://clusterC.ocsa.edu
RMCFG[clusterD] SERVER=moab://clusterD.ocsa.edu
...
> mdiag -t -v
Partition Status
System Partition Settings:  PList: clusterA,clusterB  
Name                    Procs
ALL                      1400
clusterA                  800
  RM=clusterA
clusterB                  600
  RM=clusterB
Partition    Configured         Up     U/C  Dedicated     D/U     Active     A/U
Nodes ----------------------------------------------------------------------------
ALL                 700        700 100.00%        650  86.67%        647  85.39%
clusterA            400        400 100.00%          0   0.00%          0   0.00%
clusterB            300        300 100.00%          1 100.00%          1 100.00%
Processors ----------------------------------------------------------------------------
ALL                1400       1400  84.21%          2  12.50%          2  12.50%
clusterA            800        800  84.21%          2  12.50%          2  12.50%
clusterB            600        600  84.21%          2  12.50%          2  12.50%
...
Backlog
             BacklogPS  BacklogDuration  AvgQTime
clusterA      13472.00         00:14:27  00:22:14 
clusterB       7196.00         00:07:55  00:07:06
...

See Also

Copyright © 2011 Adaptive Computing Enterprises, Inc.®