Moab can be used to collect information from multiple scattered resources. Beyond information collection, Moab can also be set up to perform automated diagnostics, produce summary reports, and initiate automated resource recovery, event, and threshold based reprovisioning. Managed resources can include compute clusters, network resources, storage resources, license resources, system services, applications, and even databases.
While significant flexibility is possible, a simple approach for monitoring and managing resources involves setting up a Moab Information Daemon (minfod) to access each of the resources to be monitored. These minfod daemons collect configuration, state, load, and other usage information and report it back to one or more central moab daemons. The central Moab is responsible for assembling this information, handling conflict resolution, identifying critical events, generating reports, and performing various automated actions.
The minfod daemon can be configured to import information from most existing HPC information sources, including both specialized application APIs and general communication standards. These interfaces include IPMI, Ganglia, SQL, Nagios, HTTP Services, Web/Soap based services, flat files, LSF, TORQUE/PBS, Loadleveler, SLURM, locally developed scripts, network routers, license managers, and so forth.
The information service feature takes advantage of the Moab peer-to-peer communication facility, identity management interface, generic event/metric facilities, generalized resource management infrastructure, and advanced accounting/reporting capabilities. With these technologies, solutions ranging from pure information services to more active systems that perform resource healing and automated load-balancing can be created.
With the flexibility of Moab, hybrid solutions anywhere along the active monitoring spectrum can be enabled. Services and resources associated with both open source/open standard protocols and vendor-specific protocols can be integrated and simultaneously managed by Moab. In real-time, the information gathered by Moab can be exported to a database, as HTML, or as a Web service. This flexibility allows the information to be of immediate use via human-readable and machine-readable interfaces.
Organizations use this capability for multiple purposes including the following:
The objective of this project is to create a centralized service that can assist users in better utilizing geographically distributed resources within a loosely coupled-grid. In this grid, many independent clusters exist, but many jobs may only be able to use a portion of the available resources due to architectural and environmental differences from cluster to cluster. The information service must provide information to both users and services to allow improved decisions regarding job to resource mapping.
To address this, a centralized Moab information service is created that collects information from each of the participating grids. On each cluster where Moab is already managing the local workload, the existing cluster-level Moab is configured to report the needed information to the central Moab daemon. On each cluster where another system is managing local cluster workload, a Moab Information Service Daemon (minfod) is started.
Because load-balancing information is required, the Moab daemon running on each cluster is configured to report backlog and start estimate information using the REPORTPEERSTARTINFO parameter.
To make information available via a Web service, on the master Moab node, the cluster.mon.ws.pl service is started, allowing Moab to receive Web service based requests and report responses in XML over SOAP. To allow human-readable browser access to the same information and services, the local Web service is configured to use the moab.is.cgi script to drive the Web service interface and report results via a standard Web page.
Due to the broad array of users within the grid, many types of information are provided. This information includes the following:
With these queries, users/services can obtain and process raw resource information or can ask a question as simple as What is the best cluster for this request?.
ENABLESTARTESTIMATESTATS TRUE REPORTPEERSTARTINFO TRUE ...
RMCFG[clusterA] SERVER=moab://clusterA.bnl.gov RMCFG[clusterB] SERVER=moab://clusterB.qrnl.gov RMCFG[clusterC] SERVER=moab://clusterC.ocsa.edu RMCFG[clusterD] SERVER=moab://clusterD.ocsa.edu ...
> mdiag -t -v Partition Status System Partition Settings: PList: clusterA,clusterB Name Procs ALL 1400 clusterA 800 RM=clusterA clusterB 600 RM=clusterB Partition Configured Up U/C Dedicated D/U Active A/U Nodes ---------------------------------------------------------------------------- ALL 700 700 100.00% 650 86.67% 647 85.39% clusterA 400 400 100.00% 0 0.00% 0 0.00% clusterB 300 300 100.00% 1 100.00% 1 100.00% Processors ---------------------------------------------------------------------------- ALL 1400 1400 84.21% 2 12.50% 2 12.50% clusterA 800 800 84.21% 2 12.50% 2 12.50% clusterB 600 600 84.21% 2 12.50% 2 12.50% ... Backlog BacklogPS BacklogDuration AvgQTime clusterA 13472.00 00:14:27 00:22:14 clusterB 7196.00 00:07:55 00:07:06 ...