Moab Adaptive Computing Suite Administrator's Guide 5.4

5.0 Data Center Workflows

Moab provides support for today's data centers, a hybrid of traditional data center workflows and dynamic, service-oriented processes. The sections below describe how a data center can best take advantage of Moab's capabilities.

5.0.1 Transitioning Workflow

With its advanced scheduling capabilities, Moab can easily handle all the scheduling needs of data centers. Core resources can be protected while still optimizing the workload to get the highest efficiency, productivity, and ROI possible. Almost all of the Data Center's existing structure and architecture is maintained. Additional steps must be made to describe the workflow and related policies to Moab, which will then intelligently schedule the resources to meet the organization's goals. Transitioning to Moab follows some important steps:

  1. Determine existing business processes and related resources and policies.
  2. For each business process, determine and chart the process, making sure to include all required resources, start times, deadlines, internal dependencies, external dependencies, decision points and critical paths.
  3. Where possible, divide processes into functional groups that represent parts of the overall process that should be considered atomic or closely related.
  4. Translate functional groups and larger processes into programmatic units that will be scheduled by Moab.
  5. Build control infrastructure to manage programmatic units to form processes.
  6. Identify on-demand services and related resources and policies.
  7. Implement system-wide policies to support static and dynamic workloads.

Most data centers have already done many, if not all, of these steps in the normal management of their system. The other major part of the transition is the taking of these steps and applying them to Moab to gain the desired results. The remainder of this section covers approaches and techniques you can use to accomplish these tasks.

5.1 Defining the Workflow

Workflow is the flow of data and processing through a dynamic series of tasks controlled by internal and external dependencies to accomplish a larger business process. As one begins the transition to Moab, it is important to have a clear understanding of the needed workflow and the underlying business processes it supports. Figure 1 shows a simple workflow diagram with multiple processing paths and external data dependencies.

Figure 1: Sample workflow diagram

It is important to create one or more diagrams such as this to document the workflow through one's system. This diagram provides a visual representation of the system, which clearly shows how data flows through the processing jobs, as well as all dependencies.

In this diagram, External Data is a type of external dependency. An external dependency is a dependency that is fulfilled external to the workflow (and maybe even the system). It is not uncommon for external dependencies to be fulfilled by processes completely external to the cluster on which Moab is running. Consequently, it is very important to clearly identify such dependencies, as special steps must be taken for Moab to be alerted when these dependencies are fulfilled.

Once all of the business processes have been succinctly described and diagrammed, it is possible to continue with the conversion process. The next step is to inventory the available resources. It is also at this point that the pairing of resources and workflow tasks is done.

5.2 Inventory of Resources

At the most basic level, Moab views the world as a group of resources onto which reservations are placed, managed, and optimized. It is aware of a large number of resource types, including a customizable generic resource that can be used to meet the needs of even the most complex data center or HPC center. Resources can be defined as software licenses, database connections, network bandwidth, specialized test equipment or even office space. However, the most common resource is a compute node.

5.2.1 Moab Node Structure

During the transition to Moab, the administrator must make an inventory of all relevant resources and then describe them to Moab. Because the approach with Moab is different than the traditional data center paradigm, this section will focus on the configuration of compute nodes. Figure 2 shows a basic view of Moab node structure.

Figure 2: Moab node structure

All clusters have a head node, which is where Moab resides. It is from this node that all compute jobs are farmed out to the compute nodes. These compute nodes need not be homogeneous. In Figure 2 depicts a heterogeneous cluster where the different colors represent different compute node types (architecture, processors, memory, software, and so forth).

5.2.2 Defining Nodes in Moab

Moab interacts with the nodes through one or more resource managers. Common resource managers include TORQUE, SLURM, LSF, LoadLeveler and PBS. In general, the resource manager will provide Moab with most of the basic information about each node for which it has responsibility. This basic information includes number of processors, memory, disk space, swap space and current usage statistics. In addition to the information provided by the resource manager, the administrator can specify additional node information in the moab.cfg file. Following is an extract from the moab.cfg file section on nodes. Because this is a simple example, only a few node attributes are shown. However, a large number of possible attributes, including customizable features are available. Additional node configuration documentation is available.

NODECFG[Node01] ARCH=ppc OS=osx
NODECFG[Node02] ARCH=ppc OS=osx
NODECFG[Node07] ARCH=i386 OS=suse10
NODECFG[Node08] ARCH=i386 OS=suse10
NODECFG[Node18] ARCH=opteron OS=centos CHARGERATE=2.0
NODECFG[Node19] ARCH=opteron OS=centos CHARGERATE=2.0

In the above example, there are six nodes. The architecture and operating system is specified on all of these. However, the administrator has enabled a double charge rate for jobs that are launched on the last two. This is used for accounting purposes.

Node attributes are considered when a job is scheduled by Moab. For example, if a specific architecture is requested, the job will only be scheduled on those nodes that have the required architecture.

5.2.3 Mapping Workflow Requirements to Resources

Once relevant resources have been configured in Moab, it is necessary to map the different stages of the workflow to the appropriate resources. Each compute stage in the workflow diagrams needs to be mapped to one or more required resource. In some cases, specifying "No Preference" is also valid, meaning it does not matter what resources are used for the computation. This step also provides the opportunity to review the workflow diagrams to ensure that all required resources have an appropriate mapping in Moab.

5.3 Defining Job Groups

With the completed workflow diagrams, it is possible to identify functional groups. Functional groups can be demarcated along many different lines. The most important consideration is whether the jobs will need to share information amongst themselves or represent internal dependencies. These functional groups are known as job groups. Job groups share a common name space wherein variables can be created, allowing for communication among the different processes.

Often job groups will be defined along business process lines. In addition, subgroups can also be defined to allow the workflow to be viewed in more manageable sections.

5.4 Setting the Schedule

As was previously mentioned, Moab views the world as a set of resources that can be scheduled. Scheduling of resources is accomplished by creating a reservation. Some reservations are created automatically, such as when a job is scheduled for launch. Other reservations can be created manually by the administrator. Manual reservations can either be static, meaning they are part of Moab's configuration (moab.cfg) or ad hoc, created as needed via the command line or the MCM tool.

5.4.1 Determining Scheduling Requirements

Within any data center, some jobs are high priority, while others are low priority. It is necessary to make sure resources are available to the high priority jobs, especially when these jobs are part of an SLA or part of a time-critical business process. Moab supports a number of configuration options and parameters to support these scheduling goals.

In this section we will look at the situation where part of the cluster needs to be reserved for certain processes during the day. Figure 3 shows a ten node cluster and its daily scheduling requirements. There are four service areas that must have sole access to their associated nodes during specific times of the day. All other jobs are allowed to float freely among the free processors. This example is typical of many data centers where workload is dependent on the time of day or day of week.

Figure 3: Daily standing reservations
  Node01     Node02     Node03     Node04     Node05     Node06     Node07     Node08     Node09     Node10  
12:00 AM               
1:00 AM               
2:00 AM               
3:00 AM               
4:00 AM               
5:00 AM               
6:00 AM               
7:00 AM               
8:00 AM               
9:00 AM               
10:00 AM               
11:00 AM               
12:00 PM               
1:00 PM               
2:00 PM               
3:00 PM               
4:00 PM               
5:00 PM               
6:00 PM               
7:00 PM               
8:00 PM               
9:00 PM               
10:00 PM               
11:00 PM               
Nightly Batch Processing Web Services
Daily Business Processes Contracted Service Access

It is important to have a clear understanding of what jobs require hard reservations such as the ones shown here and those that are more forgiving. A particular strength of Moab is its ability to schedule in a dynamic environment while still being able to support many scheduling goals. The standing reservation approach limits the ability for Moab to dynamically schedule, as it limits what can be scheduled on certain nodes during certain times. Cycles lost in a standing reservation because the jobs for which the reservation was made do not use the entire time block cannot be reclaimed by Moab. However, standing reservations are a powerful way to guarantee resources for mission-critical processes and jobs.

5.4.2 Creating Standing Reservations

If it is determined that standing reservations are appropriate, they must be created in Moab. Standing reservations are created in the moab.cfg file. However, before the standing reservation can be defined, a quality of service (QoS) should be created that will specify which users are allowed to submit jobs that will run in the new standing reservations.

QoS is a powerful concept in Moab, often used to control access to resources, instigate different billing rates, and control certain administrative privileges. Additional information on QoS and standing reservations is available.

Like the standing reservations, QoS is defined in the moab.cfg file. Example 16 shows how to create the four QoS options that are required to implement the standing reservations shown in Figure 3. This example assumes the noted users (admin, datacenter, web, apache and customer1) have been previously defined in the moab.cfg file.

QOSCFG[nightly]   QFLAGS=DEADLINE,TRIGGER
QOSCFG[nightly]   MEMBERULIST=admin,datacenter

QOSCFG[daily]     QFLAGS=DEADLINE,TRIGGER
QOSCFG[daily]     MEMBERULIST=admin,datacenter

QOSCFG[web]       MEMBERULIST=web,apache

QOSCFG[contract1] QFLAGS=DEADLINE MEMBERULIST=customer1

The above example shows that four QoS options have been defined. The first and second QoS (nightly and daily) have special flags that allow the jobs to contain triggers and to have a hard deadline. All the QoS options contain a user list of those users who have access to submit to the QoS. Notice that multiple configuration lines are allowed for each QoS.

Once the QoS options have been created, the associated standing reservations must also be created. Example 17 shows how this is done in the moab.cfg file.

SRCFG[dc_night] STARTTIME=00:00:00 ENDTIME=04:00:00
SRCFG[dc_night] HOSTLIST=Node0[1-8]$ QOSLIST=nightly

SRCFG[dc_day]   STARTTIME=06:00:00 ENDTIME=19:00:00
SRCFG[dc_day]   HOSTLIST=Node0[1-5]$ QOSLIST=daily

SRCFG[websrv1]  STARTTIME=08:00:00 ENDTIME=16:59:59
SRCFG[websrv1]  HOSTLIST=Node0[7-8]$ QOSLIST=web
SRCFG[websrv2]  STARTTIME=06:00:00 ENDTIME=19:00:00
SRCFG[websrv2]  HOSTLIST=Node09 QOSLIST=web
SRCFG[websrv3]  STARTTIME=00:00:00 ENDTIME=23:59:59
SRCFH[websrv3]  HOSTLIST=Node10 QOSLIST=web

SRCFG[cust1]    STARTTIME=17:00:00 ENDTIME=23:59:59
SRCFG[cust1]    HOSTLIST=NODE0[6-8]$ QOSLIST=contract1

In this example, we show the creation of the different standing reservations. Each of the standing reservations has a start and end time, as well as the host list and the associated QoS. Three separate standing reservations were used for the web services because of the different nodes and time period sets. This setup works fine here because Web services is going to be small, serial jobs. In other circumstances, a different reservation structure would be needed for maximum performance.

5.4.3 Submitting a Job to a Standing Reservation

It is not uncommon for a particular user to have access to multiple QoS options. For instance, Example 2 shows the user datacenter has access to both the nightly and daily QoS options. Consequently, it is necessary to denote which QoS option is to be used when a job is submitted.

In the example below, the script nightly.job.cmd is being submitted using the QoS option nightly. Consequently, it will be able to run using the nodes reserved for that QoS option in Example 3.

> msub -l qos=nightly nightly.job.cmd

5.5 Workflow Dependencies

With the different standing reservations and associated QoSs configured in Moab, the process of converting the workflow can continue. The next steps are to convert the compute jobs and build the dependency tree to support the workflow.

5.5.1 Converting Compute Jobs

Most compute jobs will require few, if any, changes to run under the new paradigm. All jobs will be submitted to Moab running on the head node. (See Figure 2.) Moab will then schedule the job based on a number of criteria, including user, QoS, standing reservations, dependencies and other job requirements. While the scheduling is dynamic, proper use of QoS and other policies will ensure the desired execution of jobs on the cluster.

Jobs may read and write files from a network drive. In addition, Moab will return all information written to STDOUT and STDERR in files denoted by the Job ID to the user who submitted the job.

> echo env | msub

40278

> ls -l

-rw------- 1 user1 user1   0 2007-04-03 13:28 STDIN.e40278
-rw------- 1 user1 user1 683 2007-04-03 13:28 STDIN.o40278

The above example shows the created output files. The user user1 submitted the env program to Moab, which will return the environment of the compute node on which it runs. As can be seen, two output files are created: STDIN.e40278 and STDIN.o40278 (representing the output on STDERR and STDOUT, respectively). The Job ID (40278) is used to denote which output files belong to which job. The first part of the file names, STDIN, is the name of the job script. In this case, because the job was submitted to msub's STDIN via a pipe, STDIN was used.

5.5.2 Introduction to Triggers

Triggers are one way to handle dependencies in Moab. This section introduces triggers and several of their key concepts. Additional trigger information is available.

In its simplest form, a trigger has three basic parts:

  1. An object
  2. An event
  3. An action

The concept of resources in Moab has already been introduced. In addition to resources, Moab is aware of a number of other entities, including the scheduler, resource managers, nodes and jobs. Each of these are represented in Moab as an object. Triggers may be attached to any of these object types. Also an object instance can have multiple triggers attached to it.

In addition to an object, each trigger must have an event specified. This event determines when the triggers action will take place. In Moab terminology, this is known as an event type. Some popular event types include cancel, create, end, start and threshold.

An action consists of two parts: (1) the action type and (2) the action command. The action type specifies what type of an action is to occur. Some examples include exec, internal and mail. The format of the action command is determined by the action type. For example, exec requires a command line, while mail needs an email address.

While triggers can be placed on any object type within Moab, the job object is the most useful for creating workflows. A trigger can be attached to an object at submission time through the use of msub and the -l flag.

> msub -l trig=AType=exec\&EType=start\&Action="job_check.pl" Job.cmd

In the previous example, the job Job.cmd is submitted with one trigger. This trigger will execute the job_check.pl script when the compute job starts on the compute node. It is important to note that all triggers are run on the head node, even those attached to compute jobs.

Often, triggers are attached to special jobs known as system jobs. System jobs are simply an instantiation of a job object that does not require any resources by default. In addition, it also runs on the head node.

> msub -l flags=NORESOURCES,trig=AType=exec\&EType=start\&Action="job_check.pl" Job

Just like normal jobs, system jobs can serve as a job group. In other words, when forming job groups, one is associating one job with another. The job to which all others attach is known as the job group, and its variable name space can be used as a common repository for each of the child jobs.

> echo true | msub -N MyJobGroup -l flags=NORESOURCES
> msub -W x=JGroup:MyJobGroup Job1
> msub -W x=JGroup:MyJobGroup Job2

In the previous example, a system job is created and given the name MyJobGroup to simplify later use of the job group. Then two compute jobs, Job1 and Job2, were submitted. They were made part of the MyJobGroup job group, meaning they will have access to the variable name space of the first job.

As a security measure, only jobs submitted to QoS with the trigger flag can have triggers attached. An example of the configuration for this can be seen in Example 2.

Another important point with triggers is that they are event-driven. This means that they operate outside of the normal Moab batch scheduling process. During each scheduling iteration, Moab evaluates all triggers to see if their event has occurred and if all dependencies are fulfilled. If this is the case, the trigger is executed, regardless of the priority of the job to which it is attached. This provides the administrator with another degree of control over the system.

By combining compute jobs, system jobs, and triggers, one can build fairly complex workflows. The next few sections cover additional information on how to map the previously created workflow diagram to these concepts.

5.5.3 Internal Dependencies

Internal dependencies are those that can be fulfilled by other jobs within the workflow. For example, if Job1 must complete before Job2, then Job1 is an internal dependency of Job2. Another possibility is that Job1 may also stage some data that Job3 requires, which is another type of internal dependency.

Internal dependencies are often handled through variables. Each trigger can see all the variables in the object to which it is attached. In the case of the object being a job, the trigger can also see the variables in any of the job's job group hierarchy. Triggers may require that certain variables simply exist or have a specific value in order to launch. Upon completion, triggers can set variables depending on the success or failure of the operation. In some instances, triggers are able to consume variables that already exist when they launch, removing them from the name space.

Figure 4: Jobs, job group and variables

The image above shows a simple setup where there is a job group (MyJobGroup) with two compute jobs associated with it (Job1 and Job2). In addition to their own variables, both Job1 and Job2 can see var1 and var2 because they are in MyJobGroup. However, Job1 cannot see var5 and var6, nor can Job2 see var3 and var4. Also, MyJobGroup cannot see var3, var4, var5 and var6. One can only see up the hierarchy tree.

Notice that var3 has a caret symbol (^) attached to the front. This means that when Job1 completes, var3 is going to be exported to the job group, MyJobGroup in this case. This can be seen in the image below, which shows the state of things after Job1 has completed.

Figure 5: Jobs, job group and variables after complete

With the completion of Job1, var3 has been exported to MyJobGroup. However, var4, which was not set to export, has been destroyed. At this point, Job2 can now see var3. If a trigger attached to Job2 required var3, it would now be able to run because its dependency is now fulfilled.

This is a very simple example of how variables can be used in conjunction with triggers. Detailed information on the interaction between triggers and variables is available.

5.5.4 External Dependencies

There are instances where dependencies must be fulfilled by entities external to the cluster. This may occur when external entities are staging in data from a remote source or when a job is waiting for the output of specialized test equipment. In instances such as these, Moab provides a method for injecting a variable into a job's name space from the command line. The mjobctl program is used.

> mjobctl -m var=newvar1=1 MyJobGroup

In this example, the variable newvar1 with a value of 1 is injected into MyJobGroup, which was created in Example 8. This provides a simple method for allowing external entities to notify Moab when an external dependency has been fulfilled.

5.5.5 Cascading Triggers

Another approach that can be used when converting a workflow is the idea of dynamic workflow. A dynamic workflow is one in which sections of the workflow are created on-the-fly through the use of triggers or other means. The technique of using triggers to accomplish this is known as cascading triggers. This is where triggers dynamically create other jobs and triggers, which handle parts of the overall work flow.

Cascading triggers has the benefit of reducing the number of triggers and jobs in the system, thereby reducing administrative overhead. While not the perfect solution in every case, they provide a great tool for those using Moab in the data center. See the Case Studies for an example of their benefits.

5.6 Dynamic Workload

A dynamic workload is a non-static workload that changes over the course of a day, week, month or year. Moab supports this through the use of traditional HPC techniques. As jobs are submitted to Moab, they are evaluated for execution. This evaluation is based on a number of factors including policies, QoS, and standing reservations and assigns a priority to the job. These priorities are used when scheduling the job.

Where possible, Moab will backfill empty sections of the schedule with lower priority jobs if it will not affect the higher priority jobs. This scheduling occurs every scheduling iteration, allowing Moab to take advantage of changing situations. Because of the dynamic nature of Moab scheduling, it is able to handle the changing workload in a data center—providing services and resources for ad hoc jobs, as well as the normal, static workflow.

5.7 Supporting SLAs and Other Commitments

When SLAs are in place, it is requisite that the workflow support these agreements. Contracted work must be accomplished accurately and on time. A number of Moab policies exist to ensure these goals can be met.

Moab has a number of different credential types which allow for the grouping of users in a large number of ways. Access rights can be given to each credential, defining limits on system usage for each credential. Users are allowed to have multiple credentials, thereby providing a rich set of access control mechanisms to the administrator.

Moab's scheduling algorithms are customizable on a number of different levels, allowing administrators to determine when and where certain jobs are allowed to run. The dynamic nature of the scheduling engine allows Moab to react to changing circumstances.

Moab's scheduling engine takes multiple circumstances and priorities into consideration when ordering jobs. Because Moab considers the future when scheduling, jobs may be scheduled to start at or complete by specific times. In addition, resources can be statically or dynamically allocated for jobs submitted via certain credentials. Policies can be put in place to describe Moab behavior with jobs that are not abiding by the imposed restrictions.

In combination, all these features allow the administrator to customize Moab behavior to perfectly match their data center's needs.

Additional resources:

5.8 Case Studies

This section illustrates Moab functionality in the data center. All company names presented in this section are fictitious.

5.8.1 Cascading Triggers

A data collection company handles a large amount of incoming information each day. New information is available every five minutes for processing. The information is processed by their cluster in a series of steps. Each step consolidates the information from the previous steps and then generates a report for the given time period, as shown in what follows:

  • Every 5 Minutes – Information gathered and "5 Minute" report generated
  • Every 15 Minutes – All new "5 Minute" data handled
  • Every Hour – All new "15 Minute" data handled
  • Every 3 Hours – All new "1 Hour" data handled
  • Every Day – All new "3 Hour" data handled

The original approach was to create the entire workflow structure for the entire day with a standing trigger that fired every night at midnight. However, this produced a large number of triggers that made management of the system more difficult. For every "5 Minute" task requiring a trigger, 288 triggers are required. This quickly made the output of mdiag -T very difficult to parse for a human, as multiple tasks, and therefore triggers, were required for each step.

To address this issue, the cascading triggers approach was adopted. With this approach, triggers were only created as needed. For example, each "3 Hour" trigger created its underlying "1 Hour" triggers when it was its time to run.

The resulting reduction of triggers in the system was very impressive. Figure 6 and Figure 7 show the difference between the two approaches in the number of triggers when only one trigger is needed per step. The difference is even greater when more than one trigger is needed per step.

Figure 6: Non-Cascading Triggers

Figure 7: Cascading Triggers

The charts show the number of triggers over the full day that would be required if all jobs were generated at the beginning of the day versus a dynamic approach where jobs and their associated triggers were created only as needed. The difference in the two approaches is clear.