Conventions

10.10 Job Arrays

10.10-A Job Array Overview

You can submit an array of jobs to Moab via the msub command. Array jobs are an easy way to submit many sub-jobs that perform the same work using the same script, but operate on different sets of data. Sub-jobs are the jobs created by an array job and are identified by the array job ID and an index; for example, if 235[1] is an identifier, the number 235 is a job array ID, and 1 is the sub-job.

Sub-jobs of an array are executed in sub-job index order.

Moab job arrays are different from TORQUE job arrays.

10.10-B Enabling Job Arrays

To enable job arrays, include the ENABLEJOBARRAYS parameter in the Moab configuration file (moab.cfg).

10.10-C Sub-job Definitions

Like a normal job, an array job submits a job script, but it additionally has a start index (sidx) and an end index (eidx); array jobs also have increment (incr) values, which Moab uses to create sub-jobs, all executing the same script. The model for sub-job creation follows the formula of end index minus start index plus increment divided by the increment value: (eidx - sidx + incr) / incr.

To illustrate, suppose an array job has a start index of 1, an end index of 100, and an increment of 1. This is an array job that creates (100 - 1 + 1) / 1 = 100 sub-jobs with indexes of 1, 2, 3, ..., 100. An increment of 2 produces (100 - 1 + 2) / 2 = 50 sub-jobs with indexes of 1, 3, 5, ..., 99. An increment of 2 with a start index of 2 produces (100 - 2 + 2) / 2 = 50 sub-jobs with indexes of 2, 4, 6, ..., 100. Again, sub-jobs are jobs in their own right that have a slightly different job naming convention jobID[subJobIndex] (e.g. mycluster.45[37] or 45[37]).

10.10-D Using Environment Variables to Specify Array Index Values

The script can use an environment variable to obtain the array index value to form data file and/or directory names unique to an array job's particular sub-job. The following two environment variables are supplied so job scripts can recognize what index in the array they are in; use the msub command with the -V option to pass the environment parameters to the resource manager, or include the parameters in a job script; for example: #PBS -V MOAB_JOBARRAYRANGE.

Environment Parameter Description
MOAB_JOBARRAYINDEX
Used to create dataset file names, directory names, and so forth, when splitting up a single problem into multiple jobs.

For example, a user may split up a problem into 20 separate jobs, each with its own input and output data files whose names contain the numbers 1-20.

To illustrate, assume a user submits the 20 sub-jobs using two msub commands; one to submit the ten even-numbered jobs and one to submit the ten odd-numbered jobs.

msub -t job1.[1-20:2]
msub -t job2.[2-20:2]

The MOAB_JOBARRAYINDEX environment variable value would populate each of the two job arrays' ten sub-jobs as 1, 3, 5, 7, 9, 11, 13, 15, 17 and 19 for the first array job's ten sub-jobs, and 2, 4, 6, 8, 10, 12, 14, 16, 18, and 20 for the second array job's ten sub-jobs.

MOAB_JOBARRAYRANGE The count of jobs in the array.

Control

Users can control individual sub-jobs in the same manner as normal jobs. In addition, an array job represents its group of sub-jobs and any user or administrator commands performed on an array job apply to its sub-jobs; for example, the command canceljob <arrayJobId> cancels all sub-jobs that belong to the array job. For more information about job control, see the documentation for the mjobctl command.

Reporting

In the first example below, the parts unique to array subjobs are in red.

$ checkjob -v Moab.1[1]
job Moab.1[1]
				
AName: Moab
State: Running   
Creds:  user:user1  group:usergroup1  
WallTime:   00:00:17 of 8:20:00  
SubmitTime: Thu Nov  4 11:50:03    
(Time Queued  Total: 00:00:00  Eligible:   INFINITY)      
StartTime: Thu Nov  4 11:50:03    
Total Requested Tasks: 1      
Req[0]  TaskCount: 1  Partition: base      
Average Utilized Procs: 0.96    
NodeCount:  1      
Allocated Nodes:    
[node010:1]        

Job Group:        Moab.1
Parent Array ID:  Moab.1    
Array Index:      1    
Array Range:      10    
SystemID:   Moab    
SystemJID:  Moab.1[1]
Task Distribution: node010      
IWD:            /home/user1    
UMask:          0000     
Executable:     /opt/moab/spool/moab.job.3CvNjl      
StartCount:     1    
Partition List: base    
SrcRM:          internal  DstRM: base  DstRMJID: Moab.1[1]
Flags:          ARRAYJOB,GLOBALQUEUE    
StartPriority:  1    
PE:             1.00    
Reservation 'Moab.1[1]' (-00:00:19 -> 8:19:41  Duration: 8:20:00)

If the array range is not provided, the output displays all the jobs in the array.

$ checkjob -v Moab.1
job Moab.1
				
AName: Moab
Job Array Info:
  Name: Moab.1
  1 : Moab.1[1] : Running
  2 : Moab.1[2] : Running
  3 : Moab.1[3] : Running
  4 : Moab.1[4] : Running
  5 : Moab.1[5] : Running
  6 : Moab.1[6] : Running
  7 : Moab.1[7] : Running
  8 : Moab.1[8] : Running
  9 : Moab.1[9] : Running
  10 : Moab.1[10] : Running
  11 : Moab.1[11] : Running
  12 : Moab.1[12] : Running
  13 : Moab.1[13] : Running
  14 : Moab.1[14] : Running
  15 : Moab.1[15] : Running
  16 : Moab.1[16] : Running
  17 : Moab.1[17] : Running
  18 : Moab.1[18] : Running
  19 : Moab.1[19] : Running
  20 : Moab.1[20] : Running
  Totals:
    Active:   20
    Idle:     0
    Complete: 0

You can also use showq. This displays the array master job with a count of how many sub-jobs are in each queue.

$ showq

active jobs------------------------
JOBID              USERNAME      STATE PROCS   REMAINING            STARTTIME

Moab.1(5)           aesplin    Running     5    00:52:41  Thu Jun 23 17:05:56
Moab.2(1)           aesplin    Running     1    00:53:41  Thu Jun 23 17:06:56

6 active jobs               6 of 6 processors in use by local jobs (100.00%)
1 of 1 nodes active      (100.00%)

eligible jobs----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME

Moab.2(4)           aesplin       Idle     4     1:00:00  Thu Jun 23 17:06:56

4 eligible jobs

blocked jobs-----------------------
JOBID              USERNAME      STATE PROCS     WCLIMIT            QUEUETIME

Moab.2(1)           aesplin    Blocked     1     1:00:00  Thu Jun 23 17:06:56

1 blocked job

Total jobs:  11

Moab.1 has five sub-jobs running. Moab.2 has one sub-job running, four waiting to run, and one that is currently blocked.

10.10-E Job Array Cancellation Policies

Job arrays can be canceled based on the success or failure of the first sub-job, the first success or failure of any sub-job, or if any sub-job exits with a specified exit code. The job array cancellation policies are:

Cancel Policy Description Exclusivity
CancelOnFirstFailure

Cancels the job array if the first sub-job (JOBARRAYINDEX = 1) fails.

> msub -t myarray[1-1000]%50 -l ...,flags=CancelOnFirstFailure
Mutually exclusive
CancelOnFirstSuccess

Cancels the job array if the first sub-job (JOBARRAYINDEX = 1) succeeds.

> msub -t myarray[1-1000]%50 -l ...,flags=CancelOnFirstSuccess
 
CancelOnAnyFailure

Cancels the job array if any sub-job fails.

> msub -t myarray[1-1000]%50 -l ...,flags=CancelOnAnyFailure
 
CancelOnAnySuccess

Cancels the job array if any sub-job succeeds.

> msub -t myarray[1-1000]%50 -l ...,flags=CancelOnAnySuccess
 
CancelOnExitCode

Cancels the job array if any sub-job returns the specified exit code.

> msub -t myarray[1-1000%50] -l ...,flags=CancelOnExitCode:<error code list>

The syntax for the error code list are ranges specified with a dash and individual codes delimited by a plus (+) sign, such as: 1-4+9+15

Exit codes 1-387 are accepted.

 

Up to two cancellation polices can be specified for an array and the two policies must be delimited by a colon (:). The two "first sub-job" policies are mutually exclusive, as are the three "any sub-job" policies. You can use either "first sub-job" policy with one of the "any sub-job" policies, as shown in this example:

> msub -t myarray[1-1000]%50 -l ...,flags=CancelOnFirstFailure:CancelOnExitCode:3-7+11

10.10-F Examples

Operations can be performed on individual jobs, a selection of jobs in a job array, or on the entire array.

Submitting Job Arrays

The syntax for submitting job arrays is: msub -t [<jobname>]<indexlist>[%<limit>] arrayscript.sh

The <jobname> and <limit> are optional. The jobname does not override the jobID Moab assigns to the array. When submitting an array with a jobname, Moab returns the jobID, which is the scheduler name followed by a unique ID.

For example, if the scheduler name in moab.cfg is Moab (SCHEDCFG[Moab]), submitting an array with a jobname responds like this:

> msub -t myarray[1-10] job.sh

Moab.6

To specify that only a certain number of sub-jobs in the array can run at a time, use the percent sign (%) delimiter. In this example, only five sub-jobs in the array can run at a time:

> msub -t myarray[1-1000]%5

To submit a specific set of array sub-jobs, use the comma delimiter in the array index list:

> msub -t myarray[1,2,3,4]
> msub -t myarray[1-5,7,10]

You can use the checkjob command on either the jobID or the jobname you specified.

> msub -t myarray[1-2] job.sh

Moab.10

$ checkjob -v myarray
  job Moab.10

AName: myarray
Job Array Info:
   Name: Moab.10
   1 : Moab.10[1] : Running
   2 : Moab.10[2] : Running

   Sub-jobs:           2
     Active:           2 ( 100.0% )
     Eligible:         0 ( 0.0% )
     Blocked:          0 ( 0.0% )
     Completed:        0 ( 0.0% )

State: Idle
Creds:  user:tuser1  group:tgroup1
WallTime:   00:00:00 of 99:23:59:59
SubmitTime: Thu Jun  2 16:37:17
   (Time Queued  Total: 00:00:33  Eligible: 00:00:00)

Total Requested Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL

To submit a job with a step size, use a colon in the array range and specify how many jobs to step. In the example below, a step size of 2 is requested. The sub-jobs will be numbered according to the step size inside the index limit. The array master job name will be the same as explained above.

$ msub -t myarray[2-10:2] job.sh

job Moab.15

$ checkjob -v myarray #or you could use 'checkjob -v Moab.15'
job Moab.15

AName: myarray
Job Array Info:
   Name: Moab.15
   2 : Moab.15[2] : Running
   4 : Moab.15[4] : Running
   6 : Moab.15[6] : Running
   8 : Moab.15[8] : Running   
   10 : Moab.15[10] : Running

   Sub-jobs:           5
     Active:           5 ( 100.0% )
     Eligible:         0 ( 0.0% )
     Blocked:          0 ( 0.0% )
     Completed:        0 ( 0.0% )

State: Idle
Creds:  user:tuser1  group:tgroup1
WallTime:   00:00:00 of 99:23:59:59
SubmitTime: Thu Jun  2 16:37:17
   (Time Queued  Total: 00:00:33  Eligible: 00:00:00)

Total Requested Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL

Related topics