Open topic with navigation
You are here: 11.0 General Job Administration > Job Arrays
11.11 Job Arrays
You can submit an array of jobs to Moab via the msub command. Array jobs are an easy way to submit many sub-jobs that perform the same work using the same script, but operate on different sets of data. Sub-jobs are the jobs created by an array job and are identified by the array job ID and an index; for example, if 235 is an identifier, the number 235 is a job array ID, and 1 is the sub-job.
Sub-jobs of an array are executed in sub-job index order.
|The job array feature, new in Moab 6.0, does not integrate natively with TORQUE support for job arrays.|
To enable job arrays, include the ENABLEJOBARRAYS parameter in the Moab configuration file (moab.cfg).
Like a normal job, an array job submits a job script, but it additionally has a start index (sidx) and an end index (eidx); array jobs also have increment (incr) values, which Moab uses to create sub-jobs, all executing the same script. The model for sub-job creation follows the formula of end index minus start index plus increment divided by the increment value: (eidx - sidx + incr) / incr.
To illustrate, suppose an array job has a start index of 1, an end index of 100, and an increment of 1. This is an array job that creates (100 - 1 + 1) / 1 = 100 sub-jobs with indexes of 1, 2, 3, ..., 100. An increment of 2 produces (100 - 1 + 2) / 2 = 50 sub-jobs with indexes of 1, 3, 5, ..., 99. An increment of 2 with a start index of 2 produces (100 - 2 + 2) / 2 = 50 sub-jobs with indexes of 2, 4, 6, ..., 100. Again, sub-jobs are jobs in their own right that have a slightly different job naming convention jobID[subJobIndex] (e.g. mycluster.45 or 45).
The script can use an environment variable to obtain the array index value to form data file and/or directory names unique to an array job's particular sub-job. The following two environment variables are supplied so job scripts can recognize what index in the array they are in; use the msub command with the -V option to pass the environment parameters to the resource manager, or include the parameters in a job script; for example: #PBS -V MOAB_JOBARRAYRANGE.
|Used to create dataset file names, directory names, and so forth, when splitting up a single problem into multiple jobs.
For example, a user may split up a problem into 20 separate jobs, each with its own input and output data files whose names contain the numbers 1-20.
To illustrate, assume a user submits the 20 sub-jobs using two msub commands; one to submit the ten even-numbered jobs and one to submit the ten odd-numbered jobs.msub -t job1.[1-20:2]
msub -t job2.[2-20:2]
The MOAB_JOBARRAYINDEX environment variable value would populate each of the two job arrays' ten sub-jobs as 1, 3, 5, 7, 9, 11, 13, 15, 17 and 19 for the first array job's ten sub-jobs, and 2, 4, 6, 8, 10, 12, 14, 16, 18, and 20 for the second array job's ten sub-jobs.
|MOAB_JOBARRAYRANGE||The count of jobs in the array.|
Users can control individual sub-jobs in the same manner as normal jobs. In addition, an array job represents its group of sub-jobs and any user or administrator commands performed on an array job apply to its sub-jobs; for example, the command canceljob <arrayJobId> cancels all sub-jobs that belong to the array job. For more information about job control, see the documentation for the mjobctl command.
In the first example below, the parts unique to array subjobs are in bold.
$ checkjob -v Moab.1 job Moab.1 AName: Moab State: Running Creds: user:user1 group:usergroup1 WallTime: 00:00:17 of 8:20:00 SubmitTime: Thu Nov 4 11:50:03 (Time Queued Total: 00:00:00 Eligible: INFINITY) StartTime: Thu Nov 4 11:50:03 Total Requested Tasks: 1 Req TaskCount: 1 Partition: base Average Utilized Procs: 0.96 NodeCount: 1 Allocated Nodes: [node010:1] Job Group: Moab.1
Parent Array ID: Moab.1 Array Index: 1 Array Range: 10 SystemID: Moab SystemJID: Moab.1 Task Distribution: node010 IWD: /home/user1 UMask: 0000 Executable: /opt/moab/spool/moab.job.3CvNjl StartCount: 1 Partition List: base SrcRM: internal DstRM: base DstRMJID: Moab.1 Flags: ARRAYJOB,GLOBALQUEUE StartPriority: 1 PE: 1.00 Reservation 'Moab.1' (-00:00:19 -> 8:19:41 Duration: 8:20:00)
If the array range is not provided, the output displays all the jobs in the array.
$ checkjob -v Moab.1 job Moab.1 AName: Moab Job Array Info: Name: Moab.1 1 : Moab.1 : Running 2 : Moab.1 : Running 3 : Moab.1 : Running 4 : Moab.1 : Running 5 : Moab.1 : Running 6 : Moab.1 : Running 7 : Moab.1 : Running 8 : Moab.1 : Running 9 : Moab.1 : Running 10 : Moab.1 : Running 11 : Moab.1 : Running 12 : Moab.1 : Running 13 : Moab.1 : Running 14 : Moab.1 : Running 15 : Moab.1 : Running 16 : Moab.1 : Running 17 : Moab.1 : Running 18 : Moab.1 : Running 19 : Moab.1 : Running 20 : Moab.1 : Running Totals: Active: 20 Idle: 0 Complete: 0
You can also use showq. This displays the array master job with a count of how many sub-jobs are in each queue.
$ showq active jobs------------------------ JOBID USERNAME STATE PROCS REMAINING STARTTIME Moab.1(5) aesplin Running 5 00:52:41 Thu Jun 23 17:05:56 Moab.2(1) aesplin Running 1 00:53:41 Thu Jun 23 17:06:56 6 active jobs 6 of 6 processors in use by local jobs (100.00%) 1 of 1 nodes active (100.00%) eligible jobs---------------------- JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME Moab.2(4) aesplin Idle 4 1:00:00 Thu Jun 23 17:06:56 4 eligible jobs blocked jobs----------------------- JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME Moab.2(1) aesplin Blocked 1 1:00:00 Thu Jun 23 17:06:56 1 blocked job Total jobs: 11
Moab.1 has five sub-jobs running. Moab.2 has one sub-job running, four waiting to run, and one that is currently blocked.
Job arrays can be canceled based on the success or failure of the first sub-job, the first success or failure of any sub-job, or if any sub-job exits with a specified exit code. The job array cancellation policies are:
Cancels the job array if the first sub-job (JOBARRAYINDEX = 1) fails.
> msub -t myarray[1-1000]%50 -l ...,flags=CancelOnFirstFailure
Cancels the job array if the first sub-job (JOBARRAYINDEX = 1) succeeds.
> msub -t myarray[1-1000]%50 -l ...,flags=CancelOnFirstSuccess
Cancels the job array if any sub-job fails.
> msub -t myarray[1-1000]%50 -l ...,flags=CancelOnAnyFailure
Cancels the job array if any sub-job succeeds.
> msub -t myarray[1-1000]%50 -l ...,flags=CancelOnAnySuccess
Cancels the job array if any sub-job returns the specified exit code.
> msub -t myarray[1-1000%50] -l ...,flags=CancelOnExitCode:<error code list>
The syntax for the error code list are ranges specified with a dash and individual codes delimited by a plus (+) sign, such as: 1-4+9+15
Exit codes 1-387 are accepted.
Up to two cancellation polices can be specified for an array and the two policies must be delimited by a colon (:). The two "first sub-job" policies are mutually exclusive, as are the three "any sub-job" policies. You can use either "first sub-job" policy with one of the "any sub-job" policies, as shown in this example:
> msub -t myarray[1-1000]%50 -l ...,flags=CancelOnFirstFailure:CancelOnExitCode:3-7+11
Operations can be performed on individual jobs, a selection of jobs in a job array, or on the entire array.
The syntax for submitting job arrays is: msub -t [<jobname>]<indexlist>[%<limit>] arrayscript.sh
The <jobname> and <limit> are optional. The jobname does not override the jobID Moab assigns to the array. When submitting an array with a jobname, Moab returns the jobID, which is the scheduler name followed by a unique ID.
For example, if the scheduler name in moab.cfg is Moab (SCHEDCFG[Moab]), submitting an array with a jobname responds like this:
> msub -t myarray[1-10] job.sh Moab.6
To specify that only a certain number of sub-jobs in the array can run at a time, use the percent sign (%) delimiter. In this example, only five sub-jobs in the array can run at a time:
> msub -t myarray[1-1000]%5
To submit a specific set of array sub-jobs, use the comma delimiter in the array index list:
> msub -t myarray[1,2,3,4] > msub -t myarray[1-5,7,10]
You can use the checkjob command on either the jobID or the jobname you specified.
> msub -t myarray[1-2] job.sh Moab.10 $ checkjob -v myarray job Moab.10 AName: myarray Job Array Info: Name: Moab.10 1 : Moab.10 : Running 2 : Moab.10 : Running Sub-jobs: 2 Active: 2 ( 100.0% ) Eligible: 0 ( 0.0% ) Blocked: 0 ( 0.0% ) Completed: 0 ( 0.0% ) State: Idle Creds: user:tuser1 group:tgroup1 WallTime: 00:00:00 of 99:23:59:59 SubmitTime: Thu Jun 2 16:37:17 (Time Queued Total: 00:00:33 Eligible: 00:00:00) Total Requested Tasks: 1 Req TaskCount: 1 Partition: ALL
To submit a job with a step size, use a colon in the array range and specify how many jobs to step. In the example below, a step size of 2 is requested. The sub-jobs will be numbered according to the step size inside the index limit. The array master job name will be the same as explained above.
$ msub -t myarray[2-10:2] job.sh job Moab.15 $ checkjob -v myarray #or you could use 'checkjob -v Moab.15' job Moab.15 AName: myarray Job Array Info: Name: Moab.15 2 : Moab.15 : Running 4 : Moab.15 : Running 6 : Moab.15 : Running 8 : Moab.15 : Running 10 : Moab.15 : Running Sub-jobs: 5 Active: 5 ( 100.0% ) Eligible: 0 ( 0.0% ) Blocked: 0 ( 0.0% ) Completed: 0 ( 0.0% ) State: Idle Creds: user:tuser1 group:tgroup1 WallTime: 00:00:00 of 99:23:59:59 SubmitTime: Thu Jun 2 16:37:17 (Time Queued Total: 00:00:33 Eligible: 00:00:00) Total Requested Tasks: 1 Req TaskCount: 1 Partition: ALL
Copyright © 2012 Adaptive Computing Enterprises, Inc.®