You can submit an array of jobs to Moab via the msub command. Array jobs are an easy way to submit many sub-jobs that perform the same work using the same script, but operate on different sets of data. Sub-jobs are the jobs created by an array job and are identified by the array job ID and an index; for example, if 235.1 is an identifier, the number 235 is a job array ID, and the number 1 is the sub-job.
The job array feature, new in Moab 6.0, does not integrate natively with TORQUE support for job arrays. Also, job array usage limits are presently unavailable. |
To enable job arrays, include the ENABLEJOBARRAYS parameter in the Moab configuration file (moab.cfg).
Like a normal job, an array job submits a job script, but it additionally has a start index (sidx) and an end index (eidx); array jobs also have increment (incr) values, which Moab uses to create sub-jobs, all executing the same script. The model for sub-job creation follows the formula of end index minus start index plus increment divided by the increment value: (eidx - sidx + incr) / incr.
To illustrate, suppose an array job has a start index of 1, an end index of 100, and an increment of 1. This is an array job that creates (100 - 1 + 1) / 1 = 100 sub-jobs with indexes of 1, 2, 3, ..., 100. An increment of 2 produces (100 - 1 + 2) / 2 = 50 sub-jobs with indexes of 1, 3, 5, ..., 99. An increment of 2 with a start index of 2 produces (100 - 2 + 2) / 2 = 50 sub-jobs with indexes of 2, 4, 6, ..., 100. Again, sub-jobs are jobs in their own right that have a slightly different job naming convention (jobID.subJobIndex).
The script can use an environment variable to obtain the array index value to form data file and/or directory names unique to an array job's particular sub-job. The following two environment variables are supplied so job scripts can recognize what index in the array they are in; use the msub command with the -V option to pass the environment parameters to the resource manager, or include the parameters in a job script; for example: #PBS -V MOAB_JOBARRAYRANGE.
Users can control individual sub-jobs in the same manner as normal jobs. In addition, an array job represents its group of sub-jobs and any user or administrator commands performed on an array job apply to its sub-jobs; for example, the command canceljob <arrayJobId> cancels all sub-jobs that belong to the array job. For more information about job control, see the documentation for the mjobctl command.
If a user submits an array job to a grid head node, Moab must schedule the array job's sub-jobs to a single cluster; that is, its sub-jobs are not permitted to execute across multiple clusters.
In the first example below, the parts unique to array subjobs are in bold.
$ checkjob -v Moab.1.1 job Moab.1.1 State: Running Creds: user:testuser1 group:testgroup1 WallTime: 00:00:17 of 8:20:00 SubmitTime: Thu Nov 4 11:50:03 (Time Queued Total: 00:00:00 Eligible: INFINITY) StartTime: Thu Nov 4 11:50:03 Total Requested Tasks: 1 Req[0] TaskCount: 1 Partition: base Average Utilized Procs: 0.96 NodeCount: 1 Allocated Nodes: [node010:1] Job Group: Moab.1 Parent Array ID: Moab.1 Array Index: 1 Array Range: 10 SystemID: Moab SystemJID: Moab.1.1 Task Distribution: node010 IWD: /home/jbanks UMask: 0000 Executable: /usr/test/moab/spool/moab.job.3CvNjl StartCount: 1 Partition List: base SrcRM: internal DstRM: base DstRMJID: Moab.1.1 Flags: ARRAYJOB,GLOBALQUEUE StartPriority: 1 PE: 1.00 Reservation 'Moab.1.1' (-00:00:19 -> 8:19:41 Duration: 8:20:00)
If the array range is not provided, the output displays all the jobs in the array.
$ checkjob -v medsec.1 job medsec.1 Job Array Info: Name: moab 1 : medsec.1.1 : Running 2 : medsec.1.2 : Running 3 : medsec.1.3 : Running 4 : medsec.1.4 : Running 5 : medsec.1.5 : Running 6 : medsec.1.6 : Running 7 : medsec.1.7 : Running 8 : medsec.1.8 : Running 9 : medsec.1.9 : Running 10 : medsec.1.10 : Running 11 : medsec.1.11 : Running 12 : medsec.1.12 : Running 13 : medsec.1.13 : Running 14 : medsec.1.14 : Running 15 : medsec.1.15 : Running 16 : medsec.1.16 : Running 17 : medsec.1.17 : Running 18 : medsec.1.18 : Running 19 : medsec.1.19 : Running 20 : medsec.1.20 : Running Totals: Active: 20 Idle: 0 Complete: 0
You can also use showq. This displays all active, eligible, blocked, and/or recently completed jobs on the system.
$ showq active jobs------------------------ JOBID USERNAME STATE PROCS REMAINING STARTTIME medsec.1.6 testuser Starting 1 99:23:59:54 Thu Oct 7 16:18:19 medsec.1.13 testuser Starting 1 99:23:59:54 Thu Oct 7 16:18:19 medsec.1.19 testuser Starting 1 99:23:59:54 Thu Oct 7 16:18:19 medsec.1.5 testuser Starting 1 99:23:59:54 Thu Oct 7 16:18:19 medsec.1.8 testuser Starting 1 99:23:59:54 Thu Oct 7 16:18:19 medsec.1.10 testuser Starting 1 99:23:59:54 Thu Oct 7 16:18:19 medsec.1.3 testuser Starting 1 99:23:59:54 Thu Oct 7 16:18:19 medsec.1.4 testuser Starting 1 99:23:59:54 Thu Oct 7 16:18:19 medsec.1.12 testuser Starting 1 99:23:59:54 Thu Oct 7 16:18:19 medsec.1.2 testuser Starting 1 99:23:59:54 Thu Oct 7 16:18:19 medsec.1.1 testuser Starting 1 99:23:59:54 Thu Oct 7 16:18:19 medsec.1.20 testuser Starting 1 99:23:59:54 Thu Oct 7 16:18:19 medsec.1.9 testuser Starting 1 99:23:59:54 Thu Oct 7 16:18:19 medsec.1.14 testuser Starting 1 99:23:59:54 Thu Oct 7 16:18:19 medsec.1.11 testuser Starting 1 99:23:59:54 Thu Oct 7 16:18:19 medsec.1.16 testuser Starting 1 99:23:59:54 Thu Oct 7 16:18:19 medsec.1.7 testuser Starting 1 99:23:59:54 Thu Oct 7 16:18:19 medsec.1.17 testuser Starting 1 99:23:59:54 Thu Oct 7 16:18:19 medsec.1.18 testuser Starting 1 99:23:59:54 Thu Oct 7 16:18:19 medsec.1.15 testuser Starting 1 99:23:59:54 Thu Oct 7 16:18:19 20 active jobs 20 of 240 processors in use by local jobs (8.33%) 1 of 4 nodes active (25.00%) eligible jobs---------------------- JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME 0 eligible jobs blocked jobs----------------------- JOBID USERNAME STATE PROCS WCLIMIT QUEUETIME 0 blocked jobs Total jobs: 21
Operations can be performed on individual jobs, a selection of jobs in a job array, or on the entire array.
The syntax for submitting job arrays is: msub -t <alias>.[<indexlist>]%<limit>
The alias and limit are optional. The alias does not override the arrayid Moab assigns to the array. When submitting an array with an alias, Moab returns the arrayid, which is the scheduler name followed by a unique ID.
For example, if the scheduler name in moab.cfg is Moab, submitting an array with an alias responds like this:
### SCHEDCFG line in moab.cfg ### SCHEDCFG[Moab] SERVER=headmaster
### Submitting an array with an alias ### > msub -t myarray.[1-10] job.sh Moab.6
To specify that only a certain number of sub-jobs in the array can run at a time, use the percent sign (%) delimiter. In this example, only five sub-jobs in the array can run at a time:
> msub -t myarray.[1-1000]%5
To submit a specific set of array sub-jobs, use the comma delimiter in the array index list:
> msub -t myarray.[1,2,3,4] > msub -t myarray.[1-5,7,10]
You can use the checkjob command on either the arrayid or the alias you specified.
> msub -t myarray.[1-2] job.sh Moab.10 $ checkjob myarray job Moab.10 AName: myarray Job Array Info: Name: Moab.1 1 : Moab.1[1] : Running 2 : Moab.1[2] : Running Sub-jobs: 2 Active: 2 ( 100.0% ) Eligible: 0 ( 0.0% ) Blocked: 0 ( 0.0% ) Completed: 0 ( 0.0% ) State: Idle Creds: user:tuser1 group:tgroup1 WallTime: 00:00:00 of 99:23:59:59 SubmitTime: Thu Jun 2 16:37:17 (Time Queued Total: 00:00:33 Eligible: 00:00:00) Total Requested Tasks: 1 Req[0] TaskCount: 1 Partition: ALL
To submit a job with a step size, use a colon in the array range and specify how many jobs to step. In the example below, a step size of 2 is requested. The sub-jobs will be numbered according to the step size inside the index limit. The array master job name will be the same as explained above.
$ msub -t myarray.[2-10:2] job.sh job Moab.15 $ checkjob -v myarray //or you could use 'checkjob -v Moab.15' job Moab.15 AName: job Job Array Info: Name: Moab.1 2 : Moab.15[2] : Running 4 : Moab.15[4] : Running 6 : Moab.15[6] : Running 8 : Moab.15[8] : Running 10 : Moab.15[10] : Running Sub-jobs: 4 Active: 4 ( 100.0% ) Eligible: 0 ( 0.0% ) Blocked: 0 ( 0.0% ) Completed: 0 ( 0.0% ) State: Idle Creds: user:tuser1 group:tgroup1 WallTime: 00:00:00 of 99:23:59:59 SubmitTime: Thu Jun 2 16:37:17 (Time Queued Total: 00:00:33 Eligible: 00:00:00) Total Requested Tasks: 1 Req[0] TaskCount: 1 Partition: ALL
$ releasehold -a myarray OR myarray.1 $ checkjob -v myarray $ canceljob myarray.1.1 $ releasehold -a myarray1.2 $ mjobctl -c r:myarray1.[20-30] $ mjobctl -u r:myarray1.[10-99] $ canceljob r:myarray1.[475-500]