Managing Reservations Applying to the Queue

15.0 Cluster Analysis, Testing, and Simulation > Simulations > Interactive Simulation Tutorial > Managing Reservations Applying to the Queue

Conventions

15.3.4.4 Managing Reservations Applying to the Queue

The showres command shows all reservations currently on the system.

> showres
ReservationID       Type S       Start         End    Duration    N/P    StartTime
fr8n01.187.0         Job R    00:00:00  1:00:00:00  1:00:00:00   20/20   Mon Feb 16 11:54:03
fr8n01.189.0         Job R    00:00:00  1:00:00:00  1:00:00:00   20/20   Mon Feb 16 11:54:03
fr8n01.190.0         Job R    00:00:00  1:00:00:00  1:00:00:00   20/20   Mon Feb 16 11:54:03
fr8n01.191.0         Job R    00:00:00  1:00:00:00  1:00:00:00   20/20   Mon Feb 16 11:54:03
fr8n01.276.0         Job R    00:00:00  1:00:00:00  1:00:00:00   20/20   Mon Feb 16 11:54:03
fr1n04.362.0         Job I  1:00:00:00  2:00:00:00  1:00:00:00   20/20   Tue Feb 17 11:54:03
fr1n04.369.0         Job R    00:00:00  1:00:00:00  1:00:00:00   20/20   Mon Feb 16 11:54:03
fr1n04.487.0         Job R    00:00:00  1:00:00:00  1:00:00:00   20/20   Mon Feb 16 11:54:03
fr8n01.804.0         Job R    00:00:00    00:05:00    00:05:00    5/5    Mon Feb 16 11:54:03
fr8n01.960.0         Job R    00:00:00  1:00:00:00  1:00:00:00   32/32   Mon Feb 16 11:54:03
10 reservations located

Here, the S column is the job's state(R = running, I = idle). All the active jobs have a reservation along with idle job fr1n04.362.0. This reservation was actually created by the backfill scheduler for the highest priority idle job as a way to prevent starvation while lower priority jobs were being backfilled (The backfill documentation describes the mechanics of the backfill scheduling more fully.).

To display information about the nodes that job fr1n04.362.0 has reserved, use showres -n <JOBID>.

> showres -n fr1n04.362.0
reservations on Mon Feb 16 11:54:03
NodeName                   Type      ReservationID   JobState Task       Start    Duration  StartTime
fr5n09                      Job       fr1n04.362.0       Idle    1  1:00:00:00  1:00:00:00  Tue Feb 17 11:54:03
...
fr7n15                      Job       fr1n04.362.0       Idle    1  1:00:00:00  1:00:00:00  Tue Feb 17 11:54:03
20 nodes reserved

Now advance the simulator an iteration to allow some jobs to actually run.

> mschedctl -S
scheduling will stop in 00:00:30 at iteration 2

Next, check the queues to see what happened.

> showq
active jobs------------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME
fr8n01.804.0            529    Running     5    00:04:30  Mon Feb 16 11:54:03
fr8n01.187.0            570    Running    20    23:59:30  Mon Feb 16 11:54:03
...
     9 active jobs     177 of  196 Processors Active (90.31%)
eligible jobs----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME
...
fr8n01.963.0            586       Idle    32     9:00:00  Mon Feb 16 11:54:33
fr8n01.1016.0           570       Idle    20  1:00:00:00  Mon Feb 16 11:54:33
16 eligible jobs
...

Two new jobs, fr8n01.963.0 and fr8n01.1016.0, are in the eligible queue. Also, note that the first job will now complete in 4 minutes 30 seconds rather than 5 minutes because we have just advanced now by 30 seconds, one RMPOLLINTERVAL. It is important to note that when the simulated jobs were created, both the job's wallclock limit and its actual run time were recorded. The wallclock limit is specified by the user indicating their best estimate of an upper bound on how long the job will run. The run time is how long the job actually ran before completing and releasing its allocated resources. For example, a job with a wallclock limit of 1 hour will be given the needed resources for up to an hour but may complete in only 20 minutes.

Stop the simulation at iteration 6.

> mschedctl -s 6I
scheduling will stop in 00:03:00 at iteration 6

The -s 6I argument indicates that the scheduler will stop at iteration 6 and will (I)gnore user input until it gets there. This prevents the possibility of obtaining showq output from iteration 5 rather than iteration 6.

> showq
active jobs------------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME
fr8n01.804.0            529    Running     5    00:02:30  Mon Feb 16 11:54:03
...
fr1n04.501.0            570    Running    20  1:00:00:00  Mon Feb 16 11:56:33
fr8n01.388.0            550    Running    20  1:00:00:00  Mon Feb 16 11:56:33
     9 active jobs     177 of  196 Processors Active (90.31%)
...
    14 eligible jobs
...

Job fr8n01.804.0 is still 2 minutes 30 seconds away from completing as expected but notice that both jobs fr8n01.189.0 and fr8n01.191.0 have completed early. Although they had almost 24 hours remaining of wallclock limit, they terminated. In reality, they probably failed on the real world system where the trace file was being created. Their completion freed up 40 processors which the scheduler was able to immediately use by starting several more jobs.

Note the system statistics:

> showstats
...
Successful/Completed Jobs:             0/2         (0.000%)
...
Avg WallClock Accuracy:           0.150%
Avg Job Proc Efficiency:        100.000%
Est/Avg Backlog (Hours):            0.00/3652178.74

A few more fields are filled in now that some jobs have completed providing information on which to generate statistics.

Decrease the default LOGLEVEL with mschedctl -m to avoid unnecessary logging, and speed up the simulation.

> mschedctl -m LOGLEVEL 0
INFO:  parameter modified

You can use mschedctl -m to immediately change the value of any parameter. The change is only made to the currently running Moab server and is not propagated to the configuration file. Changes can also be made by modifying the configuration file and restarting the scheduler.

Stop at iteration 580 and pull up the scheduler's statistics.

> mschedctl -s 580I; showq
scheduling will stop in 4:47:00 at iteration 580
...
    11 active jobs     156 of  196 Processors Active (79.59%)
eligible jobs----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME
fr8n01.963.0            586       Idle    32     9:00:00  Mon Feb 16 11:54:33
fr8n01.1075.0           560       Idle    32    23:56:00  Mon Feb 16 11:58:33
fr8n01.1076.0           560       Idle    16    23:56:00  Mon Feb 16 11:59:33
fr1n04.1953.0           520       Idle    46     7:45:00  Mon Feb 16 12:03:03
...
16 eligible jobs
...

You may note that showq hangs a while as the scheduler simulates up to iteration 580. The output shows that currently only 156 of the 196 nodes are busy, yet at first glance 3 jobs, fr8n01.963.0, fr8n01.1075.0, and fr8n01.1076.0 appear to be ready to run.

> checkjob fr8n01.963.0; checkjob fr8n01.1075.0; checkjob fr8n01.1076.0
job fr8n01.963.0
...
Network: hps_user  Memory >= 256M  Disk >= 0  Swap >= 0
...
Job Eligibility Analysis -------
job cannot run in partition DEFAULT (idle procs do not meet requirements : 20 of 32 procs found)
idle procs:  40  feasible procs:  20
Rejection Reasons: [Memory : 20][State : 156]

job fr8n01.1075.0
...
Network: hps_user  Memory >= 256M  Disk >= 0  Swap >= 0
...
job cannot run in partition DEFAULT (idle procs do not meet requirements : 0 of 32 procs found)
idle procs:  40  feasible procs:   0
Rejection Reasons: [Memory : 20][State : 156][ReserveTime : 20]

job fr8n01.1076.0
...
Network: hps_user  Memory >= 256M  Disk >= 0  Swap >= 0
...
job cannot run in partition DEFAULT (idle procs do not meet requirements : 0 of 16 procs found)
idle procs:  40  feasible procs:   0
Rejection Reasons: [Memory : 20][State : 156][ReserveTime : 20]

The checkjob command reveals that job fr8n01.963.0 only found 20 of 32 processors. The remaining 20 idle processors could not be used because the configured memory on the node did not meet the jobs requirements. The other jobs cannot find enough nodes because of ReserveTime. This indicates that the processors are idle, but that they have a reservation in place that will start before the job being checked could complete.

Verify that the idle nodes do not have enough memory configured and they are already reserved with the mdiag -n command, which provides detailed information about the state of nodes Moab is currently tracking. The mdiag command can be used with various flags to obtain detailed information about accounts, blocked jobs, fairshare, groups, jobs, nodes, QoS, reservations, the resource manager, and users. The command also performs a number of sanity checks on the data provided and will present warning messages if discrepancies are detected.

> mdiag -n -v | grep -e Name -e Idle
Name      State  Procs Memory         Disk          Swap      Speed  Opsys   Arch Par   Load Rsv  ...
fr10n09   Idle   1:1   256:256      9780:9780   411488:411488  1.00  AIX43  R6000 DEF   0.00 001  .
fr10n11   Idle   1:1   256:256      8772:8772   425280:425280  1.00  AIX43  R6000 DEF   0.00 001  . 
fr10n13   Idle   1:1   256:256      9272:9272   441124:441124  1.00  AIX43  R6000 DEF   0.00 001  .
fr10n15   Idle   1:1   256:256      8652:8652   440776:440776  1.00  AIX43  R6000 DEF   0.00 001  
fr11n01   Idle   1:1   256:256      7668:7668   438624:438624  1.00  AIX43  R6000 DEF   0.00 001 
fr11n03   Idle   1:1   256:256      9548:9548   424584:424584  1.00  AIX43  R6000 DEF   0.00 001 
fr11n05   Idle   1:1   256:256     11608:11608  454476:454476  1.00  AIX43  R6000 DEF   0.00 001 
fr11n07   Idle   1:1   256:256      9008:9008   425292:425292  1.00  AIX43  R6000 DEF   0.00 001 
fr11n09   Idle   1:1   256:256      8588:8588   424684:424684  1.00  AIX43  R6000 DEF   0.00 001 
fr11n11   Idle   1:1   256:256      9632:9632   424936:424936  1.00  AIX43  R6000 DEF   0.00 001 
fr11n13   Idle   1:1   256:256      9524:9524   425432:425432  1.00  AIX43  R6000 DEF   0.00 001 
fr11n15   Idle   1:1   256:256      9388:9388   425728:425728  1.00  AIX43  R6000 DEF   0.00 001 
fr14n01   Idle   1:1   256:256      6848:6848   424260:424260  1.00  AIX43  R6000 DEF   0.00 001 
fr14n03   Idle   1:1   256:256      9752:9752   424192:424192  1.00  AIX43  R6000 DEF   0.00 001 
fr14n05   Idle   1:1   256:256      9920:9920   434088:434088  1.00  AIX43  R6000 DEF   0.00 001 
fr14n07   Idle   1:1   256:256      2196:2196   434224:434224  1.00  AIX43  R6000 DEF   0.00 001 
fr14n09   Idle   1:1   256:256      9368:9368   434568:434568  1.00  AIX43  R6000 DEF   0.00 001 
fr14n11   Idle   1:1   256:256      9880:9880   434172:434172  1.00  AIX43  R6000 DEF   0.00 001 
fr14n13   Idle   1:1   256:256      9760:9760   433952:433952  1.00  AIX43  R6000 DEF   0.00 001 
fr14n15   Idle   1:1   256:256     25000:25000  434044:434044  1.00  AIX43  R6000 DEF   0.00 001 
fr17n05   Idle   1:1   128:128     10016:10016  182720:182720  1.00  AIX43  R6000 DEF   0.00 000 
...
Total Nodes: 196  (Active: 156  Idle: 40  Down: 0)

The grep gets the command header and the idle nodes listed. All the idle nodes with 256 MB of memory installed already have a reservation. (See the Rsv column.) The rest of the idle nodes only have 128 MB of memory.

> checknode fr10n09
node fr10n09
State:      Idle  (in current state for 4:21:00)
Configured Resources: PROCS: 1  MEM: 256M  SWAP: 401G  DISK: 9780M
Utilized   Resources: [NONE]
Dedicated  Resources: [NONE]
..
Total Time: 4:50:00  Up: 4:50:00 (100.00%)  Active: 00:34:30 (11.90%)
Reservations:
  Job 'fr8n01.963.0'(x1)  3:25:00 -> 12:25:00 (9:00:00)

Using checknode revealed that Job fr8n01.963.0 has the reservation.

Moving ahead:

> mschedctl -S 500I;showstats -v
scheduling will stop in 4:10:00 at iteration 1080
...
Eligible/Idle Jobs:                   16/16        (100.000%)
Active Jobs:                          11
Successful/Completed Jobs:             2/25        (8.000%)
Preempt Jobs:                          0
Avg/Max QTime (Hours):              0.00/0.00
Avg/Max XFactor:                    0.00/1.04
Avg/Max Bypass:                     0.00/13.00
Dedicated/Total ProcHours:       1545.44/1765.63   (87.529%)
Preempt/Dedicated ProcHours:        0.00/1545.44   (0.000%)
Current Active/Total Procs:          156/196       (79.592%)
Avg WallClock Accuracy:           9.960%
Avg Job Proc Efficiency:        100.000%
Min System Utilization:          79.592% (on iteration 33)
Est/Avg Backlog (Hours):            0.00/20289.84

We now know that the scheduler is scheduling efficiently. So far, system utilization as reported by showstats -v looks very good.