Z.11 Determining Why Jobs Are Not Running

While showq details information about the queues, scheduler statistics may be viewed using the showstats command. The field Current Active/Total Procs shows current system utilization, for example.


> showstats
moab active for      00:00:30  stats initialized on Mon Feb 16 11:53:33
Eligible/Idle Jobs:                    9/9         (100.000%)
Active Jobs:                           0
Successful/Completed Jobs:             0/0         (0.000%)
Avg/Max QTime (Hours):              0.00/0.00
Avg/Max XFactor:                    0.00/0.00
Dedicated/Total ProcHours:          1.17/1.63      (71.429%)

Current Active/Total Procs:          140/196       (71.429%)

Avg WallClock Accuracy:             N/A
Avg Job Proc Efficiency:            N/A
Est/Avg Backlog (Hours):            N/A / N/A

You might be wondering why there are only 140 of 196 Processors Active (as shown with showq) when the first job (fr1n04.362.0) in the queue only requires 20 processors. We will use the checkjob command, which reports detailed job state information and diagnostic output for a particular job to determine why it is not running:

> checkjob fr1n04.362.0
job fr1n04.362.0
State: Idle
...
Network: hps_user  Memory >= 256M  Disk >= 0  Swap >= 0
...
Job Eligibility Analysis -------
job cannot run in partition DEFAULT (idle procs do not meet requirements : 8 of 20 procs found)
idle procs:  56  feasible procs:   8
Rejection Reasons: [Memory : 48][State : 140]

checkjob not only tells us the job's wallclock limit and the number of requested nodes (they're in the ellipsis) but explains why the job was rejected from running. The Job Eligibility Analysis tells us that 48 of the processors rejected this job due to memory limitations and that another 140 processors rejected it because of their state (that is, they're running other jobs). Notice the >= 256 M(B) memory requirement.

If you run checkjob with the ID of a running job, it would also tell us exactly which nodes have been allocated to this job. There is additional information that the checkjob command page describes in more detail.

© 2016 Adaptive Computing