Stuck Jobs
If a job gets stuck in TORQUE, try these suggestions to resolve the issue:
- Use the qdel command to cancel the job.
- Force the MOM to send an obituary of the job ID to the server.
- You can try clearing the stale jobs by using the momctl command on the compute nodes where the jobs are still listed.
> momctl -c 58925 -h compute-5-20
- Setting the qmgr server setting mom_job_sync to True might help prevent jobs from hanging.
> qmgr -c "set server mom_job_sync = True"
To check and see if this is already set, use:
- If the suggestions above cannot remove the stuck job, you can try qdel -p. However, since the -p option purges all information generated by the job, this is not a recommended option unless the above suggestions fail to remove the stuck job.
- The last suggestion for removing stuck jobs from compute nodes is to restart the pbs_mom.
For additional troubleshooting, run a tracejob on one of the stuck jobs. You can then create an online support ticket with the full server log for the time period displayed in the trace job.
Related topics