TORQUE has a diagnostic script to assist you in giving TORQUE Support the files they need to support issues. It should be run by a user that has access to run all TORQUE commands and access to all TORQUE directories (this is usually root).
The script (contrib/diag/tdiag.sh) is available in TORQUE 2.3.8, TORQUE 2.4.3, and later. The script grabs the nodefile, server and MOM logfiles, and captures the output of qmgr -c 'p s'. These are put in a tarfile.
The script also has the following options (this can be shown in the command line by entering ./tdiag.sh -h):
USAGE: ./torque_diag [-d DATE] [-h] [-o OUTPUT_FILE] [-t TORQUE_HOME]
DATE should be in the format YYYYmmdd. For example, 20091130 would be the date for November 30th, 2009. If no date is specified, today's date is used. OUTPUT_FILE is the optional name of the output file. The default output file is torque_diag<today's_date>.tar.gz. TORQUE_HOME should be the path to your TORQUE directory. If no directory is specified, /var/spool/torque is the default.
Error Code Name | Number | Description |
---|---|---|
PBSE_NONE | 15000 | No error |
PBSE_UNKJOBID | 15001 | Unknown job identifier |
PBSE_NOATTR | 15002 | Undefined attribute |
PBSE_ATTRRO | 15003 | Attempt to set READ ONLY attribute |
PBSE_IVALREQ | 15004 | Invalid request |
PBSE_UNKREQ | 15005 | Unknown batch request |
PBSE_TOOMANY | 15006 | Too many submit retries |
PBSE_PERM | 15007 | No permission |
PBSE_BADHOST | 15008 | Access from host not allowed |
PBSE_JOBEXIST | 15009 | Job already exists |
PBSE_SYSTEM | 15010 | System error occurred |
PBSE_INTERNAL | 15011 | Internal server error occurred |
PBSE_REGROUTE | 15012 | Parent job of dependent in rte queue |
PBSE_UNKSIG | 15013 | Unknown signal name |
PBSE_BADATVAL | 15014 | Bad attribute value |
PBSE_MODATRRUN | 15015 | Cannot modify attribute in run state |
PBSE_BADSTATE | 15016 | Request invalid for job state |
PBSE_UNKQUE | 15018 | Unknown queue name |
PBSE_BADCRED | 15019 | Invalid credential in request |
PBSE_EXPIRED | 15020 | Expired credential in request |
PBSE_QUNOENB | 15021 | Queue not enabled |
PBSE_QACESS | 15022 | No access permission for queue |
PBSE_BADUSER | 15023 | Bad user - no password entry |
PBSE_HOPCOUNT | 15024 | Max hop count exceeded |
PBSE_QUEEXIST | 15025 | Queue already exists |
PBSE_ATTRTYPE | 15026 | Incompatible queue attribute type |
PBSE_QUEBUSY | 15027 | Queue busy (not empty) |
PBSE_QUENBIG | 15028 | Queue name too long |
PBSE_NOSUP | 15029 | Feature/function not supported |
PBSE_QUENOEN | 15030 | Cannot enable queue, needs add def |
PBSE_PROTOCOL | 15031 | Protocol (ASN.1) error |
PBSE_BADATLST | 15032 | Bad attribute list structure |
PBSE_NOCONNECTS | 15033 | No free connections |
PBSE_NOSERVER | 15034 | No server to connect to |
PBSE_UNKRESC | 15035 | Unknown resource |
PBSE_EXCQRESC | 15036 | Job exceeds queue resource limits |
PBSE_QUENODFLT | 15037 | No default queue defined |
PBSE_NORERUN | 15038 | Job not rerunnable |
PBSE_ROUTEREJ | 15039 | Route rejected by all destinations |
PBSE_ROUTEEXPD | 15040 | Time in route queue expired |
PBSE_MOMREJECT | 15041 | Request to the MOM failed |
PBSE_BADSCRIPT | 15042 | (qsub) cannot access script file |
PBSE_STAGEIN | 15043 | Stage In of files failed |
PBSE_RESCUNAV | 15044 | Resources temporarily unavailable |
PBSE_BADGRP | 15045 | Bad group specified |
PBSE_MAXQUED | 15046 | Max number of jobs in queue |
PBSE_CKPBSY | 15047 | Checkpoint busy, may be retries |
PBSE_EXLIMIT | 15048 | Limit exceeds allowable |
PBSE_BADACCT | 15049 | Bad account attribute value |
PBSE_ALRDYEXIT | 15050 | Job already in exit state |
PBSE_NOCOPYFILE | 15051 | Job files not copied |
PBSE_CLEANEDOUT | 15052 | Unknown job id after clean init |
PBSE_NOSYNCMSTR | 15053 | No master in Sync Set |
PBSE_BADDEPEND | 15054 | Invalid dependency |
PBSE_DUPLIST | 15055 | Duplicate entry in List |
PBSE_DISPROTO | 15056 | Bad DIS based request protocol |
PBSE_EXECTHERE | 15057 | Cannot execute there |
PBSE_SISREJECT | 15058 | Sister rejected |
PBSE_SISCOMM | 15059 | Sister could not communicate |
PBSE_SVRDOWN | 15060 | Requirement rejected -server shutting down |
PBSE_CKPSHORT | 15061 | Not all tasks could checkpoint |
PBSE_UNKNODE | 15062 | Named node is not in the list |
PBSE_UNKNODEATR | 15063 | Node-attribute not recognized |
PBSE_NONODES | 15064 | Server has no node list |
PBSE_NODENBIG | 15065 | Node name is too big |
PBSE_NODEEXIST | 15066 | Node name already exists |
PBSE_BADNDATVAL | 15067 | Bad node-attribute value |
PBSE_MUTUALEX | 15068 | State values are mutually exclusive |
PBSE_GMODERR | 15069 | Error(s) during global modification of nodes |
PBSE_NORELYMOM | 15070 | Could not contact the MOM |
PBSE_NOTSNODE | 15071 | No time-shared nodes |