(Click to open topic with navigation)
TORQUE has a diagnostic script to assist you in giving TORQUE Support the files they need to support issues. It should be run by a user that has access to run all TORQUE commands and access to all TORQUE directories (this is usually root).
The script (contrib/diag/tdiag.sh) is available in TORQUE 2.3.8, TORQUE 2.4.3, and later. The script grabs the node file, server and MOM log files, and captures the output of qmgr -c 'p s'. These are put in a tar file.
The script also has the following options (this can be shown in the command line by entering ./tdiag.sh -h):
USAGE: ./torque_diag [-d DATE] [-h] [-o OUTPUT_FILE] [-t TORQUE_HOME]
Table 4-5: TORQUE error codes
| Error code name | Number | Description |
|---|---|---|
| PBSE_FLOOR | 15000 | No error |
| PBSE_UNKJOBID | 15001 | Unknown job ID error |
| PBSE_NOATTR | 15002 | Undefined attribute |
| PBSE_ATTRRO | 15003 | Cannot set attribute, read only or insufficient permission |
| PBSE_IVALREQ | 15004 | Invalid request |
| PBSE_UNKREQ | 15005 | Unknown request |
| PBSE_TOOMANY | 15006 | Too many submit retries |
| PBSE_PERM | 15007 | Unauthorized Request |
| PBSE_IFF_NOT_FOUND | 15008 | trqauthd unable to authenticate |
| PBSE_MUNGE_NOT_FOUND | 15009 | Munge executable not found, unable to authenticate |
| PBSE_BADHOST | 15010 | Access from host not allowed, or unknown host |
| PBSE_JOBEXIST | 15011 | Job with requested ID already exists |
| PBSE_SYSTEM | 15012 | System error |
| PBSE_INTERNAL | 15013 | PBS server internal error |
| PBSE_REGROUTE | 15014 | Dependent parent job currently in routing queue |
| PBSE_UNKSIG | 15015 | Unknown/illegal signal name |
| PBSE_BADATVAL | 15016 | Illegal attribute or resource value for |
| PBSE_MODATRRUN | 15017 | Cannot modify attribute while job running |
| PBSE_BADSTATE | 15018 | Request invalid for state of job |
| PBSE_UNKQUE | 15020 | Unknown queue |
| PBSE_BADCRED | 15021 | Invalid credential |
| PBSE_EXPIRED | 15022 | Expired credential |
| PBSE_QUNOENB | 15023 | Queue is not enabled |
| PBSE_QACESS | 15024 | Access to queue is denied |
| PBSE_BADUSER | 15025 | Bad UID for job execution |
| PBSE_HOPCOUNT | 15026 | Job routing over too many hops |
| PBSE_QUEEXIST | 15027 | Queue already exists |
| PBSE_ATTRTYPE | 15028 | Incompatible type |
| PBSE_QUEBUSY | 15029 | Cannot delete busy queue |
| PBSE_QUENBIG | 15030 | Queue name too long |
| PBSE_NOSUP | 15031 | No support for requested service |
| PBSE_QUENOEN | 15032 | Cannot enable queue, incomplete definition |
| PBSE_PROTOCOL | 15033 | Batch protocol error |
| PBSE_BADATLST | 15034 | Bad attribute list structure |
| PBSE_NOCONNECTS | 15035 | No free connections |
| PBSE_NOSERVER | 15036 | No server specified |
| PBSE_UNKRESC | 15037 | Unknown resource type |
| PBSE_EXCQRESC | 15038 | Job exceeds queue resource limits |
| PBSE_QUENODFLT | 15039 | No default queue specified |
| PBSE_NORERUN | 15040 | Job is not rerunnable |
| PBSE_ROUTEREJ | 15041 | Job rejected by all possible destinations (check syntax, queue resources, …) |
| PBSE_ROUTEEXPD | 15042 | Time in Route Queue Expired |
| PBSE_MOMREJECT | 15043 | Execution server rejected request |
| PBSE_BADSCRIPT | 15044 | (qsub) cannot access script file |
| PBSE_STAGEIN | 15045 | Stage-in of files failed |
| PBSE_RESCUNAV | 15046 | Resource temporarily unavailable |
| PBSE_BADGRP | 15047 | Bad GID for job execution |
| PBSE_MAXQUED | 15048 | Maximum number of jobs already in queue |
| PBSE_CKPBSY | 15049 | Checkpoint busy, may retry |
| PBSE_EXLIMIT | 15050 | Resource limit exceeds allowable |
| PBSE_BADACCT | 15051 | Invalid Account |
| PBSE_ALRDYEXIT | 15052 | Job already in exit state |
| PBSE_NOCOPYFILE | 15053 | Job files not copied |
| PBSE_CLEANEDOUT | 15054 | Unknown job id after clean init |
| PBSE_NOSYNCMSTR | 15055 | No master found for sync job set |
| PBSE_BADDEPEND | 15056 | Invalid Job Dependency |
| PBSE_DUPLIST | 15057 | Duplicate entry in list |
| PBSE_DISPROTO | 15058 | Bad DIS based Request Protocol |
| PBSE_EXECTHERE | 15059 | Cannot execute at specified host because of checkpoint or stagein files |
| PBSE_SISREJECT | 15060 | Sister rejected |
| PBSE_SISCOMM | 15061 | Sister could not communicate |
| PBSE_SVRDOWN | 15062 | Request not allowed: Server shutting down |
| PBSE_CKPSHORT | 15063 | Not all tasks could checkpoint |
| PBSE_UNKNODE | 15064 | Unknown node |
| PBSE_UNKNODEATR | 15065 | Unknown node-attribute |
| PBSE_NONODES | 15066 | Server has no node list |
| PBSE_NODENBIG | 15067 | Node name is too big |
| PBSE_NODEEXIST | 15068 | Node name already exists |
| PBSE_BADNDATVAL | 15069 | Illegal value for |
| PBSE_MUTUALEX | 15070 | Mutually exclusive values for |
| PBSE_GMODERR | 15071 | Modification failed for |
| PBSE_NORELYMOM | 15072 | Server could not connect to MOM |
| PBSE_NOTSNODE | 15073 | No time-share node available |
| PBSE_JOBTYPE | 15074 | Wrong job type |
| PBSE_BADACLHOST | 15075 | Bad ACL entry in host list |
| PBSE_MAXUSERQUED | 15076 | Maximum number of jobs already in queue for user |
| PBSE_BADDISALLOWTYPE | 15077 | Bad type in disallowed_types list |
| PBSE_NOINTERACTIVE | 15078 | Queue does not allow interactive jobs |
| PBSE_NOBATCH | 15079 | Queue does not allow batch jobs |
| PBSE_NORERUNABLE | 15080 | Queue does not allow rerunable jobs |
| PBSE_NONONRERUNABLE | 15081 | Queue does not allow nonrerunable jobs |
| PBSE_UNKARRAYID | 15082 | Unknown Array ID |
| PBSE_BAD_ARRAY_REQ | 15083 | Bad Job Array Request |
| PBSE_BAD_ARRAY_DATA | 15084 | Bad data reading job array from file |
| PBSE_TIMEOUT | 15085 | Time out |
| PBSE_JOBNOTFOUND | 15086 | Job not found |
| PBSE_NOFAULTTOLERANT | 15087 | Queue does not allow fault tolerant jobs |
| PBSE_NOFAULTINTOLERANT | 15088 | Queue does not allow fault intolerant jobs |
| PBSE_NOJOBARRAYS | 15089 | Queue does not allow job arrays |
| PBSE_RELAYED_TO_MOM | 15090 | Request was relayed to a MOM |
| PBSE_MEM_MALLOC | 15091 | Error allocating memory - out of memory |
| PBSE_MUTEX | 15092 | Error allocating controling mutex (lock/unlock) |
| PBSE_THREADATTR | 15093 | Error setting thread attributes |
| PBSE_THREAD | 15094 | Error creating thread |
| PBSE_SELECT | 15095 | Error in socket select |
| PBSE_SOCKET_FAULT | 15096 | Unable to get connection to socket |
| PBSE_SOCKET_WRITE | 15097 | Error writing data to socket |
| PBSE_SOCKET_READ | 15098 | Error reading data from socket |
| PBSE_SOCKET_CLOSE | 15099 | Socket close detected |
| PBSE_SOCKET_LISTEN | 15100 | Error listening on socket |
| PBSE_AUTH_INVALID | 15101 | Invalid auth type in request |
| PBSE_NOT_IMPLEMENTED | 15102 | This functionality is not yet implemented |
| PBSE_QUENOTAVAILABLE | 15103 | Queue is currently not available |
| PBSE_TMPDIFFOWNER | 15104 | tmpdir owned by another user |
| PBSE_TMPNOTDIR | 15105 | tmpdir exists but is not a directory |
| PBSE_TMPNONAME | 15106 | tmpdir cannot be named for job |
| PBSE_CANTOPENSOCKET | 15107 | Cannot open demux sockets |
| PBSE_CANTCONTACTSISTERS | 15108 | Cannot send join job to all sisters |
| PBSE_CANTCREATETMPDIR | 15109 | Cannot create tmpdir for job |
| PBSE_BADMOMSTATE | 15110 | Mom is down, cannot run job |
| PBSE_SOCKET_INFORMATION | 15111 | Socket information is not accessible |
| PBSE_SOCKET_DATA | 15112 | Data on socket does not process correctly |
| PBSE_CLIENT_INVALID | 15113 | Client is not allowed/trusted |
| PBSE_PREMATURE_EOF | 15114 | Premature End of File |
| PBSE_CAN_NOT_SAVE_FILE | 15115 | Error saving file |
| PBSE_CAN_NOT_OPEN_FILE | 15116 | Error opening file |
| PBSE_CAN_NOT_WRITE_FILE | 15117 | Error writing file |
| PBSE_JOB_FILE_CORRUPT | 15118 | Job file corrupt |
| PBSE_JOB_RERUN | 15119 | Job can not be rerun |
| PBSE_CONNECT | 15120 | Can not establish connection |
| PBSE_JOBWORKDELAY | 15121 | Job function must be temporarily delayed |
| PBSE_BAD_PARAMETER | 15122 | Parameter of function was invalid |
| PBSE_CONTINUE | 15123 | Continue processing on job. (Not an error) |
| PBSE_JOBSUBSTATE | 15124 | Current sub state does not allow trasaction. |
| PBSE_CAN_NOT_MOVE_FILE | 15125 | Error moving file |
| PBSE_JOB_RECYCLED | 15126 | Job is being recycled |
| PBSE_JOB_ALREADY_IN_QUEUE | 15127 | Job is already in destination queue. |
| PBSE_INVALID_MUTEX | 15128 | Mutex is NULL or otherwise invalid |
| PBSE_MUTEX_ALREADY_LOCKED | 15129 | The mutex is already locked by this object |
| PBSE_MUTEX_ALREADY_UNLOCKED | 15130 | The mutex has already been unlocked by this object |
| PBSE_INVALID_SYNTAX | 15131 | Command syntax invalid |
| PBSE_NODE_DOWN | 15132 | A node is down. Check the MOM and host |
| PBSE_SERVER_NOT_FOUND | 15133 | Could not connect to batch server |
| PBSE_SERVER_BUSY | 15134 | Server busy. Currently no available threads |