(Click to open topic with navigation)
TORQUE has a diagnostic script to assist you in giving TORQUE Support the files they need to support issues. It should be run by a user that has access to run all TORQUE commands and access to all TORQUE directories (this is usually root).
The script (contrib/diag/tdiag.sh) is available in TORQUE 2.3.8, TORQUE 2.4.3, and later. The script grabs the node file, server and MOM log files, and captures the output of qmgr -c 'p s'. These are put in a tar file.
The script also has the following options (this can be shown in the command line by entering ./tdiag.sh -h):
USAGE: ./torque_diag [-d DATE] [-h] [-o OUTPUT_FILE] [-t TORQUE_HOME]
Table D-1: TORQUE error codes
Error code name | Number | Description |
---|---|---|
PBSE_FLOOR | 15000 | No error |
PBSE_UNKJOBID | 15001 | Unknown job ID error |
PBSE_NOATTR | 15002 | Undefined attribute |
PBSE_ATTRRO | 15003 | Cannot set attribute, read only or insufficient permission |
PBSE_IVALREQ | 15004 | Invalid request |
PBSE_UNKREQ | 15005 | Unknown request |
PBSE_TOOMANY | 15006 | Too many submit retries |
PBSE_PERM | 15007 | Unauthorized Request |
PBSE_IFF_NOT_FOUND | 15008 | trqauthd unable to authenticate |
PBSE_MUNGE_NOT_FOUND | 15009 | Munge executable not found, unable to authenticate |
PBSE_BADHOST | 15010 | Access from host not allowed, or unknown host |
PBSE_JOBEXIST | 15011 | Job with requested ID already exists |
PBSE_SYSTEM | 15012 | System error |
PBSE_INTERNAL | 15013 | PBS server internal error |
PBSE_REGROUTE | 15014 | Dependent parent job currently in routing queue |
PBSE_UNKSIG | 15015 | Unknown/illegal signal name |
PBSE_BADATVAL | 15016 | Illegal attribute or resource value for |
PBSE_MODATRRUN | 15017 | Cannot modify attribute while job running |
PBSE_BADSTATE | 15018 | Request invalid for state of job |
PBSE_UNKQUE | 15020 | Unknown queue |
PBSE_BADCRED | 15021 | Invalid credential |
PBSE_EXPIRED | 15022 | Expired credential |
PBSE_QUNOENB | 15023 | Queue is not enabled |
PBSE_QACESS | 15024 | Access to queue is denied |
PBSE_BADUSER | 15025 | Bad UID for job execution |
PBSE_HOPCOUNT | 15026 | Job routing over too many hops |
PBSE_QUEEXIST | 15027 | Queue already exists |
PBSE_ATTRTYPE | 15028 | Incompatible type |
PBSE_QUEBUSY | 15029 | Cannot delete busy queue |
PBSE_QUENBIG | 15030 | Queue name too long |
PBSE_NOSUP | 15031 | No support for requested service |
PBSE_QUENOEN | 15032 | Cannot enable queue, incomplete definition |
PBSE_PROTOCOL | 15033 | Batch protocol error |
PBSE_BADATLST | 15034 | Bad attribute list structure |
PBSE_NOCONNECTS | 15035 | No free connections |
PBSE_NOSERVER | 15036 | No server specified |
PBSE_UNKRESC | 15037 | Unknown resource type |
PBSE_EXCQRESC | 15038 | Job exceeds queue resource limits |
PBSE_QUENODFLT | 15039 | No default queue specified |
PBSE_NORERUN | 15040 | Job is not rerunnable |
PBSE_ROUTEREJ | 15041 | Job rejected by all possible destinations (check syntax, queue resources, …) |
PBSE_ROUTEEXPD | 15042 | Time in Route Queue Expired |
PBSE_MOMREJECT | 15043 | Execution server rejected request |
PBSE_BADSCRIPT | 15044 | (qsub) cannot access script file |
PBSE_STAGEIN | 15045 | Stage-in of files failed |
PBSE_RESCUNAV | 15046 | Resource temporarily unavailable |
PBSE_BADGRP | 15047 | Bad GID for job execution |
PBSE_MAXQUED | 15048 | Maximum number of jobs already in queue |
PBSE_CKPBSY | 15049 | Checkpoint busy, may retry |
PBSE_EXLIMIT | 15050 | Resource limit exceeds allowable |
PBSE_BADACCT | 15051 | Invalid Account |
PBSE_ALRDYEXIT | 15052 | Job already in exit state |
PBSE_NOCOPYFILE | 15053 | Job files not copied |
PBSE_CLEANEDOUT | 15054 | Unknown job id after clean init |
PBSE_NOSYNCMSTR | 15055 | No master found for sync job set |
PBSE_BADDEPEND | 15056 | Invalid Job Dependency |
PBSE_DUPLIST | 15057 | Duplicate entry in list |
PBSE_DISPROTO | 15058 | Bad DIS based Request Protocol |
PBSE_EXECTHERE | 15059 | Cannot execute at specified host because of checkpoint or stagein files |
PBSE_SISREJECT | 15060 | Sister rejected |
PBSE_SISCOMM | 15061 | Sister could not communicate |
PBSE_SVRDOWN | 15062 | Request not allowed: Server shutting down |
PBSE_CKPSHORT | 15063 | Not all tasks could checkpoint |
PBSE_UNKNODE | 15064 | Unknown node |
PBSE_UNKNODEATR | 15065 | Unknown node-attribute |
PBSE_NONODES | 15066 | Server has no node list |
PBSE_NODENBIG | 15067 | Node name is too big |
PBSE_NODEEXIST | 15068 | Node name already exists |
PBSE_BADNDATVAL | 15069 | Illegal value for |
PBSE_MUTUALEX | 15070 | Mutually exclusive values for |
PBSE_GMODERR | 15071 | Modification failed for |
PBSE_NORELYMOM | 15072 | Server could not connect to MOM |
PBSE_NOTSNODE | 15073 | No time-share node available |
PBSE_JOBTYPE | 15074 | Wrong job type |
PBSE_BADACLHOST | 15075 | Bad ACL entry in host list |
PBSE_MAXUSERQUED | 15076 | Maximum number of jobs already in queue for user |
PBSE_BADDISALLOWTYPE | 15077 | Bad type in disallowed_types list |
PBSE_NOINTERACTIVE | 15078 | Queue does not allow interactive jobs |
PBSE_NOBATCH | 15079 | Queue does not allow batch jobs |
PBSE_NORERUNABLE | 15080 | Queue does not allow rerunable jobs |
PBSE_NONONRERUNABLE | 15081 | Queue does not allow nonrerunable jobs |
PBSE_UNKARRAYID | 15082 | Unknown Array ID |
PBSE_BAD_ARRAY_REQ | 15083 | Bad Job Array Request |
PBSE_BAD_ARRAY_DATA | 15084 | Bad data reading job array from file |
PBSE_TIMEOUT | 15085 | Time out |
PBSE_JOBNOTFOUND | 15086 | Job not found |
PBSE_NOFAULTTOLERANT | 15087 | Queue does not allow fault tolerant jobs |
PBSE_NOFAULTINTOLERANT | 15088 | Queue does not allow fault intolerant jobs |
PBSE_NOJOBARRAYS | 15089 | Queue does not allow job arrays |
PBSE_RELAYED_TO_MOM | 15090 | Request was relayed to a MOM |
PBSE_MEM_MALLOC | 15091 | Error allocating memory - out of memory |
PBSE_MUTEX | 15092 | Error allocating controling mutex (lock/unlock) |
PBSE_THREADATTR | 15093 | Error setting thread attributes |
PBSE_THREAD | 15094 | Error creating thread |
PBSE_SELECT | 15095 | Error in socket select |
PBSE_SOCKET_FAULT | 15096 | Unable to get connection to socket |
PBSE_SOCKET_WRITE | 15097 | Error writing data to socket |
PBSE_SOCKET_READ | 15098 | Error reading data from socket |
PBSE_SOCKET_CLOSE | 15099 | Socket close detected |
PBSE_SOCKET_LISTEN | 15100 | Error listening on socket |
PBSE_AUTH_INVALID | 15101 | Invalid auth type in request |
PBSE_NOT_IMPLEMENTED | 15102 | This functionality is not yet implemented |
PBSE_QUENOTAVAILABLE | 15103 | Queue is currently not available |
PBSE_TMPDIFFOWNER | 15104 | tmpdir owned by another user |
PBSE_TMPNOTDIR | 15105 | tmpdir exists but is not a directory |
PBSE_TMPNONAME | 15106 | tmpdir cannot be named for job |
PBSE_CANTOPENSOCKET | 15107 | Cannot open demux sockets |
PBSE_CANTCONTACTSISTERS | 15108 | Cannot send join job to all sisters |
PBSE_CANTCREATETMPDIR | 15109 | Cannot create tmpdir for job |
PBSE_BADMOMSTATE | 15110 | Mom is down, cannot run job |
PBSE_SOCKET_INFORMATION | 15111 | Socket information is not accessible |
PBSE_SOCKET_DATA | 15112 | Data on socket does not process correctly |
PBSE_CLIENT_INVALID | 15113 | Client is not allowed/trusted |
PBSE_PREMATURE_EOF | 15114 | Premature End of File |
PBSE_CAN_NOT_SAVE_FILE | 15115 | Error saving file |
PBSE_CAN_NOT_OPEN_FILE | 15116 | Error opening file |
PBSE_CAN_NOT_WRITE_FILE | 15117 | Error writing file |
PBSE_JOB_FILE_CORRUPT | 15118 | Job file corrupt |
PBSE_JOB_RERUN | 15119 | Job can not be rerun |
PBSE_CONNECT | 15120 | Can not establish connection |
PBSE_JOBWORKDELAY | 15121 | Job function must be temporarily delayed |
PBSE_BAD_PARAMETER | 15122 | Parameter of function was invalid |
PBSE_CONTINUE | 15123 | Continue processing on job. (Not an error) |
PBSE_JOBSUBSTATE | 15124 | Current sub state does not allow trasaction. |
PBSE_CAN_NOT_MOVE_FILE | 15125 | Error moving file |
PBSE_JOB_RECYCLED | 15126 | Job is being recycled |
PBSE_JOB_ALREADY_IN_QUEUE | 15127 | Job is already in destination queue. |
PBSE_INVALID_MUTEX | 15128 | Mutex is NULL or otherwise invalid |
PBSE_MUTEX_ALREADY_LOCKED | 15129 | The mutex is already locked by this object |
PBSE_MUTEX_ALREADY_UNLOCKED | 15130 | The mutex has already been unlocked by this object |
PBSE_INVALID_SYNTAX | 15131 | Command syntax invalid |
PBSE_NODE_DOWN | 15132 | A node is down. Check the MOM and host |
PBSE_SERVER_NOT_FOUND | 15133 | Could not connect to batch server |
PBSE_SERVER_BUSY | 15134 | Server busy. Currently no available threads |