(Click to open topic with navigation)
Torque supports a number of diagnostic and debug options including the following:
PBSDEBUG environment variable - If set to 'yes', this variable will prevent pbs_server, pbs_mom, and/or pbs_sched from backgrounding themselves allowing direct launch under a debugger. Also, some client commands will provide additional diagnostic information when this value is set.
PBSLOGLEVEL environment variable - Can be set to any value between 0 and 7 and specifies the logging verbosity level (default = 0)
PBSCOREDUMP environment variable - If set, it will cause the offending resource manager daemon to create a core file if a SIGSEGV, SIGILL, SIGFPE, SIGSYS, or SIGTRAP signal is received. The core dump will be placed in the daemon's home directory ($PBSHOME/mom_priv for pbs_mom and $PBSHOME/server_priv for pbs_server).
To enable core dumping in a Red Hat-based system, you must add the following line to the /etc/init.d/pbs_mom and /etc/init.d/pbs_server scripts:
export DAEMON_COREFILE_LIMIT=unlimited
NDEBUG #define - if set at build time, will cause additional low-level logging information to be output to stdout for pbs_server and pbs_mom daemons.
tracejob reporting tool - can be used to collect and report logging and accounting information for specific jobs (See Using "tracejob" to Locate Job Failures) for more information.
PBSLOGLEVEL and PBSCOREDUMP must be added to the $PBSHOME/pbs_environment file, not just the current environment. To set these variables, add a line to the pbs_environment file as either "variable=value" or just "variable". In the case of "variable=value", the environment variable is set up as the value specified. In the case of "variable", the environment variable is set based upon its value in the current environment.
5.619.1 Torque Error Codes
Error code name | Number | Description |
---|---|---|
PBSE_FLOOR | 15000 | No error |
PBSE_UNKJOBID | 15001 | Unknown job identifier |
PBSE_NOATTR | 15002 | Undefined attribute |
PBSE_ATTRRO | 15003 | Attempt to set READ ONLY attribute |
PBSE_IVALREQ | 15004 | Invalid request |
PBSE_UNKREQ | 15005 | Unknown batch request |
PBSE_TOOMANY | 15006 | Too many submit retries |
PBSE_PERM | 15007 | No permission |
PBSE_IFF_NOT_FOUND | 15008 | "pbs_iff" not found; unable to authenticate |
PBSE_MUNGE_NOT_FOUND | 15009 | "munge" executable not found; unable to authenticate |
PBSE_BADHOST | 15010 | Access from host not allowed |
PBSE_JOBEXIST | 15011 | Job already exists |
PBSE_SYSTEM | 15012 | System error occurred |
PBSE_INTERNAL | 15013 | Internal server error occurred |
PBSE_REGROUTE | 15014 | Parent job of dependent in rte queue |
PBSE_UNKSIG | 15015 | Unknown signal name |
PBSE_BADATVAL | 15016 | Bad attribute value |
PBSE_MODATRRUN | 15017 | Cannot modify attribute in run state |
PBSE_BADSTATE | 15018 | Request invalid for job state |
PBSE_UNKQUE | 15020 | Unknown queue name |
PBSE_BADCRED | 15021 | Invalid credential in request |
PBSE_EXPIRED | 15022 | Expired credential in request |
PBSE_QUNOENB | 15023 | Queue not enabled |
PBSE_QACESS | 15024 | No access permission for queue |
PBSE_BADUSER | 15025 | Bad user - no password entry |
PBSE_HOPCOUNT | 15026 | Max hop count exceeded |
PBSE_QUEEXIST | 15027 | Queue already exists |
PBSE_ATTRTYPE | 15028 | Incompatible queue attribute type |
PBSE_QUEBUSY | 15029 | Queue busy (not empty) |
PBSE_QUENBIG | 15030 | Queue name too long |
PBSE_NOSUP | 15031 | Feature/function not supported |
PBSE_QUENOEN | 15032 | Cannot enable queue,needs add def |
PBSE_PROTOCOL | 15033 | Protocol (ASN.1) error |
PBSE_BADATLST | 15034 | Bad attribute list structure |
PBSE_NOCONNECTS | 15035 | No free connections |
PBSE_NOSERVER | 15036 | No server to connect to |
PBSE_UNKRESC | 15037 | Unknown resource |
PBSE_EXCQRESC | 15038 | Job exceeds queue resource limits |
PBSE_QUENODFLT | 15039 | No default queue defined |
PBSE_NORERUN | 15040 | Job not rerunnable |
PBSE_ROUTEREJ | 15041 | Route rejected by all destinations |
PBSE_ROUTEEXPD | 15042 | Time in route queue expired |
PBSE_MOMREJECT | 15043 | Request to MOM failed |
PBSE_BADSCRIPT | 15044 | (qsub) Cannot access script file |
PBSE_STAGEIN | 15045 | Stage-In of files failed |
PBSE_RESCUNAV | 15046 | Resources temporarily unavailable |
PBSE_BADGRP | 15047 | Bad group specified |
PBSE_MAXQUED | 15048 | Max number of jobs in queue |
PBSE_CKPBSY | 15049 | Checkpoint busy, may be retries |
PBSE_EXLIMIT | 15050 | Limit exceeds allowable |
PBSE_BADACCT | 15051 | Bad account attribute value |
PBSE_ALRDYEXIT | 15052 | Job already in exit state |
PBSE_NOCOPYFILE | 15053 | Job files not copied |
PBSE_CLEANEDOUT | 15054 | Unknown job id after clean init |
PBSE_NOSYNCMSTR | 15055 | No master in sync set |
PBSE_BADDEPEND | 15056 | Invalid dependency |
PBSE_DUPLIST | 15057 | Duplicate entry in list |
PBSE_DISPROTO | 15058 | Bad DIS based request protocol |
PBSE_EXECTHERE | 15059 | Cannot execute there |
PBSE_SISREJECT | 15060 | Sister rejected |
PBSE_SISCOMM | 15061 | Sister could not communicate |
PBSE_SVRDOWN | 15062 | Requirement rejected -server shutting down |
PBSE_CKPSHORT | 15063 | Not all tasks could checkpoint |
PBSE_UNKNODE | 15064 | Named node is not in the list |
PBSE_UNKNODEATR | 15065 | Node-attribute not recognized |
PBSE_NONODES | 15066 | Server has no node list |
PBSE_NODENBIG | 15067 | Node name is too big |
PBSE_NODEEXIST | 15068 | Node name already exists |
PBSE_BADNDATVAL | 15069 | Bad node-attribute value |
PBSE_MUTUALEX | 15070 | State values are mutually exclusive |
PBSE_GMODERR | 15071 | Error(s) during global modification of nodes |
PBSE_NORELYMOM | 15072 | Could not contact MOM |
PBSE_NOTSNODE | 15073 | No time-shared nodes |
PBSE_JOBTYPE | 15074 | Wrong job type |
PBSE_BADACLHOST | 15075 | Bad ACL entry in host list |
PBSE_MAXUSERQUED | 15076 | Maximum number of jobs already in queue for user |
PBSE_BADDISALLOWTYPE | 15077 | Bad type in "disallowed_types" list |
PBSE_NOINTERACTIVE | 15078 | Interactive jobs not allowed in queue |
PBSE_NOBATCH | 15079 | Batch jobs not allowed in queue |
PBSE_NORERUNABLE | 15080 | Rerunable jobs not allowed in queue |
PBSE_NONONRERUNABLE | 15081 | Non-rerunable jobs not allowed in queue |
PBSE_UNKARRAYID | 15082 | Unknown array ID |
PBSE_BAD_ARRAY_REQ | 15083 | Bad job array request |
PBSE_TIMEOUT | 15084 | Time out |
PBSE_JOBNOTFOUND | 15085 | Job not found |
PBSE_NOFAULTTOLERANT | 15086 | Fault tolerant jobs not allowed in queue |
PBSE_NOFAULTINTOLERANT | 15087 | Only fault tolerant jobs allowed in queue |
PBSE_NOJOBARRAYS | 15088 | Job arrays not allowed in queue |
PBSE_RELAYED_TO_MOM | 15089 | Request was relayed to a MOM |
PBSE_MEM_MALLOC | 15090 | Failed to allocate memory for memmgr |
PBSE_MUTEX | 15091 | Failed to allocate controlling mutex (lock/unlock) |
PBSE_TRHEADATTR | 15092 | Failed to set thread attributes |
PBSE_THREAD | 15093 | Failed to create thread |
PBSE_SELECT | 15094 | Failed to select socket |
PBSE_SOCKET_FAULT | 15095 | Failed to get connection to socket |
PBSE_SOCKET_WRITE | 15096 | Failed to write data to socket |
PBSE_SOCKET_READ | 15097 | Failed to read data from socket |
PBSE_SOCKET_CLOSE | 15098 | Socket closed |
PBSE_SOCKET_LISTEN | 15099 | Failed to listen in on socket |
PBSE_AUTH_INVALID | 15100 | Invalid auth type in request |
PBSE_NOT_IMPLEMENTED | 15101 | Functionality not yet implemented |
PBSE_QUENOTAVAILABLE | 15102 | Queue is not available |
Related Topics