11.0 Troubleshooting > 11.10 Debugging

Debugging

TORQUE supports a number of diagnostic and debug options including the following:

PBSDEBUG environment variable - If set to 'yes', this variable will prevent pbs_server, pbs_mom, and/or pbs_sched from backgrounding themselves allowing direct launch under a debugger. Also, some client commands will provide additional diagnostic information when this value is set.

PBSLOGLEVEL environment variable - Can be set to any value between 0 and 7 and specifies the logging verbosity level (default = 0)

PBSCOREDUMP environment variable - If set, it will cause the offending resource manager daemon to create a core file if a SIGSEGV, SIGILL, SIGFPE, SIGSYS, or SIGTRAP signal is received. The core dump will be placed in the daemon's home directory ($PBSHOME/mom_priv for pbs_mom).

NDEBUG #define - if set at build time, will cause additional low-level logging information to be output to stdout for pbs_server and pbs_mom daemons.

tracejob reporting tool - can be used to collect and report logging and accounting information for specific jobs (for more information, see Using "tracejob" to locate job failures)

PBSLOGLEVEL and PBSCOREDUMP must be added to the $PBSHOME/pbs_environment file, not just the current environment. To set these variables, add a line to the pbs_environment file as either "variable=value" or just "variable". In the case of "variable=value", the environment variable is set up as the value specified. In the case of "variable", the environment variable is set based upon its value in the current environment.

TORQUE error codes

Error code name Number Description
PBSE_NONE 15000 No error
PBSE_UNKJOBID 15001 Unknown job identifier
PBSE_NOATTR 15002 Undefined attribute
PBSE_ATTRRO 15003 Attempt to set READ ONLY attribute
PBSE_IVALREQ 15004 Invalid request
PBSE_UNKREQ 15005 Unknown batch request
PBSE_TOOMANY 15006 Too many submit retries
PBSE_PERM 15007 No permission
PBSE_BADHOST 15008 Access from host not allowed
PBSE_JOBEXIST 15009 Job already exists
PBSE_SYSTEM 15010 System error occurred
PBSE_INTERNAL 15011 Internal server error occurred
PBSE_REGROUTE 15012 Parent job of dependent in rte queue
PBSE_UNKSIG 15013 Unknown signal name
PBSE_BADATVAL 15014 Bad attribute value
PBSE_MODATRRUN 15015 Cannot modify attribute in run state
PBSE_BADSTATE 15016 Request invalid for job state
PBSE_UNKQUE 15018 Unknown queue name
PBSE_BADCRED 15019 Invalid credential in request
PBSE_EXPIRED 15020 Expired credential in request
PBSE_QUNOENB 15021 Queue not enabled
PBSE_QACESS 15022 No access permission for queue
PBSE_BADUSER 15023 Bad user - no password entry
PBSE_HOPCOUNT 15024 Max hop count exceeded
PBSE_QUEEXIST 15025 Queue already exists
PBSE_ATTRTYPE 15026 Incompatible queue attribute type
PBSE_QUEBUSY 15027 Queue busy (not empty)
PBSE_QUENBIG 15028 Queue name too long
PBSE_NOSUP 15029 Feature/function not supported
PBSE_QUENOEN 15030 Cannot enable queue,needs add def
PBSE_PROTOCOL 15031 Protocol (ASN.1) error
PBSE_BADATLST 15032 Bad attribute list structure
PBSE_NOCONNECTS 15033 No free connections
PBSE_NOSERVER 15034 No server to connect to
PBSE_UNKRESC 15035 Unknown resource
PBSE_QUENODFLT 15036 No default queue defined
PBSE_EXCQRESC 15037 Job exceeds queue resource limits
PBSE_NORERUN 15038 Job not rerunnable
PBSE_ROUTEREJ 15039 Route rejected by all destinations
PBSE_ROUTEEXPD 15040 Time in route queue expired
PBSE_MOMREJECT 15041 Request to MOM failed
PBSE_BADSCRIPT 15042 (qsub) Cannot access script file
PBSE_STAGEIN 15043 Stage-In of files failed
PBSE_RESCUNAV 15044 Resources temporarily unavailable
PBSE_BADGRP 15045 Bad group specified
PBSE_MAXQUED 15046 Max number of jobs in queue
PBSE_CKPBSY 15047 Checkpoint busy, may be retries
PBSE_EXLIMIT 15048 Limit exceeds allowable
PBSE_BADACCT 15049 Bad account attribute value
PBSE_ALRDYEXIT 15050 Job already in exit state
PBSE_NOCOPYFILE 15051 Job files not copied
PBSE_CLEANEDOUT 15052 Unknown job id after clean init
PBSE_NOSYNCMSTR 15053 No master in sync set
PBSE_BADDEPEND 15054 Invalid dependency
PBSE_DUPLIST 15055 Duplicate entry in list
PBSE_DISPROTO 15056 Bad DIS based request protocol
PBSE_EXECTHERE 15057 Cannot execute there
PBSE_SISREJECT 15058 Sister rejected
PBSE_SISCOMM 15059 Sister could not communicate
PBSE_SVRDOWN 15060 Requirement rejected -server shutting down
PBSE_CKPSHORT 15061 Not all tasks could checkpoint
PBSE_UNKNODE 15062 Named node is not in the list
PBSE_UNKNODEATR 15063 Node-attribute not recognized
PBSE_NONODES 15064 Server has no node list
PBSE_NODENBIG 15065 Node name is too big
PBSE_NODEEXIST 15066 Node name already exists
PBSE_BADNDATVAL 15067 Bad node-attribute value
PBSE_MUTUALEX 15068 State values are mutually exclusive
PBSE_GMODERR 15069 Error(s) during global modification of nodes
PBSE_NORELYMOM 15070 Could not contact Mom
PBSE_NOTSNODE 15071 No time-shared nodes

Related topics