TORQUE Resource Manager > Troubleshooting > Debugging

Debugging

TORQUE supports a number of diagnostic and debug options including the following:

PBSDEBUG environment variable - If set to 'yes', this variable will prevent pbs_server, pbs_mom, and/or pbs_sched from backgrounding themselves allowing direct launch under a debugger. Also, some client commands will provide additional diagnostic information when this value is set.

PBSLOGLEVEL environment variable - Can be set to any value between 0 and 7 and specifies the logging verbosity level (default = 0)

PBSCOREDUMP environment variable - If set, it will cause the offending resource manager daemon to create a core file if a SIGSEGV, SIGILL, SIGFPE, SIGSYS, or SIGTRAP signal is received. The core dump will be placed in the daemon's home directory ($PBSHOME/mom_priv for pbs_mom and $PBSHOME/server_priv for pbs_server).

To enable core dumping in a Red Hat system, you must add the following line to the /etc/init.d/pbs_mom and /etc/init.d/pbs_server scripts:

export DAEMON_COREFILE_LIMIT=unlimited

NDEBUG #define - if set at build time, will cause additional low-level logging information to be output to stdout for pbs_server and pbs_mom daemons.

tracejob reporting tool - can be used to collect and report logging and accounting information for specific jobs (for more information, see Using "tracejob" to Locate Job Failures).

PBSLOGLEVEL and PBSCOREDUMP must be added to the $PBSHOME/pbs_environment file, not just the current environment. To set these variables, add a line to the pbs_environment file as either "variable=value" or just "variable". In the case of "variable=value", the environment variable is set up as the value specified. In the case of "variable", the environment variable is set based upon its value in the current environment.

TORQUE error codes

Error code name Number Description
PBSE_FLOOR 15000 No error
PBSE_UNKJOBID 15001 Unknown job identifier
PBSE_NOATTR 15002 Undefined attribute
PBSE_ATTRRO 15003 Attempt to set READ ONLY attribute
PBSE_IVALREQ 15004 Invalid request
PBSE_UNKREQ 15005 Unknown batch request
PBSE_TOOMANY 15006 Too many submit retries
PBSE_PERM 15007 No permission
PBSE_IFF_NOT_FOUND 15008 "pbs_iff" not found; unable to authenticate
PBSE_MUNGE_NOT_FOUND 15009 "munge" executable not found; unable to authenticate
PBSE_BADHOST 15010 Access from host not allowed
PBSE_JOBEXIST 15011 Job already exists
PBSE_SYSTEM 15012 System error occurred
PBSE_INTERNAL 15013 Internal server error occurred
PBSE_REGROUTE 15014 Parent job of dependent in rte queue
PBSE_UNKSIG 15015 Unknown signal name
PBSE_BADATVAL 15016 Bad attribute value
PBSE_MODATRRUN 15017 Cannot modify attribute in run state
PBSE_BADSTATE 15018 Request invalid for job state
PBSE_UNKQUE 15020 Unknown queue name
PBSE_BADCRED 15021 Invalid credential in request
PBSE_EXPIRED 15022 Expired credential in request
PBSE_QUNOENB 15023 Queue not enabled
PBSE_QACESS 15024 No access permission for queue
PBSE_BADUSER 15025 Bad user - no password entry
PBSE_HOPCOUNT 15026 Max hop count exceeded
PBSE_QUEEXIST 15027 Queue already exists
PBSE_ATTRTYPE 15028 Incompatible queue attribute type
PBSE_QUEBUSY 15029 Queue busy (not empty)
PBSE_QUENBIG 15030 Queue name too long
PBSE_NOSUP 15031 Feature/function not supported
PBSE_QUENOEN 15032 Cannot enable queue,needs add def
PBSE_PROTOCOL 15033 Protocol (ASN.1) error
PBSE_BADATLST 15034 Bad attribute list structure
PBSE_NOCONNECTS 15035 No free connections
PBSE_NOSERVER 15036 No server to connect to
PBSE_UNKRESC 15037 Unknown resource
PBSE_EXCQRESC 15038 Job exceeds queue resource limits
PBSE_QUENODFLT 15039 No default queue defined
PBSE_NORERUN 15040 Job not rerunnable
PBSE_ROUTEREJ 15041 Route rejected by all destinations
PBSE_ROUTEEXPD 15042 Time in route queue expired
PBSE_MOMREJECT 15043 Request to MOM failed
PBSE_BADSCRIPT 15044 (qsub) Cannot access script file
PBSE_STAGEIN 15045 Stage-In of files failed
PBSE_RESCUNAV 15046 Resources temporarily unavailable
PBSE_BADGRP 15047 Bad group specified
PBSE_MAXQUED 15048 Max number of jobs in queue
PBSE_CKPBSY 15049 Checkpoint busy, may be retries
PBSE_EXLIMIT 15050 Limit exceeds allowable
PBSE_BADACCT 15051 Bad account attribute value
PBSE_ALRDYEXIT 15052 Job already in exit state
PBSE_NOCOPYFILE 15053 Job files not copied
PBSE_CLEANEDOUT 15054 Unknown job id after clean init
PBSE_NOSYNCMSTR 15055 No master in sync set
PBSE_BADDEPEND 15056 Invalid dependency
PBSE_DUPLIST 15057 Duplicate entry in list
PBSE_DISPROTO 15058 Bad DIS based request protocol
PBSE_EXECTHERE 15059 Cannot execute there
PBSE_SISREJECT 15060 Sister rejected
PBSE_SISCOMM 15061 Sister could not communicate
PBSE_SVRDOWN 15062 Requirement rejected -server shutting down
PBSE_CKPSHORT 15063 Not all tasks could checkpoint
PBSE_UNKNODE 15064 Named node is not in the list
PBSE_UNKNODEATR 15065 Node-attribute not recognized
PBSE_NONODES 15066 Server has no node list
PBSE_NODENBIG 15067 Node name is too big
PBSE_NODEEXIST 15068 Node name already exists
PBSE_BADNDATVAL 15069 Bad node-attribute value
PBSE_MUTUALEX 15070 State values are mutually exclusive
PBSE_GMODERR 15071 Error(s) during global modification of nodes
PBSE_NORELYMOM 15072 Could not contact MOM
PBSE_NOTSNODE 15073 No time-shared nodes
PBSE_JOBTYPE 15074 Wrong job type
PBSE_BADACLHOST 15075 Bad ACL entry in host list
PBSE_MAXUSERQUED 15076 Maximum number of jobs already in queue for user
PBSE_BADDISALLOWTYPE 15077 Bad type in "disallowed_types" list
PBSE_NOINTERACTIVE 15078 Interactive jobs not allowed in queue
PBSE_NOBATCH 15079 Batch jobs not allowed in queue
PBSE_NORERUNABLE 15080 Rerunable jobs not allowed in queue
PBSE_NONONRERUNABLE 15081 Non-rerunable jobs not allowed in queue
PBSE_UNKARRAYID 15082 Unknown array ID
PBSE_BAD_ARRAY_REQ 15083 Bad job array request
PBSE_TIMEOUT 15084 Time out
PBSE_JOBNOTFOUND 15085 Job not found
PBSE_NOFAULTTOLERANT 15086 Fault tolerant jobs not allowed in queue
PBSE_NOFAULTINTOLERANT 15087 Only fault tolerant jobs allowed in queue
PBSE_NOJOBARRAYS 15088 Job arrays not allowed in queue
PBSE_RELAYED_TO_MOM 15089 Request was relayed to a MOM
PBSE_MEM_MALLOC 15090 Failed to allocate memory for memmgr
PBSE_MUTEX 15091 Failed to allocate controlling mutex (lock/unlock)
PBSE_TRHEADATTR 15092 Failed to set thread attributes
PBSE_THREAD 15093 Failed to create thread
PBSE_SELECT 15094 Failed to select socket
PBSE_SOCKET_FAULT 15095 Failed to get connection to socket
PBSE_SOCKET_WRITE 15096 Failed to write data to socket
PBSE_SOCKET_READ 15097 Failed to read data from socket
PBSE_SOCKET_CLOSE 15098 Socket closed
PBSE_SOCKET_LISTEN 15099 Failed to listen in on socket
PBSE_AUTH_INVALID 15100 Invalid auth type in request
PBSE_NOT_IMPLEMENTED 15101 Functionality not yet implemented
PBSE_QUENOTAVAILABLE 15102 Queue is not available

Related topics 

© 2014 Adaptive Computing