TORQUE provides administrators the ability to run scripts before and/or after each job executes. With such a script, a site can prepare systems, perform node health checks, prepend and append text to output and error log files, cleanup systems, and so forth.
The following table shows which MOM runs which script. All scripts must be in the TORQUE_HOME/mom_priv/ directory and be available on every compute node. Mother Superior, as referenced in the following table, is the pbs_mom on the first node allocated, and the term Sisters refers to pbs_moms, although note that a Mother Superior is also a sister node.
Script | Execution Location | Executed as | Execution Directory | File Permissions |
---|---|---|---|---|
prologue | Mother Superior | root | TORQUE_HOME/mom_priv/ | readable and executable by root and NOT writable by anyone besides root (e.g., -r-x------) |
epilogue | ||||
prologue.user | user | readable and executable by root and other (e.g., -r-x---r-x) | ||
epilogue.user | ||||
prologue.parallel | Sister | user | readable and executable by user and NOT writable by anyone besides user (e.g., -r-x---r-x) | |
epilogue.parallel* | Sister | |||
epilogue.precancel | Mother Superior
This script is run after a job cancel request is received from pbs_server and before a kill signal is sent to the job process. |
* available in Version 2.1
When jobs start, the order of script execution is prologue followed by prologue.user. On job exit, the order of execution is epilogue.user followed by epilogue unless a job is canceled. In that case, epilogue.precancel is executed first. Epilogue.parallel is executed only on the Sister nodes when the job is completed.
The epilogue and prologue scripts are controlled by the system administrator. However, beginning in TORQUE version 2.4 a user epilogue and prologue script can be used on a per job basis. See G.2 Per Job Prologue and Epilogue Scripts for more information. |
Root squashing is now supported for epilogue and prologue scripts. |
The prolog and epilog scripts can be very simple. On most systems, the script must declare the execution shell using the #!<SHELL> syntax (i.e., '#!/bin/sh'). In addition, the script may want to process context sensitive arguments passed by TORQUE to the script.
Prolog Environment
The following arguments are passed to the prologue, prologue.user, and prologue.parallel scripts:
Argument | Description |
---|---|
argv[1] | job id |
argv[2] | job execution user name |
argv[3] | job execution group name |
argv[4] | job name (TORQUE 1.2.0p4 and higher only) |
argv[5] | list of requested resource limits (TORQUE 1.2.0p4 and higher only) |
argv[6] | job execution queue (TORQUE 1.2.0p4 and higher only) |
argv[7] | job account (TORQUE 1.2.0p4 and higher only) |
Epilog Environment
TORQUE supplies the following arguments to the epilogue, epilogue.user, epilogue.precancel, and epilogue.parallel scripts:
Argument | Description |
---|---|
argv[1] | job id |
argv[2] | job execution user name |
argv[3] | job execution group name |
argv[4] | job name |
argv[5] | session id |
argv[6] | list of requested resource limits |
argv[7] | list of resources used by job |
argv[8] | job execution queue |
argv[9] | job account |
argv[10] | job exit code |
The epilogue.precancel script is run after a job cancel request is received by the MOM and before any signals are sent to job processes. If this script exists, it is run whether the canceled job was active or idle.
For all scripts, the environment passed to the script is empty. Also, standard input for both scripts is connected to a system dependent file. Currently, for all systems this is /dev/null. Except for epilogue scripts of an interactive job, prologue.parallel and epilogue.parallel, the standard output and error are connected to input and error files associated with the job. For an interactive job, since the pseudo terminal connection is released after the job completes, the standard input and error point to /dev/null. For prologue.parallel and epilogue.parallel, the user will need to redirect stdout and stderr manually.
TORQUE supports per job prologue and epilogue scripts when using the qsub -l option. The syntax is: qsub -l prologue=<prologue_script_path> epilogue=<epilogue_script_path> <script>. The path can be either relative (from the directory where the job is submitted) or absolute. The files must be owned by the user with at least execute and write privileges, and the permissions must not be writeable by group or other.
TORQUE_HOME/mom_priv/:-r-x------ 1 usertom usertom 24 2009-11-09 16:11 prologue_script.sh -r-x------ 1 usertom usertom 24 2009-11-09 16:11 epilogue_script.sh
$ qsub -l prologue=/home/usertom/dev/prologue_script.sh epilogue=/home/usertom/dev/epilogue_script.sh job14.pl
TORQUE takes preventative measures against prologue and epilogue scripts by placing an alarm around the scripts execution. By default, TORQUE sets the alarm to go off after 5 minutes of execution. If the script exceeds this time, it will be terminated and the node will be marked down. This timeout can be adjusted by setting the prologalarm parameter in the mom_priv/config file.
While TORQUE is executing the epilogue, epilogue.user, or epilogue.precancel scripts, the job will be in the E (exiting) state. |
If the prologue script executes successfully, it should exit with a zero status. Otherwise, the script should return the appropriate error code as defined in the table below. The pbs_mom will report the script's exit status to pbs_server which will in turn take the associated action. The following table describes each exit code for the prologue scripts and the action taken.
Error | Description | Action |
---|---|---|
-4 | The script timed out | Job will be requeued |
-3 | The wait(2) call returned an error | Job will be requeued |
-2 | Input file could not be opened | Job will be requeued |
-1 | Permission error (script is not owned by root, or is writable by others) |
Job will be requeued |
0 | Successful completion | Job will run |
1 | Abort exit code | Job will be aborted |
>1 | other | Job will be requeued |
Following are example prologue and epilogue scripts that write the arguments passed to them in the job's standard out file:
prologue | |
---|---|
Script |
#!/bin/sh echo "Prologue Args:" echo "Job ID: $1" echo "User ID: $2" echo "Group ID: $3" echo "" exit 0 |
stdout |
Prologue Args: Job ID: 13724.node01 User ID: user1 Group ID: user1 |
epilogue | |
Script |
#!/bin/sh echo "Epilogue Args:" echo "Job ID: $1" echo "User ID: $2" echo "Group ID: $3" echo "Job Name: $4" echo "Session ID: $5" echo "Resource List: $6" echo "Resources Used: $7" echo "Queue Name: $8" echo "Account String: $9" echo "" exit 0 |
stdout |
Epilogue Args: Job ID: 13724.node01 User ID: user1 Group ID: user1 Job Name: script.sh Session ID: 28244 Resource List: neednodes=node01,nodes=1,walltime=00:01:00 Resources Used: cput=00:00:00,mem=0kb,vmem=0kb,walltime=00:00:07 Queue Name: batch Account String: |
The Ohio Supercomputer Center contributed the following scripts:
"prologue creates a unique temporary directory on each node assigned to a job before the job begins to run, and epilogue deletes that directory after the job completes.
Having a separate temporary directory on each node is probably not as good as having a good, high performance parallel filesystem. |
#!/bin/sh # Create TMPDIR on all the nodes # Copyright 1999, 2000, 2001 Ohio Supercomputer Center # prologue gets 3 arguments: # 1 -- jobid # 2 -- userid # 3 -- grpid # jobid=$1 user=$2 group=$3 nodefile=/var/spool/pbs/aux/$jobid if [ -r $nodefile ] ; then nodes=$(sort $nodefile | uniq) else nodes=localhost fi tmp=/tmp/pbstmp.$jobid for i in $nodes ; do ssh $i mkdir -m 700 $tmp \&\& chown $user.$group $tmp done exit 0
#!/bin/sh # Clear out TMPDIR # Copyright 1999, 2000, 2001 Ohio Supercomputer Center # epilogue gets 9 arguments: # 1 -- jobid # 2 -- userid # 3 -- grpid # 4 -- job name # 5 -- sessionid # 6 -- resource limits # 7 -- resources used # 8 -- queue # 9 -- account # jobid=$1 nodefile=/var/spool/pbs/aux/$jobid if [ -r $nodefile ] ; then nodes=$(sort $nodefile | uniq) else nodes=localhost fi tmp=/tmp/pbstmp.$jobid for i in $nodes ; do ssh $i rm -rf $tmp done exit 0
Prologue, prologue.user and prologue.parallel scripts can have dramatic effects on job scheduling if written improperly. |