TORQUE supports job preemption by allowing authorized users to suspend and resume jobs. This is supported using one of two methods. If the node supports OS-level preemption, TORQUE will recognize that during the configure process and enable it. Otherwise, the MOM may be configured to launch a custom checkpoint script in order to support preempting a job. Using a custom checkpoint script requires that the job understand how to resume itself from a checkpoint after the preemption occurs.
To configure the MOM to support a checkpoint script, the $checkpoint_script
parameter must be set in the MOM's configuration file found in TORQUE_HOME/mom_priv/config
. The checkpoint script should have execute permissions set. A typical configuration file might look as follows:
$pbsserver node06 $logevent 255 $restricted *.mycluster.org $checkpoint_script /opt/moab/tools/mom-checkpoint.sh
The second thing that must be done to enable the checkpoint script is to change the value of MOM_CHECKPOINT
to 1
in /src/include/pbs_config.h
. In some instances, MOM_CHECKPOINT
may already be defined as 1
. The new line should be as follows:
#define MOM_CHECKPOINT 1