(Click to open topic with navigation)
TORQUE supports job preemption by allowing authorized users to suspend and resume jobs. This is supported using one of two methods. If the node supports OS-level preemption, TORQUE will recognize that during the configure process and enable it. Otherwise, the MOM may be configured to launch a custom checkpoint script in order to support preempting a job. Using a custom checkpoint script requires that the job understand how to resume itself from a checkpoint after the preemption occurs.
Configuring a checkpoint script on a MOM
To configure the MOM to support a checkpoint script, the $checkpoint_script parameter must be set in the MOM's configuration file found in TORQUE_HOME/mom_priv/config. The checkpoint script should have execute permissions set. A typical configuration file might look as follows:
mom_priv/config:
$pbsserver node06
$logevent 255
$restricted *.mycluster.org
$checkpoint_script /opt/moab/tools/mom-checkpoint.sh
The second thing that must be done to enable the checkpoint script is to change the value of MOM_CHECKPOINT to 1 in /src/include/pbs_config.h. (In some instances, MOM_CHECKPOINT may already be defined as 1.) The new line should be as follows:
/src/include/pbs_config.h:
#define MOM_CHECKPOINT 1
Related topics