TORQUE supports job preemption by allowing authorized users to suspend and resume jobs. This is supported using one of two methods. If the node supports OS-level preemption, TORQUE will recognize that during the configure process and enable it. Otherwise, the MOM may be configured to launch a custom checkpoint script in order to support preempting a job. Using a custom checkpoint script requires that the job understand how to resume itself from a checkpoint after the preemption occurs.
Configuring a checkpoint script on a MOM
To configure the MOM to support a checkpoint script, the $checkpoint_script parameter must be set in the MOM's configuration file found in TORQUE_HOME/mom_priv/config. The checkpoint script should have execute permissions set. A typical configuration file might look as follows:
mom_priv/config:
$pbsserver node06
$logevent 255
$restricted *.mycluster.org
$checkpoint_script /opt/moab/tools/mom-checkpoint.sh
The second thing that must be done to enable the checkpoint script is to change the value of MOM_CHECKPOINT to 1 in /src/include/pbs_config.h. (In some instances, MOM_CHECKPOINT may already be defined as 1.) The new line should be as follows:
/src/include/pbs_config.h:
#define MOM_CHECKPOINT 1
Related topics