Checkpointing a job

Jobs are checkpointed by issuing a qhold command. This causes an image file representing the state of the process to be written to disk. The directory by default is /var/spool/torque/checkpoint.

This default can be altered at the queue level with the qmgr command. For example, the command qmgr -c set queue batch checkpoint_dir=/tmp would change the checkpoint directory to /tmp for the queue 'batch'.

The default directory can also be altered at job submission time with the -c dir=/tmp command line option.

The name of the checkpoint directory and the name of the checkpoint image file become attributes of the job and can be observed with the command qstat -f. Notice in the output the names checkpoint_dir and checkpoint_name. The variable checkpoint_name is set when the image file is created and will not exist if no checkpoint has been taken.

A job can also be checkpointed without stopping or holding the job with the command qchkpt.

Related topics 

© 2014 Adaptive Computing