4.329 Starting a Checkpointable Job

Not every job is checkpointable. A job for which checkpointing is desirable must be started with the -c command line option. This option takes a comma-separated list of arguments that are used to control checkpointing behavior. The list of valid options available in the 2.4 version of Torque is show below.

Option Description
none No checkpointing (not highly useful, but included for completeness).
enabled Specify that checkpointing is allowed, but must be explicitly invoked by either the qhold or qchkpt commands.
shutdown Specify that checkpointing is to be done on a job at pbs_mom shutdown.
periodic Specify that periodic checkpointing is enabled. The default interval is 10 minutes and can be changed by the $checkpoint_interval option in the MOM configuration file, or by specifying an interval when the job is submitted.
interval=minutes Specify the checkpoint interval in minutes.
depth=number Specify a number (depth) of checkpoint images to be kept in the checkpoint directory.
dir=path Specify a checkpoint directory (default is /var/spool/torque/checkpoint).

Example 4-214: Sample test program

#include "stdio.h"

int main( int argc, char *argv[] )

{

int i;

         for (i=0; i<100; i++)

         {

                 printf("i = %d\n", i);

                 fflush(stdout);

                 sleep(1);

         }

}

Example 4-215: Instructions for building test program

> gcc -o test test.c

Example 4-216: Sample test script

#!/bin/bash ./test

Example 4-217: Starting the test job

> qstat

> qsub -c enabled,periodic,shutdown,interval=1 test.sh

77.jakaa.cridomain

> qstat

Job id                    Name             User            Time Use S Queue

------------------------- ---------------- --------------- -------- - -----

77.jakaa                  test.sh          jsmith                 0 Q batch

>

If you have no scheduler running, you might need to start the job with qrun.

As this program runs, it writes its output to a file in /var/spool/torque/spool. This file can be observed with the command tail -f.

Related Topics 

© 2017 Adaptive Computing