Not every job is checkpointable. A job for which checkpointing is desirable must be started with the -c command line option. This option takes a comma-separated list of arguments that are used to control checkpointing behavior. The list of valid options available in the 2.4 version of TORQUE is show below.
Option | Description |
---|---|
none | No checkpointing (not highly useful, but included for completeness). |
enabled | Specify that checkpointing is allowed, but must be explicitly invoked by either the qhold or qchkpt commands. |
shutdown | Specify that checkpointing is to be done on a job at pbs_mom shutdown. |
periodic | Specify that periodic checkpointing is enabled. The default interval is 10 minutes and can be changed by the $checkpoint_interval option in the MOM configuration file, or by specifying an interval when the job is submitted. |
interval=minutes | Specify the checkpoint interval in minutes. |
depth=number | Specify a number (depth) of checkpoint images to be kept in the checkpoint directory. |
dir=path | Specify a checkpoint directory (default is /var/spool/torque/checkpoint). |
Example 2-1: Sample test program
#include "stdio.h"
int main( int argc, char *argv[] )
{
int i;
for (i=0; i<100; i++)
{
printf("i = %d\n", i);
fflush(stdout);
sleep(1);
}
}
Example 2-2: Instructions for building test program
> gcc -o test test.c
Example 2-3: Sample test script
#!/bin/bash ./test
Example 2-4: Starting the test job
> qstat
> qsub -c enabled,periodic,shutdown,interval=1 test.sh
77.jakaa.cridomain
> qstat
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
77.jakaa test.sh jsmith 0 Q batch
>
If you have no scheduler running, you might need to start the job with qrun.
As this program runs, it writes its output to a file in /var/spool/torque/spool. This file can be observed with the command tail -f.
Related topics