Not every job is checkpointable. A job for which checkpointing is desirable must be started with the -c command line option. This option takes a comma-separated list of arguments that are used to control checkpointing behavior. The list of valid options available in the 2.4 version of TORQUE is show below.
Option | Description |
---|---|
none | No checkpointing (not highly useful, but included for completeness). |
enabled | Specify that checkpointing is allowed, but must be explicitly invoked by either the qhold or qchkpt commands. |
shutdown | Specify that checkpointing is to be done on a job at pbs_mom shutdown. |
periodic | Specify that periodic checkpointing is enabled. The default interval is 10 minutes and can be changed by the $checkpoint_interval option in the MOM configuration file, or by specifying an interval when the job is submitted. |
interval=minutes | Specify the checkpoint interval in minutes. |
depth=number | Specify a number (depth) of checkpoint images to be kept in the checkpoint directory. |
dir=path | Specify a checkpoint directory (default is /var/spool/torque/checkpoint). |
Example 2-1: Sample test program
#include "stdio.h" int main( int argc, char *argv[] ) { int i; for (i=0; i<100; i++) { printf("i = %d\n", i); fflush(stdout); sleep(1); } } |
Example 2-2: Instructions for building test program
> gcc -o test test.c |
Example 2-3: Sample test script
#!/bin/bash ./test |
Example 2-4: Starting the test job
> qstat > qsub -c enabled,periodic,shutdown,interval=1 test.sh 77.jakaa.cridomain > qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 77.jakaa test.sh jsmith 0 Q batch > |
If you have no scheduler running, you might need to start the job with qrun.
As this program runs, it writes its output to a file in /var/spool/torque/spool. This file can be observed with the command tail -f.
Related topics
© 2012 Adaptive Computing