All these tests assume the following test program and shell script, test.sh.
#include int main( int argc, char *argv[] ) { int i; for (i=0; i<100; i++) { printf("i = %d\n", i); fflush(stdout); sleep(1); } }
#!/bin/bash /home/test/test
This test determines if the proper environment has been established.
Submit a test job and the issue a hold on the job.
> qsub -c enabled test.sh 999.xxx.yyy > qhold 999
Normally the result of qhold is nothing. If an error message is produced saying that qhold is not a supported feature then one of the following configuration errors might be present.
If no configuration was done to specify a specific directory location for the checkpoint file, the default location is off of the Torque directory which in my case is /var/spool/torque/checkpoint.
Otherwise, go to the specified directory for the checkpoint image files. This was done by either specifying an option on job submission i.e. -c dir=/home/test or by setting an attribute on the execution quere. This is done with the command qmgr -c 'set queue batch checkpoint_dir=/home/test'.
Doing a directory listing shows the following.
# find /var/spool/torque/checkpoint /var/spool/torque/checkpoint /var/spool/torque/checkpoint/999.xxx.yyy.CK /var/spool/torque/checkpoint/999.xxx.yyy.CK/ckpt.999.xxx.yyy.1205266630 # find /var/spool/torque/checkpoint |xargs ls -l -r-------- 1 root root 543779 2008-03-11 14:17 /var/spool/torque/checkpoint/999.xxx.yyy.CK/ckpt.999.xxx.yyy.1205266630 /var/spool/torque/checkpoint: total 4 drwxr-xr-x 2 root root 4096 2008-03-11 14:17 999.xxx.yyy.CK /var/spool/torque/checkpoint/999.xxx.yyy.CK: total 536 -r-------- 1 root root 543779 2008-03-11 14:17 ckpt.999.xxx.yyy.1205266630
Doing a qstat -f command should show the job in a held state, job_state = H. Note that the attribute checkpoint_name is set to the name of the file seen above.
If a checkpoint directory has been specified, there will also be an attribute checkpoint_dir in the output of qstat -f.
$ qstat -f Job Id: 999.xxx.yyy Job_Name = test.sh Job_Owner = test@xxx.yyy resources_used.cput = 00:00:00 resources_used.mem = 0kb resources_used.vmem = 0kb resources_used.walltime = 00:00:06 job_state = H queue = batch server = xxx.yyy Checkpoint = u ctime = Tue Mar 11 14:17:04 2008 Error_Path = xxx.yyy:/home/test/test.sh.e999 exec_host = test/0 Hold_Types = u Join_Path = n Keep_Files = n Mail_Points = a mtime = Tue Mar 11 14:17:10 2008 Output_Path = xxx.yyy:/home/test/test.sh.o999 Priority = 0 qtime = Tue Mar 11 14:17:04 2008 Rerunable = True Resource_List.neednodes = 1 Resource_List.nodect = 1 Resource_List.nodes = 1 Resource_List.walltime = 01:00:00 session_id = 9402 substate = 20 Variable_List = PBS_O_HOME=/home/test,PBS_O_LANG=en_US.UTF-8, PBS_O_LOGNAME=test, PBS_O_PATH=/usr/local/perltests/bin:/home/test/bin:/usr/local/s bin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games, PBS_O_SHELL=/bin/bash,PBS_SERVER=xxx.yyy, PBS_O_HOST=xxx.yyy,PBS_O_WORKDIR=/home/test, PBS_O_QUEUE=batch euser = test egroup = test hashname = 999.xxx.yyy queue_rank = 3 queue_type = E comment = Job started on Tue Mar 11 at 14:17 exit_status = 271 submit_args = test.sh start_time = Tue Mar 11 14:17:04 2008 start_count = 1 checkpoint_dir = /var/spool/torque/checkpoint/999.xxx.yyy.CK checkpoint_name = ckpt.999.xxx.yyy.1205266630
This test determines if the checkpoint files remain in the default directory after the job is removed from the Torque queue.
Note that this behavior was requested by a customer but in fact may not be the right thing to do as it leaves the checkpoint files on the execution node. These will gradually build up over time on the node being limited only by disk space. The right thing would seem to be that the checkpoint files are copied to the users home directory after the job is purged from the execution node.
Assuming the steps of Test 1, delete the job and then wait until the job leaves the queue after the completed job hold time. Then look at the contents of the default checkpoint directory to see if the files are still there.
> qsub -c enabled test.sh 999.xxx.yyy > qhold 999 > qdel 999 > sleep 100 > qstat > > find /var/spool/torque/checkpoint ... files ...
The files are not there, did Test 1 actually pass?
The files are there.
This test determines if the job can be restarted after a checkpoint hold.
Assuming the steps of Test 1, issue a qrls command. Have another window open into the /var/spool/torque/spool directory and tail the job.
This test determines if the checkpoint/restart cycle can be repeated multiple times.
Start a job and then while tail'ing the job output, do multiple qhold/qrls operations.
> qsub -c enabled test.sh 999.xxx.yyy > qhold 999 > qrls 999 > qhold 999 > qrls 999 > qhold 999 > qrls 999
This test determines if automatic periodic checkpoint will work.
Start the job with the option -c enabled,periodic,interval=1 and look in the checkpoint directory for checkpoint images to be generated about every minute.
> qsub -c enabled,periodic,interval=1 test.sh 999.xxx.yyy
The checkpoint directory should contain multiple checkpoint images and the time on the files should be roughly a minute apart.
This test determines if the job can be restarted from a previous checkpoint image.
Start the job with the option -c enabled,periodic,interval=1 and look in the checkpoint directory for checkpoint images to be generated about every minute. Do a qhold on the job to stop it. Change the attribute checkpoint_name with the qalter command. Then do a qrls to restart the job.
> qsub -c enabled,periodic,interval=1 test.sh 999.xxx.yyy > qhold 999 > qalter -W checkpoint_name=ckpt.999.xxx.yyy.1234567 > qrls 999
The job output file should be truncated back and the count should resume at an earlier number.