While TORQUE has had a job checkpoint and restart capability for many years, this was tied to machine specific features. Now TORQUE supports BLCR — an architecture independent package that provides for process checkpoint and restart.
The support for BLCR is only for serial jobs, not for any MPI type jobs. |
BLCR is a kernel level package. It must be downloaded and installed from BLCR.
After building and making the package, it must be installed into the kernel with commands as follows. These can be installed into the file /etc/modules but all of the testing was done with explicit invocations of modprobe.
Installing BLCR into the kernel:# /sbin/insmod /usr/local/lib/blcr/2.6.12-1.234/blcr_imports.ko # /sbin/insmod /usr/local/lib/blcr/2.6.12-1.234/blcr_vmadump.ko # /sbin/insmod /usr/local/lib/blcr/2.6.12-1.234/blcr.ko
The BLCR system provides four command line utilities: (1) cr_checkpoint, (2) cr_info, (3) cr_restart, and (4) cr_run.
For more information about BLCR, see the BLCR Administrator's Guide.
Configuring and Building TORQUE for BLCR:
> ./configure --enable-unixsockets=no --enable-blcr > make > sudo make install
Depending on where BLCR is installed you may also need to use the following configure options to specify BLCR paths:
The pbs_mom configuration file located in /var/spool/torque/mom_priv must be modified to identify the script names associated with invoking the BLCR commands. The following variables should be used in the configuration file when using BLCR checkpointing.
The following example shows the contents of the configuration file used for testing the BLCR feature in TORQUE.
The script files below must be executable by the user. Be sure to use chmod to set the permissions to 754. |
# chmod 754 blcr* # ls -l total 20 -rwxr-xr-- 1 root root 2112 2008-03-11 13:14 blcr_checkpoint_script -rwxr-xr-- 1 root root 1987 2008-03-11 13:14 blcr_restart_script -rw-r--r-- 1 root root 215 2008-03-11 13:13 config drwxr-x--x 2 root root 4096 2008-03-11 13:21 jobs -rw-r--r-- 1 root root 7 2008-03-11 13:15 mom.lock
$checkpoint_script /var/spool/torque/mom_priv/blcr_checkpoint_script $restart_script /var/spool/torque/mom_priv/blcr_restart_script $checkpoint_run_exe /usr/local/bin/cr_run $pbsserver makua.cridomain $loglevel 7
#! /usr/bin/perl ################################################################################ # # Usage: checkpoint_script # # This script is invoked by pbs_mom to checkpoint a job. # ################################################################################ use strict; use Sys::Syslog; # Log levels: # 0 = none -- no logging # 1 = fail -- log only failures # 2 = info -- log invocations # 3 = debug -- log all subcommands my $logLevel = 3; logPrint(2, "Invoked: $0 " . join(' ', @ARGV) . "\n"); my ($sessionId, $jobId, $userId, $signalNum, $checkpointDir, $checkpointName); my $usage = "Usage: $0 \n"; # Note that depth is not used in this script but could control a limit to the number of checkpoint # image files that are preserved on the disk. # # Note also that a request was made to identify whether this script was invoked by the job's # owner or by a system administrator. While this information is known to pbs_server, it # is not propagated to pbs_mom and thus it is not possible to pass this to the script. # Therefore, a workaround is to invoke qmgr and attempt to set a trivial variable. # This will fail if the invoker is not a manager. if (@ARGV == 7) { ($sessionId, $jobId, $userId, $checkpointDir, $checkpointName, $signalNum $depth) = @ARGV; } else { logDie(1, $usage); } # Change to the checkpoint directory where we want the checkpoint to be created chdir $checkpointDir or logDie(1, "Unable to cd to checkpoint dir ($checkpointDir): $!\n") if $logLevel; my $cmd = "cr_checkpoint"; $cmd .= " --signal $signalNum" if $signalNum; $cmd .= " --tree $sessionId"; $cmd .= " --file $checkpointName"; my $output = `$cmd 2>&1`; my $rc = $? >> 8; logDie(1, "Subcommand ($cmd) failed with rc=$rc:\n$output") if $rc && $logLevel >= 1; logPrint(3, "Subcommand ($cmd) yielded rc=$rc:\n$output") if $logLevel >= 3; exit 0; ################################################################################ # logPrint($message) # Write a message (to syslog) and die ################################################################################ sub logPrint { my ($level, $message) = @_; my @severity = ('none', 'warning', 'info', 'debug'); return if $level > $logLevel; openlog('checkpoint_script', '', 'user'); syslog($severity[$level], $message); closelog(); } ################################################################################ # logDie($message) # Write a message (to syslog) and die ################################################################################ sub logDie { my ($level, $message) = @_; logPrint($level, $message); die($message); }
#! /usr/bin/perl ################################################################################ # # Usage: restart_script # # This script is invoked by pbs_mom to restart a job. # ################################################################################ use strict; use Sys::Syslog; # Log levels: # 0 = none -- no logging # 1 = fail -- log only failures # 2 = info -- log invocations # 3 = debug -- log all subcommands my $logLevel = 3; logPrint(2, "Invoked: $0 " . join(' ', @ARGV) . "\n"); my ($sessionId, $jobId, $userId, $checkpointDir, $restartName); my $usage = "Usage: $0 \n"; if (@ARGV == 5) { ($sessionId, $jobId, $userId, $checkpointDir, $restartName) = @ARGV; } else { logDie(1, $usage); } # Change to the checkpoint directory where we want the checkpoint to be created chdir $checkpointDir or logDie(1, "Unable to cd to checkpoint dir ($checkpointDir): $!\n") if $logLevel; my $cmd = "cr_restart"; $cmd .= " $restartName"; my $output = `$cmd 2>&1`; my $rc = $? >> 8; logDie(1, "Subcommand ($cmd) failed with rc=$rc:\n$output") if $rc && $logLevel >= 1; logPrint(3, "Subcommand ($cmd) yielded rc=$rc:\n$output") if $logLevel >= 3; exit 0; ################################################################################ # logPrint($message) # Write a message (to syslog) and die ################################################################################ sub logPrint { my ($level, $message) = @_; my @severity = ('none', 'warning', 'info', 'debug'); return if $level > $logLevel; openlog('restart_script', '', 'user'); syslog($severity[$level], $message); closelog(); } ################################################################################ # logDie($message) # Write a message (to syslog) and die ################################################################################ sub logDie { my ($level, $message) = @_; logPrint($level, $message); die($message); }
Not every job is checkpointable. A job for which checkpointing is desirable must be started with the -c command line option. This option takes a comma-separated list of arguments that are used to control checkpointing behavior. The list of valid options available in the 2.4 version of Torque is show below.
#include "stdio.h" int main( int argc, char *argv[] ) { int i; for (i=0; i<100; i++) { printf("i = %d\n", i); fflush(stdout); sleep(1); } }
> gcc -o test test.c
#!/bin/bash ./test
> qstat > qsub -c enabled,periodic,shutdown,interval=1 test.sh 77.jakaa.cridomain > qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 77.jakaa test.sh jsmith 0 Q batch >
If you have no scheduler running, you might need to start the job with qrun.
As this program runs, it writes its output to a file in /var/spool/torque/spool. This file can be observered with the command tail -f.
Jobs are checkpointed by issuing a qhold command. This causes an image file representing the state of the process to be written to disk. The directory by default is /var/spool/torque/checkpoint.
This default can be altered at the queue level with the qmgr command. For example, the command qmgr -c set queue batch checkpoint_dir=/tmp would change the checkpoint directory to /tmp for the queue 'batch'.
The default directory can also be altered at job submission time with the -c dir=/tmp command line option.
The name of the checkpoint directory and the name of the checkpoint image file become attributes of the job and can be observed with the command qstat -f. Notice in the output the names checkpoint_dir and checkpoint_name. The variable checkpoint_name is set when the image file is created and will not exist if no checkpoint has been taken.
A job can also be checkpointed without stopping or holding the job with the command qchkpt.
The qrls command is used to restart the hibernated job. If you were using the tail -f command to watch the output file, you will see the test program start counting again.
It is possible to use the qalter command to change the name of the checkpoint file associated with a job. This could be useful if there were several job checkpoints and it restarting the job from an older image was specified.
In this case, the job must be moved to the Queued state with the qrerun command. Then the job must go to the Run state either by action of the scheduler or if there is no scheduler, through using the qrun command.
A number of tests were made to verify the functioning of the BLCR implementation. See tests-2.4 for a description of the testing.