TORQUE Resource Manager
2.6 Job Checkpoint and Restart

2.6 Job Checkpoint and Restart

While TORQUE has had a job checkpoint and restart capability for many years, this was tied to machine specific features. Now TORQUE supports BLCR — an architecture independent package that provides for process checkpoint and restart.

Note The support for BLCR is only for serial jobs, not for any MPI type jobs.

Introduction to BLCR

BLCR is a kernel level package. It must be downloaded and installed from BLCR.

After building and making the package, it must be installed into the kernel with commands as follows. These can be installed into the file /etc/modules but all of the testing was done with explicit invocations of modprobe.

Installing BLCR into the kernel:
# /sbin/insmod /usr/local/lib/blcr/2.6.12-1.234/blcr_imports.ko
# /sbin/insmod /usr/local/lib/blcr/2.6.12-1.234/blcr_vmadump.ko
# /sbin/insmod /usr/local/lib/blcr/2.6.12-1.234/blcr.ko

The BLCR system provides four command line utilities: (1) cr_checkpoint, (2) cr_info, (3) cr_restart, and (4) cr_run.

For more information about BLCR, see the BLCR Administrator's Guide.

Configuration files and scripts

Configuring and Building TORQUE for BLCR:

> ./configure --enable-unixsockets=no --enable-blcr
> make
> sudo make install

Depending on where BLCR is installed you may also need to use the following configure options to specify BLCR paths:

  • --with-blcr-include=DIR include path for libcr.h
  • --with-blcr-lib=DIR lib path for libcr
  • --with-blcr-bin=DIR bin path for BLCR utilities

The pbs_mom configuration file located in /var/spool/torque/mom_priv must be modified to identify the script names associated with invoking the BLCR commands. The following variables should be used in the configuration file when using BLCR checkpointing.

  • $checkpoint_interval - How often periodic job checkpoints will be taken (minutes).
  • $checkpoint_script - The name of the script file to execute to perform a job checkpoint.
  • $restart_script - The name of the script file to execute to perform a job restart.
  • $checkpoint_run_exe - The name of an executable program to be run when starting a checkpointable job (for BLCR, cr_run).

The following example shows the contents of the configuration file used for testing the BLCR feature in TORQUE.

Note The script files below must be executable by the user. Be sure to use chmod to set the permissions to 754.

Script file permissions:
# chmod 754 blcr*
# ls -l
total 20
-rwxr-xr-- 1 root root 2112 2008-03-11 13:14 blcr_checkpoint_script
-rwxr-xr-- 1 root root 1987 2008-03-11 13:14 blcr_restart_script
-rw-r--r-- 1 root root  215 2008-03-11 13:13 config
drwxr-x--x 2 root root 4096 2008-03-11 13:21 jobs
-rw-r--r-- 1 root root    7 2008-03-11 13:15 mom.lock

mom_priv/config:
$checkpoint_script  /var/spool/torque/mom_priv/blcr_checkpoint_script
$restart_script  /var/spool/torque/mom_priv/blcr_restart_script
$checkpoint_run_exe /usr/local/bin/cr_run
$pbsserver makua.cridomain
$loglevel 7

mom_priv/blcr_checkpoint_script:
#! /usr/bin/perl
################################################################################
#
# Usage: checkpoint_script      
#
# This script is invoked by pbs_mom to checkpoint a job.
#
################################################################################
use strict;
use Sys::Syslog;

# Log levels:
# 0 = none -- no logging
# 1 = fail -- log only failures
# 2 = info -- log invocations
# 3 = debug -- log all subcommands
my $logLevel = 3;

logPrint(2, "Invoked: $0 " . join(' ', @ARGV) . "\n");

my ($sessionId, $jobId, $userId, $signalNum, $checkpointDir, $checkpointName);
my $usage =
  "Usage: $0        \n";

# Note that depth is not used in this script but could control a limit to the number of checkpoint
# image files that are preserved on the disk.
#
# Note also that a request was made to identify whether this script was invoked by the job's
# owner or by a system administrator.  While this information is known to pbs_server, it
# is not propagated to pbs_mom and thus it is not possible to pass this to the script.
# Therefore, a workaround is to invoke qmgr and attempt to set a trivial variable.
# This will fail if the invoker is not a manager.

if (@ARGV == 7)
{
    ($sessionId, $jobId, $userId, $checkpointDir, $checkpointName, $signalNum $depth) =
      @ARGV;
}
else { logDie(1, $usage); }

# Change to the checkpoint directory where we want the checkpoint to be created
chdir $checkpointDir
  or logDie(1, "Unable to cd to checkpoint dir ($checkpointDir): $!\n")
  if $logLevel;

my $cmd = "cr_checkpoint";
$cmd .= " --signal $signalNum" if $signalNum;
$cmd .= " --tree $sessionId";
$cmd .= " --file $checkpointName";
my $output = `$cmd 2>&1`;
my $rc     = $? >> 8;
logDie(1, "Subcommand ($cmd) failed with rc=$rc:\n$output")
  if $rc && $logLevel >= 1;
logPrint(3, "Subcommand ($cmd) yielded rc=$rc:\n$output")
   if $logLevel >= 3;
exit 0;

################################################################################
# logPrint($message)
# Write a message (to syslog) and die
################################################################################
sub logPrint
{
    my ($level, $message) = @_;
    my @severity = ('none', 'warning', 'info', 'debug');

    return if $level > $logLevel;

    openlog('checkpoint_script', '', 'user');
    syslog($severity[$level], $message);
    closelog();
}

################################################################################
# logDie($message)
# Write a message (to syslog) and die
################################################################################
sub logDie
{
    my ($level, $message) = @_;

    logPrint($level, $message);
    die($message);
}

mom_priv/blcr_restart_script:
#! /usr/bin/perl
################################################################################
#
# Usage: restart_script      
#
# This script is invoked by pbs_mom to restart a job.
#
################################################################################
use strict;
use Sys::Syslog;

# Log levels:
# 0 = none -- no logging
# 1 = fail -- log only failures
# 2 = info -- log invocations
# 3 = debug -- log all subcommands
my $logLevel = 3;

logPrint(2, "Invoked: $0 " . join(' ', @ARGV) . "\n");

my ($sessionId, $jobId, $userId, $checkpointDir, $restartName);
my $usage =
  "Usage: $0      \n";
if (@ARGV == 5)
{
    ($sessionId, $jobId, $userId, $checkpointDir, $restartName) =
      @ARGV;
}
else { logDie(1, $usage); }

# Change to the checkpoint directory where we want the checkpoint to be created
chdir $checkpointDir
  or logDie(1, "Unable to cd to checkpoint dir ($checkpointDir): $!\n")
  if $logLevel;


my $cmd = "cr_restart";
$cmd .= " $restartName";
my $output = `$cmd 2>&1`;
my $rc     = $? >> 8;
logDie(1, "Subcommand ($cmd) failed with rc=$rc:\n$output")
  if $rc && $logLevel >= 1;
logPrint(3, "Subcommand ($cmd) yielded rc=$rc:\n$output")
   if $logLevel >= 3;
exit 0;

################################################################################
# logPrint($message)
# Write a message (to syslog) and die
################################################################################
sub logPrint
{
    my ($level, $message) = @_;
    my @severity = ('none', 'warning', 'info', 'debug');

    return if $level > $logLevel;

    openlog('restart_script', '', 'user');
    syslog($severity[$level], $message);
    closelog();
}

################################################################################
# logDie($message)
# Write a message (to syslog) and die
################################################################################
sub logDie
{
    my ($level, $message) = @_;

    logPrint($level, $message);
    die($message);
}

Starting a checkpointable job

Not every job is checkpointable. A job for which checkpointing is desirable must be started with the -c command line option. This option takes a comma-separated list of arguments that are used to control checkpointing behavior. The list of valid options available in the 2.4 version of Torque is show below.

  • none - No checkpointing (not highly useful, but included for completeness).
  • enabled - Specify that checkpointing is allowed, but must be explicitly invoked by either the qhold or qchkpt commands.
  • shutdown - Specify that checkpointing is to be done on a job at pbs_mom shutdown.
  • periodic - Specify that periodic checkpointing is enabled. The default interval is 10 minutes and can be changed by the $checkpoint_interval option in the MOM configuration file, or by specifying an interval when the job is submitted.
  • interval=minutes - Specify the checkpoint interval in minutes.
  • depth=number - Specify a number (depth) of checkpoint images to be kept in the checkpoint directory.
  • dir=path - Specify a checkpoint directory (default is /var/spool/torque/checkpoint).
Sample test program:
#include "stdio.h"
int main( int argc, char *argv[] )
{
int i;
        for (i=0; i<100; i++)
        {
                printf("i = %d\n", i);
                fflush(stdout);
                sleep(1);
        }
}

Instructions for building test program:
>  gcc -o test test.c

Sample test script:
#!/bin/bash

./test

Starting the test job:
>  qstat
>  qsub -c enabled,periodic,shutdown,interval=1 test.sh
77.jakaa.cridomain
>  qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
77.jakaa                  test.sh          jsmith                 0 Q batch          
>  

If you have no scheduler running, you might need to start the job with qrun.

As this program runs, it writes its output to a file in /var/spool/torque/spool. This file can be observered with the command tail -f.

Checkpointing a job

Jobs are checkpointed by issuing a qhold command. This causes an image file representing the state of the process to be written to disk. The directory by default is /var/spool/torque/checkpoint.

This default can be altered at the queue level with the qmgr command. For example, the command qmgr -c set queue batch checkpoint_dir=/tmp would change the checkpoint directory to /tmp for the queue 'batch'.

The default directory can also be altered at job submission time with the -c dir=/tmp command line option.

The name of the checkpoint directory and the name of the checkpoint image file become attributes of the job and can be observed with the command qstat -f. Notice in the output the names checkpoint_dir and checkpoint_name. The variable checkpoint_name is set when the image file is created and will not exist if no checkpoint has been taken.

A job can also be checkpointed without stopping or holding the job with the command qchkpt.

Restarting a job in the Held state

The qrls command is used to restart the hibernated job. If you were using the tail -f command to watch the output file, you will see the test program start counting again.

It is possible to use the qalter command to change the name of the checkpoint file associated with a job. This could be useful if there were several job checkpoints and it restarting the job from an older image was specified.

Restarting a job in the Completed state

In this case, the job must be moved to the Queued state with the qrerun command. Then the job must go to the Run state either by action of the scheduler or if there is no scheduler, through using the qrun command.

Acceptance tests

A number of tests were made to verify the functioning of the BLCR implementation. See tests-2.4 for a description of the testing.