(Click to open topic with navigation)
Configuring and Building TORQUE for BLCR:
> ./configure --enable-unixsockets=no --enable-blcr
> make
> sudo make install
Depending on where BLCR is installed you may also need to use the following configure options to specify BLCR paths:
Option | Description |
---|---|
--with-blcr-include=DIR | include path for libcr.h |
--with-blcr-lib=DIR | lib path for libcr |
--with-blcr-bin=DIR | bin path for BLCR utilities |
The pbs_mom configuration file located in /var/spool/torque/mom_priv must be modified to identify the script names associated with invoking the BLCR commands. The following variables should be used in the configuration file when using BLCR checkpointing.
Variable | Description |
---|---|
$checkpoint_interval | How often periodic job checkpoints will be taken (minutes) |
$checkpoint_script | The name of the script file to execute to perform a job checkpoint |
$restart_script | The name of the script file to execute to perform a job restart |
$checkpoint_run_exe | The name of an executable program to be run when starting a checkpointable job (for BLCR, cr_run) |
The following example shows the contents of the configuration file used for testing the BLCR feature in TORQUE.
The script files below must be executable by the user. Be sure to use chmod to set the permissions to 754.
Example 2-6: Script file permissions
# chmod 754 blcr*
# ls -l
total 20
-rwxr-xr-- 1 root root 2112 2008-03-11 13:14 blcr_checkpoint_script
-rwxr-xr-- 1 root root 1987 2008-03-11 13:14 blcr_restart_script
-rw-r--r-- 1 root root 215 2008-03-11 13:13 config
drwxr-x--x 2 root root 4096 2008-03-11 13:21 jobs
-rw-r--r-- 1 root root 7 2008-03-11 13:15 mom.lock
Example 2-7: mom_priv/config
$checkpoint_script /var/spool/torque/mom_priv/blcr_checkpoint_script
$restart_script /var/spool/torque/mom_priv/blcr_restart_script
$checkpoint_run_exe /usr/local/bin/cr_run
$pbsserver makua.cridomain
$loglevel 7
Example 2-8: mom_priv/blcr_checkpoint_script
#! /usr/bin/perl
################################################################################
#
# Usage: checkpoint_script
#
# This script is invoked by pbs_mom to checkpoint a job.
#
################################################################################
use strict;
use Sys::Syslog;
# Log levels:
# 0 = none -- no logging
# 1 = fail -- log only failures
# 2 = info -- log invocations
# 3 = debug -- log all subcommands
my $logLevel = 3;
logPrint(2, "Invoked: $0 " . join(' ', @ARGV) . "\n");
my ($sessionId, $jobId, $userId, $signalNum, $checkpointDir, $checkpointName);
my $usage =
"Usage: $0 \n";
# Note that depth is not used in this script but could control a limit to the number of checkpoint
# image files that are preserved on the disk.
#
# Note also that a request was made to identify whether this script was invoked by the job's
# owner or by a system administrator. While this information is known to pbs_server, it
# is not propagated to pbs_mom and thus it is not possible to pass this to the script.
# Therefore, a workaround is to invoke qmgr and attempt to set a trivial variable.
# This will fail if the invoker is not a manager.
if (@ARGV == 7)
{
($sessionId, $jobId, $userId, $checkpointDir, $checkpointName, $signalNum $depth) =
@ARGV;
}
else { logDie(1, $usage); }
# Change to the checkpoint directory where we want the checkpoint to be created
chdir $checkpointDir
or logDie(1, "Unable to cd to checkpoint dir ($checkpointDir): $!\n")
if $logLevel;
my $cmd = "cr_checkpoint";
$cmd .= " --signal $signalNum" if $signalNum;
$cmd .= " --tree $sessionId";
$cmd .= " --file $checkpointName";
my $output = `$cmd 2>&1`;
my $rc = $? >> 8;
logDie(1, "Subcommand ($cmd) failed with rc=$rc:\n$output")
if $rc && $logLevel >= 1;
logPrint(3, "Subcommand ($cmd) yielded rc=$rc:\n$output")
if $logLevel >= 3;
exit 0;
################################################################################
# logPrint($message)
# Write a message (to syslog) and die
################################################################################
sub logPrint
{
my ($level, $message) = @_;
my @severity = ('none', 'warning', 'info', 'debug');
return if $level > $logLevel;
openlog('checkpoint_script', '', 'user');
syslog($severity[$level], $message);
closelog();
}
################################################################################
# logDie($message)
# Write a message (to syslog) and die
################################################################################
sub logDie
{
my ($level, $message) = @_;
logPrint($level, $message);
die($message);
}
Example 2-9: mom_priv/blcr_restart_script
#! /usr/bin/perl
################################################################################
#
# Usage: restart_script
#
# This script is invoked by pbs_mom to restart a job.
#
################################################################################
use strict;
use Sys::Syslog;
# Log levels:
# 0 = none -- no logging
# 1 = fail -- log only failures
# 2 = info -- log invocations
# 3 = debug -- log all subcommands
my $logLevel = 3;
logPrint(2, "Invoked: $0 " . join(' ', @ARGV) . "\n");
my ($sessionId, $jobId, $userId, $checkpointDir, $restartName);
my $usage =
"Usage: $0 \n";
if (@ARGV == 5)
{
($sessionId, $jobId, $userId, $checkpointDir, $restartName) =
@ARGV;
}
else { logDie(1, $usage); }
# Change to the checkpoint directory where we want the checkpoint to be created
chdir $checkpointDir
or logDie(1, "Unable to cd to checkpoint dir ($checkpointDir): $!\n")
if $logLevel;
my $cmd = "cr_restart";
$cmd .= " $restartName";
my $output = `$cmd 2>&1`;
my $rc = $? >> 8;
logDie(1, "Subcommand ($cmd) failed with rc=$rc:\n$output")
if $rc && $logLevel >= 1;
logPrint(3, "Subcommand ($cmd) yielded rc=$rc:\n$output")
if $logLevel >= 3;
exit 0;
################################################################################
# logPrint($message)
# Write a message (to syslog) and die
################################################################################
sub logPrint
{
my ($level, $message) = @_;
my @severity = ('none', 'warning', 'info', 'debug');
return if $level > $logLevel;
openlog('restart_script', '', 'user');
syslog($severity[$level], $message);
closelog();
}
################################################################################
# logDie($message)
# Write a message (to syslog) and die
################################################################################
sub logDie
{
my ($level, $message) = @_;
logPrint($level, $message);
die($message);
}
Related topics