2.0 Submitting and managing jobs > 2.6 Job checkpoint and restart > 2.6.2 Configuration files and scripts

2.6.2 Configuration files and scripts

Configuring and Building TORQUE for BLCR:

> ./configure --enable-unixsockets=no --enable-blcr

> make

> sudo make install

Depending on where BLCR is installed you may also need to use the following configure options to specify BLCR paths:

Option Description
--with-blcr-include=DIR include path for libcr.h
--with-blcr-lib=DIR lib path for libcr
--with-blcr-bin=DIR bin path for BLCR utilities

The pbs_mom configuration file located in /var/spool/torque/mom_priv must be modified to identify the script names associated with invoking the BLCR commands. The following variables should be used in the configuration file when using BLCR checkpointing.

Variable Description
$checkpoint_interval How often periodic job checkpoints will be taken (minutes)
$checkpoint_script The name of the script file to execute to perform a job checkpoint
$restart_script The name of the script file to execute to perform a job restart
$checkpoint_run_exe The name of an executable program to be run when starting a checkpointable job (for BLCR, cr_run)

The following example shows the contents of the configuration file used for testing the BLCR feature in TORQUE.

The script files below must be executable by the user. Be sure to use chmod to set the permissions to 754.

Example 2-1: Script file permissions

# chmod 754 blcr*

# ls -l

total 20

-rwxr-xr-- 1 root root 2112 2008-03-11 13:14 blcr_checkpoint_script

-rwxr-xr-- 1 root root 1987 2008-03-11 13:14 blcr_restart_script

-rw-r--r-- 1 root root 215 2008-03-11 13:13 config

drwxr-x--x 2 root root 4096 2008-03-11 13:21 jobs

-rw-r--r-- 1 root root 7 2008-03-11 13:15 mom.lock

Example 2-2: mom_priv/config

$checkpoint_script /var/spool/torque/mom_priv/blcr_checkpoint_script

$restart_script /var/spool/torque/mom_priv/blcr_restart_script

$checkpoint_run_exe /usr/local/bin/cr_run

$pbsserver makua.cridomain

$loglevel 7

Example 2-3: mom_priv/blcr_checkpoint_script

#! /usr/bin/perl

################################################################################

#

# Usage: checkpoint_script

#

# This script is invoked by pbs_mom to checkpoint a job.

#

################################################################################

use strict;

use Sys::Syslog;

 

# Log levels:

# 0 = none -- no logging

# 1 = fail -- log only failures

# 2 = info -- log invocations

# 3 = debug -- log all subcommands

my $logLevel = 3;

 

logPrint(2, "Invoked: $0 " . join(' ', @ARGV) . "\n");

 

my ($sessionId, $jobId, $userId, $signalNum, $checkpointDir, $checkpointName);

my $usage =

  "Usage: $0        \n";

 

# Note that depth is not used in this script but could control a limit to the number of checkpoint

# image files that are preserved on the disk.

#

# Note also that a request was made to identify whether this script was invoked by the job's

# owner or by a system administrator. While this information is known to pbs_server, it

# is not propagated to pbs_mom and thus it is not possible to pass this to the script.

# Therefore, a workaround is to invoke qmgr and attempt to set a trivial variable.

# This will fail if the invoker is not a manager.

 

if (@ARGV == 7)

{

    ($sessionId, $jobId, $userId, $checkpointDir, $checkpointName, $signalNum $depth) =

       @ARGV;

}

else { logDie(1, $usage); }

 

# Change to the checkpoint directory where we want the checkpoint to be created

chdir $checkpointDir

  or logDie(1, "Unable to cd to checkpoint dir ($checkpointDir): $!\n")

  if $logLevel;

 

my $cmd = "cr_checkpoint";

$cmd .= " --signal $signalNum" if $signalNum;

$cmd .= " --tree $sessionId";

$cmd .= " --file $checkpointName";

my $output = `$cmd 2>&1`;

my $rc = $? >> 8;

logDie(1, "Subcommand ($cmd) failed with rc=$rc:\n$output")

  if $rc && $logLevel >= 1;

logPrint(3, "Subcommand ($cmd) yielded rc=$rc:\n$output")

  if $logLevel >= 3;

exit 0;

 

################################################################################

# logPrint($message)

# Write a message (to syslog) and die

################################################################################

sub logPrint

{

    my ($level, $message) = @_;

    my @severity = ('none', 'warning', 'info', 'debug');

 

    return if $level > $logLevel;

 

    openlog('checkpoint_script', '', 'user');

    syslog($severity[$level], $message);

    closelog();

}

 

################################################################################

# logDie($message)

# Write a message (to syslog) and die

################################################################################

sub logDie

{

    my ($level, $message) = @_;

    logPrint($level, $message);

    die($message);

}

Example 2-4: mom_priv/blcr_restart_script

#! /usr/bin/perl

################################################################################

#

# Usage: restart_script

#

# This script is invoked by pbs_mom to restart a job.

#

################################################################################

use strict;

use Sys::Syslog;

 

# Log levels:

# 0 = none -- no logging

# 1 = fail -- log only failures

# 2 = info -- log invocations

# 3 = debug -- log all subcommands

my $logLevel = 3;

 

logPrint(2, "Invoked: $0 " . join(' ', @ARGV) . "\n");

 

my ($sessionId, $jobId, $userId, $checkpointDir, $restartName);

my $usage =

  "Usage: $0      \n";

if (@ARGV == 5)

{

    ($sessionId, $jobId, $userId, $checkpointDir, $restartName) =

       @ARGV;

}

else { logDie(1, $usage); }

 

# Change to the checkpoint directory where we want the checkpoint to be created

chdir $checkpointDir

  or logDie(1, "Unable to cd to checkpoint dir ($checkpointDir): $!\n")

  if $logLevel;

 

my $cmd = "cr_restart";

$cmd .= " $restartName";

my $output = `$cmd 2>&1`;

my $rc = $? >> 8;

logDie(1, "Subcommand ($cmd) failed with rc=$rc:\n$output")

  if $rc && $logLevel >= 1;

logPrint(3, "Subcommand ($cmd) yielded rc=$rc:\n$output")

  if $logLevel >= 3;

exit 0;

 

################################################################################

# logPrint($message)

# Write a message (to syslog) and die

################################################################################

sub logPrint

{

    my ($level, $message) = @_;

    my @severity = ('none', 'warning', 'info', 'debug');

 

    return if $level > $logLevel;

    openlog('restart_script', '', 'user');

    syslog($severity[$level], $message);

    closelog();

}

 

################################################################################

# logDie($message)

# Write a message (to syslog) and die

################################################################################

sub logDie

{

    my ($level, $message) = @_;

 

    logPrint($level, $message);

    die($message);

}

Related topics