1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179
|
.\"
.\" Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
.\" University Research and Technology
.\" Corporation. All rights reserved.
.\" Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved.
.\"
.\" Man page for OPAL's CRS Functionality
.\"
.\" .TH name section center-footer left-footer center-header
.TH OPAL_CRS 7 "#OPAL_DATE#" "#PACKAGE_VERSION#" "#PACKAGE_NAME#"
.\" **************************
.\" Name Section
.\" **************************
.SH NAME
.
OPAL_CRS \- Open PAL MCA Checkpoint/Restart Service (CRS): Overview of Open PAL's
CRS framework, and selected modules. #PACKAGE_NAME# #PACKAGE_VERSION#.
.
.\" **************************
.\" Description Section
.\" **************************
.SH DESCRIPTION
.
.PP
Open PAL can involuntarily checkpoint and restart sequential programs.
Doing so requires that Open PAL was compiled with thread support and
that the back-end checkpointing systems are available at run-time.
.
.SS Phases of Checkpoint / Restart
.PP
Open PAL defines three phases for checkpoint / restart support in a
procress:
.
.TP 4
Checkpoint
When the checkpoint request arrives, the procress is notified of the
request before the checkpoint is taken.
.
.TP 4
Continue
After a checkpoint has successfully completed, the same process as the
checkpoint is notified of its successful continuation of execution.
.
.TP 4
Restart
After a checkpoint has successfully completed, a new / restarted
process is notified of its successful restart.
.
.PP
The Continue and Restart phases are identical except for the process
in which they are invoked. The Continue phase is invoked in the same process
as the Checkpoint phase was invoked. The Restart phase is only invoked in newly
restarted processes.
.
.\" **************************
.\" General Process Requirements Section
.\" **************************
.SH GENERAL PROCESS REQUIREMENTS
.PP
In order for a process to use the Open PAL CRS components it must adhear to a
few programmatic requirements.
.PP
First, the program must call \fIOPAL_INIT\fR early in its execution. This
should only be called once, and it is not possible to checkpoint the process
without it first having called this function.
.PP
The program must call \fIOPAL_FINALIZE\fR before termination. This does a
significant amount of cleanup. If it is not called, then it is very likely that
remnants are left in the filesystem.
.PP
To checkpoint and restart a process you must use the Open PAL tools to do
so. Using the backend checkpointer's checkpoint and restart tools will lead
to undefined behavior.
To checkpoint a process use \fIopal_checkpoint\fR (opal_checkpoint(1)).
To restart a process use \fIopal_restart\fR (opal_restart(1)).
.
.\" **********************************
.\" Available Components Section
.\" **********************************
.SH AVAILABLE COMPONENTS
.PP
Open PAL ships with two CRS components: \fIself\fR and \fIblcr\fR.
.
.PP
The following MCA parameters apply to all components:
.
.TP 4
crs_base_verbose
Set the verbosity level for all components. Default is 0, or silent except on error.
.
.\" Self Component
.\" ******************
.SS self CRS Component
.PP
The \fIself\fR component invokes user-defined functions to save and restore
checkpoints. It is simply a mechanism for user-defined functions to be invoked
at Open PAL's Checkpoint, Continue, and Restart phases. Hence, the only data
that is saved during the checkpoint is what is written in the user's checkpoint
function. No libary state is saved at all.
.
.PP
As such, the model for the \fIself\fR component is slightly differnt than for
other components. Specifically, the Restart function is not invoked in the same
process image of the process that was checkpointed. The Restart phase is
invoked during \fBOPAL_INIT\fR of the new instance of the applicaiton (i.e., it
starts over from main()).
.
.PP
The \fIself\fR component has the following MCA parameters:
.TP 4
crs_self_prefix
Speficy a string prefix for the name of the checkpoint, continue, and restart
functions that Open PAL will invoke during the respective stages. That is,
by specifying "-mca crs_self_prefix foo" means that Open PAL expects to find
three functions at run-time:
int foo_checkpoint()
int foo_continue()
int foo_restart()
By default, the prefix is set to "opal_crs_self_user".
.
.TP 4
crs_self_priority
Set the \fIself\fR components default priority
.
.TP 4
crs_self_verbose
Set the verbosity level. Default is 0, or silent except on error.
.
.TP 4
crs_self_do_restart
This is mostly internally used. A general user should never need to set this
value. This is set to non-0 when a the new process should invoke the restart
callback in \fIOPAL_INIT\fR. Default is 0, or normal execution.
.
.\" BLCR Component
.\" ******************
.SS blcr CRS Component
.PP
The Berkeley Lab Checkpoint/Restart (BLCR) single-process checkpoint is a
software system developed at Lawrence Berkeley National Laboratory. See the
project website for more details:
\fI http://ftg.lbl.gov/CheckpointRestart/CheckpointRestart.shtml \fR
.
.PP
The \fIblcr\fR component has the following MCA parameters:
.TP 4
crs_blcr_priority
Set the \fIblcr\fR components default priority.
.
.TP 4
crs_blcr_verbose
Set the verbosity level. Default is 0, or silent except on error.
.
.\" Special 'none' option
.\" ************************
.SS none CRS Component
.PP
The \fInone\fP component simply selects no CRS component. All of the CRS
function calls return immediately with OPAL_SUCCESS.
.
.PP
This component is the last component to be selected by default. This means that if
another component is available, and the \fInone\fP component was not explicity
requested then OPAL will attempt to activate all of the available components
before falling back to this component.
.
.\" **************************
.\" See Also Section
.\" **************************
.
.SH SEE ALSO
opal_checkpoint(1), opal_restart(1)
.\", orte_crs(7), ompi_crs(7)
|