1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220
|
<HTML>
<BODY BGCOLOR=white>
<PRE>
<!-- Manpage converted by man2html 3.0.1 -->
NAME
checkpoint - Sun Grid Engine checkpointing environment con-
figuration file format
DESCRIPTION
Checkpointing is a facility to save the complete status of
an executing program or job and to restore and restart from
this so called checkpoint at a later point of time if the
original program or job was halted, e.g. through a system
crash.
Sun Grid Engine provides various levels of checkpointing
support (see <B><A HREF="../htmlman1/sge_ckpt.html?pathrev=V62u5_TAG">sge_ckpt(1)</A></B>). The checkpointing environment
described here is a means to configure the different types
of checkpointing in use for your Sun Grid Engine cluster or
parts thereof. For that purpose you can define the opera-
tions which have to be executed in initiating a checkpoint
generation, a migration of a checkpoint to another host or a
restart of a checkpointed application as well as the list of
queues which are eligible for a checkpointing method.
Supporting different operating systems may easily force Sun
Grid Engine to introduce operating system dependencies for
the configuration of the checkpointing configuration file
and updates of the supported operating system versions may
lead to frequently changing implementation details. Please
refer to the <sge_root>/ckpt directory for more information.
Please use the -<I>ackpt</I>, -<I>dckpt</I>, -<I>mckpt</I> or -<I>sckpt</I> options to
the <B><A HREF="../htmlman1/qconf.html?pathrev=V62u5_TAG">qconf(1)</A></B> command to manipulate checkpointing environ-
ments from the command-line or use the corresponding <B><A HREF="../htmlman1/qmon.html?pathrev=V62u5_TAG">qmon(1)</A></B>
dialogue for X-Windows based interactive configuration.
Note, Sun Grid Engine allows backslashes (\) be used to
escape newline (\newline) characters. The backslash and the
newline are replaced with a space (" ") character before any
interpretation.
FORMAT
The format of a <I>checkpoint</I> file is defined as follows:
ckpt_name
The name of the checkpointing environment as defined for
<I>ckpt</I>_<I>name</I> in <B><A HREF="../htmlman1/sge_types.html?pathrev=V62u5_TAG">sge_types(1)</A></B>. <B><A HREF="../htmlman1/qsub.html?pathrev=V62u5_TAG">qsub(1)</A></B> -ckpt switch or for the
<B><A HREF="../htmlman1/qconf.html?pathrev=V62u5_TAG">qconf(1)</A></B> options mentioned above.
interface
The type of checkpointing to be used. Currently, the follow-
ing types are valid:
<I>hibernator</I>
The Hibernator kernel level checkpointing is
interfaced.
<I>cpr</I> The SGI kernel level checkpointing is used.
<I>cray</I>-<I>ckpt</I>
The Cray kernel level checkpointing is assumed.
<I>transparent</I>
Sun Grid Engine assumes that the jobs submitted with
reference to this checkpointing interface use a check-
pointing library such as provided by the public domain
package <I>Condor</I>.
<I>userdefined</I>
Sun Grid Engine assumes that the jobs submitted with
reference to this checkpointing interface perform their
private checkpointing method.
<I>application</I>-<I>level</I>
Uses all of the interface commands configured in the
checkpointing object like in the case of one of the
kernel level checkpointing interfaces (<I>cpr</I>, <I>cray</I>-<I>ckpt</I>,
etc.) except for the restart_command (see below), which
is not used (even if it is configured) but the job
script is invoked in case of a restart instead.
ckpt_command
A command-line type command string to be executed by Sun
Grid Engine in order to initiate a checkpoint.
migr_command
A command-line type command string to be executed by Sun
Grid Engine during a migration of a checkpointing job from
one host to another.
restart_command
A command-line type command string to be executed by Sun
Grid Engine when restarting a previously checkpointed appli-
cation.
clean_command
A command-line type command string to be executed by Sun
Grid Engine in order to cleanup after a checkpointed appli-
cation has finished.
ckpt_dir
A file system location to which checkpoints of potentially
considerable size should be stored.
ckpt_signal
A Unix signal to be sent to a job by Sun Grid Engine to ini-
tiate a checkpoint generation. The value for this field can
either be a symbolic name from the list produced by the -<I>l</I>
option of the <B><A HREF="../htmlman1/kill.html?pathrev=V62u5_TAG">kill(1)</A></B> command or an integer number which
must be a valid signal on the systems used for checkpoint-
ing.
when
The points of time when checkpoints are expected to be gen-
erated. Valid values for this parameter are composed by the
letters <I>s</I>, <I>m</I>, <I>x</I> and <I>r</I> and any combinations thereof without
any separating character in between. The same letters are
allowed for the -<I>c</I> option of the <B><A HREF="../htmlman1/qsub.html?pathrev=V62u5_TAG">qsub(1)</A></B> command which will
overwrite the definitions in the used checkpointing environ-
ment. The meaning of the letters is defined as follows:
<I>s</I> A job is checkpointed, aborted and if possible migrated
if the corresponding <B><A HREF="../htmlman8/sge_execd.html?pathrev=V62u5_TAG">sge_execd(8)</A></B> is shut down on the
job's machine.
<I>m</I> Checkpoints are generated periodically at the
<I>min</I>_<I>cpu</I>_<I>interval</I> interval defined by the queue (see
<B><A HREF="../htmlman5/queue_conf.html?pathrev=V62u5_TAG">queue_conf(5)</A></B>) in which a job executes.
<I>x</I> A job is checkpointed, aborted and if possible migrated
as soon as the job gets suspended (manually as well as
automatically).
<I>r</I> A job will be rescheduled (not checkpointed) when the
host on which the job currently runs went into unknown
state and the time interval <I>reschedule</I>_<I>unknown</I> (see
<B><A HREF="../htmlman5/sge_conf.html?pathrev=V62u5_TAG">sge_conf(5)</A></B>) defined in the global/local cluster confi-
guration will be exceeded.
RESTRICTIONS
Note, that the functionality of any checkpointing, migration
or restart procedures provided by default with the Sun Grid
Engine distribution as well as the way how they are invoked
in the <I>ckpt</I>_<I>command</I>, <I>migr</I>_<I>command</I> or <I>restart</I>_<I>command</I> parame-
ters of any default checkpointing environments should not be
changed or otherwise the functionality remains the full
responsibility of the administrator configuring the check-
pointing environment. Sun Grid Engine will just invoke
these procedures and evaluate their exit status. If the pro-
cedures do not perform their tasks properly or are not
invoked in a proper fashion, the checkpointing mechanism may
behave unexpectedly, Sun Grid Engine has no means to detect
this.
SEE ALSO
<B><A HREF="../htmlman1/sge_intro.html?pathrev=V62u5_TAG">sge_intro(1)</A></B>, <B><A HREF="../htmlman1/sge_ckpt.html?pathrev=V62u5_TAG">sge_ckpt(1)</A></B>, <B><A HREF="../htmlman1/sge__types.html?pathrev=V62u5_TAG">sge__types(1)</A></B>, <B><A HREF="../htmlman1/qconf.html?pathrev=V62u5_TAG">qconf(1)</A></B>, <B><A HREF="../htmlman1/qmod.html?pathrev=V62u5_TAG">qmod(1)</A></B>,
<B><A HREF="../htmlman1/qsub.html?pathrev=V62u5_TAG">qsub(1)</A></B>, <B><A HREF="../htmlman8/sge_execd.html?pathrev=V62u5_TAG">sge_execd(8)</A></B>.
COPYRIGHT
See <B><A HREF="../htmlman1/sge_intro.html?pathrev=V62u5_TAG">sge_intro(1)</A></B> for a full statement of rights and permis-
sions.
</PRE>
<HR>
<ADDRESS>
Man(1) output converted with
<a href="http://www.oac.uci.edu/indiv/ehood/man2html.html">man2html</a>
</ADDRESS>
</BODY>
</HTML>
|