File: checkpoint.html

package info (click to toggle)
gridengine 6.2u5-7.1
  • links: PTS, VCS
  • area: main
  • in suites: wheezy
  • size: 57,216 kB
  • sloc: ansic: 438,030; java: 66,252; sh: 36,399; jsp: 7,757; xml: 5,850; makefile: 5,520; csh: 4,571; cpp: 2,848; perl: 2,401; tcl: 692; lisp: 669; yacc: 668; ruby: 642; lex: 344
file content (220 lines) | stat: -rw-r--r-- 8,463 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
<HTML>
<BODY BGCOLOR=white>
<PRE>
<!-- Manpage converted by man2html 3.0.1 -->
NAME
     checkpoint - Sun Grid Engine checkpointing environment  con-
     figuration file format

DESCRIPTION
     Checkpointing is a facility to save the complete  status  of
     an  executing program or job and to restore and restart from
     this so called checkpoint at a later point of  time  if  the
     original  program  or job was halted, e.g.  through a system
     crash.

     Sun Grid Engine provides  various  levels  of  checkpointing
     support  (see  <B><A HREF="../htmlman1/sge_ckpt.html?pathrev=V62u5_TAG">sge_ckpt(1)</A></B>).   The checkpointing environment
     described here is a means to configure the  different  types
     of  checkpointing in use for your Sun Grid Engine cluster or
     parts thereof. For that purpose you can  define  the  opera-
     tions  which  have to be executed in initiating a checkpoint
     generation, a migration of a checkpoint to another host or a
     restart of a checkpointed application as well as the list of
     queues which are eligible for a checkpointing method.

     Supporting different operating systems may easily force  Sun
     Grid  Engine  to introduce operating system dependencies for
     the configuration of the  checkpointing  configuration  file
     and  updates  of the supported operating system versions may
     lead to frequently changing implementation  details.  Please
     refer to the &lt;sge_root&gt;/ckpt directory for more information.

     Please use the -<I>ackpt</I>, -<I>dckpt</I>, -<I>mckpt</I> or -<I>sckpt</I>  options  to
     the  <B><A HREF="../htmlman1/qconf.html?pathrev=V62u5_TAG">qconf(1)</A></B>  command  to manipulate checkpointing environ-
     ments from the command-line or use the corresponding <B><A HREF="../htmlman1/qmon.html?pathrev=V62u5_TAG">qmon(1)</A></B>
     dialogue for X-Windows based interactive configuration.

     Note, Sun Grid Engine allows  backslashes  (\)  be  used  to
     escape  newline (\newline) characters. The backslash and the
     newline are replaced with a space (" ") character before any
     interpretation.

FORMAT
     The format of a <I>checkpoint</I> file is defined as follows:

  ckpt_name
     The name of the checkpointing  environment  as  defined  for
     <I>ckpt</I>_<I>name</I>  in <B><A HREF="../htmlman1/sge_types.html?pathrev=V62u5_TAG">sge_types(1)</A></B>.  <B><A HREF="../htmlman1/qsub.html?pathrev=V62u5_TAG">qsub(1)</A></B> -ckpt switch or for the
     <B><A HREF="../htmlman1/qconf.html?pathrev=V62u5_TAG">qconf(1)</A></B> options mentioned above.

  interface
     The type of checkpointing to be used. Currently, the follow-
     ing types are valid:

     <I>hibernator</I>
          The   Hibernator   kernel   level   checkpointing    is
          interfaced.

     <I>cpr</I>  The SGI kernel level checkpointing is used.

     <I>cray</I>-<I>ckpt</I>
          The Cray kernel level checkpointing is assumed.

     <I>transparent</I>
          Sun Grid Engine assumes that the  jobs  submitted  with
          reference  to this checkpointing interface use a check-
          pointing library such as provided by the public  domain
          package <I>Condor</I>.

     <I>userdefined</I>
          Sun Grid Engine assumes that the  jobs  submitted  with
          reference to this checkpointing interface perform their
          private checkpointing method.

     <I>application</I>-<I>level</I>
          Uses all of the interface commands  configured  in  the
          checkpointing  object  like  in  the case of one of the
          kernel level checkpointing interfaces (<I>cpr</I>,  <I>cray</I>-<I>ckpt</I>,
          etc.) except for the restart_command (see below), which
          is not used (even if it  is  configured)  but  the  job
          script is invoked in case of a restart instead.

  ckpt_command
     A command-line type command string to  be  executed  by  Sun
     Grid Engine in order to initiate a checkpoint.

  migr_command
     A command-line type command string to  be  executed  by  Sun
     Grid  Engine  during a migration of a checkpointing job from
     one host to another.

  restart_command
     A command-line type command string to  be  executed  by  Sun
     Grid Engine when restarting a previously checkpointed appli-
     cation.

  clean_command
     A command-line type command string to  be  executed  by  Sun
     Grid  Engine in order to cleanup after a checkpointed appli-
     cation has finished.

  ckpt_dir
     A file system location to which checkpoints  of  potentially
     considerable size should be stored.

  ckpt_signal
     A Unix signal to be sent to a job by Sun Grid Engine to ini-
     tiate  a checkpoint generation. The value for this field can
     either be a symbolic name from the list produced by  the  -<I>l</I>
     option  of  the  <B><A HREF="../htmlman1/kill.html?pathrev=V62u5_TAG">kill(1)</A></B>  command or an integer number which
     must be a valid signal on the systems used  for  checkpoint-
     ing.

  when
     The points of time when checkpoints are expected to be  gen-
     erated.  Valid values for this parameter are composed by the
     letters <I>s</I>, <I>m</I>, <I>x</I> and <I>r</I> and any combinations  thereof  without
     any  separating  character  in between. The same letters are
     allowed for the -<I>c</I> option of the <B><A HREF="../htmlman1/qsub.html?pathrev=V62u5_TAG">qsub(1)</A></B> command which  will
     overwrite the definitions in the used checkpointing environ-
     ment.  The meaning of the letters is defined as follows:

     <I>s</I>    A job is checkpointed, aborted and if possible migrated
          if  the  corresponding <B><A HREF="../htmlman8/sge_execd.html?pathrev=V62u5_TAG">sge_execd(8)</A></B> is shut down on the
          job's machine.

     <I>m</I>    Checkpoints   are   generated   periodically   at   the
          <I>min</I>_<I>cpu</I>_<I>interval</I>  interval  defined  by  the queue (see
          <B><A HREF="../htmlman5/queue_conf.html?pathrev=V62u5_TAG">queue_conf(5)</A></B>) in which a job executes.

     <I>x</I>    A job is checkpointed, aborted and if possible migrated
          as  soon as the job gets suspended (manually as well as
          automatically).

     <I>r</I>    A job will be rescheduled (not checkpointed)  when  the
          host  on which the job currently runs went into unknown
          state and the  time  interval  <I>reschedule</I>_<I>unknown</I>  (see
          <B><A HREF="../htmlman5/sge_conf.html?pathrev=V62u5_TAG">sge_conf(5)</A></B>) defined in the global/local cluster confi-
          guration will be exceeded.


RESTRICTIONS
     Note, that the functionality of any checkpointing, migration
     or  restart procedures provided by default with the Sun Grid
     Engine distribution as well as the way how they are  invoked
     in the <I>ckpt</I>_<I>command</I>, <I>migr</I>_<I>command</I> or <I>restart</I>_<I>command</I> parame-
     ters of any default checkpointing environments should not be
     changed  or  otherwise  the  functionality  remains the full
     responsibility of the administrator configuring  the  check-
     pointing  environment.   Sun  Grid  Engine  will just invoke
     these procedures and evaluate their exit status. If the pro-
     cedures  do  not  perform  their  tasks  properly or are not
     invoked in a proper fashion, the checkpointing mechanism may
     behave  unexpectedly, Sun Grid Engine has no means to detect
     this.

SEE ALSO
     <B><A HREF="../htmlman1/sge_intro.html?pathrev=V62u5_TAG">sge_intro(1)</A></B>, <B><A HREF="../htmlman1/sge_ckpt.html?pathrev=V62u5_TAG">sge_ckpt(1)</A></B>, <B><A HREF="../htmlman1/sge__types.html?pathrev=V62u5_TAG">sge__types(1)</A></B>, <B><A HREF="../htmlman1/qconf.html?pathrev=V62u5_TAG">qconf(1)</A></B>, <B><A HREF="../htmlman1/qmod.html?pathrev=V62u5_TAG">qmod(1)</A></B>,
     <B><A HREF="../htmlman1/qsub.html?pathrev=V62u5_TAG">qsub(1)</A></B>, <B><A HREF="../htmlman8/sge_execd.html?pathrev=V62u5_TAG">sge_execd(8)</A></B>.

COPYRIGHT
     See <B><A HREF="../htmlman1/sge_intro.html?pathrev=V62u5_TAG">sge_intro(1)</A></B> for a full statement of rights and  permis-
     sions.

















































</PRE>
<HR>
<ADDRESS>
Man(1) output converted with
<a href="http://www.oac.uci.edu/indiv/ehood/man2html.html">man2html</a>
</ADDRESS>
</BODY>
</HTML>