File: checkpoint.html

package info (click to toggle)
gridengine 6.2-4
  • links: PTS, VCS
  • area: main
  • in suites: lenny
  • size: 51,532 kB
  • ctags: 51,172
  • sloc: ansic: 418,155; java: 37,080; sh: 22,593; jsp: 7,699; makefile: 5,292; csh: 4,244; xml: 2,901; cpp: 2,086; perl: 1,895; tcl: 1,188; lisp: 669; ruby: 642; yacc: 393; lex: 266
file content (168 lines) | stat: -rw-r--r-- 7,634 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
<HTML>
<BODY BGCOLOR=white>
<PRE>
<!-- Manpage converted by man2html 3.0.1 -->
NAME
     checkpoint - Grid Engine  checkpointing  environment  confi-
     guration file format

DESCRIPTION
     Checkpointing is a facility to save the complete  status  of
     an  executing program or job and to restore and restart from
     this so called checkpoint at a later point of  time  if  the
     original  program  or job was halted, e.g.  through a system
     crash.

     Grid Engine provides various levels of checkpointing support
     (see  <B><A HREF="../htmlman1/sge_ckpt.html">sge_ckpt(1)</A></B>).  The checkpointing environment described
     here is a means to configure the different types  of  check-
     pointing  in  use  for  your  Grid  Engine  cluster or parts
     thereof. For that purpose  you  can  define  the  operations
     which have to be executed in initiating a checkpoint genera-
     tion, a migration of a checkpoint to another host or a  res-
     tart  of  a  checkpointed application as well as the list of
     queues which are eligible for a checkpointing method.

     Supporting different operating systems may easily force Grid
     Engine  to  introduce  operating system dependencies for the
     configuration of the checkpointing  configuration  file  and
     updates  of the supported operating system versions may lead
     to frequently changing implementation details. Please  refer
     to the &lt;sge_root&gt;/ckpt directory for more information.

     Please use the -<I>ackpt</I>, -<I>dckpt</I>, -<I>mckpt</I> or -<I>sckpt</I>  options  to
     the  <B><A HREF="../htmlman1/qconf.html">qconf(1)</A></B>  command  to manipulate checkpointing environ-
     ments from the command-line or use the corresponding <B><A HREF="../htmlman1/qmon.html">qmon(1)</A></B>
     dialogue for X-Windows based interactive configuration.

FORMAT
     The format of a <I>checkpoint</I> file is defined as follows:

  ckpt_name
     The name of the checkpointing environment. To be used in the
     <B><A HREF="../htmlman1/qsub.html">qsub(1)</A></B>  -ckpt  switch or for the <B><A HREF="../htmlman1/qconf.html">qconf(1)</A></B> options mentioned
     above.

  interface
     The type of checkpointing to be used. Currently, the follow-
     ing types are valid:

     <I>hibernator</I>
          The Hibernator kernel  level  checkpointing  is  inter-
          faced.

     <I>cpr</I>  The SGI kernel level checkpointing is used.

     <I>cray</I>-<I>ckpt</I>
          The Cray kernel level checkpointing is assumed.

     <I>transparent</I>
          Grid Engine assumes that the jobs submitted with refer-
          ence  to this checkpointing interface use a checkpoint-
          ing library such as provided by the public domain pack-
          age <I>Condor</I>.

     <I>userdefined</I>
          Grid Engine assumes that the jobs submitted with refer-
          ence  to  this  checkpointing  interface  perform their
          private checkpointing method.

     <I>application</I>-<I>level</I>
          Uses all of the interface commands  configured  in  the
          checkpointing  object  like  in  the case of one of the
          kernel level checkpointing interfaces (<I>cpr</I>,  <I>cray</I>-<I>ckpt</I>,
          etc.) except for the restart_command (see below), which
          is not used (even if it  is  configured)  but  the  job
          script is invoked in case of a restart instead.

  ckpt_command
     A command-line type command string to be  executed  by  Grid
     Engine in order to initiate a checkpoint.

  migr_command
     A command-line type command string to be  executed  by  Grid
     Engine  during  a  migration of a checkpointing job from one
     host to another.

  restart_command
     A command-line type command string to be  executed  by  Grid
     Engine  when  restarting  a previously checkpointed applica-
     tion.

  clean_command
     A command-line type command string to be  executed  by  Grid
     Engine  in order to cleanup after a checkpointed application
     has finished.

  ckpt_dir
     A file system location to which checkpoints  of  potentially
     considerable size should be stored.

  ckpt_signal
     A Unix signal to be sent to a job by Grid Engine to initiate
     a checkpoint generation. The value for this field can either
     be a symbolic name from the list produced by the  -<I>l</I>  option
     of  the <B><A HREF="../htmlman1/kill.html">kill(1)</A></B> command or an integer number which must be a
     valid signal on the systems used for checkpointing.


  when
     The points of time when checkpoints are expected to be  gen-
     erated.  Valid values for this parameter are composed by the
     letters <I>s</I>, <I>m</I>, <I>x</I> and <I>r</I> and any combinations  thereof  without
     any  separating  character  in between. The same letters are
     allowed for the -<I>c</I> option of the <B><A HREF="../htmlman1/qsub.html">qsub(1)</A></B> command which  will
     overwrite the definitions in the used checkpointing environ-
     ment.  The meaning of the letters is defined as follows:

     <I>s</I>    A job is checkpointed, aborted and if possible migrated
          if  the  corresponding <B><A HREF="../htmlman8/sge_execd.html">sge_execd(8)</A></B> is shut down on the
          job's machine.

     <I>m</I>    Checkpoints   are   generated   periodically   at   the
          <I>min</I>_<I>cpu</I>_<I>interval</I>  interval  defined  by  the queue (see
          <B><A HREF="../htmlman5/queue_conf.html">queue_conf(5)</A></B>) in which a job executes.

     <I>x</I>    A job is checkpointed, aborted and if possible migrated
          as  soon as the job gets suspended (manually as well as
          automatically).

     <I>r</I>    A job will be rescheduled (not checkpointed)  when  the
          host  on which the job currently runs went into unknown
          state and the  time  interval  <I>reschedule</I>_<I>unknown</I>  (see
          <B><A HREF="../htmlman5/sge_conf.html">sge_conf(5)</A></B>) defined in the global/local cluster confi-
          guration will be exceeded.


RESTRICTIONS
     Note, that the functionality of any checkpointing, migration
     or  restart  procedures  provided  by  default with the Grid
     Engine distribution as well as the way how they are  invoked
     in the <I>ckpt</I>_<I>command</I>, <I>migr</I>_<I>command</I> or <I>restart</I>_<I>command</I> parame-
     ters of any default checkpointing environments should not be
     changed  or  otherwise  the  functionality  remains the full
     responsibility of the administrator configuring  the  check-
     pointing  environment.   Grid  Engine will just invoke these
     procedures and evaluate their exit status. If the procedures
     do  not perform their tasks properly or are not invoked in a
     proper fashion, the checkpointing mechanism may behave unex-
     pectedly, Grid Engine has no means to detect this.

SEE ALSO
     <B><A HREF="../htmlman1/sge_intro.html">sge_intro(1)</A></B>,  <B><A HREF="../htmlman1/sge_ckpt.html">sge_ckpt(1)</A></B>,  <B><A HREF="../htmlman1/qconf.html">qconf(1)</A></B>,   <B><A HREF="../htmlman1/qmod.html">qmod(1)</A></B>,   <B><A HREF="../htmlman1/qsub.html">qsub(1)</A></B>,
     <B><A HREF="../htmlman8/sge_execd.html">sge_execd(8)</A></B>.

COPYRIGHT
     See <B><A HREF="../htmlman1/sge_intro.html">sge_intro(1)</A></B> for a full statement of rights and  permis-
     sions.



</PRE>
<HR>
<ADDRESS>
Man(1) output converted with
<a href="http://www.oac.uci.edu/indiv/ehood/man2html.html">man2html</a>
</ADDRESS>
</BODY>
</HTML>