File: checkpoint.5

package info (click to toggle)
gridengine 6.2-4
  • links: PTS, VCS
  • area: main
  • in suites: lenny
  • size: 51,532 kB
  • ctags: 51,172
  • sloc: ansic: 418,155; java: 37,080; sh: 22,593; jsp: 7,699; makefile: 5,292; csh: 4,244; xml: 2,901; cpp: 2,086; perl: 1,895; tcl: 1,188; lisp: 669; ruby: 642; yacc: 393; lex: 266
file content (192 lines) | stat: -rw-r--r-- 6,947 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
'\" t
.\"___INFO__MARK_BEGIN__
.\"
.\" Copyright: 2004 by Sun Microsystems, Inc.
.\"
.\"___INFO__MARK_END__
.\" $RCSfile$     Last Update: $Date$     Revision: $Revision$
.\"
.\"
.\" Some handy macro definitions [from Tom Christensen's man(1) manual page].
.\"
.de SB		\" small and bold
.if !"\\$1"" \\s-2\\fB\&\\$1\\s0\\fR\\$2 \\$3 \\$4 \\$5
..
.\"
.de T		\" switch to typewriter font
.ft CW		\" probably want CW if you don't have TA font
..
.\"
.de TY		\" put $1 in typewriter font
.if t .T
.if n ``\c
\\$1\c
.if t .ft P
.if n \&''\c
\\$2
..
.\"
.de M		\" man page reference
\\fI\\$1\\fR\\|(\\$2)\\$3
..
.TH CHECKPOINT 5 "$Date$" "xxRELxx" "xxQS_NAMExx File Formats"
.\"
.SH NAME
checkpoint \- xxQS_NAMExx checkpointing environment configuration file format
.\"
.\"
.SH DESCRIPTION
Checkpointing is a facility to save the complete status of an executing
program or job and to restore and restart from this so called checkpoint
at a later point of time if the original program or job was halted, e.g.
through a system crash.
.PP
xxQS_NAMExx provides various levels of checkpointing support (see
.M xxqs_name_sxx_ckpt 1 ).
The checkpointing environment described here is a means to configure
the different types of checkpointing in use for your xxQS_NAMExx cluster or
parts thereof. For that purpose you can define the operations which
have to be executed in initiating a checkpoint generation, a migration
of a checkpoint to another host or a restart of a checkpointed
application as well as the list of queues which are eligible for a
checkpointing method.
.PP
Supporting different operating systems may easily force xxQS_NAMExx to 
introduce operating system dependencies for the configuration of the 
checkpointing configuration file and updates of the supported operating 
system versions may lead to frequently changing implementation details. 
Please refer to the <xxqs_name_sxx_root>/ckpt directory for more 
information.
.PP
Please use the \fI\-ackpt\fP, \fI\-dckpt\fP, \fI\-mckpt\fP or \fI\-sckpt\fP
options to the
.M qconf 1
command to manipulate checkpointing environments from the command-line or
use the corresponding
.M qmon 1
dialogue for X-Windows based interactive configuration.
.PP
Note, xxQS_NAMExx allows backslashes (\\) be used to escape newline
(\\newline) characters. The backslash and the newline are replaced with a
space (" ") character before any interpretation.
.\"
.\"
.SH FORMAT
The format of a
.I checkpoint
file is defined as follows:
.SS "\fBckpt_name\fP"
The name of the checkpointing environment as defined for \fIckpt_name\fP in
.M sge_types 1 .
To be used in the
.M qsub 1
\fB\-ckpt\fP switch or for the
.M qconf 1
options mentioned above.
.SS "\fBinterface\fP"
The type of checkpointing to be used. Currently, the following types are
valid:
.IP "\fIhibernator\fP"
The Hibernator kernel level checkpointing is interfaced.
.IP "\fIcpr\fP"
The SGI kernel level checkpointing is used.
.IP "\fIcray-ckpt\fP"
The Cray kernel level checkpointing is assumed.
.IP "\fItransparent\fP"
xxQS_NAMExx assumes that the jobs submitted with reference to this checkpointing
interface use a checkpointing library such as provided by 
the public domain package \fICondor\fP.
.IP "\fIuserdefined\fP"
xxQS_NAMExx assumes that the jobs submitted with reference to this checkpointing
interface perform their private checkpointing method.
.IP "\fIapplication-level\fP"
Uses all of the interface commands configured in the checkpointing object
like in the case of one of the kernel level checkpointing interfaces
(\fIcpr\fP, \fIcray-ckpt\fP, etc.) except for the
.B restart_command
(see below), which is not
used (even if it is configured) but the job script is invoked in case of a
restart instead.
.SS "\fBckpt_command\fP"
A command-line type command string to be executed by xxQS_NAMExx in order to
initiate a checkpoint.
.SS "\fBmigr_command\fP"
A command-line type command string to be executed by xxQS_NAMExx during a
migration of a checkpointing job from one host to another.
.SS "\fBrestart_command\fP"
A command-line type command string to be executed by xxQS_NAMExx when restarting
a previously checkpointed application.
.SS "\fBclean_command\fP"
A command-line type command string to be executed by xxQS_NAMExx in order
to cleanup after a checkpointed application has finished.
.SS "\fBckpt_dir\fP"
A file system location to which checkpoints of potentially considerable
size should be stored.
.SS "\fBckpt_signal\fP"
A Unix signal to be sent to a job by xxQS_NAMExx to initiate a checkpoint
generation. The value for this field can either be a symbolic name from the
list produced by the \fI\-l\fP option of the
.M kill 1
command or an integer number which must be a valid signal on the systems
used for checkpointing.
.SS "\fBwhen\fP"
The points of time when checkpoints are expected to be generated.
Valid values for this parameter are composed by the letters \fIs\fP,
\fIm\fP,
\fIx\fP and
\fIr\fP and
any combinations thereof without any separating character in between. The
same letters are allowed for the \fI\-c\fP option of the
.M qsub 1
command which will overwrite the definitions in the used checkpointing
environment.
The meaning of the letters is defined as follows:
.IP "\fIs\fP"
A job is checkpointed, aborted and if possible migrated if the
corresponding
.M xxqs_name_sxx_execd 8
is shut down on the job's machine.
.IP "\fIm\fP"
Checkpoints are generated periodically at the \fImin_cpu_interval\fP
interval defined by the queue (see
.M queue_conf 5 )
in which a job executes.
.IP "\fIx\fP"
A job is checkpointed, aborted and if possible migrated as soon as the job
gets suspended (manually as well as automatically).
.IP "\fIr\fP"
A job will be rescheduled (not checkpointed) when the host on which the job
currently runs went into unknown state and the time interval
\fIreschedule_unknown\fP (see
.M xxqs_name_sxx_conf 5 )
defined in the global/local cluster configuration will be exceeded.

.\"
.\"
.SH RESTRICTIONS
\fBNote\fP, that the functionality of any checkpointing,
migration or restart procedures provided by default with
the xxQS_NAMExx distribution as well as the way how they are invoked in
the \fIckpt_command\fP, \fImigr_command\fP or \fIrestart_command\fP
parameters of any default checkpointing environments should not be
changed or otherwise the functionality remains the full responsibility
of the administrator configuring the checkpointing environment.
xxQS_NAMExx will just invoke these procedures and evaluate their
exit status. If the procedures do not perform their tasks
properly or are not invoked in a proper fashion, the checkpointing
mechanism may behave unexpectedly, xxQS_NAMExx has no means to detect this.
.\"
.\"
.SH "SEE ALSO"
.M xxqs_name_sxx_intro 1 ,
.M xxqs_name_sxx_ckpt 1 ,
.M xxqs_name_sxx__types 1 ,
.M qconf 1 ,
.M qmod 1 ,
.M qsub 1 ,
.M xxqs_name_sxx_execd 8 .
.\"
.SH "COPYRIGHT"
See
.M xxqs_name_sxx_intro 1
for a full statement of rights and permissions.