File: shadowd.html

package info (click to toggle)
gridengine 6.2-4
  • links: PTS, VCS
  • area: main
  • in suites: lenny
  • size: 51,532 kB
  • ctags: 51,172
  • sloc: ansic: 418,155; java: 37,080; sh: 22,593; jsp: 7,699; makefile: 5,292; csh: 4,244; xml: 2,901; cpp: 2,086; perl: 1,895; tcl: 1,188; lisp: 669; ruby: 642; yacc: 393; lex: 266
file content (72 lines) | stat: -rw-r--r-- 4,309 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<head>
   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
   <meta name="Author" content="christian reissmann">
   <meta name="GENERATOR" content="Mozilla/4.76C-CCK-MCD Netscape [en] (X11; U; SunOS 5.8 sun4u) [Netscape]">
</head>
<body>

<h1>
<a NAME="shadowd"></a>shadowd - qmaster breakdown detection daemon</h1>
<font size=+1>The&nbsp; <a href="/cod_home/cr114091/CollabNet_OpenSource/gridengine/source/daemons/qmaster/qmaster.html">qmaster</a>
is the central control unit of a Grid Engine cluster. If it is unavailable
to provide service, the cluster operation is strongly affected (note, however,
that executing jobs will finish normally). To prevent from qmaster becoming
a single point of failure, the shadowd is available. It will detect when
the currently active qmaster becomes unavailable and it will startup a
new qmaster/ <a href="/cod_home/cr114091/CollabNet_OpenSource/gridengine/source/daemons/schedd/schedd.html">schedd</a>
pair. The shadowd has to run on a so called shadow master host, which has
to have identical access permissions to the qmaster spool files as the
currently active qmaster host. If multiple shadowds are active in a cluster,
they will run a protocol to ensure that only one new qmaster/schedd pair
is started. For more detailed information look at the man page <i>sge_shadowd(8).</i></font>
<p><font size=+1>The shadowd mechanism will only work if all Grid Engine
client commands can access the same shared file system for the cell directory
(In the cell directory - often $SGE_ROOT/default - are all configuration
files and the spooling directories for the cluster).</font>
<p><font size=+1>To start up a shadowd use the Grid Engine cluster startup
script <i>sge5</i> with the <i>-shadowd</i>&nbsp; option.</font>
<p><font size=+1>Note that the shadowd will <b>not</b> start a new qmaster/schedd
if the file "lock" exists in the qmaster spool directory. This file is
written by the qmaster in the case of a regular shutdown of the cluster
(<i>qconf -km</i>). The shadowd identifies unavailability of the qmaster
service first by reading a so called heartbeat file, which an active qmaster
updates from time to time. If it does not get updated, the shadowd will
<b>not</b>
start a new qmaster/schedd immediatelly. It will wait for several minutes
to ensure that a temporary network overload or breakdown are not the cause
of the problem.</font>
<br>&nbsp;
<br>&nbsp;
<p><b><font size=+1>The shadowd main function is implemented in the file:</font></b>
<blockquote><font size=+1>gridengine/source/daemons/shadowd/shadowd.c</font></blockquote>
<b><font size=+1>The main loop looks like follows:</font></b>
<blockquote><font size=+1>1. Check for running shadowd on the local host.
If there is already a running shadowd: stop</font>
<p><font size=+1>2. Prepare enroll for later enroll() call of commlib:
prepare_enroll()</font>
<blockquote><font size=+1>Set all parameter for comunication with commd</font></blockquote>
<font size=+1>3. Install exit functions and setup signal handlers</font>
<p><font size=+1>4. Parse cmd line parameters ( only -help available )</font>
<p><font size=+1>5. Get qmaster spool directory from configuration file</font>
<blockquote><font size=+1>The configuration file is <i>$SGE_ROOT/default/common/configuration</i></font></blockquote>
<font size=+1>6. Switch to admin user ( is also stored in the configuration
file )</font>
<p><font size=+1>7. Get qmaster heartbeat ( from file heartbeat in the
qmaster spool directory )</font>
<p><font size=+1>8. When heartbeat is unchanged more than one heartbeat
interval (240 seconds) start delay timer ( 600 seconds )</font>
<p><font size=+1>9. When the heartbeat is still unchanged after heartbeat
interval time and delay time:</font>
<blockquote><font size=+1>if no valid lock file ( from other shadowd )
found:</font>
<blockquote><font size=+1>- lock qmaster</font>
<br><font size=+1>- startup&nbsp; <a href="/cod_home/cr114091/CollabNet_OpenSource/gridengine/source/daemons/qmaster/qmaster.html">qmaster</a>&nbsp;
and <a href="/cod_home/cr114091/CollabNet_OpenSource/gridengine/source/daemons/schedd/schedd.html">schedd</a>
on local host</font></blockquote>
</blockquote>
</blockquote>

</body>
</html>