1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
|
<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="Author" content="christian reissmann">
<meta name="GENERATOR" content="Mozilla/4.76C-CCK-MCD Netscape [en] (X11; U; SunOS 5.8 sun4u) [Netscape]">
</head>
<body>
<h1>
<a NAME="shadowd"></a>shadowd - qmaster breakdown detection daemon</h1>
<font size=+1>The <a href="/cod_home/cr114091/CollabNet_OpenSource/gridengine/source/daemons/qmaster/qmaster.html">qmaster</a>
is the central control unit of a Grid Engine cluster. If it is unavailable
to provide service, the cluster operation is strongly affected (note, however,
that executing jobs will finish normally). To prevent from qmaster becoming
a single point of failure, the shadowd is available. It will detect when
the currently active qmaster becomes unavailable and it will startup a
new qmaster/ <a href="/cod_home/cr114091/CollabNet_OpenSource/gridengine/source/daemons/schedd/schedd.html">schedd</a>
pair. The shadowd has to run on a so called shadow master host, which has
to have identical access permissions to the qmaster spool files as the
currently active qmaster host. If multiple shadowds are active in a cluster,
they will run a protocol to ensure that only one new qmaster/schedd pair
is started. For more detailed information look at the man page <i>sge_shadowd(8).</i></font>
<p><font size=+1>The shadowd mechanism will only work if all Grid Engine
client commands can access the same shared file system for the cell directory
(In the cell directory - often $SGE_ROOT/default - are all configuration
files and the spooling directories for the cluster).</font>
<p><font size=+1>To start up a shadowd use the Grid Engine cluster startup
script <i>sge5</i> with the <i>-shadowd</i> option.</font>
<p><font size=+1>Note that the shadowd will <b>not</b> start a new qmaster/schedd
if the file "lock" exists in the qmaster spool directory. This file is
written by the qmaster in the case of a regular shutdown of the cluster
(<i>qconf -km</i>). The shadowd identifies unavailability of the qmaster
service first by reading a so called heartbeat file, which an active qmaster
updates from time to time. If it does not get updated, the shadowd will
<b>not</b>
start a new qmaster/schedd immediatelly. It will wait for several minutes
to ensure that a temporary network overload or breakdown are not the cause
of the problem.</font>
<br>
<br>
<p><b><font size=+1>The shadowd main function is implemented in the file:</font></b>
<blockquote><font size=+1>gridengine/source/daemons/shadowd/shadowd.c</font></blockquote>
<b><font size=+1>The main loop looks like follows:</font></b>
<blockquote><font size=+1>1. Check for running shadowd on the local host.
If there is already a running shadowd: stop</font>
<p><font size=+1>2. Prepare enroll for later enroll() call of commlib:
prepare_enroll()</font>
<blockquote><font size=+1>Set all parameter for comunication with commd</font></blockquote>
<font size=+1>3. Install exit functions and setup signal handlers</font>
<p><font size=+1>4. Parse cmd line parameters ( only -help available )</font>
<p><font size=+1>5. Get qmaster spool directory from configuration file</font>
<blockquote><font size=+1>The configuration file is <i>$SGE_ROOT/default/common/configuration</i></font></blockquote>
<font size=+1>6. Switch to admin user ( is also stored in the configuration
file )</font>
<p><font size=+1>7. Get qmaster heartbeat ( from file heartbeat in the
qmaster spool directory )</font>
<p><font size=+1>8. When heartbeat is unchanged more than one heartbeat
interval (240 seconds) start delay timer ( 600 seconds )</font>
<p><font size=+1>9. When the heartbeat is still unchanged after heartbeat
interval time and delay time:</font>
<blockquote><font size=+1>if no valid lock file ( from other shadowd )
found:</font>
<blockquote><font size=+1>- lock qmaster</font>
<br><font size=+1>- startup <a href="/cod_home/cr114091/CollabNet_OpenSource/gridengine/source/daemons/qmaster/qmaster.html">qmaster</a>
and <a href="/cod_home/cr114091/CollabNet_OpenSource/gridengine/source/daemons/schedd/schedd.html">schedd</a>
on local host</font></blockquote>
</blockquote>
</blockquote>
</body>
</html>
|