File: redundancy.html

package info (click to toggle)
icinga 1.0.2-2%2Bsqueeze1
  • links: PTS, VCS
  • area: main
  • in suites: squeeze
  • size: 33,952 kB
  • ctags: 13,294
  • sloc: xml: 154,821; ansic: 99,198; sh: 14,585; sql: 5,852; php: 5,126; perl: 2,838; makefile: 1,268
file content (451 lines) | stat: -rw-r--r-- 23,955 bytes parent folder | download | duplicates (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Redundant and Failover Network Monitoring</title>
<meta name="generator" content="DocBook XSL Stylesheets V1.75.1">
<meta name="keywords" content="Supervision, Icinga, Nagios, Linux">
<link rel="home" href="index.html" title="Icinga Version 1.0.2 Documentation">
<link rel="up" href="ch06.html" title="Chapter 6. Advanced Topics">
<link rel="prev" href="distributed.html" title="Distributed Monitoring">
<link rel="next" href="flapping.html" title="Detection and Handling of State Flapping">
</head>
<body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF">
<CENTER><IMG src="../images/logofullsize.png" border="0" alt="Icinga" title="Icinga"></CENTER>
<div class="navheader">
<table width="100%" summary="Navigation header">
<tr><th colspan="3" align="center">Redundant and Failover Network Monitoring</th></tr>
<tr>
<td width="20%" align="left">
<a accesskey="p" href="distributed.html">Prev</a> </td>
<th width="60%" align="center">Chapter 6. Advanced Topics</th>
<td width="20%" align="right"> <a accesskey="n" href="flapping.html">Next</a>
</td>
</tr>
</table>
<hr>
</div>
<div class="section" title="Redundant and Failover Network Monitoring">
<div class="titlepage"><div><div><h2 class="title" style="clear: both">
<a name="redundancy"></a><a name="redundancy_failover"></a>Redundant and Failover Network Monitoring</h2></div></div></div>
  

  <p><span class="bold"><strong>Introduction</strong></span></p>

  <p>This section describes a few scenarios for implementing redundant monitoring hosts an various types of network layouts.
  With redundant hosts, you can maintain the ability to monitor your network when the primary host that runs Icinga fails
  or when portions of your network become unreachable.</p>

  <div class="note" title="Note" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Note">
<tr>
<td rowspan="2" align="center" valign="top" width="25"><img alt="[Note]" src="../images/note.png"></td>
<th align="left">Note</th>
</tr>
<tr><td align="left" valign="top">
    <p><span class="color"></span> If you are just learning how to use Icinga, we would suggest
    not trying to implement redundancy until you have becoming familiar with the <a class="link" href="redundancy.html#redundancy-prerequisites">prerequisites</a> that have been laid out. Redundancy is a relatively complicated
    issue to understand, and even more difficult to implement properly.</p>
  </td></tr>
</table></div>

  <p><span class="bold"><strong>Index</strong></span></p>

  <p><a class="link" href="redundancy.html#redundancy-prerequisites">Prerequisites</a></p>

  <p><a class="link" href="redundancy.html#redundancy-samples">Sample scripts</a></p>

  <p><a class="link" href="redundancy.html#redundancy-scenario_1">Scenario 1 - Redundant monitoring</a></p>

  <p><a class="link" href="redundancy.html#redundancy-scenario_2">Scenario 2 - Failover monitoring</a></p>

  <p><a name="redundancy-prerequisites"></a></p>

  <div class="informaltable">
    <table border="0">
<colgroup><col></colgroup>
<tbody><tr><td><p> <span class="bold"><strong>Prerequisites</strong></span> </p></td></tr></tbody>
</table>
  </div>

  <p>Before you can even think about implementing redundancy with Icinga, you need to be familiar with the
  following...</p>

  <div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem">
      <p>Implementing <a class="link" href="eventhandlers.html" title="Event Handlers">event handlers</a> for hosts and services</p>
    </li>
<li class="listitem">
      <p>Issuing <a class="link" href="extcommands.html" title="External Commands">external commands</a> to Icinga via shell scripts</p>
    </li>
<li class="listitem">
      <p>Executing plugins on remote hosts using either the <a class="link" href="addons.html#addons-nrpe">NRPE addon</a> or some other
      method</p>
    </li>
<li class="listitem">
      <p>Checking the status of the Icinga process with the <span class="emphasis"><em>check_nagios</em></span> plugin</p>
    </li>
</ul></div>

  <p><a name="redundancy-samples"></a></p>

  <div class="informaltable">
    <table border="0">
<colgroup><col></colgroup>
<tbody><tr><td><p> <span class="bold"><strong>Sample Scripts</strong></span> </p></td></tr></tbody>
</table>
  </div>

  <p>All of the sample scripts that we use in this documentation can be found in the <span class="emphasis"><em>eventhandlers/</em></span>
  subdirectory of the Icinga distribution. You'll probably need to modify them to work on your system...</p>

  <p><a name="redundancy-scenario_1"></a></p>

  <div class="informaltable">
    <table border="0">
<colgroup><col></colgroup>
<tbody><tr><td><p> <span class="bold"><strong>Scenario 1 - Redundant Monitoring</strong></span> </p></td></tr></tbody>
</table>
  </div>

  <p><span class="bold"><strong>Introduction</strong></span></p>

  <p>This is an easy (and naive) method of implementing redundant monitoring hosts on your network and it will only protect
  against a limited number of failures. More complex setups are necessary in order to provide smarter redundancy, better
  redundancy across different network segments, etc.</p>

  <p><span class="bold"><strong>Goals</strong></span></p>

  <p>The goal of this type of redundancy implementation is simple. Both the "master" and "slave" hosts monitor the same hosts
  and service on the network. Under normal circumstances only the "master" host will be sending out notifications to contacts
  about problems. We want the "slave" host running Icinga to take over the job of notifying contacts about problems
  if:</p>

  <div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem">
      <p>The "master" host that runs Icinga is down or..</p>
    </li>
<li class="listitem">
      <p>The Icinga process on the "master" host stops running for some reason</p>
    </li>
</ol></div>

  <p><span class="bold"><strong>Network Layout Diagram</strong></span></p>

  <p>The diagram below shows a very simple network setup. For this scenario we will be assuming that hosts A and E are both
  running Icinga and are monitoring all the hosts shown. Host A will be considered the "master" host and host E will be
  considered the "slave" host.</p>

  <div class="informaltable">
    <table border="1">
<colgroup><col></colgroup>
<tbody><tr><td><p> <span class="inlinemediaobject"><img src="../images/redundancy.png"></span> </p></td></tr></tbody>
</table>
  </div>

  <p><span class="bold"><strong>Initial Program Settings</strong></span></p>

  <p>The slave host (host E) has its initial <a class="link" href="configmain.html#configmain-enable_notifications">enable_notifications</a>
  directive disabled, thereby preventing it from sending out any host or service notifications. You also want to make sure that
  the slave host has its <a class="link" href="configmain.html#configmain-check_external_commands">check_external_commands</a> directive enabled. That
  was easy enough...</p>

  <p><span class="bold"><strong>Initial Configuration</strong></span></p>

  <p>Next we need to consider the differences between the <a class="link" href="configobject.html" title="Object Configuration Overview">object configuration file(s)</a> on
  the master and slave hosts...</p>

  <p>We will assume that you have the master host (host A) setup to monitor services on all hosts shown in the diagram above.
  The slave host (host E) should be setup to monitor the same services and hosts, with the following additions in the
  configuration file...</p>

  <div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem">
      <p>The host definition for host A (in the host E configuration file) should have a host <a class="link" href="eventhandlers.html" title="Event Handlers">event handler</a> defined. Lets say the name of the host event handler is <span class="color">handle-master-host-event</span>.</p>
    </li>
<li class="listitem">
      <p>The configuration file on host E should have a service defined to check the status of the Icinga process on
      host A. Lets assume that you define this service check to run the <span class="emphasis"><em>check_nagios</em></span> plugin on host A. This
      can be done by using one of the methods described in <span class="bold"><strong>this FAQ</strong></span> (update this!).</p>
    </li>
<li class="listitem">
      <p>The service definition for the Icinga process check on host A should have an <a class="link" href="eventhandlers.html" title="Event Handlers">event handler</a> defined. Lets say the name of the service event handler is <span class="color">handle-master-proc-event</span>.</p>
    </li>
</ul></div>

  <p>It is important to note that host A (the master host) has no knowledge of host E (the slave host). In this scenario it
  simply doesn't need to. Of course you may be monitoring services on host E from host A, but that has nothing to do with the
  implementation of redundancy...</p>

  <p><span class="bold"><strong>Event Handler Command Definitions</strong></span></p>

  <p>We need to stop for a minute and describe what the command definitions for the event handlers on the slave host look like.
  Here is an example...</p>

  <pre class="screen"> define command{
     command_name handle-master-host-event
     command_line /usr/local/icinga/libexec/eventhandlers/handle-master-host-event $HOSTSTATE$ $HOSTSTATETYPE$
 } 

 define command{
     command_name handle-master-proc-event 
     command_line /usr/local/icinga/libexec/eventhandlers/handle-master-proc-event $SERVICESTATE$ $SERVICESTATETYPE$
 }</pre>

  <p>This assumes that you have placed the event handler scripts in the
  <span class="emphasis"><em>/usr/local/icinga/libexec/eventhandlers</em></span> directory. You may place them anywhere you wish, but you'll need to
  modify the examples that are given here.</p>

  <p><span class="bold"><strong>Event Handler Scripts</strong></span></p>

  <p>Okay, now lets take a look at what the event handler scripts look like...</p>

  <p>Host Event Handler (<span class="color">handle-master-host-event</span>):</p>

  <pre class="screen">#!/bin/sh

# Only take action on hard host states...
case "$2" in
HARD)
        case "$1" in
        DOWN)
                # The master host has gone down!
                # We should now become the master host and take
                # over the responsibilities of monitoring the 
                # network, so enable notifications...
                /usr/local/icinga/libexec/eventhandlers/enable_notifications
                ;;
        UP)
                # The master host has recovered!
                # We should go back to being the slave host and
                # let the master host do the monitoring, so 
                # disable notifications...
                /usr/local/icinga/libexec/eventhandlers/disable_notifications
                ;;
        esac
        ;;
esac
exit 0</pre>

  <p>Service Event Handler (<span class="color">handle-master-proc-event</span>):</p>

  <pre class="screen">#!/bin/sh

# Only take action on hard service states...
case "$2" in
HARD)
        case "$1" in
        CRITICAL)
                # The master Icinga process is not running!
                # We should now become the master host and
                # take over the responsibility of monitoring
                # the network, so enable notifications...
                /usr/local/icinga/libexec/eventhandlers/enable_notifications
                ;;
        WARNING)
        UNKNOWN)
                # The master Icinga process may or may not
                # be running.. We won't do anything here, but
                # to be on the safe side you may decide you 
                # want the slave host to become the master in
                # these situations...
                ;;
        OK)
                # The master Icinga process running again!
                # We should go back to being the slave host, 
                # so disable notifications...
                /usr/local/icinga/libexec/eventhandlers/disable_notifications
                ;;
        esac
        ;;
esac
exit 0</pre>

  <p><span class="bold"><strong>What This Does For Us</strong></span></p>

  <p>The slave host (host E) initially has notifications disabled, so it won't send out any host or service notifications while
  the Icinga process on the master host (host A) is still running.</p>

  <p>The Icinga process on the slave host (host E) becomes the master host when...</p>

  <div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem">
      <p>The master host (host A) goes down and the <span class="emphasis"><em> <span class="color">handle-master-host-event</span> </em></span> host event handler is executed.</p>
    </li>
<li class="listitem">
      <p>The Icinga process on the master host (host A) stops running and the <span class="emphasis"><em> <span class="color">handle-master-proc-event</span> </em></span> service event handler is executed.</p>
    </li>
</ul></div>

  <p>When the Icinga process on the slave host (host E) has notifications enabled, it will be able to send out
  notifications about any service or host problems or recoveries. At this point host E has effectively taken over the
  responsibility of notifying contacts of host and service problems!</p>

  <p>The Icinga process on host E returns to being the slave host when...</p>

  <div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem">
      <p>Host A recovers and the <span class="emphasis"><em> <span class="color">handle-master-host-event</span>
      </em></span> host event handler is executed.</p>
    </li>
<li class="listitem">
      <p>The Icinga process on host A recovers and the <span class="emphasis"><em> <span class="color">handle-master-proc-event</span> </em></span> service event handler is executed.</p>
    </li>
</ul></div>

  <p>When the Icinga process on host E has notifications disabled, it will not send out notifications about any service
  or host problems or recoveries. At this point host E has handed over the responsibilities of notifying contacts of problems to
  the Icinga process on host A. Everything is now as it was when we first started!</p>

  <p><span class="bold"><strong>Time Lags</strong></span></p>

  <p>Redundancy in Icinga is by no means perfect. One of the more obvious problems is the lag time between the master
  host failing and the slave host taking over. This is affected by the following...</p>

  <div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem">
      <p>The time between a failure of the master host and the first time the slave host detects a problem</p>
    </li>
<li class="listitem">
      <p>The time needed to verify that the master host really does have a problem (using service or host check retries on the
      slave host)</p>
    </li>
<li class="listitem">
      <p>The time between the execution of the event handler and the next time that Icinga checks for external
      commands</p>
    </li>
</ul></div>

  <p>You can minimize this lag by...</p>

  <div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem">
      <p>Ensuring that the Icinga process on host E (re)checks one or more services at a high frequency. This is done by
      using the <span class="emphasis"><em>check_interval</em></span> and <span class="emphasis"><em>retry_interval</em></span> arguments in each service
      definition.</p>
    </li>
<li class="listitem">
      <p>Ensuring that the number of host rechecks for host A (on host E) allow for fast detection of host problems. This is
      done by using the <span class="emphasis"><em>max_check_attempts</em></span> argument in the host definition.</p>
    </li>
<li class="listitem">
      <p>Increase the frequency of <a class="link" href="extcommands.html" title="External Commands">external command</a> checks on host E. This is done by
      modifying the <a class="link" href="configmain.html#configmain-command_check_interval">command_check_interval</a> option in the main
      configuration file.</p>
    </li>
</ul></div>

  <p>When Icinga recovers on the host A, there is also some lag time before host E returns to being a slave host. This
  is affected by the following...</p>

  <div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem">
      <p>The time between a recovery of host A and the time the Icinga process on host E detects the recovery</p>
    </li>
<li class="listitem">
      <p>The time between the execution of the event handler on host B and the next time the Icinga process on host E
      checks for external commands</p>
    </li>
</ul></div>

  <p>The exact lag times between the transfer of monitoring responsibilities will vary depending on how many services you have
  defined, the interval at which services are checked, and a lot of pure chance. At any rate, its definitely better than
  nothing.</p>

  <p><span class="bold"><strong>Special Cases</strong></span></p>

  <p>Here is one thing you should be aware of... If host A goes down, host E will have notifications enabled and take over the
  responsibilities of notifying contacts of problems. When host A recovers, host E will have notifications disabled. If - when
  host A recovers - the Icinga process on host A does not start up properly, there will be a period of time when neither
  host is notifying contacts of problems! Fortunately, the service check logic in Icinga accounts for this. The next time
  the Icinga process on host E checks the status of the Icinga process on host A, it will find that it is not
  running. Host E will then have notifications enabled again and take over all responsibilities of notifying contacts of
  problems.</p>

  <p>The exact amount of time that neither host is monitoring the network is hard to determine. Obviously, this period can be
  minimized by increasing the frequency of service checks (on host E) of the Icinga process on host A. The rest is up to
  pure chance, but the total "blackout" time shouldn't be too bad.</p>

  <p><a name="redundancy-scenario_2"></a></p>

  <div class="informaltable">
    <table border="0">
<colgroup><col></colgroup>
<tbody><tr><td><p> <span class="bold"><strong>Scenario 2 - Failover Monitoring</strong></span> </p></td></tr></tbody>
</table>
  </div>

  <p><span class="bold"><strong>Introduction</strong></span></p>

  <p>Failover monitoring is similiar to, but slightly different than redundant monitoring (as discussed above in <a class="link" href="redundancy.html#redundancy-scenario_1">scenario 1</a>).</p>

  <p><span class="bold"><strong>Goals</strong></span></p>

  <p>The basic goal of failover monitoring is to have the Icinga process on the slave host sit idle while the
  Icinga process on the master host is running. If the process on the master host stops running (or if the host goes down),
  the Icinga process on the slave host starts monitoring everything.</p>

  <p>While the method described in <a class="link" href="redundancy.html#redundancy-scenario_1">scenario 1</a> will allow you to continue
  receive notifications if the master monitoring hosts goes down, it does have some pitfalls. The biggest problem is that the
  slave host is monitoring the same hosts and servers as the master <span class="emphasis"><em>at the same time as the master</em></span>! This can
  cause problems with excessive traffic and load on the machines being monitored if you have a lot of services defined. Here's how
  you can get around that problem...</p>

  <p><span class="bold"><strong>Initial Program Settings</strong></span></p>

  <p>Disable active service checks and notifications on the slave host using the <a class="link" href="configmain.html#configmain-execute_service_checks">execute_service_checks</a> and <a class="link" href="configmain.html#configmain-enable_notifications">enable_notifications</a> directives. This will prevent the slave host from
  monitoring hosts and services and sending out notifications while the Icinga process on the master host is still up and
  running. Make sure you also have the <a class="link" href="configmain.html#configmain-check_external_commands">check_external_commands</a> directive
  enabled on the slave host.</p>

  <p><span class="bold"><strong>Master Process Check</strong></span></p>

  <p>Set up a cron job on the slave host that periodically (say every minute) runs a script that checks the status of the
  Icinga process on the master host (using the <span class="emphasis"><em>check_nrpe</em></span> plugin on the slave host and the <a class="link" href="addons.html#addons-nrpe">nrpe daemon</a> and <span class="emphasis"><em>check_nagios</em></span> plugin on the master host). The script should
  check the return code of the <span class="emphasis"><em>check_nrpe plugin</em></span> . If it returns a non-OK state, the script should send the
  appropriate commands to the <a class="link" href="configmain.html#configmain-command_file">external command file</a> to enable both notifications
  and active service checks. If the plugin returns an OK state, the script should send commands to the external command file to
  disable both notifications and active checks.</p>

  <p>By doing this you end up with only one process monitoring hosts and services at a time, which is much more efficient that
  monitoring everything twice.</p>

  <p>Also of note, you <span class="emphasis"><em>don't</em></span> need to define host and service handlers as mentioned in <a class="link" href="redundancy.html#redundancy-scenario_1">scenario 1</a> because things are handled differently.</p>

  <p><span class="bold"><strong>Additional Issues</strong></span></p>

  <p>At this point, you have implemented a very basic failover monitoring setup. However, there is one more thing you should
  consider doing to make things work smoother.</p>

  <p>The big problem with the way things have been setup thus far is the fact that the slave host doesn't have the current
  status of any services or hosts at the time it takes over the job of monitoring. One way to solve this problem is to enable the
  <a class="link" href="configmain.html#configmain-ocsp_command">ocsp command</a> on the master host and have it send all service check results to the
  slave host using the <a class="link" href="addons.html#addons-nsca">nsca addon</a>. The slave host will then have up-to-date status information
  for all services at the time it takes over the job of monitoring things. Since active service checks are not enabled on the
  slave host, it will not actively run any service checks. However, it will execute host checks if necessary. This means that both
  the master and slave hosts will be executing host checks as needed, which is not really a big deal since the majority of
  monitoring deals with service checks.</p>

  <p>That's pretty much it as far as setup goes.</p>
  <a class="indexterm" name="id1994073"></a>
  <a class="indexterm" name="id1994082"></a>
  <a class="indexterm" name="id1994084"></a>
</div>
<div class="navfooter">
<hr>
<table width="100%" summary="Navigation footer">
<tr>
<td width="40%" align="left">
<a accesskey="p" href="distributed.html">Prev</a> </td>
<td width="20%" align="center"><a accesskey="u" href="ch06.html">Up</a></td>
<td width="40%" align="right"> <a accesskey="n" href="flapping.html">Next</a>
</td>
</tr>
<tr>
<td width="40%" align="left" valign="top">Distributed Monitoring </td>
<td width="20%" align="center"><a accesskey="h" href="index.html">Home</a></td>
<td width="40%" align="right" valign="top"> Detection and Handling of State Flapping</td>
</tr>
</table>
</div>
<P class="copyright">© 2009-2010 Icinga Development Team, http://www.icinga.org</P>
</body>
</html>