1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433
|
<html>
<head>
<title>Redundant and Failover Network Monitoring</title>
<STYLE type="text/css">
<!--
.Default { font-family: verdana,arial,serif; font-size: 8pt; }
.PageTitle { font-family: verdana,arial,serif; font-size: 12pt; font-weight: bold; }
-->
</STYLE>
</head>
<body bgcolor="#FFFFFF" text="black" class="Default">
<p>
<div align="center">
<h2 class="PageTitle">Redundant and Failover Network Monitoring</h2>
</div>
</p>
<hr>
<p>
<strong><u>Introduction</u></strong>
</p>
<p>
This section describes a few scenarios for implementing redundant monitoring hosts an various types of network layouts. With redundant hosts, you can maintain the ability to monitor your network when the primary host that runs Nagios fails or when portions of your network become unreachable.
</p>
<p>
<font color="red"><strong>Note:</strong></font> If you are just learning how to use Nagios, I would suggest not trying to implement redudancy until you have becoming familiar with the <a href="#prerequisites">prerequisites</a> I've laid out. Redundancy is a relatively complicated issue to understand, and even more difficult to implement properly.
</p>
<p>
<strong><u>Index</u></strong>
</p>
<p>
<a href="#prerequisites">Prerequisites</a><br>
<a href="#samples">Sample scripts</a><br>
<a href="#scenario_1">Scenario 1 - Redundant monitoring</a><br>
<a href="#scenario_2">Scenario 2 - Failover monitoring</a><br>
</p>
<p>
<a name="prerequisites"></a>
<table border="0" width="100%" class="Default">
<tr>
<td bgcolor="#cbcbcb"><strong>Prerequisites</strong></td>
</tr>
</table>
</p>
<p>
Before you can even think about implementing redundancy with Nagios, you need to be familiar with the following...
</p>
<p>
<ul>
<li>Implementing <a href="eventhandlers.html">event handlers</a> for hosts and services
<li>Issuing <a href="extcommands.html">external commands</a> to Nagios via shell scripts
<li>Executing plugins on <a href="faqs.html#remote_host_monitoring">remote hosts</a> using either the <a href="addons.html#nrpe">nrpe addon</a> or some other method
<li>Checking the status of the Nagios process with the <a href="plugins.html#check_nagios">check_nagios</a> plugin
</ul>
</p>
<p>
<a name="samples"></a>
<table border="0" width="100%" class="Default">
<tr>
<td bgcolor="#cbcbcb"><strong>Sample Scripts</strong></td>
</tr>
</table>
</p>
<p>
All of the sample scripts that I use in this documentation can be found in the <i>eventhandlers/</i> subdirectory of the Nagios distribution. You'll probably need to modify them to work on your system...
</p>
<p>
<a name="scenario_1"></a>
<table border="0" width="100%" class="Default">
<tr>
<td bgcolor="#cbcbcb"><strong>Scenario 1 - Redundant Monitoring</strong></td>
</tr>
</table>
</p>
<p>
<strong><u>Introduction</u></strong>
</p>
<p>
This is an easy (and naive) method of implementing redundant monitoring hosts on your network and it will only protect against a limited number of failures. More complex setups are necessary in order to provide smarter redundancy, better redundancy across different network segments, etc.
</p>
<p>
<strong><u>Goals</u></strong>
</p>
<p>
The goal of this type of redundancy implementation is simple. Both the "master" and "slave" hosts monitor the same hosts and service on the network. Under normal circumstances only the "master" host will be sending out notifications to contacts about problems. We want the "slave" host running Nagios to take over the job of notifying contacts about problems if:
</p>
<p>
<ol>
<li>The "master" host that runs Nagios is down or..
<li>The Nagios process on the "master" host stops running for some reason
</ol>
</p>
<p>
<strong><u>Network Layout Diagram</u></strong>
</p>
<p>
The diagram below shows a very simple network setup. For this scenario I will be assuming that hosts A and E are both running Nagios and are monitoring all the hosts shown. Host A will be considered the "master" host and host E will be considered the "slave" host.
</p>
<p>
<table border="1" class="Default">
<tr>
<td><img src="images/redundancy.png" border=0></td>
</tr>
</table>
</p>
<p>
<strong><u>Initial Program Settings</u></strong>
</p>
<p>
The slave host (host E) has its initial <a href="configmain.html#enable_notifications">enable_notifications</a> directive disabled, thereby preventing it from sending out any host or service notifications. You also want to make sure that the slave host has its <a href="configmain.html#check_external_commands">check_external_commands</a> directive enabled. That was easy enough...
</p>
<p>
<strong><u>Initial Configuration</u></strong>
</p>
<p>
Next we need to consider the differences between the <a href="configobject.html">object configuration file(s)</a> on the master and slave hosts...
</p>
<p>
I will assume that you have the master host (host A) setup to monitor services on all hosts shown in the diagram above. The slave host (host E) should be setup to monitor the same services and hosts, with the following additions in the configuration file...
</p>
<p>
<ul>
<li>The host definition for host A (in the host E configuration file) should have a host <a href="eventhandlers.html">event handler</a> defined. Lets say the name of the host event handler is <font color="red">handle-master-host-event</font>.
<li>The configuration file on host E should have a service defined to check the status of the Nagios process on host A. Lets assume that you define this service check to run the <i>check_nagios</i> plugin on host A. This can be done by using one of the methods described in <a href="faqs.html#remote_host_monitoring">this FAQ</a>.
<li>The service definition for the Nagios process check on host A should have an <a href="eventhandlers.html">event handler</a> defined. Lets say the name of the service event handler is <font color="red">handle-master-proc-event</font>.
</ul>
</p>
<p>
It is important to note that host A (the master host) has no knowledge of host E (the slave host). In this scenario it simply doesn't need to. Of course you may be monitoring services on host E from host A, but that has nothing to do with the implementation of redundancy...
</p>
<p>
<strong><u>Event Handler Command Definitions</u></strong>
</p>
<p>
We need to stop for a minute and describe what the command definitions for the event handlers on the slave host look like. Here is an example...
</p>
<p>
<strong>
<font color="red">
<pre>
define command{
command_name handle-master-host-event
command_line /usr/local/nagios/libexec/eventhandlers/handle-master-host-event $HOSTSTATE$ $HOSTSTATETYPE$
}
define command{
command_name handle-master-proc-event
command_line /usr/local/nagios/libexec/eventhandlers/handle-master-proc-event $SERVICESTATE$ $SERVICESTATETYPE$
}
</pre>
</font>
</strong>
</p>
<p>
This assumes that you have placed the event handler scripts in the <i>/usr/local/nagios/libexec/eventhandlers</i> directory. You may place them anywhere you wish, but you'll need to modify the examples I've given here.
</p>
<p>
<strong><u>Event Handler Scripts</u></strong>
</p>
<p>
Okay, now lets take a look at what the event handler scripts look like...
</p>
<p>
Host Event Handler (<font color="red">handle-master-host-event</font>):
</p>
<p>
<pre>
#!/bin/sh
# Only take action on hard host states...
case "$2" in
HARD)
case "$1" in
DOWN)
# The master host has gone down!
# We should now become the master host and take
# over the responsibilities of monitoring the
# network, so enable notifications...
/usr/local/nagios/libexec/eventhandlers/enable_notifications
;;
UP)
# The master host has recovered!
# We should go back to being the slave host and
# let the master host do the monitoring, so
# disable notifications...
/usr/local/nagios/libexec/eventhandlers/disable_notifications
;;
esac
;;
esac
exit 0
</pre>
</p>
<p>
Service Event Handler (<font color="red">handle-master-proc-event</font>):
</p>
<p>
<pre>
#!/bin/sh
# Only take action on hard service states...
case "$2" in
HARD)
case "$1" in
CRITICAL)
# The master Nagios process is not running!
# We should now become the master host and
# take over the responsibility of monitoring
# the network, so enable notifications...
/usr/local/nagios/libexec/eventhandlers/enable_notifications
;;
WARNING)
UNKNOWN)
# The master Nagios process may or may not
# be running.. We won't do anything here, but
# to be on the safe side you may decide you
# want the slave host to become the master in
# these situations...
;;
OK)
# The master Nagios process running again!
# We should go back to being the slave host,
# so disable notifications...
/usr/local/nagios/libexec/eventhandlers/disable_notifications
;;
esac
;;
esac
exit 0
</pre>
</p>
<p>
<strong><u>What This Does For Us</u></strong>
</p>
<p>
The slave host (host E) initially has notifications disabled, so it won't send out any host or service notifications while the Nagios process on the master host (host A) is still running.
</p>
<p>
The Nagios process on the slave host (host E) becomes the master host when...
</p>
<p>
<ul>
<li>The master host (host A) goes down and the <i><font color="red">handle-master-host-event</font></i> host event handler is executed.
<li>The Nagios process on the master host (host A) stops running and the <i><font color="red">handle-master-proc-event</font></i> service event handler is executed.
</ul>
</p>
<p>
When the Nagios process on the slave host (host E) has notifications enabled, it will be able to send out notifications about any service or host problems or recoveries. At this point host E has effectively taken over the responsibility of notifying contacts of host and service problems!
</p>
<p>
The Nagios process on host E returns to being the slave host when...
</p>
<p>
<ul>
<li>Host A recovers and the <i><font color="red">handle-master-host-event</font></i> host event handler is executed.
<li>The Nagios process on host A recovers and the <i><font color="red">handle-master-proc-event</font></i> service event handler is executed.
</ul>
</p>
<p>
When the Nagios process on host E has notifications disabled, it will not send out notifications about any service or host problems or recoveries. At this point host E has handed over the responsibilities of notifying contacts of problems to the Nagios process on host A. Everything is now as it was when we first started!
</p>
<p>
<strong><u>Time Lags</u></strong>
</p>
<p>
Redundancy in Nagios is by no means perfect. One of the more obvious problems is the lag time between the master host failing and the slave host taking over. This is affected by the following...
</p>
<p>
<ul>
<li>The time between a failure of the master host and the first time the slave host detects a problem
<li>The time needed to verify that the master host really does have a problem (using service or host check retries on the slave host)
<li>The time between the execution of the event handler and the next time that Nagios checks for external commands
</ul>
</p>
<p>
You can minimize this lag by...
</p>
<p>
<ul>
<li>Ensuring that the Nagios process on host E (re)checks one or more services at a high frequency. This is done by using the <i>check_interval</i> and <i>retry_interval</i> arguments in each service definition.
<li>Ensuring that the number of host rechecks for host A (on host E) allow for fast detection of host problems. This is done by using the <i>max_check_attempts</i> argument in the host definition.
<li>Increase the frequency of <a href="extcommands.html">external command</a> checks on host E. This is done by modifying the <a href="configmain.html#command_check_interval">command_check_interval</a> option in the main configuration file.
</ul>
</p>
<p>
When Nagios recovers on the host A, there is also some lag time before host E returns to being a slave host. This is affected by the following...
</p>
<p>
<ul>
<li>The time between a recovery of host A and the time the Nagios process on host E detects the recovery
<li>The time between the execution of the event handler on host B and the next time the Nagios process on host E checks for external commands
</ul>
</p>
<p>
The exact lag times between the transfer of monitoring responsibilities will vary depending on how many services you have defined, the interval at which services are checked, and a lot of pure chance. At any rate, its definitely better than nothing.
</p>
<p>
<strong><u>Special Cases</u></strong>
</p>
<p>
Here is one thing you should be aware of... If host A goes down, host E will have notifications enabled and take over the responsibilities of notifying contacts of problems. When host A recovers, host E will have notifications disabled. If - when host A recovers - the Nagios process on host A does not start up properly, there will be a period of time when neither host is notifying contacts of problems! Fortunately, the service check logic in Nagios accounts for this. The next time the Nagios process on host E checks the status of the Nagios process on host A, it will find that it is not running. Host E will then have notifications enabled again and take over all responsibilities of notifying contacts of problems.
</p>
<p>
The exact amount of time that neither host is monitoring the network is hard to determine. Obviously, this period can be minimized by increasing the frequency of service checks (on host E) of the Nagios process on host A. The rest is up to pure chance, but the total "blackout" time shouldn't be too bad.
</p>
<p>
<a name="scenario_2"></a>
<table border="0" width="100%" class="Default">
<tr>
<td bgcolor="#cbcbcb"><strong>Scenario 2 - Failover Monitoring</strong></td>
</tr>
</table>
</p>
<p>
<strong><u>Introduction</u></strong>
</p>
<p>
Failover monitoring is similiar to, but slightly different than redundant monitoring (as discussed above in <a href="#scenario_1">scenario 1</a>).
</p>
<p>
<strong><u>Goals</u></strong>
</p>
<p>
The basic goal of failover monitoring is to have the Nagios process on the slave host sit idle while the Nagios process on the master host is running. If the process on the master host stops running (or if the host goes down), the Nagios process on the slave host starts monitoring everything.
</p>
<p>
While the method described in <a href="#scenario_1">scenario 1</a> will allow you to continue receive notifications if the master monitoring hosts goes down, it does have some pitfalls. The biggest problem is that the slave host is monitoring the same hosts and servers as the master <i>at the same time as the master</i>! This can cause problems with excessive traffic and load on the machines being monitored if you have a lot of services defined. Here's how you can get around that problem...
</p>
<p>
<strong><u>Initial Program Settings</u></strong>
</p>
<p>
Disable active service checks and notifications on the slave host using the <a href="configmain.html#execute_service_checks">execute_service_checks</a> and <a href="configmain.html#enable_notifications">enable_notifications</a> directives. This will prevent the slave host from monitoring hosts and services and sending out notifications while the Nagios process on the master host is still up and running. Make sure you also have the <a href="configmain.html#check_external_commands">check_external_commands</a> directive enabled on the slave host.
</p>
<p>
<strong><u>Master Process Check</u></strong>
</p>
<p>
Set up a cron job on the slave host that periodically (say every minute) runs a script that checks the staus of the Nagios process on the master host (using the <i>check_nrpe</i> plugin on the slave host and the <a href="addons.html#nrpe">nrpe daemon</a> and <i>check_nagios</i> plugin on the master host). The script should check the return code of the <i>check_nrpe plugin</i> . If it returns a non-OK state, the script should send the appropriate commands to the <a href="configmain.html#command_file">external command file</a> to enable both notifications and active service checks. If the plugin returns an OK state, the script should send commands to the external command file to disable both notifications and active checks.
</p>
<p>
By doing this you end up with only one process monitoring hosts and services at a time, which is much more efficient that monitoring everything twice.
</p>
<p>
Also of note, you <i>don't</i> need to define host and service handlers as mentioned in <a href="#scenario_1">scenario 1</a> because things are handled differently.
</p>
<p>
<strong><u>Additional Issues</u></strong>
</p>
<p>
At this point, you have implemented a very basic failover monitoring setup. However, there is one more thing you should consider doing to make things work smoother.
</p>
<p>
The big problem with the way things have been setup thus far is the fact that the slave host doesn't have the current status of any services or hosts at the time it takes over the job of monitoring. One way to solve this problem is to enable the <a href="configmain.html#ocsp_command">ocsp command</a> on the master host and have it send all service check results to the slave host using the <a href="addons.html#nsca">nsca addon</a>. The slave host will then have up-to-date status information for all services at the time it takes over the job of monitoring things. Since active service checks are not enabled on the slave host, it will not actively run any service checks. However, it will execute host checks if necessary. This means that both the master and slave hosts will be executing host checks as needed, which is not really a big deal since the majority of monitoring deals with service checks.
</p>
<p>
That's pretty much it as far as setup goes.
</p>
<hr>
</body>
</html>
|