1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<title>Distributed Monitoring</title>
<STYLE type="text/css">
<!--
.Default { font-family: verdana,arial,serif; font-size: 8pt; }
.PageTitle { font-family: verdana,arial,serif; font-size: 12pt; font-weight: bold; }
-->
</STYLE>
</head>
<body bgcolor="#FFFFFF" text="black" class="Default">
<p>
<div align="center">
<h2 class="PageTitle">Distributed Monitoring</h2>
</div>
</p>
<hr>
<p>
<strong><u>Introduction</u></strong>
</p>
<p>
Nagios can be configured to support distributed monitoring of network services and resources. I'll try to briefly explan how this can be accomplished...
</p>
<p>
<strong><u>Goals</u></strong>
</p>
<p>
The goal in the distributed monitoring environment that I will describe is to offload the overhead (CPU usage, etc.) of performing service checks from a "central" server onto one or more "distributed" servers. Most small to medium sized shops will not have a real need for setting up such an environment. However, when you want to start monitoring hundreds or even thousands of <i>hosts</i> (and several times that many services) using Nagios, this becomes quite important.
</p>
<p>
<strong><u>Reference Diagram</u></strong>
</p>
<p>
The diagram below should help give you a general idea of how distributed monitoring works with Nagios. I'll be referring to the items shown in the diagram as I explain things...
</p>
<p>
<a href="images/distributed.png"><img src="images/distributed.png" border=1 width="200" height="250"></a>
</p>
<p>
<strong><u>Central Server vs. Distributed Servers</u></strong>
</p>
<p>
When setting up a distributed monitoring environment with Nagios, there are differences in the way the central and distributed servers are configured. I'll show you how to configure both types of servers and explain what effects the changes being made have on the overall monitoring. For starters, lets describe the purpose of the different types of servers...
</p>
<p>
The function of a <i>distributed server</i> is to actively perform checks all the services you define for a "cluster" of hosts. I use the term "cluster" loosely - it basically just mean an arbitrary group of hosts on your network. Depending on your network layout, you may have several cluters at one physical location, or each cluster may be separated by a WAN, its own firewall, etc. The important thing to remember to that for each cluster of hosts (however you define that), there is one distributed server that runs Nagios and monitors the services on the hosts in the cluster. A distributed server is usually a bare-bones installation of Nagios. It doesn't have to have the web interface installed, send out notifications, run event handler scripts, or do anything other than execute service checks if you don't want it to. More detailed information on configuring a distributed server comes later...
</p>
<p>
The purpose of the <i>central server</i> is to simply listen for service check results from one or more distributed servers. Even though services are occassionally actively checked from the central server, the active checks are only performed in dire circumstances, so lets just say that the central server only accepts passive check for now. Since the central server is obtaining <a href="passivechecks.html">passive service check</a> results from one or more distributed servers, it serves as the focal point for all monitoring logic (i.e. it sends out notifications, runs event handler scripts, determines host states, has the web interface installed, etc).
</p>
<p>
<strong><u>Obtaining Service Check Information From Distributed Monitors</u></strong>
</p>
<p>
Okay, before we go jumping into configuration detail we need to know how to send the service check results from the distributed servers to the central server. I've already discussed how to submit passive check results to Nagios from same host that Nagios is running on (as described in the documentation on <a href="passivechecks.html">passive checks</a>), but I haven't given any info on how to submit passive check results from other hosts.
</p>
<p>
In order to facilitate the submission of passive check results to a remote host, I've written the <a href="addons.html#nsca">nsca addon</a>. The addon consists of two pieces. The first is a client program (send_nsca) which is run from a remote host and is used to send the service check results to another server. The second piece is the nsca daemon (nsca) which either runs as a standalone daemon or under inetd and listens for connections from client programs. Upon receiving service check information from a client, the daemon will sumbit the check information to Nagios (on the central server) by inserting a <i>PROCESS_SVC_CHECK_RESULT</i> command into the <a href="configmain.html#command_file">external command file</a>, along with the check results. The next time Nagios checks for <a href="extcommands.html">external commands</a>, it will find the passive service check information that was sent from the distributed server and process it. Easy, huh?
</p>
<p>
<strong><u>Distributed Server Configuration</u></strong>
</p>
<p>
So how exactly is Nagios configured on a distributed server? Basically, its just a bare-bones installation. You don't need to install the web interface or have notifications sent out from the server, as this will all be handled by the central server.
</p>
<p>
Key configuration changes:
</p>
<p>
<ul>
<li>Only those services and hosts which are being monitored directly by the distributed server are defined in the <a href="configobject.html">object configuration file</a>.
<li>The distributed server has its <a href="configmain.html#enable_notifications">enable_notifications</a> directive set to 0. This will prevent any notifications from being sent out by the server.
<li>The distributed server is configured to <a href="configmain.html#obsess_over_services">obsess over services</a>.
<li>The distributed server has an <a href="configmain.html#ocsp_command">ocsp command</a> defined (as described below).
</ul>
</p>
<p>
In order to make everything come together and work properly, we want the distributed server to report the results of <i>all</i> service checks to Nagios. We could use <a href="eventhandlers.html">event handlers</a> to report <i>changes</i> in the state of a service, but that just doesn't cut it. In order to force the distributed server to report all service check results, you must enabled the <a href="configmain.html#obsess_over_services">obsess_over_services</a> option in the main configuration file and provide a <a href="configmain.html#ocsp_command">ocsp_command</a> to be run after every service check. We will use the ocsp command to send the results of all service checks to the central server, making use of the send_nsca client and nsca daemon (as described above) to handle the tranmission.
</p>
<p>
In order to accomplish this, you'll need to define an ocsp command like this:
</p>
<p>
<strong><font color="red">ocsp_command=submit_check_result</font></strong>
</p>
<p>
The command definition for the <i>submit_check_result</i> command looks something like this:
</p>
<p>
<strong>
<font color="red">
<pre>
define command{
command_name submit_check_result
command_line /usr/local/nagios/libexec/eventhandlers/submit_check_result $HOSTNAME$ '$SERVICEDESC$' $SERVICESTATEID$ '$SERVICEOUTPUT$'
}
</pre>
</font>
</strong>
</p>
<p>
The <i>submit_check_result</i> shell scripts looks something like this (replace <i>central_server</i> with the IP address of the central server):
</p>
<p>
<pre>
#!/bin/sh
# Arguments:
# $1 = host_name (Short name of host that the service is
# associated with)
# $2 = svc_description (Description of the service)
# $3 = state_string (A string representing the status of
# the given service - "OK", "WARNING", "CRITICAL"
# or "UNKNOWN")
# $4 = plugin_output (A text string that should be used
# as the plugin output for the service checks)
#
# Convert the state string to the corresponding return code
return_code=-1
case "$3" in
OK)
return_code=0
;;
WARNING)
return_code=1
;;
CRITICAL)
return_code=2
;;
UNKNOWN)
return_code=-1
;;
esac
# pipe the service check info into the send_nsca program, which
# in turn transmits the data to the nsca daemon on the central
# monitoring server
/bin/printf "%s\t%s\t%s\t%s\n" "$1" "$2" "$return_code" "$4" | /usr/local/nagios/bin/send_nsca <i>central_server</i> -c /usr/local/nagios/etc/send_nsca.cfg
</pre>
</p>
<p>
The script above assumes that you have the send_nsca program and it configuration file (send_nsca.cfg) located in the <i>/usr/local/nagios/bin/</i> and <i>/usr/local/nagios/etc/</i> directories, respectively.
</p>
<p>
That's it! We've sucessfully configured a remote host running Nagios to act as a distributed monitoring server. Let's go over exactly what happens with the distributed server and how it sends service check results to Nagios (the steps outlined below correspond to the numbers in the reference diagram above):
</p>
<p>
<ol>
<li>After the distributed server finishes executing a service check, it executes the command you defined by the <a href="configmain.html#ocsp_command">ocsp_command</a> variable. In our example, this is the <i>/usr/local/nagios/libexec/eventhandlers/submit_check_result</i> script. Note that the definition for the <i>submit_check_result</i> command passed four pieces of information to the script: the name of the host the service is associated with, the service description, the return code from the service check, and the plugin output from the service check.
<li>The <i>submit_check_result</i> script pipes the service check information (host name, description, return code, and output) to the <i>send_nsca</i> client program.
<li>The <i>send_nsca</i> program transmits the service check information to the <i>nsca</i> daemon on the central monitoring server.
<li>The <i>nsca</i> daemon on the central server takes the service check information and writes it to the external command file for later pickup by Nagios.
<li>The Nagios process on the central server reads the external command file and processes the passive service check information that originated from the distributed monitoring server.
</ol>
</p>
<p>
<strong><u>Central Server Configuration</u></strong>
</p>
<p>
We've looked at hot distributed monitoring servers should be configured, so let's turn to the central server. For all intensive purposes, the central is configured as you would normally configure a standalone server. It is setup as follows:
</p>
<p>
<ul>
<li>The central server has the web interface installed (optional, but recommended)
<li>The central server has its <a href="configmain.html#enable_notifications">enable_notifications</a> directive set to 1. This will enable notifications. (optional, but recommended)
<li>The central server has <a href="configmain.html#execute_service_checks">active service checks</a> disabled (optional, but recommended - see notes below)
<li>The central server has <a href="configmain.html#check_external_commands">external command checks</a> enabled (required)
<li>The central server has <a href="configmain.html#accept_passive_service_checks">passive service checks</a> enabled (required)
</ul>
</p>
<p>
There are three other very important things that you need to keep in mind when configuring the central server:
</p>
<p>
<ul>
<li>The central server must have service definitions for <i>all services</i> that are being monitored by all the distributed servers. Nagios will ignore passive check results if they do not correspond to a service that has been defined.
<li>If you're only using the central server to process services whose results are going to be provided by distributed hosts, you can simply disable all active service checks on a program-wide basis by setting the <a href="configmain.html#execute_service_checks">execute_service_checks</a> directive to 0. If you're using the central server to actively monitor a few services on its own (without the aid of distributed servers), the <i>enable_active_checks</i> option of the defintions for service being monitored by distributed servers should be set to 0. This will prevent Nagios from actively checking those services.
</ul>
</p>
<p>
It is important that you either disable all service checks on a program-wide basis or disable the <i>enable_active_checks</i> option in the definitions for each service that is monitored by a distributed server. This will ensure that active service checks are never executed under normal circumstances. The services will keep getting rescheduled at their normal check intervals (3 minutes, 5 minutes, etc...), but the won't actually be executed. This rescheduling loop will just continue all the while Nagios is running. I'll explain why this is done in a bit...
</p>
<p>
That's it! Easy, huh?
</p>
<p>
<strong><u>Problems With Passive Checks</u></strong>
</p>
<p>
For all intensive purposes we can say that the central server is relying solely on passive checks for monitoring. The main problem with relying completely on passive checks for monitoring is the fact that Nagios must rely on something else to provide the monitoring data. What if the remote host that is sending in passive check results goes down or becomes unreachable? If Nagios isn't actively checking the services on the host, how will it know that there is a problem?
</p>
<p>
Fortunately, there is a way we can handle these types of problems...
</p>
<p>
<strong><u>Freshness Checking</u></strong>
</p>
<p>
Nagios supports a feature that does "freshness" checking on the results of service checks. More information freshness checking can be found <a href="freshness.html">here</a>. This features gives some protection against situations where remote hosts may stop sending passive service checks into the central monitoring server. The purpose of "freshness" checking is to ensure that service checks are either being provided passively by distributed servers on a regular basis or performed actively by the central server if the need arises. If the service check results provided by the distributed servers get "stale", Nagios can be configured to force active checks of the service from the central monitoring host.
</p>
<p>
So how do you do this? On the central monitoring server you need to configure services that are being monitoring by distributed servers as follows...
</p>
<p>
<ul>
<li>The <i>check_freshness</i> option in the service definitions should be set to 1. This enables "freshness" checking for the services.
<li>The <i>freshness_threshold</i> option in the service definitions should be set to a value (in seconds) which reflects how "fresh" the results for the services (provided by the distributed servers) should be.
<li>The <i>check_command</i> option in the service definitions should reflect valid commands that can be used to actively check the service from the central monitoring server.
</ul>
</p>
<p>
Nagios periodically checks the "freshness" of the results for all services that have freshness checking enabled. The <i>freshness_threshold</i> option in each service definition is used to determine how "fresh" the results for each service should be. For example, if you set this value to 300 for one of your services, Nagios will consider the service results to be "stale" if they're older than 5 minutes (300 seconds). If you do not specify a value for the <i>freshness_threshold</i> option, Nagios will automatically calculate a "freshness" threshold by looking at either the <i>normal_check_interval</i> or <i>retry_check_interval</i> options (depending on what <a href="statetypes.html">type of state</a> the service is in). If the service results are found to be "stale", Nagios will run the service check command specified by the <i>check_command</i> option in the service definition, thereby actively checking the service.
</p>
<p>
Remember that you have to specify a <i>check_command</i> option in the service definitions that can be used to actively check the status of the service from the central monitoring server. Under normal circumstances, this check command is never executed (because active checks were disabled on a program-wide basis or for the specific services). When freshness checking is enabled, Nagios will run this command to actively check the status of the service <i>even if active checks are disabled on a program-wide or service-specific basis</i>.
</p>
<p>
If you are unable to define commands to actively check a service from the central monitoring host (or if turns out to be a major pain), you could simply define all your services with the <i>check_command</i> option set to run a dummy script that returns a critical status. Here's an example... Let's assume you define a command called 'service-is-stale' and use that command name in the <i>check_command</i> option of your services. Here's what the definition would look like...
</p>
<p>
<strong>
<font color="red">
<pre>
define command{
command_name service-is-stale
command_line /usr/local/nagios/libexec/staleservice.sh
}
</pre>
</font>
</strong>
</p>
<p>
The <b>staleservice.sh</b> script in your <i>/usr/local/nagios/libexec</i> directory might look something like this:
</p>
<p>
<dir>
<table border=1>
<tr>
<td>
<pre>
#!/bin/sh
/bin/echo "CRITICAL: Service results are stale!"
exit 2
</pre>
</td>
</tr>
</table>
</dir>
</p>
<p>
When Nagios detects that the service results are stale and runs the <b>service-is-stale</b> command, the <i>/usr/local/nagios/libexec/staleservice.sh</i> script is executed and the service will go into a critical state. This would likely cause notifications to be sent out, so you'll know that there's a problem.
</p>
<p>
<strong><u>Performing Host Checks</u></strong>
</p>
<p>
At this point you know how to obtain service check results passivly from distributed servers. This means that the central server is not actively checking services on its own. But what about host checks? You still need to do them, so how?
</p>
<p>
Since host checks usually compromise a small part of monitoring activity (they aren't done unless absolutely necessary), I'd recommend that you perform host checks actively from the central server. That means that you define host checks on the central server the same way that you do on the distributed servers (and the same way you would in a normal, non-distributed setup).
</p>
<p>
Passive host checks are available (read <a href="passivechecks.html">here</a>), so you could use them in your distributed monitoring setup, but they suffer from a few problems. The biggest problem is that Nagios does not translate passive host check problem states (DOWN and UNREACHABLE) when they are processed. This means that if your monitoring servers have a different parent/child host structure (and they will, if you monitoring servers are in different locations), the central monitoring server will have an inaccurate view of host states.
</p>
<p>
If you do want to send passive host checks to a central server in your distributed monitoring setup, make sure:
</p>
<p>
<ul>
<li>The central server has <a href="configmain.html#accept_passive_host_checks">passive host checks</a> enabled (required)
<li>The distributed server is configured to <a href="configmain.html#obsess_over_hosts">obsess over hosts</a>.
<li>The distributed server has an <a href="configmain.html#ochp_command">ochp command</a> defined.
</ul>
</p>
<p>
The ochp command, which is used for processing host check results, works in a similiar manner to the ocsp command, which is used for processing service check results (see documentation above). In order to make sure passive host check results are up to date, you'll want to enable <a href="freshness.html">freshness checking</a> for hosts (similiar to what is described above for services).
</p>
<hr>
</body>
</html>
|