1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340
|
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Detection and Handling of State Flapping</title>
<meta name="generator" content="DocBook XSL Stylesheets V1.75.1">
<meta name="keywords" content="Supervision, Icinga, Nagios, Linux">
<link rel="home" href="index.html" title="Icinga Version 1.0.2 Documentation">
<link rel="up" href="ch06.html" title="Chapter 6. Advanced Topics">
<link rel="prev" href="redundancy.html" title="Redundant and Failover Network Monitoring">
<link rel="next" href="escalations.html" title="Notification Escalations">
</head>
<body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF">
<CENTER><IMG src="../images/logofullsize.png" border="0" alt="Icinga" title="Icinga"></CENTER>
<div class="navheader">
<table width="100%" summary="Navigation header">
<tr><th colspan="3" align="center">Detection and Handling of State Flapping</th></tr>
<tr>
<td width="20%" align="left">
<a accesskey="p" href="redundancy.html">Prev</a> </td>
<th width="60%" align="center">Chapter 6. Advanced Topics</th>
<td width="20%" align="right"> <a accesskey="n" href="escalations.html">Next</a>
</td>
</tr>
</table>
<hr>
</div>
<div class="section" title="Detection and Handling of State Flapping">
<div class="titlepage"><div><div><h2 class="title" style="clear: both">
<a name="flapping"></a><a name="state_flapping"></a>Detection and Handling of State Flapping</h2></div></div></div>
<p><span class="bold"><strong>Introduction</strong></span></p>
<p>Icinga supports optional detection of hosts and services that are "flapping". Flapping occurs when a service or
host changes state too frequently, resulting in a storm of problem and recovery notifications. Flapping can be indicative of
configuration problems (i.e. thresholds set too low), troublesome services, or real network problems.</p>
<p><span class="bold"><strong>How Flap Detection Works</strong></span></p>
<p>Before we get into this, it is time to say that flapping detection has been a little difficult to implement. How exactly
does one determine what "too frequently" means in regards to state changes for a particular host or service? When Ethan Galstad
first started thinking about implementing flap detection he tried to find some information on how flapping could/should be
detected. He couldn't find any information about what others were using (where they using any?), so he decided to settle with
what seemed to him to be a reasonable solution...</p>
<p>Whenever Icinga checks the status of a host or service, it will check to see if it has started or stopped flapping.
It does this by:</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem">
<p>Storing the results of the last 21 checks of the host or service</p>
</li>
<li class="listitem">
<p>Analyzing the historical check results and determine where state changes/transitions occur</p>
</li>
<li class="listitem">
<p>Using the state transitions to determine a percent state change value (a measure of change) for the host or
service</p>
</li>
<li class="listitem">
<p>Comparing the percent state change value against low and high flapping thresholds</p>
</li>
</ul></div>
<p>A host or service is determined to have <span class="emphasis"><em>started</em></span> flapping when its percent state change first exceeds
a <span class="emphasis"><em>high</em></span> flapping threshold.</p>
<p>A host or service is determined to have <span class="emphasis"><em>stopped</em></span> flapping when its percent state goes below a
<span class="emphasis"><em>low</em></span> flapping threshold (assuming that is was previously flapping).</p>
<p><span class="bold"><strong>Example</strong></span></p>
<p>Let's describe in more detail how flap detection works with services...</p>
<p>The image below shows a chronological history of service states from the most recent 21 service checks. OK states are
shown in green, WARNING states in yellow, CRITICAL states in red, and UNKNOWN states in orange.</p>
<p><span class="inlinemediaobject"><img src="../images/statetransitions.png"></span></p>
<p>The historical service check results are examined to determine where state changes/transitions occur. State changes occur
when an archived state is different from the archived state that immediately precedes it chronologically. Since we keep the
results of the last 21 service checks in the array, there is a possibility of having at most 20 state changes. In this example
there are 7 state changes, indicated by blue arrows in the image above.</p>
<p>The flap detection logic uses the state changes to determine an overall percent state change for the service. This is a
measure of volatility/change for the service. Services that never change state will have a 0% state change value, while services
that change state each time they're checked will have 100% state change. Most services will have a percent state change
somewhere in between.</p>
<p>When calculating the percent state change for the service, the flap detection algorithm will give more weight to new state
changes compare to older ones. Specfically, the flap detection routines are currently designed to make the newest possible state
change carry 50% more weight than the oldest possible state change. The image below shows how recent state changes are given
more weight than older state changes when calculating the overall or total percent state change for a particular service.</p>
<p><span class="inlinemediaobject"><img src="../images/statetransitions2.png"></span></p>
<p>Using the images above, lets do a calculation of percent state change for the service. You will notice that there are a
total of 7 state changes (at t<sub>3</sub>, t<sub>4</sub>, t<sub>5</sub>,
t<sub>9</sub>, t<sub>12</sub>, t<sub>16</sub>, and t<sub>19</sub>). Without any
weighting of the state changes over time, this would give us a total state change of 35%:</p>
<p>(7 observed state changes / possible 20 state changes) * 100 = 35 %</p>
<p>Since the flap detection logic will give newer state changes a higher rate than older state changes, the actual calculated
percent state change will be slightly less than 35% in this example. Let's say that the weighted percent of state change turned
out to be 31%...</p>
<p>The calculated percent state change for the service (31%) will then be compared against flapping thresholds to see what
should happen:</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem">
<p>If the service was <span class="emphasis"><em>not</em></span> previously flapping and 31% is <span class="emphasis"><em>equal to or greater
than</em></span> the high flap threshold, Icinga considers the service to have just started flapping.</p>
</li>
<li class="listitem">
<p>If the service <span class="emphasis"><em>was</em></span> previously flapping and 31% is <span class="emphasis"><em>less than</em></span> the low flap
threshold, Icinga considers the service to have just stopped flapping.</p>
</li>
</ul></div>
<p>If neither of those two conditions are met, the flap detection logic won't do anything else with the service, since it is
either not currently flapping or it is still flapping.</p>
<p><span class="bold"><strong>Flap Detection for Services</strong></span></p>
<p>Icinga checks to see if a service is flapping whenever the service is checked (either actively or
passively).</p>
<p>The flap detection logic for services works as described in the example above.</p>
<p><span class="bold"><strong>Flap Detection for Hosts</strong></span></p>
<p>Host flap detection works in a similiar manner to service flap detection, with one important difference: Icinga
will attempt to check to see if a host is flapping whenever:</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem">
<p>The host is checked (actively or passively)</p>
</li>
<li class="listitem">
<p>Sometimes when a service associated with that host is checked. More specifically, when at least <span class="emphasis"><em>x</em></span>
amount of time has passed since the flap detection was last performed, where <span class="emphasis"><em>x</em></span> is equal to the average
check interval of all services associated with the host.</p>
</li>
</ul></div>
<p>Why is this done? With services we know that the minimum amount of time between consecutive flap detection routines is
going to be equal to the service check interval. However, you might not be monitoring hosts on a regular basis, so there might
not be a host check interval that can be used in the flap detection logic. Also, it makes sense that checking a service should
count towards the detection of host flapping. Services are attributes of or things associated with host after all... At any
rate, that's the best method I could come up with for determining how often flap detection could be performed on a host, so
there you have it.</p>
<p><span class="bold"><strong>Flap Detection Thresholds</strong></span></p>
<p>Icinga uses several variables to determine the percent state change thresholds is uses for flap detection. For both
hosts and services, there are <span class="emphasis"><em>global</em></span> high and low thresholds and <span class="emphasis"><em>host-</em></span> or
<span class="emphasis"><em>service-specific</em></span> thresholds that you can configure. Icinga will use the global thresholds for flap
detection if you to not specify host- or service- specific thresholds.</p>
<p>The table below shows the global and host- or service-specific variables that control the various thresholds used in flap
detection.</p>
<div class="informaltable">
<table border="1">
<colgroup>
<col>
<col>
<col>
</colgroup>
<tbody>
<tr>
<td><p> <span class="bold"><strong>Object Type</strong></span> </p></td>
<td><p> <span class="bold"><strong>Global Variables</strong></span> </p></td>
<td><p> <span class="bold"><strong>Object-Specific Variables</strong></span> </p></td>
</tr>
<tr>
<td><p>Host</p></td>
<td>
<p>
<a class="link" href="configmain.html#configmain-low_host_flap_threshold">low_host_flap_threshold</a>
</p> <p>
<a class="link" href="configmain.html#configmain-high_host_flap_threshold">high_host_flap_threshold</a>
</p>
</td>
<td>
<p>
<a class="link" href="objectdefinitions.html#objectdefinitions-host">low_flap_threshold</a>
</p> <p>
<a class="link" href="objectdefinitions.html#objectdefinitions-host">high_flap_threshold</a>
</p>
</td>
</tr>
<tr>
<td><p>Service</p></td>
<td>
<p>
<a class="link" href="configmain.html#configmain-low_service_flap_threshold">low_service_flap_threshold</a>
</p> <p>
<a class="link" href="configmain.html#configmain-high_service_flap_threshold">high_service_flap_threshold</a>
</p>
</td>
<td>
<p>
<a class="link" href="objectdefinitions.html#objectdefinitions-service">low_flap_threshold</a>
</p> <p>
<a class="link" href="objectdefinitions.html#objectdefinitions-service">high_flap_threshold</a>
</p>
</td>
</tr>
</tbody>
</table>
</div>
<p><span class="bold"><strong>States Used For Flap Detection</strong></span></p>
<p>Normally Icinga will track the results of the last 21 checks of a host or service, regardless of the check result
(host/service state), for use in the flap detection logic.</p>
<div class="tip" title="Tip" style="margin-left: 0.5in; margin-right: 0.5in;"><table border="0" summary="Tip">
<tr>
<td rowspan="2" align="center" valign="top" width="25"><img alt="[Tip]" src="../images/tip.png"></td>
<th align="left">Tip</th>
</tr>
<tr><td align="left" valign="top">
<p>You can exclude certain host or service states from use in flap detection logic by using the
<span class="emphasis"><em>flap_detection_options</em></span> directive in your host or service definitions. This directive allows you to
specify what host or service states (i.e. "UP, "DOWN", "OK, "CRITICAL") you want to use for flap detection. If you don't use
this directive, all host or service states are used in flap detection.</p>
</td></tr>
</table></div>
<p><span class="bold"><strong>Flap Handling</strong></span></p>
<p>When a service or host is first detected as flapping, Icinga will:</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem">
<p>Log a message indicating that the service or host is flapping.</p>
</li>
<li class="listitem">
<p>Add a non-persistent comment to the host or service indicating that it is flapping.</p>
</li>
<li class="listitem">
<p>Send a "flapping start" notification for the host or service to appropriate contacts.</p>
</li>
<li class="listitem">
<p>Suppress other notifications for the service or host (this is one of the filters in the <a class="link" href="notifications.html" title="Notifications">notification logic</a>).</p>
</li>
</ol></div>
<p>When a service or host stops flapping, Icinga will:</p>
<div class="orderedlist"><ol class="orderedlist" type="1">
<li class="listitem">
<p>Log a message indicating that the service or host has stopped flapping.</p>
</li>
<li class="listitem">
<p>Delete the comment that was originally added to the service or host when it started flapping.</p>
</li>
<li class="listitem">
<p>Send a "flapping stop" notification for the host or service to appropriate contacts.</p>
</li>
<li class="listitem">
<p>Remove the block on notifications for the service or host (notifications will still be bound to the normal <a class="link" href="notifications.html" title="Notifications">notification logic</a>).</p>
</li>
</ol></div>
<p><span class="bold"><strong>Enabling Flap Detection</strong></span></p>
<p>In order to enable the flap detection features in Icinga, you'll need to:</p>
<div class="itemizedlist"><ul class="itemizedlist" type="disc">
<li class="listitem">
<p>Set <a class="link" href="configmain.html#configmain-enable_flap_detection">enable_flap_detection</a> directive is set to 1.</p>
</li>
<li class="listitem">
<p>Set the <span class="emphasis"><em>flap_detection_enabled</em></span> directive in your host and service definitions is set to 1.</p>
</li>
</ul></div>
<p>If you want to disable flap detection on a global basis, set the <a class="link" href="configmain.html#configmain-enable_flap_detection">enable_flap_detection</a> directive to 0.</p>
<p>If you would like to disable flap detection for just a few hosts or services, use the
<span class="emphasis"><em>flap_detection_enabled</em></span> directive in the host and/or service definitions to do so.</p>
<a class="indexterm" name="id1994783"></a>
</div>
<div class="navfooter">
<hr>
<table width="100%" summary="Navigation footer">
<tr>
<td width="40%" align="left">
<a accesskey="p" href="redundancy.html">Prev</a> </td>
<td width="20%" align="center"><a accesskey="u" href="ch06.html">Up</a></td>
<td width="40%" align="right"> <a accesskey="n" href="escalations.html">Next</a>
</td>
</tr>
<tr>
<td width="40%" align="left" valign="top">Redundant and Failover Network Monitoring </td>
<td width="20%" align="center"><a accesskey="h" href="index.html">Home</a></td>
<td width="40%" align="right" valign="top"> Notification Escalations</td>
</tr>
</table>
</div>
<P class="copyright">© 2009-2010 Icinga Development Team, http://www.icinga.org</P>
</body>
</html>
|