1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<title>Critical Systems</title>
</head>
<body>
<h1>Critical Systems</h1>
<p>If you are monitoring lots of hosts, getting an overview of which
hosts need attention can be difficult. Most likely you've split the
hosts among several pages, and the "All non-green" view is
just cramped full with systems where a logfile is showing some errors,
a filesystem needs cleaning up etc.</p>
<p>The "Critical Systems" view lets you define exactly which
tests on what hosts need attention. In other words, this is the view
your Operations Center will be using to decide whether to call out
people in the middle of the night. It might look like this:
<img src="critview-disk.jpg"></p>
<p>This document describes how you configure the Critical Systems
view, and how it works for your operators. By "operators"
I mean the people who are doing the 24x7 monitoring. Where I work,
these people normally do not resolve the issues - they just raise
the trouble-tickets and assign them to the "engineer" on
duty. It may be different in your organization.</p>
<h1>The Critical Systems editor</h1>
<p>To configure what goes on the
Critical Systems view, you use a dedicated editor.</p>
<p>The default Xymon setup has nothing on the critical systems
view. So to use it, you must configure some of your systems and
tests to be included on this view. From the <b>Administration</b>
menu, pick the <b>Edit Critical Systems</b> item. This is usually
in the password-protected area of Xymon, so you will need to
authenticate yourself before you are allowed access. If you haven't
set this up yet, look at the <a href="install.html#htpasswd">installation guide</a>
to see how you do that.</p>
<p>After authenticating, you are presented with the editor page.
<img src="editor-main.jpg">
<h3>The editor form</h3>
Let me explain what the various fields are for:
<ul>
<li>The <b>Host</b> and <b>Test</b> fields are text entry fields.
This is where you enter the name of the host and test you want to
configure. If you would rather not type too much, you can enter
just the beginning of the hostname and use the <b>Search</b> and
<b>Next</b> buttons to walk through the currently configured tests.</li>
<li>The <b>Priority</b> field defines how important this test is.
By default you have three priorities: 1, 2 and 3. Priority 3 is
the lowest - things you must fix, but it can wait until you've had
lunch or finished the department meeting. Priority 2 is for more
important things, like one of your RAID systems running in degraded
mode. Priority 1 is the highest priority - the kind of problem
where you want to get a phonecall at 3 AM in the morning.</li>
<li>Then there is a group of time-related settings. The <b>Monitoring time</b>
defines when this test should show up on the Critical Systems view.
By default, that will be 24 hours a day, 7 days a week. But you
probably have some systems that don't need attention during week-ends,
or perhaps you only want to support a server during normal work-hours.
Then you can use this setting to make sure it will only show up on the
Critical Systems view during those periods. If you are migrating from
the old "NK" settings in the hosts.cfg file, this is the
equivalent of the "NKTIME" setting.<br>
The <b>Start monitoring</b> and <b>Stop monitoring</b> settings are
used if you have systems that go into production at a certain date,
or which are de-commisioned at a certain date. Instead of having
to update your Critical Systems configuration exactly when that
happens, you can configure the dates when monitoring of the systems
should begin or end.</li>
<li>The <b>Resolver group</b> is a text field. You can use it for
your operations people to see which group of engineers the should
call about this problem. If you have multiple groups handling
different parts of your IT systems, use this to let the operations
staff know whether to call the Unix admins, the DBA's or one of
your Webmasters.</li>
<li>The <b>Instruction</b> is a text entry field, where you can place
a brief instruction to the operators handling the problem:
If there is a simple thing that the operations people can try to fix the
problem before calling the on-duty engineer, then you can place instructions
here - e.g. perhaps the issue is with an external partner, so they
just need to call them and let them know there is an issue. You can use
HTML tags in this field, so if it's a long story then just put in an
HTML link to another document.</li>
<li>The <b>Clone</b> fields at the bottom of the form (not visible in the
screenshot) are described later</li>
</ul>
<h3>Setting up a disk status</h3>
<p>Right now, there is a yellow disk status on my system.<br>
<img src="mainview.jpg"></p>
<p>But it is not on the Critical Systems view, and I want it to be.
It is a priority 3 event, and I only want it monitored between 7AM and
8PM on weekdays. Most likely it is just some logfiles that are filling up,
so the operators can try and clean out the /var/log/ directory - if
that doesn't solve the problem, then they must escalate it to the Unix admins.</p>
<p>So on the Critical Systems editor, I enter the hostname <b>localhost</b>
and the test <b>disk</b>, then hit the <b>Search</b> button. I get this
warning:<br><center><img src="editor-nohost.jpg"></center><br>
telling me that there is nothing
configured yet for this host+test combination. If there had been any
previous configuration, it would have shown up on the form.</p>
<p>So I fill out the fields of the form and hit the <b>Update</b> button.
The form changes to look like this: <img src="editor-diskchanged.jpg">
As you can see, there is now a <b>Last update</b> text showing who has
changed this configuration, and when it was last done.</p>
<p>If I now go back to the Critical Systems view - from the menu, pick
<b>Views</b> and <b>Critical Systems view</b> - you will see that the
status is now showing up:<br><img src="critview-disk.jpg"></p>
<h3>Template definitions - cloning records</h3>
<p>If you have many hosts that share a common setup on the Critical
Systems view, then editing all of them can be tiresome. Instead,
you should define a template and then <b>clone</b> it to all of
the hosts.</p>
<p>NOTE: A cloned definition is not a copy of the original definition.
It is in fact a pointer back to the original definition, so if you
change the original definition <i>after</i> you performed the cloning,
then the clone definition will <i>also</i> change.</p>
<p>Defining a template is just like defining the Critical Systems
view for a host. Just call the host something that looks like a
template - "<b>Standard Unix</b>", for instance. So here
is a definition for a <b>Unix cpu</b> template.<br><img src="editor-clonemaster.jpg"></p>
<p>Now we have created the template (if you haven't pushed the <b>Update</b>
button to save the template, do it now). To apply this template to a host,
scroll down to the bottom of the editor form, and enter the hostname that
you want to apply the template to, then hit the <b>Add/remove clones</b>
button:<br><img src="editor-makeclone.jpg"></p>
<p>After it has updated, you can see that "localhost" is now listed
in the scrollbox showing the clones.<br><img src="editor-showclone.jpg"></p>
<p>NOTE: Cloning happens at the <b>host</b> level, so even though we
did the cloning from a <b>cpu</b> test definition, it will also affect
all the other definitions we have for the <b>Standard Unix</b> host.</p>
<h1>The Critical Systems view</h1>
<p>The critical systems view lets the operators filter active
alerts in several ways. It might look like this:<br>
<img src="critview-disk.jpg"></p>
<h3>Filtering the Critical Systems view</h3>
<p>The drop-down boxes lets the operators filter the alerts that
show up on the page.
<ul>
<li>The <b>Priority</b> limits alerts so that only those
with a matching priority get displayed.</li>
<li>The <b>Color</b> removes those alerts that have an
unwanted color</li>
<li>The <b>Age</b> limit can be used to only see the most
recent events.</li>
<li>The <b>Acked</b> selection can be used to toggle the view
of events that have been acknowledged by the operators.
</ul>
<p><b>Tip:</b> If you have a preferred default setting for these,
then you can bookmark it in your browser - the settings are part
of the URL, so your bookmark will include the current settings.</p>
<h3>The detailed status view</h3>
<p>When looking at the status of one of the items shown on the
Critical Systems view, a number of additional items show up. On
the example Critical Systems view above, you will notice that the
instructions we entered about what to do with the disk status is shown
here, so they are available to the operators. There are links to the host
documentation and host information. There is also an acknowledge
function, so that the operators can acknowledge an alert right away.</p>
<h3>Critical Systems acknowledgment</h3>
<p>From the detailed status view, the operator can <b>acknowledge</b>
an alert, after he has assigned the problem to an engineer or has
handled it in some other way. This serves two purposes: First, it
removes the status from the Critical Systems view, so the operator
can concentrate on the new problems that appear. And second, it
lets everyone else see that the problem has been noticed and is
being handled by someone.</p>
<p>When acknowledging an alert, the operator can add information
about what the problem is, or who is handling it, and when it
is expected to be resolved. E.g. like this:<br>
<img src="critview-detail-ackform.jpg"></p>
<p>The <b>Host-ack</b> checkbox lets the operator acknowledge all
current alerts for a given host, e.g. a full disk could easily
trigger alerts for both the disk-, msgs- and procs-statuses -
a Host ack lets him handle all of those.</p>
<p>After the operator has acknowledged the status, the acknowledgment
will be visible on the Critical systems status view:<br>
<img src="critview-detail-acked.jpg"></p>
<p>(If you are wondering why this image says it is a "Level 1"
acknowledgement, then the answer is that a <u>future</u> release of Xymon
will allow multiple acknowledgments by different groups. Level 1
is the operator who sees the alert on the Critical Systems view.
Level 2 could be the engineer who gets paged by a Xymon alert
going out).</p>
<h3>How acknowledgements are visible to everyone</h3>
<p>The acknowledgments that the operator enters from the status
page will show up on the status visible to everyone. E.g. here is
how the overview page will appear to a normal user: Note that the
"disk" status has a yellow checkmark, indicating that it
has been acknowledged:<br> <img src="mainview-acked.jpg"></p>
<p>And the detailed status page also includes the acknowledgment
information:<br><img src="stdview-detail-acked.jpg"></p>
</body>
</html>
|