File: criticalsystems.html

package info (click to toggle)
xymon 4.3.30-5
links: PTS, VCS
area: main
in suites: forky, sid
size: 11,384 kB
sloc: ansic: 69,137; sh: 3,601; makefile: 863; javascript: 452; perl: 48
file content (227 lines) | stat: -rw-r--r-- 11,428 bytes
parent folder | download | duplicates (7)
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
	<head>
		<title>Critical Systems</title>
	</head>

	<body>
		<h1>Critical Systems</h1>
		<p>If you are monitoring lots of hosts, getting an overview of which 
		hosts need attention can be difficult. Most likely you've split the
		hosts among several pages, and the &quot;All non-green&quot; view is
		just cramped full with systems where a logfile is showing some errors,
		a filesystem needs cleaning up etc.</p>

		<p>The &quot;Critical Systems&quot; view lets you define exactly which
		tests on what hosts need attention. In other words, this is the view
		your Operations Center will be using to decide whether to call out
		people in the middle of the night. It might look like this:
		<img src="critview-disk.jpg"></p>

		<p>This document describes how you configure the Critical Systems
		view, and how it works for your operators. By &quot;operators&quot;
		I mean the people who are doing the 24x7 monitoring. Where I work,
		these people normally do not resolve the issues - they just raise
		the trouble-tickets and assign them to the &quot;engineer&quot; on 
		duty. It may be different in your organization.</p>

		<h1>The Critical Systems editor</h1>

		<p>To configure what goes on the
		Critical Systems view, you use a dedicated editor.</p>

		<p>The default Xymon setup has nothing on the critical systems
		view. So to use it, you must configure some of your systems and
		tests to be included on this view. From the <b>Administration</b>
		menu, pick the <b>Edit Critical Systems</b> item. This is usually
		in the password-protected area of Xymon, so you will need to
		authenticate yourself before you are allowed access. If you haven't
		set this up yet, look at the <a href="install.html#htpasswd">installation guide</a>
		to see how you do that.</p>

		<p>After authenticating, you are presented with the editor page.
		<img src="editor-main.jpg">

		<h3>The editor form</h3>
		Let me explain what the various fields are for:
		<ul>
			<li>The <b>Host</b> and <b>Test</b> fields are text entry fields.
			This is where you enter the name of the host and test you want to
			configure. If you would rather not type too much, you can enter
			just the beginning of the hostname and use the <b>Search</b> and
			<b>Next</b> buttons to walk through the currently configured tests.</li>

			<li>The <b>Priority</b> field defines how important this test is.
			By default you have three priorities: 1, 2 and 3. Priority 3 is
			the lowest - things you must fix, but it can wait until you've had
			lunch or finished the department meeting. Priority 2 is for more
			important things, like one of your RAID systems running in degraded
			mode. Priority 1 is the highest priority - the kind of problem
			where you want to get a phonecall at 3 AM in the morning.</li>

			<li>Then there is a group of time-related settings. The <b>Monitoring time</b>
			defines when this test should show up on the Critical Systems view.
			By default, that will be 24 hours a day, 7 days a week. But you
			probably have some systems that don't need attention during week-ends,
			or perhaps you only want to support a server during normal work-hours.
			Then you can use this setting to make sure it will only show up on the 
			Critical Systems view during those periods. If you are migrating from
			the old &quot;NK&quot; settings in the hosts.cfg file, this is the
			equivalent of the &quot;NKTIME&quot; setting.<br>
			The <b>Start monitoring</b> and <b>Stop monitoring</b> settings are
			used if you have systems that go into production at a certain date,
			or which are de-commisioned at a certain date. Instead of having
			to update your Critical Systems configuration exactly when that
			happens, you can configure the dates when monitoring of the systems
			should begin or end.</li>

			<li>The <b>Resolver group</b> is a text field. You can use it for
			your operations people to see which group of engineers the should 
			call about this problem. If you have multiple groups handling 
			different parts of your IT systems, use this to let the operations 
			staff know whether to call the Unix admins, the DBA's or one of 
			your Webmasters.</li>

			<li>The <b>Instruction</b> is a text entry field, where you can place
			a brief instruction to the operators handling the problem:
			If there is a simple thing that the operations people can try to fix the
			problem before calling the on-duty engineer, then you can place instructions
			here - e.g. perhaps the issue is with an external partner, so they
			just need to call them and let them know there is an issue. You can use 
			HTML tags in this field, so if it's a long story then just put in an 
			HTML link to another document.</li>

			<li>The <b>Clone</b> fields at the bottom of the form (not visible in the
			screenshot) are described later</li>
		</ul>

		<h3>Setting up a disk status</h3>
		<p>Right now, there is a yellow disk status on my system.<br>
		<img src="mainview.jpg"></p>
		<p>But it is not on the Critical Systems view, and I want it to be. 
		It is a priority 3 event, and I only want it monitored between 7AM and 
		8PM on weekdays.  Most likely it is just some logfiles that are filling up, 
		so the operators can try and clean out the /var/log/ directory - if 
		that doesn't solve the problem, then they must escalate it to the Unix admins.</p>

		<p>So on the Critical Systems editor, I enter the hostname <b>localhost</b>
		and the test <b>disk</b>, then hit the <b>Search</b> button. I get this
		warning:<br><center><img src="editor-nohost.jpg"></center><br>
		telling me that there is nothing
		configured yet for this host+test combination. If there had been any
		previous configuration, it would have shown up on the form.</p>

		<p>So I fill out the fields of the form and hit the <b>Update</b> button.
		The form changes to look like this: <img src="editor-diskchanged.jpg">
		As you can see, there is now a <b>Last update</b> text showing who has 
		changed this configuration, and when it was last done.</p>

		<p>If I now go back to the Critical Systems view - from the menu, pick
		<b>Views</b> and <b>Critical Systems view</b> - you will see that the
		status is now showing up:<br><img src="critview-disk.jpg"></p>

		<h3>Template definitions - cloning records</h3>
		<p>If you have many hosts that share a common setup on the Critical
		Systems view, then editing all of them can be tiresome. Instead,
		you should define a template and then <b>clone</b> it to all of 
		the hosts.</p>

		<p>NOTE: A cloned definition is not a copy of the original definition.
		It is in fact a pointer back to the original definition, so if you
		change the original definition <i>after</i> you performed the cloning,
		then the clone definition will <i>also</i> change.</p>

		<p>Defining a template is just like defining the Critical Systems
		view for a host. Just call the host something that looks like a
		template - &quot;<b>Standard Unix</b>&quot;, for instance. So here
		is a definition for a <b>Unix cpu</b> template.<br><img src="editor-clonemaster.jpg"></p>

		<p>Now we have created the template (if you haven't pushed the <b>Update</b>
		button to save the template, do it now). To apply this template to a host, 
		scroll down to the bottom of the editor form, and enter the hostname that
		you want to apply the template to, then hit the <b>Add/remove clones</b>
		button:<br><img src="editor-makeclone.jpg"></p>

		<p>After it has updated, you can see that &quot;localhost&quot; is now listed
		in the scrollbox showing the clones.<br><img src="editor-showclone.jpg"></p>

		<p>NOTE: Cloning happens at the <b>host</b> level, so even though we
		did the cloning from a <b>cpu</b> test definition, it will also affect
		all the other definitions we have for the <b>Standard Unix</b> host.</p>


		<h1>The Critical Systems view</h1>
		<p>The critical systems view lets the operators filter active
		alerts in several ways. It might look like this:<br>
		<img src="critview-disk.jpg"></p>

		<h3>Filtering the Critical Systems view</h3>
		<p>The drop-down boxes lets the operators filter the alerts that
		show up on the page. 
		
		<ul>
			<li>The <b>Priority</b> limits alerts so that only those 
			with a matching priority get displayed.</li>
			<li>The <b>Color</b> removes those alerts that have an
			unwanted color</li>
			<li>The <b>Age</b> limit can be used to only see the most
			recent events.</li>
			<li>The <b>Acked</b> selection can be used to toggle the view
			of events that have been acknowledged by the operators.
		</ul>

		<p><b>Tip:</b> If you have a preferred default setting for these,
		then you can bookmark it in your browser - the settings are part
		of the URL, so your bookmark will include the current settings.</p>

		<h3>The detailed status view</h3>
		<p>When looking at the status of one of the items shown on the
		Critical Systems view, a number of additional items show up. On
		the example Critical Systems view above, you will notice that the 
		instructions we entered about what to do with the disk status is shown 
		here, so they are available to the operators. There are links to the host 
		documentation and host information. There is also an acknowledge
		function, so that the operators can acknowledge an alert right away.</p>

		<h3>Critical Systems acknowledgment</h3>
		<p>From the detailed status view, the operator can <b>acknowledge</b>
		an alert, after he has assigned the problem to an engineer or has
		handled it in some other way. This serves two purposes: First, it
		removes the status from the Critical Systems view, so the operator 
		can concentrate on the new problems that appear. And second, it
		lets everyone else see that the problem has been noticed and is
		being handled by someone.</p>

		<p>When acknowledging an alert, the operator can add information
		about what the problem is, or who is handling it, and when it
		is expected to be resolved. E.g. like this:<br>
		<img src="critview-detail-ackform.jpg"></p>

		<p>The <b>Host-ack</b> checkbox lets the operator acknowledge all
		current alerts for a given host, e.g. a full disk could easily
		trigger alerts for both the disk-, msgs- and procs-statuses - 
		a Host ack lets him handle all of those.</p>

		<p>After the operator has acknowledged the status, the acknowledgment
		will be visible on the Critical systems status view:<br>
		<img src="critview-detail-acked.jpg"></p>

		<p>(If you are wondering why this image says it is a &quot;Level 1&quot;
		acknowledgement, then the answer is that a <u>future</u> release of Xymon
		will allow multiple acknowledgments by different groups. Level 1
		is the operator who sees the alert on the Critical Systems view. 
		Level 2 could be the engineer who gets paged by a Xymon alert 
		going out).</p>

		<h3>How acknowledgements are visible to everyone</h3>
		<p>The acknowledgments that the operator enters from the status 
		page will show up on the status visible to everyone. E.g. here is
		how the overview page will appear to a normal user: Note that the
		&quot;disk&quot; status has a yellow checkmark, indicating that it
		has been acknowledged:<br> <img src="mainview-acked.jpg"></p>

		<p>And the detailed status page also includes the acknowledgment
		information:<br><img src="stdview-detail-acked.jpg"></p>
	</body>
</html>