1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369
|
-implement trap delivery for "redistribute" in the mon server itself as an
option. retain the "call script" behavior, but maybe specify internal
trap delivery via "redistribute -h hostname [hostname...]". also allow
multiple redistribute lines to build a list of scripts to call
-deliver traps with acknowledgement via tcp
-add protocol commands to dump entire status + configuration in one operation
to reduce latency (not so many serialized get/response operations just to
get status)
-no alerts for n mins
-better cookbook of examples, including some pre-fab m4 defines for templates
with focus on the ability to quickly configure mon out-of-the-box for
the most common setups
-period "templates"
> like I have to repeat my period definitions all 260 times, one for
> each watch. we should have templates in the Mon config file for any
> kind of object so it can be reused.
so do you mean a way to define a "template" for a period so that
you don't need to keep rewriting "wd {Sun-Sat}", or so that it'll use
some default period if you don't specify one, or what? i can see this
working a bunch of different ways.
like this?
define period-template xyz
period wd {Sun-Sat}
alert mail.alert mis@domain.com
alert page.alert mis-pagers@domain.com
alertevery 1h
watch something
service something
period template(xyz)
watch somethingelse
service something
period template(xyz)
# override the 1h
alertevery 2h
-my recent thoughts on config management are that the parsing should be
all modularized, (a keeping the config parsing code in a separate
perl module to be reused by other apps),
and there should be a way to turn the resulting data
structure into xml and importing the same back, not so you can write
your config by hand in xml, but so you can use some generic xml editing
tool to mess around with the config, to get one type of gui.
-the most common things should be easiest to do, regardless of
a gui or text file config. that is what makes stuff "easy". however,
i don't think more complicated setups lend themselves to guis as much,
and in complicated setups you have to invest a lot of time to learn how
the tool works, and a fancy gui in that case is less of a payoff.
this is for configuration, i mean. fancy guis for reporting and stuff
are good, no doubt.
-global alert definitions with their own squelches (alertevery, etc.)
> also, alarms need to be collated so pagers and cell phones don't get
> buried with large numbers of alerts. I have a custom solution that I
> wrote for this, but it's a lousy solution since it essentially implements
> its own paging system.
i could see how it would be good to be able to define some alert
destinations *outside* of the period definitions, then refer to them
in the period definitions, then you can do "collation" that way. like
this:
define global-alert xyz mail.alert xyz@lmnop.com
alertevery 1h
watch
service
period
globalalert xyz <---collated globally
watch
service
period
globalalert xyz <---collated globally
alert mail.alert pdq@lmnop.com <---not collated
that would be quite easy to do and i think very useful. you could
apply all the same squelch knobs (alertevery, etc.) to the global ones.
-----
(from mon-1.2.0)
$Id: TODO,v 1.2.2.1 2007/06/27 11:51:17 trockij Exp $
-add short a "radius howto" to the doc/ directory.
-make traps authenticate via the same scheme used to obscure
the password in RADIUS packets
-descriptions defined in mon.cf should be 'quoted'
-document command section and trap section in authfile
-finish support for receiving snmp traps
-output to client should be buffered and incorporated into the I/O loop.
There is the danger that a sock_write to a client will block the server.
-finish muxpect
-make "chainable" alerts
?? i don't recall who asked for this or how it would work
-make alerts nonblocking, and handle them in a similar fashion to
monitors. i.e., serialize per-service (or per-period) alerts.
-document "clear" client command
-Document trap authentication.
-Document traps.
-Make monitors parallelize their tasks, similar to fping.monitor. This
is an important scalability problem.
-re-vamp the host disabling. 1) store them in a table with a timeout
on each so that they can automatically re-enable themselves so
people don't forget to re-enable them manually. 2) don't do
the disabling by "commenting" them out of the host groups.
We still want them to be tested for failure, but just disable
alerts that have to do with the disabled hosts.
When a host is commented out, accept a "reason" field that
is later accessible so that you can tell why someone disabled
the host.
-allow checking a service at a particular time of day, maybe using
inPeriod.
-maybe make a command that will disable an alert for a certain amount
of time
-make it possible to disable just one of multiple alarms in a service
-make a logging facility which forks and execs external logging
daemons and writes to them via some ipc such as unix domain socket.
mon should be sure that one of each type of these loggers is running
at all times. configure the logging either globally or for each
service. write both the success and failure status to the log in
some "list opstatus" type format. each logger can do as it wishes
with the data (e.g. stuff it into rrdtool, mysql, cat it to a file, etc.)
# global setting
logger = file
watch stuff
service http
logger file -p _LOGDIR_
...
service fping
# this will use the global logger setting
...
service
# this will override the global logger setting
logger none
...
common options to logger:
-d dir path to logging dir
-f file name of log file
-g, -s group, service
-----------
notes on a v2 protocol redesign from trockij
- Configuring on a hostgroup scheme works very well. In the beginning, mon was
never intended to get this complex(tm), it was intended to be a tool
where it was easy to whip up custom monitoring scripts and alert scripts
and plug them into a framework which allowed them all to connect to each
other, and to have a way to easily build custom clients and report
generators as well.
- However, per host status is needed now.
- This requires changes to both mon itself and also the monitors / alerts.
Backward compatibility is important, and KISS is very important to
retain the ease at which one can whip up a new monitor or alert or reporting
client.
- There will be a new protocol for communicating with the monitors / alerts,
which will be masked by a Mon::Monitor / Mon::Alert module in Perl.
Appropriate shell functions will be provided by the first one who asks.
See below for the protocol.
- We still want to retain the benefits of the old behaviour, but extend
some alert management features, such as the ability to liberate
alert definitions from the service periods so they can be used globally.
- The server code might be broken up into multiple files (I/O routines, config
parser, related parts, etc)
- monitors can communicate better with the alerts (see below). For example,
the monitor might hint (using "a_mail_list") the mail.alert about where else
to send a warning that a user dir goes over quota.
(Attention should be paid to privacy that we don't accidentially inform
all users that /home/foo/how-i-will-destroy-western-civilization/
is consuming 1GB too much space ;)
- Associations: these allow monitors to communicate details
about failures back to the server which can be used to specify who
to alert.
The associations are based on key/value pairs specified in the
association config file, and are expanded on the alert command line
(or possibly within the alert protocol) if "@assoc-*" is in the
configuration. If a host assoc. is needed, an alert spec will look like:
alert mail.alert admin@xyz.com @assoc-host
There are two association types (possibly more in the future): host
associations, and user-defined associations. Host associations use the
"assoc-host" specifier, and map one or more username to an individual
host. User-defined associations are just that, and begin with the
"assoc-u-" specifier.
Monitors return associations via the "assoc-*" key in the monitor
protocol.
Alerts obtain association information either via command-line arguments
which were expanded by the server from "@assoc-*" in the config file,
or via the "assoc-*" key in the alert protocol.
- Metrics are only passed to the mon server for "monitoring" purposes, but can
be marked up in such a way that they could be easily piped to a logging
utility, one which is not part of the mon process itself.
monitors are _encouraged_ to collect and report performance data.
"Failures" are basically just a conclusion based upon performance data and
it makes no sense to collect the data twice, e.g. if you have mon polling
ifInOctets.0 on a system, why should mrtg have to poll on its own.
It may be desireable to propose a "unified logging system" which all
monitors can easily use, something which is pluggable and extensible
- The hostgroup syntax is going to be extended to add per host options. (which
will be passed to the monitors / alerts using the new protocol)
ns1.teuto.net( fs="/(80%,90%)",mail_list="lmb@teuto.net" )
would be passed as "h_fs=/(80%,90%)" and "h_mail_list="lmb@teuto.net"
FLOATING MONITORS
A floating monitor is started by mon and remains running for the entire time.
If it dies, it is automatically restarted.
The server forks off a separate process for fping and communicates with
it via some IPC, like a named pipe or a socket or something. The floating
monitor sits there waiting for a message from the server that says "start
checking now". The server then adds this descriptor to %fhandles and %running
and treats it similar to other forked monitors. When the floting monitor is
done, it spits its output back to the server and then goes dormant again,
awaiting another message from the server. Floating monitors are started
when mon starts, and are restarted if mon notices that they go away. This
is a way to save on fork() overhead, but to also
PROTOCOL
The protocol will be simple and ASCII based, in the form of "key=value". Line
continuation will be provided by prefixing following lines with a ">". A "\n"
on a line by itself indicates the start of a new block.
The order of the keys should not be important.
The first block will always contain metadata further defining the following
blocks. The "version" key is always present.
The current protocol version is "1".
(In the examples, everything after a "#" is a comment and should be cut out)
KEY CONVENTIONS
Keys only private to monitors will be prefixed with an "m_". In the same
vain, keys private to alerts will be prefixed with a "a_", and additional
host option keys specified in the mon.cf file will be prefixed with a "h_"
before being passed to monitors/alerts.
By convention, flags only pertaining to a specific alert will embed that name
in the key name too - ie keys only pertaining to "mail.alert" will start with
"a_mail_".
The key/values pairs will be passed to all processes for a specific service.
"h_" are static between invocations as they come from the mon.cf file. "m_"
keys will be preserved between multiple monitor executions. "a_" keys will be
passed from the monitor to the alert script.
MONITOR PROTOCOL (monitor -> mon)
The metadata block is followed by a block describing the overall hostgroup
status, followed by a detailled status for each host.
The following keys are defined for the blocks:
"summary" = contains a one line short summary of the status.
"status" = up, fail, ignore
"metric_1" = an opaque floating point number which can be referenced for
triggering alerts. May try to give an "operational percentage".
More than one metric may be returned.
(Ping rtt, packet loss, disk space etc)
"description" = longer elaborate description of the current status.
"host" = hostgroup member to which this status applies. The overall
hostgroup status does not include this field.
"assoc-host" = host association
"assoc-u-*" = user-defined association
Here is an example for a hypothetical hostgroup with 2 hosts and the ping
service.
###
version=1
summary=Still alive.
metric_1=50 # Packetloss
metric_2=20.23 # rtt times
description=1 out of 2 hosts still responding.
> Whatever else one might want to say about the status. It is difficult to
> come up with a good text here so I will just babble.
status=up
host=foo.bar.com
metric_1=100
metric_2=0 # 100% packet loss make rtt measurements difficult ;)
summary=ICMP unreachable from 2.2.2.2
status=fail
description=PING 2.2.2.2 (2.2.2.2): 56 data bytes
>
>--- 2.2.2.2 ping statistics ---
>23 packets transmitted, 0 packets received, 100% packet loss
metric_1=0
metric_2=52.1
summary=ICMP echo reply received ok
status=up
description=64 bytes from 212.8.197.2: icmp_seq=0 ttl=60 time=110.0 ms
>64 bytes from 212.8.197.2: icmp_seq=1 ttl=60 time=32.3 ms
>64 bytes from 212.8.197.2: icmp_seq=2 ttl=60 time=32.8 ms
>64 bytes from 212.8.197.2: icmp_seq=3 ttl=60 time=33.4 ms
>
>--- ns1.teuto.net ping statistics ---
>4 packets transmitted, 4 packets received, 0% packet loss
>round-trip min/avg/max = 32.3/52.1/110.0 ms
host=baz.bar.com
######
Points still open:
- mon -> monitor communication
- mon <-> alert communication
- the new trap protocol
- muxpect
- a unified logging proposal
|