1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375
|
I'm sure the Gold App Dev Request form for all of this is just delayed
in Campus Mail...
This email note is an epic. Ed in the role of the protagonist, Doug in
the role of the antagonist (of course). Save this note, it's the only
documentation you will get for a while...
Here is a summary of the changes that I made to spong. The new version
of Spong was put into production this morning. I came down and walked
the day shift through the new features. I'll leave a note for the 2nd
and 3rd shift. The changes that I made should not significantly affect
how they use it.
But anyway - first the single, lone, isolated, only, request that I
didn't do.
> --------------------------------------------------
> (*) Change "spong" name to "pong" or something doesn't get confused with
> "sponge",
> --------------------------------------------------
Changing the name would just cause additional confusion, and requires
more effort then what it is worth IMNSHO. Feel free to make fun of me for
being attached to the name if it makes you feel better...
> --------------------------------------------------
> (*) Change "procs" name to "jobs" to lessen confusion about what is down,
> --------------------------------------------------
Done, the real fix will require the following however.
1. You will need to push out a new copy of spong-client to all of the
machines and restart-spong client.
2. Once you have that done everywhere, you will need to clean out the
spong database on dim, so all records of the old procs service is
removed.
But since, I know how much you will whine if I try to make you do that,
I also went ahead and added a few little hacks in the spong-scripts that
check for "procs" in various places and replace the string with "jobs"
when it finds it. So you can do steps 1 and 2 above at your leisure...
> --------------------------------------------------
> (*) Improve flexibility of setting down times for various services,
> --------------------------------------------------
Basically, I just fixed the way it was supposed to work before, In the
spong.hosts file you can configure a host like the following:
'strobe.weeg.uiowa.edu' => { services => 'ftp pop3 http',
contact => 'unix-staff',
group => 'unix-all',
down => [ '*:12:00-14:00',
'1:4:00-6:00' ] };
This definition says that strobe is down everyday from noon until 2pm,
and is also down on Monday morning from 4-6am. You can include as many
different downtimes as you want. The first part of the field is the day
of the week (0=sunday), or * indicating that it happens everyday. The
second part is the time range in 24 hour format.
This wasn't working in the version of spong that you are using, so I
have fixed it. Basically during the times that you define, all of the
services on the machine are "acknowledged" - well not actually, what
really happens is that their services just have a "blue" state, so that
they are not reported as problems to the operators. I could make them
have a unique color if you want, but I think 5 colors is plenty...
Now, this isn't a full blow cron type thing, but when working I think it
will handle most of the cases. It doesn't handle the silicon case where
just ftp is down at certain times, but that is addressed later...
> --------------------------------------------------
> (*) Include option (default?) that just shows "down" services,
> --------------------------------------------------
I believe that the main reason that you wanted to do this is just
because it was so damn slow. Hopefully I've fixed that problem so this
is less of an issue.
If you still want to see just the hosts with problems, basically you
just need to jump to the URL that loads the left frame. That URL is:
http://dim.weeg.uiowa.edu/cgi-bin/www-spong/problems/all
> --------------------------------------------------
> (*) Provide clarity in messages so operators page with exact phrasing,
> --------------------------------------------------
I've done the following. In both the Problem List frame where it has
the name of the group/person to contact, and on the Host and Service
page where there is a link that says "Contact Staff". Those links are
now "smart", meaning when you click on them - it will take you to the
Operator Paging Page, with the name of the person who is on call for
that machine already selected and a message indicating the problem
already filled out in the message box. So in most cases the operator
just has to press the send button.
This means that there is an additional file you have to maintain with
the paging cgi script. There is a file called "hosts" in the directory
with the rest of the paging stuff. That file contains a list of hosts,
and the group or individual that is responsible for that machine.
The default message contains the host name, and if there is a single
problem it lists the service name and summary information (which is
sometimes redundant). If there are multiple problems then it will just
list the services that are red.
Again, the operators will still have the ability to change the message
if they want.
> --------------------------------------------------
> Improve loading performance.
> --------------------------------------------------
Well... Here is what I've done. I believe it's significantly faster
from what some of you have said (but then again I thought the old one
was just fine).
Got rid of the little dot images, and replaced them with little squares
that are just tables with their background set to different colors.
This seems to get around the problem that Netscape has displaying all
the images. So once the page gets to your browser, it should display
much faster.
If some day, Netscape gets fixed and you are feeling nostalgic for the
dots, then you can turn them back on in the spong.conf file.
$WWW_USE_IMAGES = 1;
Also, the exact color of the various boxes can be adjusted as well. I
have played with a number of variations, and I do think what I have is
pretty good, but if you want to change it, then change the following in
the spong.conf file
$WWW_COLOR{"red"} = "#cc0000";
$WWW_COLOR{"yellow"} = "#ffff00";
$WWW_COLOR{"green"} = "#339900";
$WWW_COLOR{"purple"} = "#990099";
$WWW_COLOR{"blue"} = "#0000ff";
> --------------------------------------------------
> Allow others (e.g., Help Desk) to access spong. This will require
> access control (e.g., passwords) as well as configurable "refresh rate"
> option so that only operators and us are allowed auto refresh.
> --------------------------------------------------
Ok, I'll meet you half way on this one. The access control should come
through the standard web server mechanism. There are 3 parts to spong:
www-spong - which just allows you to view the spong database
www-spong-ack - which allows you to update the database
page.cgi - which allows the operators to page us
I would suggest that you basically open up the www-spong program to the
people that need it, and tighten down the ack and page programs so that
you don't have the help desk acking problems or paging you.
The way that you want to set up this access is up to you (host base vs
user names and passwords, etc...) You are quite familiar with the issues
of maintaining the password files for this type of thing, etc...
Ok, now that I've passed the buck on that, here is what I'll do. I've
included two new variables in the spong.conf file. They are:
@WWW_REFRESH_ALLOW = ( '.*' );
@WWW_REFRESH_DENY = ( 'edhill', '128.255.51\.\d+', 'traitor.*' );
These lists would contain regular expressions that are checked. ALLOW
is checked first, followed by deny. The following pieces of information
are checked against these regexps - REMOTE_USER (the username if you
protect www-spong using user authentication), REMOTE_HOST (the hostname
of the person connecting), and REMOTE_ADDR (the IP address of the person
connecting).
If no regular expression is matched from either list, then the
auto-refresh is not included in the output.
----------------------------------------------------------------------------
And over the course of trying to do these things, instead of hearing the
kind words of encouragement that I'm used to over in this building, I
instead heard the following additional whining and tried to address it.
These comments have not been through Rex's tone modifier like the
requests above.
> --------------------------------------------------
> I don't know why we use this program anyway, the spong-server seems to
> crash for no good reason at random times. We try to hack in things
> like getting spong to restart spong child when they die, but if think
> was just written well to begin with, we wouldn't have to do that crap.
> --------------------------------------------------
I have found a repeatable case where I could get spong to crash. If a
client (either updating or querying spong) would time out (because spong
was too slow) - that client would just shut down the connection.
Well, the spong-server was still trying to write to a PIPE that was now
closed. spong-server was told this with a SIG_PIPE, but I wasn't
checking for that signal, and if you don't check for that signal and you
get it, the program seems to exit.
So I'm now checking for it, and it no longer seems to go away. I've
left your child restarting code just in case (but as you probably know -
that won't help the case where the parent dies - either could happen
with the case that I fixed.)
There could of course be other problems then this, but there is now one
less thing that will make spong die.
> --------------------------------------------------
> The history feature is useless, I can't believe it's actually slower
> then the spong summary. It's so slow, it never even comes back before
> netscape times it out. It's so slow, I went off and wrote my own
> little spong-history command because your a bad programmer and blah
> blah blah...
> --------------------------------------------------
Ok, I wrote a script called spong-cleanup. It should be run every
night. It does the following:
* Cleans out any history older then 7 day. It moves the old history
for each host into the /local/www/docs/spong/archive directory.
If you don't think you would ever want to get at that history, then
you can just change the script so that it is deleted.
* Removes any acknowledgments that are no longer valid.
* Removes any services that don't seem to be reported any more (if
you stop monitoring something on a machine - the old entry will
still hang around and show up as purple).
Cleaning out the old history brought the load time down from 90 seconds
to about 4 seconds. That combined with the colored squares instead of
gif images should make retrieving the history via the web usable again.
On another note, I actually have a command line program called "spong"
which I have now included in the spong depot package. This is a
client/server program that basically reports all the same things that
the www-spong program does (including history). So you could install
"spong" on your desktop and run:
spong --history
or if you just want to see the Unix machines:
spong --history unix-all
It also allows you to view the summary table, problem hosts, individual
hosts, etc... Type:
spong --help
So if you wanted to do some type of automatic monitoring of the
automatic monitoring system from your desktop, you can use the command
line program to do it.
> --------------------------------------------------
> The acknowledgment mechanism sucks, we want web-ARS. Hey, tie spong
> into the directory server. Hey, why you're at it, build us a directory
> server.
> --------------------------------------------------
Ok, you now have the ability to delete existing acknowledgments via the
web interface. When you click on a host, in the Acknowledgment section
next to the descriptions there is a new link which allows you to delete
an Ack.
If you click on just the generic "Ack" menu item, it will take you to
the same screen as before (allowing you to add a new acknowledgment),
but at the top of the screen is listed all of the pending Acks, so
you can click on them, and either delete them.
Updating an acknowledgment can be done through this interface now. 1)
first you delete the old one, and 2) then you add a new one that was
similar but different then the old one - Walla, updates 8-) (I ran out
of time...)
I'll assume that the interface to all of this is self-explanatory.
Ok, now to the lame solution to the "ftp is down on silicon" problem.
As with the spong command line program, there is also a spong-ack
client/server command line program. In your script on silicon that
disables ftp, you could add the following command
spong-ack silicon.weeg.uiowa.edu ftp '+18h' 'its all Taos baby'
You can of course do this in any place that you have a script which is
going to down a service for a period of time.
You can also delete an acknowledgment through this command line
interface, but it is a little more convoluted. You would run the
following command
spong-ack --delete silicon.weeg.uiowa.edu-ftp-898185233
The little funky looking thing after the delete is the ack id. No, you
probably don't normally know what the ID of your acknowledgment is, but
you can find it with the following
spong --brief --acks
yeah, deleting is not the cleanest via the command line, but you get
what you pay for (and no there is not "updating" via the command line,
because in reality there is no updating period - it's all an illusion).
> --------------------------------------------------
> Well what else is broken with this piece of crap package...
> --------------------------------------------------
Here is one I had not heard of, but noticed this week. If you had a
host which had a problem with say it's disk, and you acknowledged that
service, but then the host had another problem with say jobs. The host
would not show up in the "Problems list" on the left side of the frame,
so I'm sure the operators would probably not call you about the second
problem.
I fixed that bug...
> --------------------------------------------------
> Well, I bet you screwed up all the patches that we have had to make to
> duct tape this software together since you last worked on it a long
> long time ago.
> --------------------------------------------------
Well, I incorporated all the changes that I could tell including the
following:
* The paging space addition to spong-client
* Dan's connection to the LDAP server for machine info
* Dave's various fixes for things like PID file removal, etc...
Basically I incorporated everything that I could tell by the History
file or diffing the various programs.
> --------------------------------------------------
> Why would you get rid of frames you dufus, it was the only thing you
> did right.
> --------------------------------------------------
Jeeze, I was just trying to make things faster. Relax, it's back to the
frames version...
> --------------------------------------------------
> Why do you need root access on dim? How about we just give you access
> to the cp and ls commands. All other commands that you want to execute
> will need to be placed in a file called "duh" and submitted for our
> approval.
> --------------------------------------------------
My initial response: Grrr...
My response after accidently chowning /tmp: Yes sir, whatever is best
sir...
If there are any problems with any of this, I'm sure you will let me
know.
|