File: changes-2.1

package info (click to toggle)
spong 2.7.7-19
  • links: PTS
  • area: main
  • in suites: etch, etch-m68k
  • size: 1,880 kB
  • ctags: 1,224
  • sloc: perl: 6,640; sh: 2,247; makefile: 237
file content (375 lines) | stat: -rwxr-xr-x 15,743 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375

I'm sure the Gold App Dev Request form for all of this is just delayed
in Campus Mail...

This email note is an epic.  Ed in the role of the protagonist, Doug in
the role of the antagonist (of course).  Save this note, it's the only
documentation you will get for a while...


Here is a summary of the changes that I made to spong.  The new version
of Spong was put into production this morning.  I came down and walked
the day shift through the new features.  I'll leave a note for the 2nd
and 3rd shift.  The changes that I made should not significantly affect
how they use it.

But anyway - first the single, lone, isolated, only, request that I
didn't do.

> --------------------------------------------------
> (*) Change "spong" name to "pong" or something doesn't get confused with
> "sponge",
> --------------------------------------------------

Changing the name would just cause additional confusion, and requires
more effort then what it is worth IMNSHO.  Feel free to make fun of me for
being attached to the name if it makes you feel better...


> --------------------------------------------------
> (*) Change "procs" name to "jobs" to lessen confusion about what is down,
> --------------------------------------------------

Done, the real fix will require the following however.  

  1. You will need to push out a new copy of spong-client to all of the
     machines and restart-spong client.

  2. Once you have that done everywhere, you will need to clean out the
     spong database on dim, so all records of the old procs service is
     removed. 

But since, I know how much you will whine if I try to make you do that,
I also went ahead and added a few little hacks in the spong-scripts that
check for "procs" in various places and replace the string with "jobs"
when it finds it.  So you can do steps 1 and 2 above at your leisure...


> --------------------------------------------------
> (*) Improve flexibility of setting down times for various services,
> --------------------------------------------------

Basically, I just fixed the way it was supposed to work before,  In the
spong.hosts file you can configure a host like the following:

 'strobe.weeg.uiowa.edu' =>  	{ services => 'ftp pop3 http',
				  contact  => 'unix-staff',
				  group    => 'unix-all',
				  down     => [ '*:12:00-14:00',
					        '1:4:00-6:00' ] };

This definition says that strobe is down everyday from noon until 2pm,
and is also down on Monday morning from 4-6am.  You can include as many
different downtimes as you want.  The first part of the field is the day
of the week (0=sunday), or * indicating that it happens everyday.  The
second part is the time range in 24 hour format.

This wasn't working in the version of spong that you are using, so I
have fixed it.  Basically during the times that you define, all of the
services on the machine are "acknowledged" - well not actually, what
really happens is that their services just have a "blue" state, so that
they are not reported as problems to the operators.  I could make them
have a unique color if you want, but I think 5 colors is plenty...

Now, this isn't a full blow cron type thing, but when working I think it
will handle most of the cases.  It doesn't handle the silicon case where
just ftp is down at certain times, but that is addressed later...


> --------------------------------------------------
> (*) Include option (default?) that just shows "down" services,
> --------------------------------------------------

I believe that the main reason that you wanted to do this is just
because it was so damn slow.  Hopefully I've fixed that problem so this
is less of an issue.

If you still want to see just the hosts with problems, basically you
just need to jump to the URL that loads the left frame.  That URL is:

	http://dim.weeg.uiowa.edu/cgi-bin/www-spong/problems/all


> --------------------------------------------------
> (*) Provide clarity in messages so operators page with exact phrasing,
> --------------------------------------------------

I've done the following.  In both the Problem List frame where it has
the name of the group/person to contact, and on the Host and Service
page where there is a link that says "Contact Staff".  Those links are
now "smart", meaning when you click on them - it will take you to the
Operator Paging Page, with the name of the person who is on call for
that machine already selected and a message indicating the problem
already filled out in the message box.  So in most cases the operator
just has to press the send button.

This means that there is an additional file you have to maintain with
the paging cgi script.  There is a file called "hosts" in the directory
with the rest of the paging stuff.  That file contains a list of hosts,
and the group or individual that is responsible for that machine.

The default message contains the host name, and if there is a single
problem it lists the service name and summary information (which is
sometimes redundant).  If there are multiple problems then it will just
list the services that are red.

Again, the operators will still have the ability to change the message
if they want.


> --------------------------------------------------
> Improve loading performance.
> --------------------------------------------------

Well...  Here is what I've done.  I believe it's significantly faster
from what some of you have said (but then again I thought the old one
was just fine).

Got rid of the little dot images, and replaced them with little squares
that are just tables with their background set to different colors.
This seems to get around the problem that Netscape has displaying all
the images.  So once the page gets to your browser, it should display
much faster.

If some day, Netscape gets fixed and you are feeling nostalgic for the
dots, then you can turn them back on in the spong.conf file.

    $WWW_USE_IMAGES = 1;

Also, the exact color of the various boxes can be adjusted as well.  I
have played with a number of variations, and I do think what I have is
pretty good, but if you want to change it, then change the following in
the spong.conf file

    $WWW_COLOR{"red"}    = "#cc0000";
    $WWW_COLOR{"yellow"} = "#ffff00";
    $WWW_COLOR{"green"}  = "#339900";
    $WWW_COLOR{"purple"} = "#990099";
    $WWW_COLOR{"blue"}   = "#0000ff";

> --------------------------------------------------
> Allow others (e.g., Help Desk) to access spong. This will require
> access control (e.g., passwords) as well as configurable "refresh rate"
> option so that only operators and us are allowed auto refresh.
> --------------------------------------------------

Ok, I'll meet you half way on this one.  The access control should come
through the standard web server mechanism.  There are 3 parts to spong:

   www-spong      - which just allows you to view the spong database
   www-spong-ack  - which allows you to update the database
   page.cgi       - which allows the operators to page us

I would suggest that you basically open up the www-spong program to the
people that need it, and tighten down the ack and page programs so that
you don't have the help desk acking problems or paging you.  

The way that you want to set up this access is up to you (host base vs
user names and passwords, etc...)  You are quite familiar with the issues
of maintaining the password files for this type of thing, etc...

Ok, now that I've passed the buck on that, here is what I'll do.  I've
included two new variables in the spong.conf file.  They are:

      @WWW_REFRESH_ALLOW = ( '.*' );
      @WWW_REFRESH_DENY  = ( 'edhill', '128.255.51\.\d+', 'traitor.*' );

These lists would contain regular expressions that are checked.  ALLOW
is checked first, followed by deny.  The following pieces of information
are checked against these regexps - REMOTE_USER (the username if you
protect www-spong using user authentication), REMOTE_HOST (the hostname
of the person connecting), and REMOTE_ADDR (the IP address of the person
connecting).

If no regular expression is matched from either list, then the
auto-refresh is not included in the output.

----------------------------------------------------------------------------

And over the course of trying to do these things, instead of hearing the
kind words of encouragement that I'm used to over in this building, I
instead heard the following additional whining and tried to address it.
These comments have not been through Rex's tone modifier like the
requests above.

> --------------------------------------------------
> I don't know why we use this program anyway, the spong-server seems to
> crash for no good reason at random times.  We try to hack in things
> like getting spong to restart spong child when they die, but if think
> was just written well to begin with, we wouldn't have to do that crap.
> --------------------------------------------------

I have found a repeatable case where I could get spong to crash.  If a
client (either updating or querying spong) would time out (because spong
was too slow) - that client would just shut down the connection.  

Well, the spong-server was still trying to write to a PIPE that was now
closed.  spong-server was told this with a SIG_PIPE, but I wasn't
checking for that signal, and if you don't check for that signal and you
get it, the program seems to exit.

So I'm now checking for it, and it no longer seems to go away.  I've
left your child restarting code just in case (but as you probably know -
that won't help the case where the parent dies - either could happen
with the case that I fixed.)

There could of course be other problems then this, but there is now one
less thing that will make spong die.


> --------------------------------------------------
> The history feature is useless, I can't believe it's actually slower
> then the spong summary.  It's so slow, it never even comes back before
> netscape times it out.  It's so slow, I went off and wrote my own
> little spong-history command because your a bad programmer and blah
> blah blah...
> --------------------------------------------------

Ok, I wrote a script called spong-cleanup.  It should be run every
night.  It does the following:

   * Cleans out any history older then 7 day.  It moves the old history
     for each host into the /local/www/docs/spong/archive directory.
     If you don't think you would ever want to get at that history, then
     you can just change the script so that it is deleted.

   * Removes any acknowledgments that are no longer valid.

   * Removes any services that don't seem to be reported any more (if
     you stop monitoring something on a machine - the old entry will
     still hang around and show up as purple).

Cleaning out the old history brought the load time down from 90 seconds
to about 4 seconds.  That combined with the colored squares instead of
gif images should make retrieving the history via the web usable again.


On another note, I actually have a command line program called "spong"
which I have now included in the spong depot package.  This is a
client/server program that basically reports all the same things that
the www-spong program does (including history).  So you could install
"spong" on your desktop and run:

	spong --history

or if you just want to see the Unix machines:

	spong --history unix-all

It also allows you to view the summary table, problem hosts, individual
hosts, etc...  Type:

	spong --help

So if you wanted to do some type of automatic monitoring of the
automatic monitoring system from your desktop, you can use the command
line program to do it.


> --------------------------------------------------
> The acknowledgment mechanism sucks, we want web-ARS.  Hey, tie spong
> into the directory server. Hey, why you're at it, build us a directory
> server.
> --------------------------------------------------

Ok, you now have the ability to delete existing acknowledgments via the
web interface.  When you click on a host, in the Acknowledgment section
next to the descriptions there is a new link which allows you to delete
an Ack.

If you click on just the generic "Ack" menu item, it will take you to
the same screen as before (allowing you to add a new acknowledgment),
but at the top of the screen is listed all of the pending Acks, so
you can click on them, and either delete them.

Updating an acknowledgment can be done through this interface now.  1)
first you delete the old one, and 2) then you add a new one that was
similar but different then the old one - Walla, updates 8-) (I ran out
of time...)

I'll assume that the interface to all of this is self-explanatory.

Ok, now to the lame solution to the "ftp is down on silicon" problem.
As with the spong command line program, there is also a spong-ack
client/server command line program.  In your script on silicon that
disables ftp, you could add the following command

    spong-ack silicon.weeg.uiowa.edu ftp '+18h' 'its all Taos baby'

You can of course do this in any place that you have a script which is
going to down a service for a period of time.

You can also delete an acknowledgment through this command line
interface, but it is a little more convoluted.  You would run the
following command

    spong-ack --delete silicon.weeg.uiowa.edu-ftp-898185233

The little funky looking thing after the delete is the ack id.  No, you
probably don't normally know what the ID of your acknowledgment is, but
you can find it with the following

    spong --brief --acks

yeah, deleting is not the cleanest via the command line, but you get
what you pay for (and no there is not "updating" via the command line,
because in reality there is no updating period - it's all an illusion).


> --------------------------------------------------
> Well what else is broken with this piece of crap package...
> --------------------------------------------------

Here is one I had not heard of, but noticed this week.  If you had a
host which had a problem with say it's disk, and you acknowledged that
service, but then the host had another problem with say jobs. The host
would not show up in the "Problems list" on the left side of the frame,
so I'm sure the operators would probably not call you about the second
problem.

I fixed that bug...


> --------------------------------------------------
> Well, I bet you screwed up all the patches that we have had to make to
> duct tape this software together since you last worked on it a long
> long time ago.
> --------------------------------------------------

Well, I incorporated all the changes that I could tell including the
following:
	
    * The paging space addition to spong-client
    * Dan's connection to the LDAP server for machine info
    * Dave's various fixes for things like PID file removal, etc...

Basically I incorporated everything that I could tell by the History
file or diffing the various programs.


> --------------------------------------------------
> Why would you get rid of frames you dufus, it was the only thing you
> did right.
> --------------------------------------------------

Jeeze, I was just trying to make things faster.  Relax, it's back to the
frames version...


> --------------------------------------------------
> Why do you need root access on dim?  How about we just give you access
> to the cp and ls commands. All other commands that you want to execute
> will need to be placed in a file called "duh" and submitted for our
> approval.
> --------------------------------------------------

My initial response:  Grrr...

My response after accidently chowning /tmp:  Yes sir, whatever is best
sir...


If there are any problems with any of this, I'm sure you will let me
know.