File: tag_lockups.html

package info (click to toggle)
lg-issue30 3-2
  • links: PTS
  • area: main
  • in suites: potato
  • size: 2,076 kB
  • ctags: 80
  • sloc: makefile: 36; sh: 4
file content (270 lines) | stat: -rw-r--r-- 12,170 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
<!--startcut =======================================================  -->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<META NAME="generator" CONTENT="lgazmail v1.1pre8">
<TITLE>The Answer Guy 30: Hardware Lockups due to Graphics Load</TITLE>
</head>

<BODY BGCOLOR="#FFFFFF" TEXT="#000000" LINK="#0000FF" VLINK="#A000A0"
ALINK="#FF0000">
<!--endcut =========================================================  -->
<H4>"Linux Gazette...<I>making Linux just a little more fun!</I>"
</H4>
<P> <hr> <P>

<!-- ===============================================================  -->
<H1 align="center"><A NAME="answer">
<img src="../gx/dennis/qbubble.gif" alt="" border="0" align="middle">
<a href="./lg_toc30.html">The Answer Guy</a>
<img src="../gx/dennis/bbubble.gif" alt="" border="0" align="middle">
</A></H1> <BR>
<H4 align="center">By James T. Dennis,
<a href="mailto:answerguy@ssc.com">answerguy@ssc.com</a><BR>
Starshine Technical Services,
<A HREF="http://www.starshine.org/">http://www.starshine.org/</A> </H4>
<p><hr><p>
<H3><img src="../gx/dennis/qbub.gif" alt="(?)" width="50" height="28"
	align="left" border="0">Hardware Lockups due to Graphics Load</H3>

<p><strong>From Brad Alexander on 30 May 1998

<!-- begin body -->
<BR><BR>
 Hi Jim,

<br><br>
 This isn't Linux-specific, but I'm having a problem and I'm hoping you can
 help me come up with a workaround that isn't going to cost a lot of money.

<br><br>
 I have an Intel P-100 on an Amptron AM-7900 board with 64MB of EDO RAM (2
 32MB sticks), a gob of hard drives (a 2.2MB Quantum Fireball IDE and a
 FutureDomain SCSI controller with a 420MB Conner, a 1GB Seagate, 1GB
 Micropolis and 1GB Quantum Empire), a Diamond Stealth 64 with 2MB DRAM, and
 a SoundBlaster 16 Plug'n'Pray.

<br><br>
 I'm running a heavily modified RedHat 5.0 machine with an 800MB DOS
 partition on <TT>/dev/hda1</TT> and a 200MB win95 partition on
 <TT>/dev/hda3</TT> (Linux's <TT>/+/usr</TT> is on <TT>/dev/hda2</TT>).

<br><br>
 I have been seeing system lockups for quite a while now. I noticed them
 when running xlock in random mode initially, then noticed that I was also
 starting to have problems with some of my dos apps, like Jane's Longbow and
 Duke Nukem locking up. Under Linux, I settled on using <TT>xlock</TT> in
 galaxy mode, and the lockups dropped to every couple of weeks. (Note that
 during this time, I upgraded memory from 4 8MB sticks to 2 32s.)

<br><br>
 Everything went all right until I upgraded to RedHat 5.0, with XFree86
 3.3.1. The lockups increased to about every 2 days. Once I upgraded to
 XFree86 3.3.2, they dropped back down to about once a week.

<br><br>
 I'm basically using you as a sounding board to see if I might have missed
 something. I'm thinking its hardware, but where? The stealth? The lockups
 seem to occur during graphics app use, <TT>xlock</TT>, or the <TT>gimp</TT>.
 The motherboard? The chip? What can I start replacing without sinking a whole
 bunch of money into it?

<br><br>
 Thanks in advance,
 <br>--Brad Alexander
</strong></p>

<blockquote><img src="../gx/dennis/bbub.gif"width="50" height="28" alt="(!)"
align="left" border="0">
	Well, the first thought would be to try a different video card.
	I don't have too much confidence that the problem is truly
	related to the video card's activity --- so it's just a
	diagnostics start.

<br><br>
	To see if this really is related to graphics, boot up the
	system in text mode (don't run X, change your runlevel or
	initdefault to one of the non-xdm modes if necessary).  Now you
	can run a couple of kernel builds on it (that's usually a
	pretty good stress test.  Try '<TT>make -j</TT>' to work it harder.

<br><br>
	It would also be helpful to know what sort of lockup you're
	getting.  It may be that you could still login via a serial
	port (using a null modem and a laptop or any other nearby
	computer or terminal).  Do do this simply add a line like

<blockquote><code>t1:23:respawn:/sbin/agetty -L 38400,19200,9600,2400,1200 ttyS1 vt100
</code></blockquote>

	... to your <TT>/etc/inittab</TT>.  This should allow you to use one of
	your serial lines to login.  It is possible for the Linux X
	Windows system and console to be dead while the kernel and
	other processes are still up and running.  Another test is to
	ping it from another system (if you have an ethernet LAN
	connected to this machine).  Even if telnet doesn't work you
	want to ping it to see if the kernel is still responding.

<br><br>
	It's also probably worth trying the software watchdog timer
	code in the newer kernels.  These allow you to configure a
	kernel module to emulate a hardware watchdog timer card.  These
	WDT devices are basically a "dead man's switch" for your
	system.  If the timer isn't periodically updated by the kernel
	(or by some <EM>other thread in the kernel</EM>, in the case of the
	emulated WDT) then the WDT triggers a system reset.

<br><br>
	Obviously a software emulation of this isn't quite as reliable
	as a hardware WDT --- since a completely hung kernel will
	never get around to calling on that module's thread of
	execution.  However, it isn't too unlikely that the hang is in
	some specific kernel thread and that some other thread
	continues to execute after other parts have died.

<br><br>
	Frankly I'm not sure what the difference between the kernel
	watchdog emulation code and the boot "<TT>panic=</TT>" parameter.  But
	that's definitely another thing to try (just add something
	like <TT>panic=60</TT> to your lilo "<TT>append=</TT>" directive, or
	manually when you boot up your system).  I guess that the difference
	would be that there may be some conditions under which the
	kernel could get into a comatose or unresponsive state without
	panic'ing (if it got tricked into some really long timeout wait
	or something).  The <TT>panic=</TT> option forces the Linux kernel to
	reboot after a "panic" (a critical error condition detected by
	the kernel, usually a corrupted table that fails its consistency and
	integrity checks).

<br><br>
	Normally the kernel would just display a "panic" message and
	sit there waiting for human intervention.  These are very
	rare (other than the old "VFS kernel panic, unable to mount
	root" that occurs when you have your kernel misconfigured for
	your arrangement of hard drives --- or when you change the
	hardware setting of your disk drives without updating your
	kernel (with the '<TT>rdev</TT>' command to set the root device flags)
	and/or without updating your <TT>LILO</TT> or <TT>LOADLIN</TT> commands
	(which are usually used to pass these flags to your kernel to
	over-ride the compiled in defaults).

<br><br>
	Other than that common case I think I've only seen one or two
	Linux kernel panics in the last 6 years.  I've only had about a
	half dozen unexplained system lockups over that period --- and
	that's on about fifty Linux machines that I've managed during
	various portions of that time.  These lockups might have been
	panics in situations that were so bad the kernel couldn't even
	display an error message, there's no way to know).

<br><br>
	I've only had to reboot unresponsive Linux boxes about a dozen
	or so times in all the years I've used it.  This was only a
	problem in the late .99 and early 1.0x kernels when I was
	running a very busy FTP/Web server that was simply overloaded
	-- the TCP/IP stack would get so congested that the system
	would timeout between my login name and password --- at the
	console (I'd've loved a working SAK --- secure attention key
	back then).  I was glad to see the major TCP/IP re-write in
	between 1.2 and 2.x.

<br><br>
	I'm not trying to tout Linux' horn here --- (well, maybe a
	little).  The point is that I don't get panics and lockups
	often enough to see how the <TT>panic=</TT> parameter and the
	softdog/watchdog code would work in those situations.

<br><br>
	However, if you enabled the <TT>panic=</TT> and/or the softdog
	kernel option, you may see that the machine reboots without a minute
	or two after your lockup (wait for ten or fifteen).  This tells
	you that some part of the kernel was still running (and that
	the hardware isn't completely wigged out).

<br><br>
	Beyond that the things to do are to take out all non-essential
	hardware (the sound card would be a great choice --- and the
	SCSI card, since you mention that your Linux partitions are on
	the IDE drives.  As with most technical computing issues the it
	eventually boils down to a matter of cost.  You mentioned a
	couple of times how you don't want to spend money on solving
	this problem.  Ultimately the time you spend fighting with it
	translates to money --- and you'll have to eventually ask what
	your time is worth.

<br><br>
	(The deeper part of this question is that you may find that
	your home machine isn't worth the time <EM>or the money</EM> and you
	may content yourself to just use any machines that you
	encounter at work, or whatever.  Strange as that sounds I've
	had friends who refuse to keep a computer around the house
	specifically because they "spend enough time with them at work"
	and feels that "home is for family time").

<br><br>
	At the same time I don't recommend throwing replacement
	components at the problem without understanding the nature of
	the problem.  However, it may be that the best solution is to
	replace the motherboard and/or the video card and/or the RAM.

<br><br>
	Troubleshooting computers is difficult work.  Whole books have
	been devoted to the subject (I like the Win L. Rosch Hardware
	Bible personally --- read it years ago and should probably get
	an updated copy).  There are also parts of the process that
	can't be gained from any book --- that you must learn by
	experience and figure out through some combination of analysis
	and intuition.  As our computers become more sophisticated the
	balance seems to lean more for the intuition.
</blockquote>

<!-- end body -->
<!--================================================================-->
<P> <hr> <P>
<H5 align="center"><a href="http://www.linuxgazette.com/ssc.copying.html"
	>Copyright &copy;</a> 1998, James T. Dennis <BR>
Published in <I>Linux Gazette</I> Issue 30 July 1998</H5>
<P> <hr> <P>
<!--================================================================-->
<table width="98%"><tr valign="center" align="center">
<td rowspan="3"><A HREF="../lg_answer30.html"><IMG
	SRC="../gx/dennis/answernew.gif"
	ALT="[ Answer Guy Index ]"></A></td>
<td><A HREF="tag_SCOkeys.html">SCOkeys</A></td>
<td><A HREF="tag_chroot.html">chroot</A></td>
<td><A HREF="tag_dosemu-db.html">dosemu-db</A></td>
<td><A HREF="tag_NTauth.html">NTauth</A></td>
<td><A HREF="tag_cdr.html">cdr</A></td>
<td><A HREF="tag_3270.html">3270</A></td>
<td><A HREF="tag_comport.html">comport</A></td>
</tr><tr valign="center" align="center">
<td><A HREF="tag_lilostop.html">lilostop</A></td>
<td><A HREF="tag_emulate.html">emulate</A></td>
<td><A HREF="tag_ppadrivers.html">ppadrivers</A></td>
<td><A HREF="tag_database.html">database</A></td>
<td><A HREF="tag_vacation.html">vacation</A></td>
<td><A HREF="tag_nullmodem.html">nullmodem</A></td>
<td><A HREF="tag_lockups.html">lockups</A></td>
</tr><tr valign="center" align="center">
<td><A HREF="tag_gzipC.html">gzipC</A></td>
<td><A HREF="tag_newlook.html">newlook</A></td>
<td><A HREF="tag_c500.html">c500</A></td>
<td><A HREF="tag_solprint.html">solprint</A></td>
<td><A HREF="tag_vc1shell.html">vc1shell</A></td>
<td><A HREF="tag_memleak.html">memleak</A></td>
<td><A HREF="tag_tvcard.html">tvcard</A></td>
</tr></table>
<P> <hr> <P>
<!--================================================================-->
<A HREF="./lg_toc30.html"><IMG SRC="../gx/indexnew.gif"
        ALT="[ Table Of Contents ]"></A>
<A HREF="../index.html"><IMG SRC="../gx/homenew.gif"
        ALT="[ Front Page ]"></A>
<A HREF="lg_bytes30.html"><IMG SRC="../gx/back2.gif"
        ALT="[ Previous Section ]"></A>
<A HREF="./vrenios.html"><IMG SRC="../gx/fwd.gif"
        ALT="[ Next Section ]"></A>
<!--startcut =======================================================  -->
</body>
</html>
<!--endcut =========================================================  -->