File: FAQ-collectl.html

package info (click to toggle)
collectl 3.7.4-1
  • links: PTS
  • area: main
  • in suites: jessie, jessie-kfreebsd
  • size: 1,624 kB
  • ctags: 119
  • sloc: perl: 14,928; sh: 429; makefile: 11
file content (582 lines) | stat: -rw-r--r-- 34,589 bytes parent folder | download | duplicates (4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
<html>
<title>collectl Freuqently Asked Questions</title>
<body>

<h1>Collectl Frequently Asked Questions</h1>
<h2>General Questions</h2>
<ul>
  <li><a href="#gen1">What is the difference between collectl and sar?</a></li>
  <li><a href="#gen2">Isn't a default monitoring frequency of 1 second going to kill my system?</a></li>
  <li><a href="#gen3">What is the best monitoring frequency?</a></li>
  <li><a href="#gen4">Why so many switches?</a></li>
  <li><a href="#gen5">Why doesn't <i>--top</i> show as much data as the <i>top</i> command?</a></li>
  <li><a href="#gen6">What does <i>collectl</i> stand for?</a></li>
  <li><a href="#gen7">How do you pronounce <i>collectl</i></a></li>
  <li><a href="#gen8">Why is the default socket port 2655?</a></li>
</ul>
<h2>Running collectl</h2></ul>
<ul>
  <li><a href="#run1">How do I get started?</a></li>
  <li><a href="#run2">How do I make a plot?</a></li>
  <li><a href="#run3">How do I drill down to get a closer look at what's going on?</a></li>
  <li><a href="#run4">I want to look at detail data but forgot to specify it when 
                      I collected the data.  Now what?</a></li>
  <li><a href="#run5">How do I configure collectl to run all the time as a service?</a></li>
  <li><a href="#run6">How do I change the monitoring parameters for the 'service'?</a></li>
  <li><a href="#run7">What are the differences between --rawtoo, --lexpr and --sexpr</a></li>
  <li><a href="#run8">How can I pass collectl data to other applications?</a></li>
</ul>

<h2>Gottchas - what you don't know <i>can</i> hurt you<img src=new.gif></h2>
<ul>
  <li><a href="#got1">Round-off vs normalization</a></li>
</ul>
</h2>

<h2>Operational Problems</h2>
<ul>
  <li><a href="#ops1">Why won't collectl run as a service?</a></li>
  <li><a href="#ops2">Why is my 'raw' file so big?</a></li>
  <li><a href="#ops3">Playing back multiple files to terminal doesn't show file names</a></li>
  <li><a href="#ops4">Why don't the averages/totals produced in brief mode look correct?</a></li>
  <li><a href="#ops6">What does <i>New slab created after logging started</i> mean?</a></li>
  <li><a href="#ops7">Why does collectl say <i>waiting for 60 second sample...</i> but doesn't?</a></li>
  <li><a href="#ops8">Why am I not seeing exceptions only with -ox?</a></li>
  <li><a href="#ops9">I'm seeing a bogus data point!</a></li>
  <li><a href="#ops10">What does the error <i>-sj or -sJ with -P also requires CPU details so add C or remove J.</i> mean?</a></li>
  <li><a href="#ops11">Why can't I see Process I/O statistics?</a></li>
  <li><a href="#ops12">I'm getting an error that <i>formatit.ph</i> can't be found</a></li>
  <li><a href="#ops13">When I use an interval >4 seconds I'm getting non-uniform sample times</a></li>
  <li><a href="#ops14">I'm getting <i>settimer</i> messages on the console and in dmesg</a></li>
  <li><a href="#ops15">Why is a set of process data missing during playback?</a></li>
  <li><a href="#ops16">I played back all summary data with -P but got a cpu detail file.  Why?</a></li>
  <li><a href="#ops17">What does this message mean: <i>-sj or -sJ with -P also requires CPU details so adding -sC</i></a></li>
  <li><a href="#ops18">Why are some process usernames being reported as numbers?</a></li>
  <li><a href="#ops19">Why don't I get a prc file with -P and --rawtoo?</a></li>
  <li><a href="#ops21">What happened to -sy in brief mode?</a></li>
  <li><a href="#ops22">Why do I get an empty file during play back with -sY/sZ and -P -f?</a></li>
  <li><a href="#ops23">Why don't I get slb/prc files with -P, -sY/Z and --rawtoo but witout -rawtoo I do?</a></li>
  <li><a href="#ops24">I'm getting an error from RPM: <i>perl >= 0:5.008000 is needed by collectl...</i></a></li>
  <li><a href="#ops25">What does this mean: <i>New network device found: xxx?</i></a></li>
  <li><a href="#ops26">What does this mean: <i>NumNets in header is 'x' but only 'y' listed...?</i></a></li>
  <li><a href="#ops27">Why is -sC showing CPUs with no load and no idle?</a></li>
  <li><a href="#ops28">When did playing back a file start taking so long?</a></li>
  <li><a href="#ops29">What does the message <i>-D requires -f OR -A server</i> mean?</a></li>
</ul>

<h1>General Questions</h1>

<a name="gen1"></a>
<h2>What is the difference between collectl and sar?</h2>
At the highest level, both collectl and sar provide lightweight collection of 
device performance information.  However, when used in a diagnostic mode sar
falls short on a number of points, though admittedly some could be addressed
by wrapping it with scripts that reformat the data:
<ul>
  <li>sar plays back data by device/subsystem, sorted by time</li>
  <li>sar does not deal with sub-second time</li>
  <li>sar output cannot be directly fed into a plotting tool</li>
  <li>sar does not provide nfs, lustre or interconnect statistics
  <li>sar does not provide for the collection of slab data
  <li>sar's process monitoring is limited in that if cannot save
    process data in a file, cannot monitor threads, cannot select processes
    to monitor other than ALL or by pid (so cannot selectively discover new 
    processes) and in interactive mode is limited to 256 processes</li>
</ul>

<a name="gen2"></a>
<h2>Isn't a default monitoring frequency of 1 second going to kill my system?</h2>
Running collectl interactively at a 1 second interval has been shown to provide
minimal load.  However, for running collectl for long periods of time it is 
recommended to use a default monitoring period of 10 second and in fact is the 
default when collectl is run as a daemon and started using the 'service start
collectl command'.
<br>
A lot of effort has gone into making collectl very efficient in spite of the fact
that it's written in an interpretive language like perl, which by the way is known
for its efficiency.  collectl has been measured to use less than 0.01% of the cpu
on most systems at an interval of 10 seconds.  
To measure collectl's load on your own system you can use the command
"time collectl -i0 -c8640 -s??? -f." to see the load of collecting a day's worth 
of data for the specific subsystems included with the -s switch.

<a name="gen3"></a>
<h2>What is the best monitoring frequency?</h2>
There really isn't a 'best' per se.  In general collecting counter data every 10 seconds
and process/slab data every minute has been observed to produce a maximum amount of data
with a minimal load.  When this granularity isn't sufficient there have been uses for 
collecting data as 0.1 second intervals!  There have even been times when wanting to 
verify a short lived process really does start that doing process monitoring by name at 
an interval of 0.01 seconds has been found to be useful.

<a name="gen4"></a>
<h2>Why so many switches?</h2>
In general, most people will not need most switches and that's the main reason
for 'basic' vs 'extended' help.  However, it's also possible that there may be
an extended switch that provides some specific piece of functionality not there
with the basic ones and it is recommended that once you feel more comfortable
with the basic operations that you spend a little time looking at them too.

<a name="gen5"></a>
<h2>Why doesn't <i>--top</i> show as much data as the <i>top</i> command?</h2>
The simple answer is because this is <i>collectl</i>, not <i>top</i>.
Actually I thought of that and then decided with all the different switches
and options, the easiest thing to do is just run
a second instance of collectl in another window, 
showing whatever else you want to see in whatever
format you like.  You can even pick different monitoring intervals.

<a name="gen6"></a>
<h2>What does <i>collectl</i> stand for?</h2>
Collectl is based on  the very popular <i>collect</i> tool written by Rob Urban which
was distributed as with DEC's Tru64 Unix Operating System and therefore stands for
<i>collect for linux</i>.

<a name="gen7"></a>
<h2>How do you pronounce <i>collectl</i>?</h2>
It rhymes with pterodactyl.

<a name="gen8"></a>
<h2>Why is the default socket port 2655??</h2>
Those are the first 4 digits of <i>collectl</i> on a telephone numeric key pad.

<h1>Running collectl</h1>

<a name="run1"></a>
<h2>How do I get started?</h2>
The easiest way to get started is to just type 'collectl'.  It will report
summary statistics on cpu, disk and network once a second.  
If you want to change the subsystems being reported on use -s and to change
the interval use -i.  More verbose information can be displayed with --verbose.  See
the man pages for more detail.

<a name="run2"></a>
<h2>How do I make a plot?</h2>
Collectl supports saving data in plot format - space separated fields - through
the use of the -P switch.  The resultant output can then be easily plotted 
using gnuplot, excel or any other packages that understand this format.  You
can redirect collectl's output to a file OR it's much easier to just use the
-f switch to specify a location to write the data.

<a name="run3"></a>
<h2>How do I drill down to get a closer look at what's going on?</h2>
The first order of business is to familiarize yourself with the types of data 
collectl is capable of collecting.  This is best done by looking at the data
produced by all the different settings for -s, both lower and upper case as 
there is some detail data that is not visible at the summary level.  
Take a look at -sd and -sD.  If you still
don't see something it might actually be written in -P format.  See -sT for an
example.
<br>
Next, run collectl and instruct it to log everything (or at least as much as you 
think you'll need) to a file.  When you believe you've collected enough data - and
this could span multiple days - identify times of interest or just plot everything
(see the -P switch).
Visually inspecting the plotted data can often show times of unusually heavy resource
loads.  Often times there is a strong time delineation between good and bad.
<br>
It you want to see the actual numbers in the data as opposed to plots,
play back the data using the --from switch to 
select a begin time, usually a few samples prior to the time when things started to
go bad.  To reduce the amount of output you can also use --thru to set the end time for
the collection.  You can also start selecting specific subsystems to look at as well
as individual devices.  For example, if you've discovered that at 11:03 there was 
an unusal network load, try 'collectl -p filename --from 11:02 --thru 11:05 -sN' to see
the activity at each NIC.
<br>
And don't forget process and/or slab activity if either has been collected.  You can
also play back this data at specific time intervals too.

<a name="run4"></a>
<h2>I want to look at detail data but forgot to specify it when I collected the 
data.  Now what?</h2>
Good news!  With the exception of CPU data, collectl always collects
detail data whether you ask for it or not - that's how it generates the 
summaries.  When you extract data into plot format, by default it extracts the
data based on the switches you used when you collected it.  So, if you 
specified -sd you'll only see summary data when you extract it.  BUT if you
include -s+D during the generation of plotting data you WILL generate disk
details as well.

<a name="run5"></a>
<h2>How do I configure collectl to run all the time as a service?</h2>
Use the chkconfig to change collectl's setting to 'on'.  On boot, collectl will
be automatically started.  To start collectl immediately, type 
'service collectl start'.

<a name="run6"></a>
<h2>How do I change the monitoring parameters for the 'service'?</h2>
Edit /etc/collectl.conf and add any switches you like to the 'DaemonCommands'
line.  To verify these are indeed compatible (some switches aren't), cut/paste
that line into a collectl run command to make sure they work before trying to
start the service.

<a name="run7"></a>
<h2>What are the differences between --rawtoo, --lexpr and --sexpr</h2>
Looking at --lexpr and --sexpr first, these will cause the contents of most
counters to be written as a list or s-expression to either a file or socket
based on whether -f or -A is specified.  If both are specified the data will
be sent over the socket and written locally as a raw file.
Adding -P will cause the local file to be written in plot format while adding
--rawtoo will cause both plot and raw files to be written locally.
<p>
For more information also see <a href=Logging.html>Logging</a>.
<a name="run8">
</a><h2>How can I pass collectl data to other applications?</h2>
You actually have several choices, all of which are based on
<i>--export</i> in which you specify a routine that will export collectl's
output in some other format that your application 
may prefer to see it in.  There are currently 2 such routines:
<ul>
<li>lexpr.ph formats data as a list</li>
<li>sexpr.ph formats data as an s-expression</li>
</ul>
If you don't like either, you're free to write your own.
<p>
The next thing you need to decide is whether you simply want to write a
data snapshot to a local file which some other program/script can retrieve
OR send the data over a socket to your application.  Clearly using the socket
is more efficient, but the choice is all yours.
<p>
An example parser script called <i>readS</i> had been provided as a convenient
way to parse a file that is an s-expression, but just keep in mind that it is 
written in perl and every invocation involved starting up the perl interpreter
which may be a little heavier-weight than you wish to use.
<p>
If you do choose to use it, the arguments to readS take the following form:
<p><tt>dir category variable [instance [divisor]]</tt>
<ul>
  <li>dir - directory in which the s-expression S exists, often
      /var/log/collectl
  <li>category - category of data element, such asc <i>cputotals</i> or <i>diskinfo</i>
  <li>variable - the name of a summary item such as <i>nice</i> for cpu data 
      or the name of the detail category such as <i>diskinfo</i>.
  <li>instance - for detail data this is the instance name
  <li>divisor - divide the results by this number, noting for summary data 
      <i>instance</i> needs to be a null string
</ul>

Detailed customization instructions for use of data returned by <i>readS</i> 
is beyond the scope of this FAQ.

<h1>Gottchas</h1>
This section is for those things people might otherwise miss in the FAQ and Shouldn't, because
they can cause misunderstanding of what is being reported.

<a name="got1"></a>
<h2>Round-off vs normalization</h2>

Since most of the numbers collectl reports are large, certainly greater than the number of
seconds in an interval,  they are not affected and so this is rarely an issue.  However there 
are values that can be quite low such as the number of processes created/second or network errors.
This can also be the case for many of the nfs metadata operations counters, some of which report
values as low as 1 over many minutes.
The issue is if you're looking at data at some interval other than 1 second, the result of dividing
these numbers by the interval size can result in values less than 0.5 and since collectl rounds off,
those values would be reported as 0 and you could miss a critical data point.  If you
tell collectl not to normalize the data, in other words NOT to divide by the interval, you'll
always see the actual numbers.  However, since most people prefer values/sec, it's easy to
forget.  Try not to...

<h1>Operational Problems</h1>

<a name="ops1"></a>
<h2>Why won't collectl run as a service?</h2>
As configured, collectl will write its date/time named log files to 
/var/log/collectl, rolling them every day just after midnight and retaining 
one week's worth.  In addition it also maintains a 'message log' file named for
the host, year and month, eg hostname-200508.log - the creation of the message log is
driven off the -m switch in DaemonCommands.  Check this log for any messages
that should explain what is going on.

<a name="ops2"></a>
<h2>Why is my 'raw' file so big?</h2>
By default, collectl will collect a lot of data - as much as 10 or more MB/day!
If the perl-Compress library is installed, these logs will automatically be 
compressed and are typically less than 2MB/day.
<br>
The output file size is also affected by the number of devices being monitored.
In general, even on large systems the number network interfaces is small and
shouldn't matter, but if the number of disks gets very high, say in the dozens
or more, this can begin to have an effect on the file size.  The other big
variable is the number of processes when collecting process data.  As this 
number grows to the many hundreds (or more), you will see the size of the data
file grow.
<br>
Finally the other parameter that effects size is the monitoring interval.  The
aforementioned sizes are based on the defaults which are process/slab 
monitoring once every 60 seconds and device monitoring once every 10 seconds.
Did you override these and make them too small?

<a name="ops3"></a>
<h2>Playing back multiple files to terminal doesn't show file names</h2>
By design, collectl is expected to be used in multiple ways and a lot of 
flexibility in the output format has been provided.  The most common 
way of using playback mode is to play back a single file and therefore the name
of the file is not displayed.  The -m switch will provide the file names as
they are processed.

<a name="ops4"></a>
<h2>Why don't the averages/totals produced in <i>brief</i> mode look correct?</h2>
There may be two reasons for this, the most obvious being that by default the 
intermediate numbers are normalized into a /sec rate and the averages/totals
are based on the raw numbers.  If the monitoring interval is 1 sec or you use
-on to suppress normalization, the results will be very close.
<br>
The other point to consider is that numbers are often stored at a higher 
resolution than displayed and so there is less round-off error with the averages
and totals.

<a name="ops5"></a>

<a name="ops6"></a>
<h2>What does <i>New slab created after logging started</i> mean?</h2>
When collectl first starts, it builds a list of all the existing slabs.
As the message states, collectl has discovered a new slab and adds it to
its list.  This is relatively rare but can also indicate collection was
started too soon, possibly before system processes or applications
have allocated system data structures.  It is really just an informational
message and can safely be ignored.

<a name="ops7"></a>
<h2>Why does collectl say <i>waiting for 60 second sample...</i> but doesn't?</h2>
This is very rare as it will only happen when collecting a small number of process
or slab data samples, but it is also worth understanding what is happening
because it gets into the internal mechanics of data collection.  In addition to the
normal counter collectl uses to collect most data, it also maintains a second
one for coarser samples such as process and slab data.  When reporting how long
collectl is going to wait for a sample, it uses a number based on the type of
data being collected.  In almost all cases this is the value of the fine-grained
counter, but if only collecting process or slab data, it reports the second
counter whose default is 60 seconds.
<p>
Collection of counters, such as disk traffic or cpu load, always requires 2
samples since it's their different that represent the actual value.  Other
data such as memory in use or process data only require a single sample 
<i>but</i> in order to synchronize all the values being reported, collectl
always uses its first sampling interval to collectl a base sample and doesn't
actually report anything until the second sample is taken which is why it
reports the <i>waiting...</i> message even if it isn't being asked to report
any counters.
<p>
Finally, the -c switch which specifies the number of samples to collect
applies to the finer-grained counter.  This means if you try to collect a number
of samples that will cause the -c switch limit to be reached because any data
is actually collected, you will see collectl exit without reporting anything!
The best example of this would be the command <i>collectl -sZ -c1</i>.
Since the default interactive sample counters are 1 and 60 seconds respectively
<i>and</i> collectl has to actually take 2 samples, collectl will only run long
enough for one tick of the fine-grained counter or 1 second and immediately
exit with no output.  Therefore to 
collect 1 process sample you will actually need to use -c60 but will also have
to wait 60 seconds to see anything.  
Alternatively you could set the fine-grained sample counter to the same as
the process sample counter
and so the command <i>collectl -i60:60 -sZ -c1</i> would also report 1 sample
after waiting for 60 seconds.  If you want to collect a sample after just 1
second, you should use <i>collectl -i:1 -sZ -c1</i>.

<a name="ops8"></a>
<h2>Why am I not seeing exceptions only with -ox?</h2>
Exception processing requires --verbose.  Did you forget to include it?

<a name="ops9"></a>
<h2>I'm seeing a bogus data point!</h2>
This message means collectl has read a corrupted network statistics record
and is ignoring it.  It also turns out this has been attributed to some
bnx2 chips and a workaround has been generated for newer drivers.  If you
want a little more information your can read about it
<a href=http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.git;a=commit;h=02537b0676930b1bd9aff2139e0e645c79986931>
here.</a>
<p>
The way collectl determines a record is bogus is to look at the transmit and receive 
rates for each interface and compare them to the speed of that interface
from /sys/devices OR if no entry found uses the value of <i>DefNetSpeed</i>
which can be overridden in collectl.conf).
It either exceeds twice the interface rate, the record is considered bogus
and ignored.  This will cause collectl to report the previous rate
for this interval.  While not foolproof, it is hoped this will reduce the
frequency of this type of data.

<a name="ops10"></a>
<h2>What does the error <i>-sj or -sJ with -P also requires CPU details so add C or remove J.</i> mean?</h2>
Interrupt reporting has a unique property in that summary data provides CPU specific data
while detail data provides data about individual interrupts and you will get this error
if you request interrupt plot data but not CPU detail data.  The most common place this
can happen is if you run collectl V2.5.0 as a daemon because it collectl interrupt data
but not CPU detail data.
<p>
In order to play back plot data from a file that did not specify CPU details be 
collected, you can either tell collectl not to include interrupts by the command
<i>collectl -s-j...</i> or tell it to also include CPU details with the command
<i>collectl -s+C...</i>.
<p>
In order to make this less confusing with future releases, and until I think of simpler
way to do this, the collectl daemon will be set up to include CPU details, noting this
has no impact on data collection but only on the playback.  You can always request CPU
detail data not be generated on playback but will now also have to request interrupts
not be included as well by <i>collectl -s-jC...</i>.

<a name="ops11"></a>
<h2>Why can't I see Process I/O statistics?</h2>
You need to be running version 2.6.22 of the kernel or greater <i>and</i> it must have
process I/O statistics enabled.  The easiest way to check is to see if <i>/proc/self/io</i>
exists.  If not, you don't have them enabled and will need to rebuild your kernel and the
instructions for doing so are beyond the scope of this FAQ.  If you do rebuild, make sure
you have the following symbols enabled: <i>CONFIG_TASKSTATS, CONFIG_TASK_XACCT</i> and
<i>CONFIG_TASK_IO_ACCOUNTING</i>.

<a name="ops12"></a>
<h2>I'm getting an error that <i>formatit.ph</i> can't be found</h2>
This component of collectl must be in the same directory as collectl itself.  On startup
collectl looks at the command used to start it and from there determines its location by
following as many links as may be associated with that command.  It then extracts its 
directory name from the last link (if any) in the chain.  If one has set up a set of links
such that the last one uses a relative path, when collectl prepends that path to formatit.ph
it's likely not to find it and hence this message.  To fix the problem
simply specify the complete path in the final link.

<a name="ops13"></a>
<h2>When I use an interval >4 seconds I'm getting non-uniform sample times</h2>
Awhile back I found a problem on a SuSE 10 system that was running with a new version
of glibc that changed the granularity of timers from micro-seconds to nanoseconds and
therefore went from 32 to 64 bits.  Guess what, 4.3 seconds is > 32 bits!  Once I
reported this to the author of HiRes he immediately (within hours) release version
1.91 which addressed the problem.  A newer version of HiRes should be the remedy.

<a name="ops14"></a>
<h2>I'm getting <i>settimer</i> messages on the console and in dmesg</h2>
This problem is actually another form of the previous one and is related to version 2.5 of
glibc.  See more details on what this means and how to correct it <a href=HiResTime.html>here</a>.

<a name="ops15"></a>
<h2>Why is a set of process data missing during playback?</h2>
This actually applies to slab data too.  Both types of data contain counters and in order to
display them as rates, collectl needs to read a sample as a baselevel - that's why you see the
<i>waiting for n second sample</i> message when collectl first starts.   This applies to
playback too and that's why when you playback data from a file you never see the first
sample.  Since process/slab data are usually collected at a different frequency, collectl
has to read more intervals to get to the proper baselevel.  As an example, consider data collected
at 00:01AM.  When you play it back, collectl is able to include processes/slabs in its baselevel
since this data is typically collected on an intergral minute boundary.  It will therefore start playing back non-process
data at 00:01:10 and process data at 00:02:00.  In fact, if you play back data from 00:01:10 
collectl reads data from the previous interval and will still report process data for 00:02:00.
However if you play back data from 00:01:20 it will set its baselevel to 00:01:10 which does not
contain process data and so will not report process data until 00:03:00.
<p>
Therefore, if you're not seeing process data for the time interval you've chosen use an earlier
value for <i>--from</i>.

<a name="ops16"></a>
<h2>I played back all summary data with -P but got a cpu detail file.  Why?</h2>
This will happen when you try to convert interrupt data to plot format file.  In fact if you include
-m you will see a <a href=FAQ-collectl.html#ops17>message</a> that this is automatically being done for you.  The basic problem is that by
definition summary data is written in a fixed format to help ease the automated processing of the tab
file.  However interrupt data by definition is variable since there is a counter for each CPU.  By
forcing a CPU detail file, there is now a place to put the interrupt data.  To get rid of the message
simply include cpu detail when in playback mode.

<a name="ops17"></a>
<h2>What does this message mean: <i>-sj or -sJ with -P also requires CPU details so adding -sC</i></h2>
This message is displayed whenever you try to convert interrupt data to a file in plot format and have not included
CPU detail data as explained in the previous section.  This message is only displayed when -m is included.

<a name="ops18"></a>
<h2>Why are some process usernames being reported as numbers?</h2>
Collectl uses the usernames from /etc/passwd to associate with a process's UID.  If it can't find that UID in
the file there is no username to associated with it and so collectl
will report the UID instead.  This can happen when using NIS in which case username information is stored
remotely OR if playing back a file on a different host than where the raw data was collected.  To get around this
problem, copy (or create) the passwd file with the correct relationships to the local system and point 
collectl to it with <i>--passwd</i> OR change the <i>Passwd</i> entry in collectl.conf to point to that file if
you get tired of using <i>--passwd</i>.

<a name="ops19"></a>
<h2>Why don't I get a prc file with -P and --rawtoo?</h2>
The data in the raw file is essentially identical to that in the prc file so the theory is why repeat it
when you can easily create it from the raw file.  Furthermore the raw file is already compressed and so
is a lot smaller.

<a name="ops21"><a><h2>
<h2>What happened to -sy in brief mode?</h2>
The brief form was always a hack because slab data is typically collected at a different frequency
than everything else.  With the 2.6 kernel, slab usage was included in the brief memory display 
anyway and therefore less important via the -sy switch.  With the move to supporting slab data in
a separate file with -G/--group, that become an even better reason to drop it so I did.

<a name="ops22"><a>
<h2>Why do I get an empty file during play back with -sY/-sZ and -P -f?</h2>
Trying to get this right was just considered too much effort for too rare a case.

<a name="ops23"><a>
<h2>Why don't I get slb/prc files with -P, -sY/Z and --rawtoo but witout -rawtoo I do?</h2>
Process data in raw format is almost identical to process data in plot format and it felt like a waste of
space to replicate the data.  Since slab data tends to be treated the same, at least in rawp files, it just
came along for the ride.  You can always force the generation of the slb/prc files by playing back the raw
file with -sY/Z for a second pass.

<i>perl >= 0:5.008000 is needed by collectl...</i></a></li>
</ul>

<a name="ops24"></a>
<h2>I'm getting an error from RPM: <i>perl >= 0:5.008000 is needed by collectl...</i></h2>
This seems to be a problem with some older versions of RPM.  Include <i>--nodeps</i> to
override the dependency check and collectl will install and run just fine.

<a name="ops25"></a>
<h2>What does this mean: New network device found: xxx?</h2>
When collectl was first developed, the assumption had been that to add new devices, especially a
network device, one would have to power off the machine and reboot, making the possibility of a
new device showing up after boot unlikely.  Therefore, when collectl first starts up it identifies 
all the networks on the system and if logging to a file writes this information into the header
so that during playback it can all the network names without having to look into the file.  If a 
new network device does becomes visible after collectl started or during playback, you'll see this message.
<p>
However, this message has recently been seen when running with InfiniBand, so a devices obviously
showed up after collectl started.  This message also means that the new device has been added to the 
end of the list of known networks.  If it in fact should have been inserted in the middle of that 
list the results are unpredictable and will probably be wrong.  Be sure to let me know if this occurs.
<p>
Clearly as newer versions of linux/hardware emerge, the types of hardware changes that can occur
without shutting down the systems will no doubt increase and so perhaps more of these types of 
situations will occur.

<a name="ops26"></a>
<h2>What does this mean: NumNets in header is 'x' but only 'y' listed...?</h2>
This message only occurs during playback.  Versions of collectl older than 3.5.1 were 
updating the internal number of network devices when a new one was encountered in real-time or during 
playback, but they were forgetting to update the names used the header as well.  This meant when a 
new logfile was created after midnight (or whenever it was rotated), the new value for the number of
networks was changed in the header but not their names.  If you see this message it's simply pointing 
out that fact and you WILL later see <i> New network device found: xxx</i> messages as those new 
networks are found in the raw file during playback.  This should not occur with raw files created after
V3.5.1.

<a name="ops27"></a>
<h2>Why is -sC showing CPUs with no load and no idle?</h2>
Somewhere along the kernel started incorrectly updating the CPU usage stats in /proc/stat and if that's
the case with your kernel, you'll see this behavior.  There is really nothing collectl can do about it
as it relies on the kernel for accurate stats.  Furthermore, before you run out and try a different tool
remember that all tools use /proc/stat to report CPU usage and so that won't help.

<a name="ops28"></a>
<h2>When did playing back a file start taking so long?</h2>
This problem was first discovered when a user upgraded to RHEL6.2, but it turns out it has nothing to do
with that distro but rather the underlying compression library that collectl relies on, namely Compress::Zlib.
It turns out when that compression library went from 1.42 (last rev of the 1.0 architecure) this broke so
any versions in the 2.0 generation will show this problem.  You can see which version of zlib you're using
by running 'collectl -v'.  If the playback times are too long for you, just do what I've been doing for 
years now and run collectl with -P --rawtoo and you're get both a raw file as well as a plot file with 
not a lot of extra overhead.

<a name="ops29"></a>
<h2>What does the message -D requires -f OR -A server mean?</h2>
First of all, this should only be seen by someone who has modified <i>/etc/collectl.conf</i> because it already
specifies -f.  The point of the message is that when you run collectl as a deamon, it needs to know what 
to do with its output since there is no terminal available.  That destination must be a file OR collectl
must be run as a server, sending its output out over a socket.

<table width=100%><tr><td align=right><i>updated July 30, 2013</i></td></tr></colgroup></table>

</body>
</html>