File: SlowProc.html

package info (click to toggle)
collectl 3.7.4-1
  • links: PTS
  • area: main
  • in suites: jessie, jessie-kfreebsd
  • size: 1,624 kB
  • ctags: 119
  • sloc: perl: 14,928; sh: 429; makefile: 11
file content (133 lines) | stat: -rw-r--r-- 7,040 bytes parent folder | download | duplicates (5)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
<html>
<head>
<link rel=stylesheet href="style.css" type="text/css">
<title>collectl - Slow Proc Access?</title>
</head>

<body>
<center><h2>Is Your Kernel Reading /proc Too Slowly?</h2></center>
<p>
<h3>Introduction</h3>
The 2.6.32 release of the kernel introduced a regression that causes the /proc to be read significantly
more slowly and <i>with significantly more overhead</i> on systems with high core counts.  In fact this
overhead has been measured to be over a factor of 50 reading /proc/stat on a system with 8 sockets and
48 cores.
<p>
The good news is newer RedHat and SUSE distros have been updated to mitigate this problem, specifically
RHEL 6.2 and SLES SP1.  As for other distros, I just don't have the access to verify everything, so if you
are running a different distro and can verify this problem has been resolved, I'd appreciate hearing about
which specific version addresses it so I can publish the news here.
<p>
So why should you even care about this?  You may have high core counts and running a kernel that has not yet been 
patched.  While this probably won't have any impact on any of your running applications - but do
you ever run top?  iostat?  sar?  or any other monitoring tools?  If you're reading this you probably run
collectl.  Most monitoring tools are farily light-weight and for a good reason - if you're trying to measure
something you don't want the tool's overhead to get in the way.  <i>Unfortunately with this regression
it will now!</i>
<p>
<h3>The Analysis</h3>
Whenever a monitoring change is made to collectl I measure its overhead, just to make
sure it's being done efficiently.  The way I do this is simulate a day's worth of sampling at an
interval of 0 seconds and timing it.  In this case I had recently added numa monitoring and initially found
this problem reading <i>/sys/devices/system/node</i>, but with more testing found it appears elsewhere 
as well.
<p>
In the following example, you can see monitoring CPU data takes about 3 seconds to read almost
9K samples and write them to a file on a 2-socket/dual-core system.  Very efficient!
<p>
<div class=terminal><pre>
time collectl -sc -i0 -c 8640 -f/tmp
real    0m2.879s
user    0m1.908s
sys     0m0.913s
</pre></div>
Next I ran the same command on an 8-socket/48-core system and look at the difference.  Note that the overhead
was so high and took so long I only took 1/10th the number of samples (I get impatient).  This system is
running Fedora 14, which is a 2.6.35 kernel, and this alone is over 5 times the overhead
of the previous example which normalizing to a full day would be over 50 times the load:

<div class=terminal><pre>
time collectl -sc -i0 -c 864 -f/tmp
real    0m16.783s
user    0m3.003s
sys     0m13.523s
</pre></div>

<h3>How can you tell if <i>your</i> system has this problem?</h3>
Before you panic, there are 2 things to keep in mind:
<ul>
<li>This only seems to effect systems with higher core counts, on the order of 32 or more.  There 
may be some impact at the 16 core level, but not very significant.</li>
<li>You must be running a kernel with the problematic code from the 2.6.32 release.</li>
</ul>
<p>
Since a simply <i>uname</i> command will tell you your kernel version, you might think that's all
it takes, but nothing is always that simple because most vendors patch their kernels and you can't
always be sure what code it's actually running.
<p>
One simple way to
tell for sure is to run the very simple test below which times a read of /proc/stat (which seems to be 
the most heavily effected) by using <i>strace</i> see how much time is spent in the actually <i>read</i>.
<p>
The following is on my 2-socket/dual-core system:

<div class=terminal><pre>
strace -c cat /proc/stat>/dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.000251         251         1           execve
  0.00    0.000000           0         3           read
  0.00    0.000000           0         1           write
  0.00    0.000000           0         4           open
  0.00    0.000000           0         5           close
  0.00    0.000000           0         5           fstat
  0.00    0.000000           0         8           mmap
  0.00    0.000000           0         3           mprotect
  0.00    0.000000           0         1           munmap
  0.00    0.000000           0         3           brk
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         1           uname
  0.00    0.000000           0         1           arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00    0.000251                    37         1 total
</pre></div>

while the following in on the 8-socket/48-core system:
<div class=terminal><pre>
strace -c cat /proc/stat >/dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.014997        4999         3           read
  0.00    0.000000           0         1           write
  0.00    0.000000           0        20        16 open
  0.00    0.000000           0         6           close
  0.00    0.000000           0        12        10 stat
  0.00    0.000000           0         5           fstat
  0.00    0.000000           0         8           mmap
  0.00    0.000000           0         3           mprotect
  0.00    0.000000           0         1           munmap
  0.00    0.000000           0         4           brk
  0.00    0.000000           0         1         1 access
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         1           arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00    0.014997                    66        27 total
</pre></div>

As you can see the differences are dramatic!  On my 4-core machine virtually 0 time is spent doing
the read while on the 48 core machine almost 15 msec is spent reading, <i>and that's only reading 
/proc/stat one time!</i>  Also remember - monitoring tools typically read a lot of different 
/proc structures.  Perhaps now you can get a better appreciation of how significant this problem
really is.
<p>
<h3>What is collectl doing about this?</h3>
While there is nothing collectl can do to reduce this overhead, staring with the next release it will 
be including code upon startup times the reading of /proc on newer kernels with core counts of 32 or 
more and warns the users if their systems exhibit this problem.  It will also skip this test for RHEL 6.2
and SLES SP1.  As I hear that other distros have been identified that have also fixed this problem,
I'll start excluding them from the tests as well.

<p><table width=100%><tr><td align=right><i>updated Jan 18, 2012</i></td></tr></colgroup></table>

</body>
</html>