1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227
|
<html>
<head>
<link rel=stylesheet href="style.css" type="text/css">
<title>Tutorial - Lustre</title>
</head>
<body>
<body>
<center>
<h1>Tutorial - Lustre</h1>
</center>
<p>
<h3>Introduction</h3>
This tutorial is intended to help you get started monitoring a system on which the
<a href=http://wiki.lustre.org>lustre</a> filesystem has been installed. This is <i>not</i>
intended to be a tutorial on how to use such a filesystem. In terms of the basics,
there's really not much to say other then tell collectl to monitor lustre by
specifying an <i>l</i> with <i>-s</i> along with any other subsystems you may want to monitor.
<p>
As you should be aware of by now, collectl will try to display everything you've
selected in <i>brief</i> format if it can, writing all your data out on a single
line for each sampling interval.
This can get quite wide depending on how many subsystems you choose to monitor and
while you can cerainly specify <i>collectl -s+l</i> to add lustre to the default
subsystems, the output width is too cumbersome for this tutorial.
Therfore I'll use some other subsystems and switches in the examples
to mix things up, showing that there are a lot of possible combinations. In this
first example, we see what happens when you run collectl on a lustre client and
request cpu and memory data along with lustre.
<div class=terminal-wide14><pre>
$ collectl -scml
#<--------CPU--------><-----------Memory----------><-------Lustre Client------>
#cpu sys inter ctxsw free buff cach inac slab map Reads KBRead Writes KBWrite
0 0 101 26 3G 6M 29M 12M 43M 65M 0 0 0 0
</pre></div>
<p>
Collectl is actually very intelligent about dealing with lustre because if you
were to run the identical command on an OST, it would recognize that too and
change what it shows accordingly as you can see below.
<div class=terminal-wide14><pre>
$ collectl -scdl
#<--------CPU--------><-----------Disks-----------><--------Lustre OST------->
#cpu sys inter ctxsw KBRead Reads KBWrit Writes KBRead Reads KBWrit Writes
0 0 100 28 0 0 0 0 0 0 0 0
</pre></div>
In fact, if you're also running a client on an OST it will show both!
I've also included time stamps to make the output a little more intersting.
<div class=terminal-wide14><pre>
$ collectl -scl -oT
# <--------CPU--------><--------Lustre OST-------><-------Lustre Client------>
#Time cpu sys inter ctxsw KBRead Reads KBWrit Writes Reads KBRead Writes KBWrite
14:35:32 0 0 103 24 0 0 0 0 0 0 0 0
14:35:33 0 0 123 53 0 0 0 0 0 0 0 0
</pre></div>
As expected, you can run collectl on a system that has any combination of MDS, OST and
client services and it will show you what you want to see. For more detail on some of
the more advanced concepts beyond the scope of this tutorial you can read more about
how collectl deals with Lustre <a href=Lustre.html>here</a>.
<p>
And finally, don't forget any of this data can be written to a file for continuous
logging, played back later and even converted to a format suitable for plotting.
In fact collectl is configured to monitor lustre by default when run as a daemon
so all you need to do to begin collecting the basic data shown above is
<i>service collectl start</i>.
<p>
<h3>Beyond the Basics</h3>
For many users of collectl, you are now sufficiently equipped to tackle most lustre
monitoring tasks, but for those more exotic situations there's so much more you can
do.
<p><B>CLIENTS</b><p>
Let's start off by looking more closely at client data. As with all other collectl
data, one can switch between summary and detail data by simply entering an
upper case <i>L</i> instead of a lower case one as I've done below.
The actual content of what is
displayed will depend on whether you are on a client, OSS (there is currently no
detail data for an MDS), but as you can see for clients, this data is broken down
by filesystem and just to make the display a little more interesting I decided to
show the timestamps in milli-seconds. Naturally if there is only one filesystem,
the data with match that displayed in summary mode.
<div class=terminal-wide14><pre>
$ collectl -sL -oTm
# LUSTRE CLIENT DETAIL
# Filsys Reads ReadKB Writes WriteKB
15:10:06.009 spfs1 0 0 0 0
15:10:06.009 spfs2 0 0 0 0
</pre></div>
Just to take it one step futher, it turns out that lustre actually tracks client I/O by
individual OSTs and sometimes that is more interesting, so there is a special option
for lustre clients, <i>--lustopts O</i>.
<div class=terminal-wide14><pre>
$ collectl -sL --lustopts O -oTm
# LUSTRE CLIENT DETAIL
# Filsys Ost Reads ReadKB Writes WriteKB
15:19:17.007 spfs1 OST0000 0 0 0 0
15:19:17.007 spfs1 OST0001 0 0 0 0
15:19:17.007 spfs2 OST0000 0 0 0 0
15:19:17.007 spfs2 OST0001 0 0 0 0
</pre></div>
So what else can we look at? Lots! Lustre tracks <i>readahead</i> data which you can
show in brief format like this by using the <i>R</i> option noting this time we're
also requesting date/time stamps be included. Here we see the cache hits/misses
added to the brief display for the client.
<div class=terminal-wide14><pre>
$ collectl -sl --lustopts R -oD
# <-------------Lustre Client-------------->
#Date Time Reads KBRead Writes KBWrite Hits Misses
20080319 15:20:12 0 0 0 0 0 0
20080319 15:20:13 0 0 0 0 0 0
</pre></div>
or you can look at it in verbose format like this:
<div class=terminal-wide14><pre>
$ collectl -sl --lustopts R --verbose
# LUSTRE CLIENT SUMMARY: READAHEAD
# Reads ReadKB Writes WriteKB Pend Hits Misses NotCon MisWin LckFal Discrd ZFile ZerWin RA2Eof HitMax
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
</pre></div>
Lustre also tracks client metadata, which you can request by specifying <i>--lustopts M</i> and
BRW stats which are selected by <i>--lustopts B</i>. However both are so wide they only
have a verbose form and the following shows both being displayed at the same time.
<div class=terminal-wide14><pre>
$ collectl -sl --lustopts BM
# LUSTRE CLIENT SUMMARY: RPC-BUFFERS (pages) METADATA
#Rds RdK 1P 2P 4P 8P 16P 32P 64P 128P 256P Wrts WrtK 1P 2P 4P 8P 16P 32P 64P 128P 256P
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# Reads ReadKB Writes WriteKB Open Close GAttr SAttr Seek Fsynk DrtHit DrtMis
0 0 0 0 0 0 0 0 0 0 0 0
</pre></div>
And if that's not enough, these even have a detailed mode of display and you can look
at them by filesystem or OST. You can specify any combinations of B,M and R with --lustopts
to show the associated data in both summary or detail formats, though only <i>readahead</i>
appears in the brief display. All others will force verbose format.
<p><b>Object Storage Server</b><p>
If you haven't figured it out yet, it's also possible to display OST level information
on an OSS by simply using the uppercase subsystem specification as you can see here,
noting just to be different I've chosen a second form of the date/timestamp.
<div class=terminal-wide14><pre>
$ collectl -sL -od
# LUSTRE FILESYSTEM SINGLE OST STATISTICS
# Ost Read Ops Read KB Write Ops Write KB
03/19 15:32:14 spfs1-OST0000 0 0 0 0
03/19 15:32:14 spfs2-OST0000 0 0 0 0
03/19 15:32:15 spfs1-OST0000 0 0 0 0
03/19 15:32:15 spfs2-OST0000 0 0 0 0
</pre></div>
You can also show BRW stats as both summary and detail data as well and since they
look identical to the way they're displayed for a client, I won't bother repeating
those forms here.
<p><b>Metadata Server</b><p>
As mentioned earlier, there is no detail data for an MDS nor are there any other
types of data other than that which can be displayed in summary mode.
<p>
<h3>Summary</h3>
As you have seen, there is a wealth of data here and often you may not even know what
you want to look for. When such is the case, just collect as much as you can and save
it in a file. Then you can play it back later and display it in multiple formats.
<p>
I just want to close with an example of a problem with readaheads and Lustre 1.4
which demonstrates the power of a tool
like collectl that can show multiple types of data at once because I know a lot of
people may still not be convinced. Consider the following sample which was collected while
doing random 32KB reads of a large file. See anything wrong? Do you know why the
network bandwidth is so much higher than the lustre client read rate? How long would it
have taken to even realize there is a problem?
<div class=terminal-wide14><pre>
$ collectl -snl -oT
# <----------Network----------><-------------Lustre Client-------------->
#Time netKBi pkt-in netKBo pkt-out Reads KBRead Writes KBWrite Hits Misses
08:14:18 41776 28310 1065 14786 50 200 0 0 30 20
08:14:19 38328 25987 1032 14078 62 248 0 0 35 19
08:14:20 44763 30337 1167 16114 58 232 0 0 30 20
08:14:21 43666 29596 1137 15632 46 184 0 0 30 16
08:14:22 33777 22905 891 12191 58 232 0 0 35 23
</pre></div>
It turned out the old algorithm triggered a readahead after 2 consecutive pages were read
and every 32KB read (which is 8 pages)
resulted in 1MB being read over the network, loaded in to cache and was then discarded!
If the value of <i>max_read_ahead_mb</i> is set to 0 for the associated filesystem on
<i>each</i> client involved, you see 2 things - the network rate drops to track the
client read rate and the <i>misses</i> goes to zero.
<div class=terminal-wide14><pre>
$ collectl -snl -oT
# <----------Network----------><-------------Lustre Client-------------->
#Time netKBi pkt-in netKBo pkt-out Reads KBRead Writes KBWrite Hits Misses
08:21:22 0 3 0 3 59 236 0 0 0 0
08:21:23 317 335 58 298 91 364 0 0 0 0
08:21:24 442 457 77 388 107 428 0 0 0 0
08:21:25 442 457 77 383 97 388 0 0 0 0
08:21:26 432 446 75 373 89 356 0 0 0 0
</pre></div>
<i>Caution - changing the value of the readable variable in /proc to 0 will eliminate
all readahead for </i>all readers<i> meaning any applications that do sequential
reads could have significant performance problems. In other words, this is <i>not</i> a
recommendation to turn off readahead but rather to be aware of what it does and if you do
choose to turn it off to be aware of the consequences</i>
<p>
<i><b>Note: This readahead algorithm has changed with V1.6 and lustre no longer triggers readahead
on the third page read but rather on the third sequential read.</b></i>
<table width=100%><tr><td align=right><i>updated Mar 25, 2010</i></td></tr></colgroup></table>
</body>
</html>
|