1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302
|
<!DOCTYPE html
PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<!-- saved from url=(0014)about:internet -->
<html xmlns:MSHelp="http://www.microsoft.com/MSHelp/" lang="en-us" xml:lang="en-us"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="DC.Type" content="topic">
<meta name="DC.Title" content="Controlling Chunking">
<meta name="DC.subject" content="Controlling Chunking">
<meta name="keywords" content="Controlling Chunking">
<meta name="DC.Relation" scheme="URI" content="../tbb_userguide/parallel_for.htm">
<meta name="DC.Relation" scheme="URI" content="Automatic_Chunking.htm#tutorial_Automatic_Chunking">
<meta name="DC.Relation" scheme="URI" content="Bandwidth_and_Cache_Affinity.htm#tutorial_Bandwidth_and_Cache_Affinity">
<meta name="DC.Format" content="XHTML">
<meta name="DC.Identifier" content="tutorial_Controlling_Chunking">
<link rel="stylesheet" type="text/css" href="../intel_css_styles.css">
<title>Controlling Chunking</title>
<xml>
<MSHelp:Attr Name="DocSet" Value="Intel"></MSHelp:Attr>
<MSHelp:Attr Name="Locale" Value="kbEnglish"></MSHelp:Attr>
<MSHelp:Attr Name="TopicType" Value="kbReference"></MSHelp:Attr>
</xml>
</head>
<body id="tutorial_Controlling_Chunking">
<!-- ==============(Start:NavScript)================= -->
<script src="..\NavScript.js" language="JavaScript1.2" type="text/javascript"></script>
<script language="JavaScript1.2" type="text/javascript">WriteNavLink(1);</script>
<!-- ==============(End:NavScript)================= -->
<a name="tutorial_Controlling_Chunking"><!-- --></a>
<h1 class="topictitle1">Controlling Chunking </h1>
<div>
<p>Chunking is controlled by a
<em>partitioner</em> and a
<em>grainsize.</em>To gain the most control over chunking, you specify
both.
</p>
<ul type="disc">
<li>
<p>Specify
<samp class="codeph">simple_partitioner()</samp> as the third argument to
<samp class="codeph">parallel_for</samp>. Doing so turns off automatic chunking.
</p>
</li>
<li>
<p>Specify the grainsize when constructing the range. The thread
argument form of the constructor is
<samp class="codeph">blocked_range<<var>T</var>>(<em>begin</em>,<em>end</em>,<em>grainsize</em>)</samp>.
The default value of
<var>grainsize</var> is 1. It is in units of loop iterations
per chunk.
</p>
</li>
</ul>
<p>If the chunks are too small, the overhead may exceed the performance
advantage.
</p>
<p>The following code is the last example from
<span class="keyword">parallel_for</span>, modified to use an explicit grainsize
<samp class="codeph">G</samp>. Additions are shown in
<samp class="codeph"><span style="color:blue"><strong>bold font</strong></span></samp>.
</p>
<pre>#include "tbb/tbb.h"
void ParallelApplyFoo( float a[], size_t n ) {
parallel_for(blocked_range<size_t>(0,n<span style="color:blue"><strong>,G</strong></span>), ApplyFoo(a)<span style="color:blue">,</span>
simple_partitioner());
}</pre>
<p>The grainsize sets a minimum threshold for parallelization. The
<samp class="codeph">parallel_for</samp> in the example invokes
<samp class="codeph">ApplyFoo::operator()</samp> on chunks, possibly of different
sizes. Let
<em>chunksize</em> be the number of iterations in a chunk. Using
<samp class="codeph">simple_partitioner</samp> guarantees that
<span class="eqsymbol">⌈</span>G/2<span class="eqsymbol">⌉</span>
<span class="eqsymbol">≤</span>
<em>chunksize</em>
<span class="eqsymbol">≤</span> G.
</p>
<p>There is also an intermediate level of control where you specify the
grainsize for the range, but use an
<samp class="codeph">auto_partitioner</samp> and
<samp class="codeph">affinity_partitioner</samp>. An
<samp class="codeph">auto_partitioner</samp> is the default partitioner. Both
partitioners implement the automatic grainsize heuristic described in
<strong>Automatic Chunking</strong>. An
<samp class="codeph">affinity_partitioner</samp> implies an additional hint, as
explained later in Section
<strong>Bandwidth and Cache Affinity</strong>. Though these partitioners may cause
chunks to have more than G iterations, they never generate chunks with less
than
<span class="eqsymbol">⌈</span>G/2<span class="eqsymbol">⌉</span>
iterations. Specifying a range with an explicit grainsize may occasionally be
useful to prevent these partitioners from generating wastefully small chunks if
their heuristics fail.
</p>
<p>Because of the impact of grainsize on parallel loops, it is worth
reading the following material even if you rely on
<samp class="codeph">auto_partitioner</samp> and
<samp class="codeph">affinity_partitioner</samp> to choose the grainsize
automatically.
</p>
<div class="tablenoborder"><a name="fig1"><!-- --></a><table cellpadding="4" summary="" id="fig1" frame="void" border="1" rules="none" cellspacing="2"><caption><span class="tablecap">Packaging Overhead Versus Grainsize</span></caption>
<tbody>
<tr>
<td class="noborder" align="center" valign="middle" width="NaN%"><br><img width="161" height="163" src="Images/image002.jpg"><br>
</td>
<td class="noborder" align="center" valign="middle" width="NaN%"><br><img width="157" height="144" src="Images/image004.jpg"><br>
</td>
</tr>
<tr>
<td class="noborder" align="center" valign="top" width="NaN%">
<p>Case A
</p>
</td>
<td class="noborder" align="center" valign="top" width="NaN%">
<p>Case B
</p>
</td>
</tr>
</tbody>
</table>
</div>
<p>The above figure illustrates the impact of grainsize by showing the
useful work as the gray area inside a brown border that represents overhead.
Both Case A and Case B have the same total gray area. Case A shows how too
small a grainsize leads to a relatively high proportion of overhead. Case B
shows how a large grainsize reduces this proportion, at the cost of reducing
potential parallelism. The overhead as a fraction of useful work depends upon
the grainsize, not on the number of grains. Consider this relationship and not
the total number of iterations or number of processors when setting a
grainsize.
</p>
<p>A rule of thumb is that
<samp class="codeph">grainsize</samp> iterations of
<samp class="codeph">operator()</samp> should take at least 100,000 clock cycles to
execute. For example, if a single iteration takes 100 clocks, then the
<samp class="codeph">grainsize</samp> needs to be at least 1000 iterations. When in
doubt, do the following experiment:
</p>
<ol>
<li>
<p>Set the
<samp class="codeph">grainsize</samp> parameter higher than necessary. The
grainsize is specified in units of loop iterations. If you have no idea of how
many clock cycles an iteration might take, start with
<samp class="codeph">grainsize</samp>=100,000. The rationale is that each
iteration normally requires at least one clock per iteration. In most cases,
step 3 will guide you to a much smaller value.
</p>
</li>
<li>
<p>Run your algorithm.
</p>
</li>
<li>
<p>Iteratively halve the
<var>grainsize</var> parameter and see how much the algorithm
slows down or speeds up as the value decreases.
</p>
</li>
</ol>
<p>A drawback of setting a grainsize too high is that it can reduce
parallelism. For example, if the grainsize is 1000 and the loop has 2000
iterations, the
<samp class="codeph">parallel_for</samp> distributes the loop across only two
processors, even if more are available. However, if you are unsure, err on the
side of being a little too high instead of a little too low, because too low a
value hurts serial performance, which in turns hurts parallel performance if
there is other parallelism available higher up in the call tree.
</p>
<div class="Note"><h3 class="NoteTipHead">
Tip</h3>
<p>You do not have to set the grainsize too precisely.
</p>
</div>
<p>The next figure shows the typical "bathtub curve" for execution time
versus grainsize, based on the floating point
<samp class="codeph">a[i]=b[i]*c</samp> computation over a million indices. There is
little work per iteration. The times were collected on a four-socket machine
with eight hardware threads.
</p>
<div class="fignone" id="fig2"><a name="fig2"><!-- --></a><span class="figcap">Wall Clock Time Versus Grainsize </span>
<br><img width="462" height="193" src="Images/image006.jpg"><br>
</div>
<p>The scale is logarithmic. The downward slope on the left side indicates
that with a grainsize of one, most of the overhead is parallel scheduling
overhead, not useful work. An increase in grainsize brings a proportional
decrease in parallel overhead. Then the curve flattens out because the parallel
overhead becomes insignificant for a sufficiently large grainsize. At the end
on the right, the curve turns up because the chunks are so large that there are
fewer chunks than available hardware threads. Notice that a grainsize over the
wide range 100-100,000 works quite well.
</p>
<div class="Note"><h3 class="NoteTipHead">
Tip</h3>
<p>A general rule of thumb for parallelizing loop nests is to parallelize
the outermost one possible. The reason is that each iteration of an outer loop
is likely to provide a bigger grain of work than an iteration of an inner loop.
</p>
</div>
<p>
<div class="tablenoborder"><table cellpadding="4" summary="" frame="border" border="1" cellspacing="0" rules="all">
<thead align="left">
<tr>
<th class="cellrowborder" align="left" valign="top" width="100%" id="d130143e300">
<p>Optimization Notice
</p>
</th>
</tr>
</thead>
<tbody>
<tr>
<td class="bgcolor(#ccecff)" bgcolor="#ccecff" valign="top" width="100%" headers="d130143e300 ">
Intel's compilers may or may not optimize to the same degree for non-Intel
microprocessors for optimizations that are not unique to Intel microprocessors.
These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other
optimizations. Intel does not guarantee the availability, functionality, or
effectiveness of any optimization on microprocessors not manufactured by Intel.
Microprocessor-dependent optimizations in this product are intended for use
with Intel microprocessors. Certain optimizations not specific to Intel
microarchitecture are reserved for Intel microprocessors. Please refer to the
applicable product User and Reference Guides for more information regarding the
specific instruction sets covered by this notice.
<p>Notice revision #20110804
</p>
</td>
</tr>
</tbody>
</table>
</div>
</p>
</div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="../tbb_userguide/parallel_for.htm">parallel_for</a></div>
</div>
<div class="See Also">
<h2>See Also</h2>
<div class="linklist">
<div><a href="Automatic_Chunking.htm#tutorial_Automatic_Chunking">Automatic Chunking
</a></div>
<div><a href="Bandwidth_and_Cache_Affinity.htm#tutorial_Bandwidth_and_Cache_Affinity">Bandwidth and Cache Affinity
</a></div></div>
</div>
</body>
</html>
|