1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181
|
<!DOCTYPE html
PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<!-- saved from url=(0014)about:internet -->
<html xmlns:MSHelp="http://www.microsoft.com/MSHelp/" lang="en-us" xml:lang="en-us"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="DC.Type" content="topic">
<meta name="DC.Title" content="Bandwidth and Cache Affinity">
<meta name="DC.subject" content="Bandwidth and Cache Affinity">
<meta name="keywords" content="Bandwidth and Cache Affinity">
<meta name="DC.Relation" scheme="URI" content="../tbb_userguide/parallel_for.htm">
<meta name="DC.Format" content="XHTML">
<meta name="DC.Identifier" content="tutorial_Bandwidth_and_Cache_Affinity">
<link rel="stylesheet" type="text/css" href="../intel_css_styles.css">
<title>Bandwidth and Cache Affinity</title>
<xml>
<MSHelp:Attr Name="DocSet" Value="Intel"></MSHelp:Attr>
<MSHelp:Attr Name="Locale" Value="kbEnglish"></MSHelp:Attr>
<MSHelp:Attr Name="TopicType" Value="kbReference"></MSHelp:Attr>
</xml>
</head>
<body id="tutorial_Bandwidth_and_Cache_Affinity">
<!-- ==============(Start:NavScript)================= -->
<script src="..\NavScript.js" language="JavaScript1.2" type="text/javascript"></script>
<script language="JavaScript1.2" type="text/javascript">WriteNavLink(1);</script>
<!-- ==============(End:NavScript)================= -->
<a name="tutorial_Bandwidth_and_Cache_Affinity"><!-- --></a>
<h1 class="topictitle1">Bandwidth and Cache Affinity</h1>
<div>
<p>For a sufficiently simple function
<samp class="codeph">Foo</samp>, the examples might not show good speedup when
written as parallel loops. The cause could be insufficient system bandwidth
between the processors and memory. In that case, you may have to rethink your
algorithm to take better advantage of cache. Restructuring to better utilize
the cache usually benefits the parallel program as well as the serial program.
</p>
<p>An alternative to restructuring that works in some cases is
<samp class="codeph">affinity_partitioner.</samp> It not only automatically chooses
the grainsize, but also optimizes for cache affinity. Using
<samp class="codeph">affinity_partitioner</samp> can significantly improve
performance when:
</p>
<ul type="disc">
<li>
<p>The computation does a few operations per data access.
</p>
</li>
<li>
<p>The data acted upon by the loop fits in cache.
</p>
</li>
<li>
<p>The loop, or a similar loop, is re-executed over the same data.
</p>
</li>
<li>
<p>There are more than two hardware threads available. If only two
threads are available, the default scheduling in Intel® Threading Building
Blocks (Intel® TBB) usually provides sufficient cache affinity.
</p>
</li>
</ul>
<p>The following code shows how to use
<samp class="codeph">affinity_partitioner</samp>.
</p>
<pre>#include "tbb/tbb.h"
void ParallelApplyFoo( float a[], size_t n ) {
static affinity_partitioner ap;
parallel_for(blocked_range<size_t>(0,n), ApplyFoo(a), ap);
}
void TimeStepFoo( float a[], size_t n, int steps ) {
for( int t=0; t<steps; ++t )
ParallelApplyFoo( a, n );
}</pre>
<p>In the example, the
<samp class="codeph">affinity_partitioner</samp> object
<samp class="codeph">ap</samp> lives between loop iterations. It remembers where
iterations of the loop ran, so that each iteration can be handed to the same
thread that executed it before. The example code gets the lifetime of the
partitioner right by declaring the
<samp class="codeph">affinity_partitioner</samp> as a local static object. Another
approach would be to declare it at a scope outside the iterative loop in
<samp class="codeph">TimeStepFoo</samp>, and hand it down the call chain to
<samp class="codeph">parallel_for</samp>.
</p>
<p>If the data does not fit across the system’s caches, there may be little
benefit. The following figure shows the situations.
</p>
<div class="fignone" id="fig3"><a name="fig3"><!-- --></a><span class="figcap">Benefit of Affinity Determined by Relative Size of Data Set and
Cache</span>
<br><img src="Images/image007.jpg" width="453" height="178"><br>
</div>
<p>The next figure shows how parallel speedup might vary with the size of a
data set. The computation for the example is
<samp class="codeph">A[i]+=B[i]</samp> for
<samp class="codeph">i</samp> in the range [0,N). It was chosen for dramatic effect.
You are unlikely to see quite this much variation in your code. The graph shows
not much improvement at the extremes. For small N, parallel scheduling overhead
dominates, resulting in little speedup. For large N, the data set is too large
to be carried in cache between loop invocations. The peak in the middle is the
sweet spot for affinity. Hence
<samp class="codeph">affinity_partitioner</samp> should be considered a tool, not a
cure-all, when there is a low ratio of computations to memory accesses.
</p>
<div class="fignone" id="fig4"><a name="fig4"><!-- --></a><span class="figcap">Improvement from Affinity Dependent on Array Size</span>
<br><img src="Images/image008.jpg" width="551" height="192"><br>
</div>
<p>
<div class="tablenoborder"><table cellpadding="4" summary="" frame="border" border="1" cellspacing="0" rules="all">
<thead align="left">
<tr>
<th class="cellrowborder" align="left" valign="top" width="100%" id="d128272e134">
<p>Optimization Notice
</p>
</th>
</tr>
</thead>
<tbody>
<tr>
<td class="bgcolor(#ccecff)" bgcolor="#ccecff" valign="top" width="100%" headers="d128272e134 ">
Intel's compilers may or may not optimize to the same degree for non-Intel
microprocessors for optimizations that are not unique to Intel microprocessors.
These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other
optimizations. Intel does not guarantee the availability, functionality, or
effectiveness of any optimization on microprocessors not manufactured by Intel.
Microprocessor-dependent optimizations in this product are intended for use
with Intel microprocessors. Certain optimizations not specific to Intel
microarchitecture are reserved for Intel microprocessors. Please refer to the
applicable product User and Reference Guides for more information regarding the
specific instruction sets covered by this notice.
<p>Notice revision #20110804
</p>
</td>
</tr>
</tbody>
</table>
</div>
</p>
</div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="../tbb_userguide/parallel_for.htm">parallel_for</a></div>
</div>
<div></div>
</body>
</html>
|