File: Bandwidth_and_Cache_Affinity.htm

package info (click to toggle)
tbb 4.2~20140122-5
  • links: PTS
  • area: main
  • in suites: jessie, jessie-kfreebsd
  • size: 21,492 kB
  • ctags: 21,278
  • sloc: cpp: 92,813; ansic: 9,775; asm: 1,070; makefile: 1,057; sh: 351; java: 226; objc: 98; pascal: 71; xml: 41
file content (181 lines) | stat: -rwxr-xr-x 7,104 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
<!DOCTYPE html
  PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<!-- saved from url=(0014)about:internet -->
<html xmlns:MSHelp="http://www.microsoft.com/MSHelp/" lang="en-us" xml:lang="en-us"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

<meta name="DC.Type" content="topic">
<meta name="DC.Title" content="Bandwidth and Cache Affinity">
<meta name="DC.subject" content="Bandwidth and Cache Affinity">
<meta name="keywords" content="Bandwidth and Cache Affinity">
<meta name="DC.Relation" scheme="URI" content="../tbb_userguide/parallel_for.htm">
<meta name="DC.Format" content="XHTML">
<meta name="DC.Identifier" content="tutorial_Bandwidth_and_Cache_Affinity">
<link rel="stylesheet" type="text/css" href="../intel_css_styles.css">
<title>Bandwidth and Cache Affinity</title>
<xml>
<MSHelp:Attr Name="DocSet" Value="Intel"></MSHelp:Attr>
<MSHelp:Attr Name="Locale" Value="kbEnglish"></MSHelp:Attr>
<MSHelp:Attr Name="TopicType" Value="kbReference"></MSHelp:Attr>
</xml>
</head>
<body id="tutorial_Bandwidth_and_Cache_Affinity">
 <!-- ==============(Start:NavScript)================= -->
 <script src="..\NavScript.js" language="JavaScript1.2" type="text/javascript"></script>
 <script language="JavaScript1.2" type="text/javascript">WriteNavLink(1);</script>
 <!-- ==============(End:NavScript)================= -->
<a name="tutorial_Bandwidth_and_Cache_Affinity"><!-- --></a>

 
  <h1 class="topictitle1">Bandwidth and Cache Affinity</h1>
 
   
  <div> 
	 <p>For a sufficiently simple function 
		<samp class="codeph">Foo</samp>, the examples might not show good speedup when
		written as parallel loops. The cause could be insufficient system bandwidth
		between the processors and memory. In that case, you may have to rethink your
		algorithm to take better advantage of cache. Restructuring to better utilize
		the cache usually benefits the parallel program as well as the serial program. 
	 </p>
 
	 <p>An alternative to restructuring that works in some cases is 
		<samp class="codeph">affinity_partitioner.</samp> It not only automatically chooses
		the grainsize, but also optimizes for cache affinity. Using 
		<samp class="codeph">affinity_partitioner</samp> can significantly improve
		performance when: 
	 </p>
 
	 <ul type="disc"> 
		<li> 
		  <p>The computation does a few operations per data access. 
		  </p>
 
		</li>
 
		<li> 
		  <p>The data acted upon by the loop fits in cache. 
		  </p>
 
		</li>
 
		<li> 
		  <p>The loop, or a similar loop, is re-executed over the same data. 
		  </p>
 
		</li>
 
		<li> 
		  <p>There are more than two hardware threads available. If only two
			 threads are available, the default scheduling in Intel&reg; Threading Building
			 Blocks (Intel&reg; TBB) usually provides sufficient cache affinity. 
		  </p>
 
		</li>
 
	 </ul>
 
	 <p>The following code shows how to use 
		<samp class="codeph">affinity_partitioner</samp>. 
	 </p>
 
	 <pre>#include "tbb/tbb.h"
&nbsp;
void ParallelApplyFoo( float a[], size_t n ) {
    static affinity_partitioner ap;
    parallel_for(blocked_range&lt;size_t&gt;(0,n), ApplyFoo(a), ap);
}
&nbsp;
void TimeStepFoo( float a[], size_t n, int steps ) {    
    for( int t=0; t&lt;steps; ++t )
        ParallelApplyFoo( a, n );
}</pre> 
	 <p>In the example, the 
		<samp class="codeph">affinity_partitioner</samp> object 
		<samp class="codeph">ap</samp> lives between loop iterations. It remembers where
		iterations of the loop ran, so that each iteration can be handed to the same
		thread that executed it before. The example code gets the lifetime of the
		partitioner right by declaring the 
		<samp class="codeph">affinity_partitioner</samp> as a local static object. Another
		approach would be to declare it at a scope outside the iterative loop in 
		<samp class="codeph">TimeStepFoo</samp>, and hand it down the call chain to 
		<samp class="codeph">parallel_for</samp>. 
	 </p>
 
	 <p>If the data does not fit across the system’s caches, there may be little
		benefit. The following figure shows the situations. 
	 </p>
 
	 <div class="fignone" id="fig3"><a name="fig3"><!-- --></a><span class="figcap">Benefit of Affinity Determined by Relative Size of Data Set and
		  Cache</span> 
		<br><img src="Images/image007.jpg" width="453" height="178"><br> 
	 </div>
 
	 <p>The next figure shows how parallel speedup might vary with the size of a
		data set. The computation for the example is 
		<samp class="codeph">A[i]+=B[i]</samp> for 
		<samp class="codeph">i</samp> in the range [0,N). It was chosen for dramatic effect.
		You are unlikely to see quite this much variation in your code. The graph shows
		not much improvement at the extremes. For small N, parallel scheduling overhead
		dominates, resulting in little speedup. For large N, the data set is too large
		to be carried in cache between loop invocations. The peak in the middle is the
		sweet spot for affinity. Hence 
		<samp class="codeph">affinity_partitioner</samp> should be considered a tool, not a
		cure-all, when there is a low ratio of computations to memory accesses. 
	 </p>
 
	 <div class="fignone" id="fig4"><a name="fig4"><!-- --></a><span class="figcap">Improvement from Affinity Dependent on Array Size</span> 
		<br><img src="Images/image008.jpg" width="551" height="192"><br> 
	 </div>
 
	 <p> 
	 
<div class="tablenoborder"><table cellpadding="4" summary="" frame="border" border="1" cellspacing="0" rules="all"> 
		   
		  <thead align="left">
			 <tr>
				<th class="cellrowborder" align="left" valign="top" width="100%" id="d128272e134">
				  <p>Optimization Notice
				  </p>

				</th>

			 </tr>
</thead>
 
		  <tbody> 
			 <tr> 
				<td class="bgcolor(#ccecff)" bgcolor="#ccecff" valign="top" width="100%" headers="d128272e134 ">
				  Intel's compilers may or may not optimize to the same degree for non-Intel
				  microprocessors for optimizations that are not unique to Intel microprocessors.
				  These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other
				  optimizations. Intel does not guarantee the availability, functionality, or
				  effectiveness of any optimization on microprocessors not manufactured by Intel.
				  Microprocessor-dependent optimizations in this product are intended for use
				  with Intel microprocessors. Certain optimizations not specific to Intel
				  microarchitecture are reserved for Intel microprocessors. Please refer to the
				  applicable product User and Reference Guides for more information regarding the
				  specific instruction sets covered by this notice. 
				  <p>Notice revision #20110804 
				  </p>

				</td>
 
			 </tr>
 
		  </tbody>
 
		</table>
</div>
 
	 </p>
 
  </div>
 

<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong>&nbsp;<a href="../tbb_userguide/parallel_for.htm">parallel_for</a></div>
</div>
<div></div>

</body>
</html>