File: Reduction.htm

package info (click to toggle)
tbb 4.2~20140122-5
  • links: PTS
  • area: main
  • in suites: jessie, jessie-kfreebsd
  • size: 21,492 kB
  • ctags: 21,278
  • sloc: cpp: 92,813; ansic: 9,775; asm: 1,070; makefile: 1,057; sh: 351; java: 226; objc: 98; pascal: 71; xml: 41
file content (267 lines) | stat: -rwxr-xr-x 10,242 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
<!DOCTYPE html
  PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<!-- saved from url=(0014)about:internet -->
<html xmlns:MSHelp="http://www.microsoft.com/MSHelp/" lang="en-us" xml:lang="en-us"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

<meta name="DC.Type" content="topic">
<meta name="DC.Title" content="Reduction">
<meta name="DC.subject" content="Reduction">
<meta name="keywords" content="Reduction">
<meta name="DC.Relation" scheme="URI" content="../../tbb_userguide/Design_Patterns/Design_Patterns.htm">
<meta name="DC.Relation" scheme="URI" content="Agglomeration.htm#Agglomeration">
<meta name="DC.Relation" scheme="URI" content="Elementwise.htm#Elementwise">
<meta name="DC.Relation" scheme="URI" content="Divide_and_Conquer.htm#Divide_and_Conquer">
<meta name="DC.Format" content="XHTML">
<meta name="DC.Identifier" content="Reduction">
<link rel="stylesheet" type="text/css" href="../../intel_css_styles.css">
<title>Reduction</title>
<xml>
<MSHelp:Attr Name="DocSet" Value="Intel"></MSHelp:Attr>
<MSHelp:Attr Name="Locale" Value="kbEnglish"></MSHelp:Attr>
<MSHelp:Attr Name="TopicType" Value="kbReference"></MSHelp:Attr>
</xml>
</head>
<body id="Reduction">
 <!-- ==============(Start:NavScript)================= -->
 <script src="..\..\NavScript.js" language="JavaScript1.2" type="text/javascript"></script>
 <script language="JavaScript1.2" type="text/javascript">WriteNavLink(2);</script>
 <!-- ==============(End:NavScript)================= -->
<a name="Reduction"><!-- --></a>

 
  <h1 class="topictitle1">Reduction</h1>
 
   
  <div> 
	 <div class="section"><h2 class="sectiontitle">Problem</h2> 
		 
		<p>Perform an associative reduction operation across a data set. 
		</p>
 
	 </div>
 
	 <div class="section"><h2 class="sectiontitle">Context</h2> 
		 
		<p>Many serial algorithms sweep over a set of items to collect summary
		  information. 
		</p>
 
	 </div>
 
	 <div class="section"><h2 class="sectiontitle">Forces</h2> 
		 
		<p>The summary can be expressed as an associative operation over the data
		  set, or at least is close enough to associative that reassociation does not
		  matter. 
		</p>
 
	 </div>
 
	 <div class="section"><h2 class="sectiontitle">Solution</h2> 
		 
		<p>Two solutions exist in Intel&reg; Threading Building Blocks (Intel&reg; TBB).
		  The choice on which to use depends upon several considerations: 
		</p>
 
		<ul type="disc"> 
		  <li> 
			 <p>Is the operation commutative as well as associative? 
			 </p>
 
		  </li>
 
		  <li> 
			 <p>Are instances of the reduction type expensive to construct and
				destroy. For example, a floating point number is inexpensive to construct. A
				sparse floating-point matrix might be very expensive to construct. 
			 </p>
 
		  </li>
 
		</ul>
 
		<p>Use 
		  <samp class="codeph">tbb::parallel_reduce</samp> when the objects are inexpensive
		  to construct. It works even if the reduction operation is not commutative. The
		  Intel&reg; TBB Tutorial describes how to use 
		  <samp class="codeph">tbb::parallel_reduce</samp> for basic reductions. 
		</p>
 
		<p>Use 
		  <samp class="codeph">tbb::parallel_for</samp> and 
		  <samp class="codeph">tbb::combinable</samp> if the reduction operation is
		  commutative and instances of the type are expensive. 
		</p>
 
		<p>If the operation is not precisely associative but a precisely
		  deterministic result is required, use recursive reduction and parallelize it
		  using 
		  <samp class="codeph">tbb::parallel_invoke</samp>. 
		</p>
 
	 </div>
 
	 <div class="section"><h2 class="sectiontitle">Examples</h2> 
		 
		<p>The examples presented here illustrate the various solutions and some
		  tradeoffs. 
		</p>
 
		<p>The first example uses
		  <samp class="codeph">tbb::parallel_reduce</samp> to do a + reduction over sequence
		  of type <samp class="codeph">T</samp>. The sequence is defined by a half-open interval [first,last). 
		</p>
 
		<pre>T AssocReduce( const T* first, const T* last, T identity ) {
   return tbb::parallel_reduce(
       // Index range for reduction
       tbb::blocked_range&lt;const T*&gt;(first,last),
       // Identity element
       identity,
       // Reduce a subrange and partial sum
       [&amp;]( tbb::blocked_range&lt;const T*&gt; r, T partial_sum )-&gt;float {
           return std::accumulate( r.begin(), r.end(), partial_sum );
       },
       // Reduce two partial sums
       std::plus&lt;T&gt;()
   );
} </pre> 
		<p>The third and fourth arguments to this form of <samp class="codeph">parallel_reduce</samp> are a
		  built in form of the agglomeration pattern. If there is an elementwise action
		  to be performed before the reduction, incorporating it into the third argument
		  (reduction of a subrange) may improve performance because of better locality of
		  reference. 
		</p>
 
		<p>The second example assumes the + is commutative on <samp class="codeph">T</samp>. It is a good
		  solution when <samp class="codeph">T</samp> objects are expensive to construct. 
		</p>
 
		<pre>T CombineReduce( const T* first, const T* last, T identity ) {
   tbb::combinable&lt;T&gt; sum(identity);
   tbb::parallel_for(
       tbb::blocked_range&lt;const T*&gt;(first,last),
       [&amp;]( tbb::blocked_range&lt;const T*&gt; r ) {
           sum.local() += std::accumulate(r.begin(), r.end(), identity);
       }
   );
   return sum.combine( []( const T&amp; x, const T&amp; y ) {return x+y;} );
}</pre> 
		<p>Sometimes it is desirable to destructively use the partial results to
		  generate the final result. For example, if the partial results are lists, they
		  can be spliced together to form the final result. In that case use class 
		  <samp class="codeph">tbb::enumerable_thread_specific</samp> instead of 
		  <samp class="codeph">combinable</samp>. The 
		  <samp class="codeph">ParallelFindCollisions</samp>
		  example in <em>Divide amd Conquer</em> demonstrates the technique. 
		</p>
 
		<p>Floating-point addition and multiplication are almost associative.
		  Reassociation can cause changes because of rounding effects. The techniques
		  shown so far reassociate terms non-deterministically. Fully deterministic
		  parallel reduction for a not quite associative operation requires using
		  deterministic reassociation. The code below demonstrates this in the form of a
		  template that does a + reduction over a sequence of values of type <samp class="codeph">T</samp>. 
		</p>
 
		<pre>template&lt;typename T&gt;
T RepeatableReduce( const T* first, const T* last, T identity ) {
   if( last-first&lt;=1000 ) {
       // Use serial reduction
       return std::accumulate( first, last, identity );
   } else {
       // Do parallel divide-and-conquer reduction
       const T* mid = first+(last-first)/2;
       T left, right;
       tbb::parallel_invoke(
           [&amp;]{left=RepeatableReduce(first,mid,identity);},
           [&amp;]{right=RepeatableReduce(mid,last,identity);} 
       );
       return left+right;
   }
}</pre> 
		<p>The outer if-else is an instance of the agglomeration pattern for
		  recursive computations. The reduction graph, though not a strict binary tree,
		  is fully deterministic. Thus the result will always be the same for a given
		  input sequence, assuming all threads do identical floating-point rounding. 
		</p>
 
		<p>The final example shows how a problem that typically is not viewed as
		  a reduction can be parallelized by viewing it as a reduction. The problem is
		  retrieving floating-point exception flags for a computation across a data set.
		  The serial code might look something like: 
		</p>
 
		<pre>   feclearexcept(FE_ALL_EXCEPT); 
   for( int i=0; i&lt;N; ++i )
       C[i]=A[i]*B[i];
   int flags = fetestexcept(FE_ALL_EXCEPT);
   if (flags &amp; FE_DIVBYZERO) ...;
   if (flags &amp; FE_OVERFLOW) ...;
   ...</pre> 
		<p>The code can be parallelized by computing chunks of the loop
		  separately, and merging floating-point flags from each chunk. To do this with 
		  <samp class="codeph">tbb:parallel_reduce</samp>, first define a
		  "body" type, as shown below. 
		</p>
 
		<pre>struct ComputeChunk {
   int flags;          // Holds floating-point exceptions seen so far.
   void reset_fpe() {
       flags=0;
       feclearexcept(FE_ALL_EXCEPT);
   }
   ComputeChunk () {
       reset_fpe();
   }
   // "Splitting constructor"called by parallel_reduce when splitting a range into subranges.
   ComputeChunk ( const ComputeChunk&amp;, tbb::split ) {
       reset_fpe();
   }
   // Operates on a chunk and collects floating-point exception state into flags member.
   void operator()( tbb::blocked_range&lt;int&gt; r ) {
       int end=r.end();
       for( int i=r.begin(); i!=end; ++i )
           C[i] = A[i]/B[i];
       // It is critical to do |= here, not =, because otherwise we
       // might lose earlier exceptions from the same thread.
       flags |= fetestexcept(FE_ALL_EXCEPT);
   }
   // Called by parallel_reduce when joining results from two subranges.
   void join( Body&amp; other ) {
       flags |= other.flags;
   }
};</pre> 
		<p>Then invoke it as follows: 
		</p>
 
		<pre>// Construction of cc implicitly resets FP exception state.
   ComputeChunk cc;
   tbb::parallel_reduce( tbb::blocked_range&lt;int&gt;(0,N), cc );
   if (cc.flags &amp; FE_DIVBYZERO) ...;
   if (cc.flags &amp; FE_OVERFLOW) ...;
   ...</pre> 
		<p>&nbsp; 
		</p>
 
	 </div>
 
  </div>
 
  
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong>&nbsp;<a href="../../tbb_userguide/Design_Patterns/Design_Patterns.htm">Design Patterns</a></div>
</div>
<div class="See Also">
<h2>See Also</h2>
<div class="linklist">
<div><a href="Agglomeration.htm#Agglomeration">Agglomeration 
		  </a></div>
<div><a href="Elementwise.htm#Elementwise">Elementwise 
		  </a></div>
<div><a href="Divide_and_Conquer.htm#Divide_and_Conquer">Divide and Conquer 
		  </a></div></div>
</div> 

</body>
</html>