File: parallel_reduce.htm

package info (click to toggle)
tbb 4.2~20140122-5
  • links: PTS
  • area: main
  • in suites: jessie, jessie-kfreebsd
  • size: 21,492 kB
  • ctags: 21,278
  • sloc: cpp: 92,813; ansic: 9,775; asm: 1,070; makefile: 1,057; sh: 351; java: 226; objc: 98; pascal: 71; xml: 41
file content (230 lines) | stat: -rwxr-xr-x 8,971 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
<!DOCTYPE html
  PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<!-- saved from url=(0014)about:internet -->
<html xmlns:MSHelp="http://www.microsoft.com/MSHelp/" lang="en-us" xml:lang="en-us"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

<meta name="DC.Type" content="topic">
<meta name="DC.Title" content="parallel_reduce">
<meta name="DC.subject" content="parallel_reduce">
<meta name="keywords" content="parallel_reduce">
<meta name="DC.Relation" scheme="URI" content="../tbb_userguide/Parallelizing_Simple_Loops.htm">
<meta name="DC.Relation" scheme="URI" content="parallel_for.htm#tutorial_parallel_for">
<meta name="DC.Format" content="XHTML">
<meta name="DC.Identifier" content="tutorial_parallel_reduce">
<link rel="stylesheet" type="text/css" href="../intel_css_styles.css">
<title>parallel_reduce</title>
<xml>
<MSHelp:Attr Name="DocSet" Value="Intel"></MSHelp:Attr>
<MSHelp:Attr Name="Locale" Value="kbEnglish"></MSHelp:Attr>
<MSHelp:Attr Name="TopicType" Value="kbReference"></MSHelp:Attr>
</xml>
</head>
<body id="tutorial_parallel_reduce">
 <!-- ==============(Start:NavScript)================= -->
 <script src="..\NavScript.js" language="JavaScript1.2" type="text/javascript"></script>
 <script language="JavaScript1.2" type="text/javascript">WriteNavLink(1);</script>
 <!-- ==============(End:NavScript)================= -->
<a name="tutorial_parallel_reduce"><!-- --></a>

 
  <h1 class="topictitle1">parallel_reduce</h1>
 
   
  <div> 
	 <p>A loop can do a reduction, as in this summation: 
	 </p>
 
	 <pre>float SerialSumFoo( float a[], size_t n ) {
    float sum = 0;
    for( size_t i=0; i!=n; ++i )
        sum += Foo(a[i]);
    return sum;
}</pre> 
	 <p>If the iterations are independent, you can parallelize this loop using
		the template class 
		<samp class="codeph">parallel_reduce</samp> as follows: 
	 </p>
 
	 <pre>float ParallelSumFoo( const float a[], size_t n ) {
    SumFoo sf(a);
    parallel_reduce( blocked_range&lt;size_t&gt;(0,n), sf );
    return <span style="color:blue"><strong>sf.my_</strong></span>sum;
}</pre> 
	 <p>The class 
		<samp class="codeph">SumFoo</samp> specifies details of the reduction, such as how
		to accumulate subsums and combine them. Here is the definition of class 
		<samp class="codeph">SumFoo</samp>: 
	 </p>
 
	 <pre>class SumFoo {
    float* my_a;
public:
    float my_sum; 
    void operator()( const blocked_range&lt;size_t&gt;&amp; r ) {
        float *a = my_a;
        float sum = my_sum;
        size_t end = r.end();
        for( size_t i=r.begin(); i!=end; ++i ) 
            sum += Foo(a[i]); 
        <span style="color:blue"><strong>my_sum = sum;</strong></span>    
    <span style="color:blue"><strong>}</strong></span>
&nbsp;
    SumFoo( SumFoo&amp; x, split ) : my_a(x.my_a), my_sum(0) {}
&nbsp;
    void join( const SumFoo&amp; y ) {my_sum+=y.my_sum;}
             
    SumFoo(float a[] ) :
        my_a(a), my_sum(0)
    {}
};</pre> 
	 <p>Note the differences with class 
		<samp class="codeph">ApplyFoo</samp> from 
		<span class="keyword">parallel_for</span>. First, 
		<samp class="codeph">operator()</samp> is 
		<em>not</em> 
		<samp class="codeph">const</samp>. This is because it must update 
	 <span class="option">SumFoo::my_sum</span>. Second, 
	 <samp class="codeph">SumFoo</samp> has a 
	 <em>splitting constructor</em> and a method 
	 <samp class="codeph">join</samp> that must be present for 
	 <samp class="codeph">parallel_reduce</samp> to work. The splitting constructor takes
	 as arguments a reference to the original object, and a dummy argument of type 
	 <samp class="codeph">split</samp>, which is defined by the library. The dummy argument
	 distinguishes the splitting constructor from a copy constructor. 
	 </p>
 
	 <div class="Note"><h3 class="NoteTipHead">
					Tip</h3> 
		<p>In the example, the definition of 
		  <samp class="codeph">operator()</samp> uses local temporary variables
		  (<samp class="codeph">a</samp>, 
		  <samp class="codeph">sum</samp>, 
		  <samp class="codeph">end</samp>) for scalar values accessed inside the loop. This
		  technique can improve performance by making it obvious to the compiler that the
		  values can be held in registers instead of memory. If the values are too large
		  to fit in registers, or have their address taken in a way the compiler cannot
		  track, the technique might not help. With a typical optimizing compiler, using
		  local temporaries for only written variables (such as 
		  <samp class="codeph">sum</samp> in the example) can suffice, because then the
		  compiler can deduce that the loop does not write to any of the other locations,
		  and hoist the other reads to outside the loop. 
		</p>
 
	 </div> 
	 <p>When a worker thread is available, as decided by the task scheduler, 
		<samp class="codeph">parallel_reduce</samp> invokes the splitting constructor to
		create a subtask for the worker. When the subtask completes, 
		<samp class="codeph">parallel_reduce</samp> uses method 
		<samp class="codeph">join</samp> to accumulate the result of the subtask. The graph
		at the top of the following figure shows the split-join sequence that happens
		when a worker is available: 
	 </p>
 
	 <div class="fignone" id="fig5"><a name="fig5"><!-- --></a><span class="figcap">Graph of the Split-join Sequence</span> 
		<br><img src="Images/image009.jpg" width="512" height="438"><br> 
	 </div>
 
	 <p>An arc in the above figure indicates order in time. The splitting
		constructor might run concurrently while object x is being used for the first
		half of the reduction. Therefore, all actions of the splitting constructor that
		creates y must be made thread safe with respect to x. So if the splitting
		constructor needs to increment a reference count shared with other objects, it
		should use an atomic increment. 
	 </p>
 
	 <p>If a worker is not available, the second half of the iteration is
		reduced using the same body object that reduced the first half. That is the
		reduction of the second half starts where reduction of the first half finished.
		
	 </p>
 
	 <div class="Note"><h3 class="NoteTipHead">
					Caution</h3> 
		<p>Because split/join are not used if workers are unavailable, 
		  <samp class="codeph">parallel_reduce</samp> does not necessarily do recursive
		  splitting. 
		</p>
 
	 </div> 
	 <div class="Note"><h3 class="NoteTipHead">
					Caution</h3> 
		<p>Because the same body might be used to accumulate multiple subranges,
		  it is critical that 
		  <samp class="codeph">operator()</samp> not discard earlier accumulations. The code
		  below shows an incorrect definition of 
		  <samp class="codeph">SumFoo::operator()</samp>. 
		</p>
 
	 </div> 
	 <pre>class SumFoo {
    ...
public:
    float my_sum; 
    void operator()( const blocked_range&lt;size_t&gt;&amp; r ) {
        ...
        float <span style="color:blue"><strong>sum = 0;</strong></span>  // WRONG – should be 'sum = <span style="color:blue"><strong>my_sum</strong></span>".
        ...
        for( ... ) 
            sum += Foo(a[i]); 
        my_sum = sum;   
    }
    ...
};</pre> 
	 <p>With the mistake, the body returns a partial sum for the last subrange
		instead of all subranges to which 
		<samp class="codeph">parallel_reduce</samp> applies it. 
	 </p>
 
	 <p>The rules for partitioners and grain sizes for 
		<samp class="codeph">parallel_reduce</samp> are the same as for 
		<samp class="codeph">parallel_for</samp>. 
	 </p>
 
	 <p><samp class="codeph">parallel_reduce</samp> generalizes to any associative
		operation. In general, the splitting constructor does two things: 
	 </p>
 
	 <ul type="disc"> 
		<li> 
		  <p>Copy read-only information necessary to run the loop body. 
		  </p>
 
		</li>
 
		<li> 
		  <p>Initialize the reduction variable(s) to the identity element of the
			 operation(s). 
		  </p>
 
		</li>
 
	 </ul>
 
	 <p>The join method should do the corresponding merge(s). You can do more
		than one reduction at the same time: you can gather the min and max with a
		single 
		<samp class="codeph">parallel_reduce</samp>. 
	 </p>
 
	 <div class="Note"><h3 class="NoteTipHead">
					Note</h3> 
		<p>The reduction operation can be non-commutative. The example still
		  works if floating-point addition is replaced by string concatenation. 
		</p>
 
	 </div> 
  </div>
 
  
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong>&nbsp;<a href="../tbb_userguide/Parallelizing_Simple_Loops.htm">Parallelizing Simple Loops</a></div>
</div>
<div class="See Also">
<h2>See Also</h2>
<div class="linklist">
<div><a href="parallel_for.htm#tutorial_parallel_for">parallel_for 
		  </a></div></div>
</div> 

</body>
</html>