File: rcpp_parallel_talk_jan2015.Rmd

package info (click to toggle)
r-cran-rcppparallel 5.1.7%2Bdfsg-6
  • links: PTS, VCS
  • area: main
  • in suites: forky, sid, trixie
  • size: 544 kB
  • sloc: cpp: 1,852; sh: 19; makefile: 11
file content (326 lines) | stat: -rw-r--r-- 10,834 bytes parent folder | download | duplicates (3)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
---
title: "R, Rcpp and Parallel Computing"
author: "Dirk Eddelbuettel and JJ Allaire"
date: "Jan 26-27, 2015\\newline Workshop for Distributed Computing in R\\newline HP Research, Palo Alto, CA"
output:
  beamer_presentation:
    includes:
      in_header: header.tex
    keep_tex: yes
    theme: Warsaw
fontsize: 12pt
subtitle: Notes from our Rcpp Experience
classoption: compress
---

# Intro

## One View on Parallel Computing

> The whole "let's parallelize" thing is a huge waste of everybody's
> time. There's this huge body of "knowledge" that parallel is somehow more
> efficient, and that whole huge body is pure and utter garbage. Big caches
> are efficient. Parallel stupid small cores without caches are horrible
> unless you have a very specific load that is hugely regular (ie graphics). 
>   
> [...]  
>   
> Give it up. The whole "parallel computing is the future" is a bunch of crock.


[Linus Torvalds, Dec 2014](http://www.realworldtech.com/forum/?threadid=146066&curpostid=146227)


## Another View on Big Data
\framesubtitle{Imagine a \texttt{gsub("DBMs", "", tweet)} to complement further...}

\centering{\includegraphics[width=\textwidth,height=0.8\textheight,keepaspectratio]{images/big-data-big-machine-tweet.png}}

# R

## CRAN Task View on HPC

\framesubtitle{\texttt{http://cran.r-project.org/web/views/HighPerformanceComputing.html}}

Things R does well:

\medskip

- Package snow by Tierney et al a trailblazer
- Package Rmpi by Yu equally important
- multicore / snow / parallel even work on Windows
- Hundreds of applications
- _It just works_ for data-parallel tasks

# Rcpp

## Rcpp: Early Days

In the fairly early days of Rcpp, we also put out RInside as a simple C++
class wrapper around the R-embedding API.

It got one clever patch taking this (ie: R wrapped in C++ with its own
`main()` function) and encapsulating it within MPI. 

HP Vertica also uses Rcpp and RInside in 
[DistributedR](https://github.com/vertica/DistributedR/tree/master/third_party).

## Rcpp: More recently

Rcpp is now easy to deploy; Rcpp Attributes played a key role:

```r
#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
double piSugar(const int N) {
    NumericVector x = runif(N);
    NumericVector y = runif(N);
    NumericVector d = sqrt(x*x + y*y);
    return 4.0 * sum(d < 1.0) / N;
}
```

## Rcpp: Extensions

Rcpp Attributes also support "plugins"

OpenMP is easy to use and widely
supported (on suitable OS / compiler combinations).

So we added support via a plugin. Use is still not as wide-spread.

Errors have commonality: calling back into R.


# RcppParallel

## Parallel Programming for Rcpp Users

\framesubtitle{NOT like this...}

```cpp
using namespace boost;

void task()
{
   lock_guard<boost::mutex> lock(mutex);
   // etc...
}

threadpool::pool tp(thread::hardware_concurrency());
for (int i=0; i<slices; i++)
   tp.schedule(&task); 
```

## Parallel Programming for Rcpp Users

Goals:

* Encapsulate threading and locking
* Provide high level constructs for common parallel tasks
* High performance: locality, work stealing, etc.
* Safe access to R data structures on multiple threads


## Parallel Programming Alternatives
   
   \footnotesize
   
|   | TBB | OMP | RAW |
|---|:----------:|:------:|:-------:|
Task level parallelism | \textbullet | \textbullet |   |
Data decomposition support | \textbullet | \textbullet |   |
Non loop parallel patterns | \textbullet |   |   |
Generic parallel patterns | \textbullet |   |   |
Nested parallelism support | \textbullet |   |   |
Built in load balancing | \textbullet | \textbullet |   |
Affinity support |   | \textbullet | \textbullet |
Static scheduling |   | \textbullet |   |
Concurrent data structures | \textbullet |   |   |
Scalable memory allocator | \textbullet |   |   |

## TBB vs. OpenMP vs. Threads

* Raw threads shift too much burden for parallelization onto the developer (error prone and not performant)
* OpenMP is excellent for parallelizing existing loops where the iterations are independent (R already has some support for OpenMP)
* TBB fares better when there is more potential interaction between threads (e.g. more complex loops, simulations, or where concurrent containers are required).
* RcppParallel: Enable use of TBB with R to complement existing OpenMP stack.

## Win32 Platform Complications

* TBB supports mingw on Win32 however we haven't (yet) sorted out how to build it with Rtools
* As a result we use [TinyThread](http://tinythreadpp.bitsnbites.eu/) on Win32
* This requires that we create a layer to abstract over TBB and TinyThread (thus limiting the expressiveness of code that wants to be portable to Windows).
* Developers are still free to use all of TBB if they are content targeting only Linux and OSX
* Would love to see TBB working on Win32 (pull requests welcome!)

## R Concurrency Complications

R is single-threaded and includes this warning in [Writing R Extensions](http://cran.r-project.org/doc/manuals/r-release/R-exts.html) when discussing the use of OpenMP:

> Calling any of the R API from threaded code is ‘for experts only’: they will need to read the source code to determine if it is thread-safe. In particular, code which makes use of the stack-checking mechanism must not be called from threaded code.

However we don't really want to force Rcpp users to resort to reading the Rcpp and R source code to assess thread safety issues.

## RcppParallel Threadsafe Accessors

Since R vectors and matrices are just raw contiguous arrays it's easy to create threadsafe C++ wrappers for them:

* `RVector<T>` is a very thin wrapper over a C array.

* `RMatrix<T>` is the same but also provides `Row<T>` and `Column<T>` accessors/iterators.

The implementions of these classes are extremely lightweight and never call into Rcpp or the R API (so are always threadsafe).

## RcppParallel Operations

Two high-level operations are provided (with TBB and TinyThread implementations of each):

* `parallelFor` -- Convert the work of a standard serial "for" loop into a parallel one

* `parallelReduce` -- Used for accumulating aggregate or other values.

Not surprisingly the TBB versions of these operations perform ~ 50% better than the "naive" parallel implementation provided by TinyThread.

## Basic Mechanics: Create a Worker

Create a `Worker` class with `operator()` that RcppParallel uses to operate on discrete slices of the input data on different threads:

```cpp

class MyWorker : public RcppParallel::Worker {

   void operator()(size_t begin, size_t end) {
      // do some work from begin to end 
      // within the input data
   }
   
}

```

## Basic Mechanics: Call the Worker

Worker would typically take input and output data in it's constructor then save them as members (for reading/writing within `operator()`):

```cpp
NumericMatrix matrixSqrt(NumericMatrix x) {
  
  NumericMatrix output(x.nrow(), x.ncol());

  SquareRootWorker worker(x, output);
  
  parallelFor(0, x.length(), worker);
  
  return output;
}
```

## Basic Mechanics: Join Function

For `parallelReduce` you need to specify how data is to be combined. Typically you save data in a member within `operator()` then fuse it with another `Worker` instance in the `join` function.

```cpp
class SumWorker : public RcppParallel::Worker
     
   // join my value with that of another SumWorker
   void join(const SumWorker& rhs) { 
      value += rhs.value; 
   }
}

```

## What does all of this buy us?

* Developers just write pieces of code that are called at the correct time by an intelligent parallel supervisor
* In most cases no locking or explicit thread management required!
* Supervisor does some intelligent optimization around:
    - Grain size (which affects locality of reference and therefore cache hit rates). Note that grain size can also be tuned directly per-application.
    - Work stealing (detecting idle threads and pushing work to them)
* In the case of TBB, high performance concurrent containers are available if necessary

## Examples

* All available on the Rcpp Gallery <http://gallery.rcpp.org>

* Tested with 4 cores on a 2.6GHz Haswell MacBook Pro

* Note that benchmarks will be 30-50% slower on Windows because we aren't using the more sophisticated scheduling of TBB

## Example: Transforming a Matrix in Parallel

\framesubtitle{\texttt{http://gallery.rcpp.org/articles/parallel-matrix-transform}}

```cpp
 void operator()(size_t begin, size_t end) {
      std::transform(input.begin() + begin, 
                     input.begin() + end, 
                     output.begin() + begin, 
                     ::sqrt);
   }
```

```
                   test replications elapsed relative
2 parallelMatrixSqrt(m)          100   0.294    1.000
1         matrixSqrt(m)          100   0.755    2.568
```

## Example: Summing a Vector in Parallel

\framesubtitle{\texttt{http://gallery.rcpp.org/articles/parallel-vector-sum}}


```cpp
void operator()(size_t begin, size_t end) {
   value += std::accumulate(input.begin() + begin, 
                            input.begin() + end, 
                            0.0);
}    
void join(const Sum& rhs) { 
   value += rhs.value; 
}
```
```
                  test replications elapsed relative
2 parallelVectorSum(v)          100   0.182    1.000
1         vectorSum(v)          100   0.857    4.709
```

## Example: Parallel Distance Matrix Calculation

\framesubtitle{\texttt{http://gallery.rcpp.org/articles/parallel-distance-matrix}}

```
                       test reps elapsed relative
3 rcpp_parallel_distance(m)    3   0.110    1.000
2          rcpp_distance(m)    3   0.618    5.618
1               distance(m)    3  35.560  323.273
```

* Rcpp + RcppParallel = 323x over R implementation!
* Unbalanced workload benefits from work stealing

## The Rest of TBB

* Advanced algorithms: `parallel_scan`, `parallel_while`, `parallel_do`, `parallel_pipeline`, `parallel_sort`
* Containers: `concurrent_queue`, `concurrent_priority_queue`, `concurrent_vector`, `concurrent_hash_map`
* Mutual exclusion: `mutex`, `spin_mutex`, `queuing_mutex`, `spin_rw_mutex`, `queuing_rw_mutex`, `recursive_mutex`
* Atomic operations: `fetch_and_add`, `fetch_and_increment`, `fetch_and_decrement`, `compare_and_swap`, `fetch_and_store`
* Timing: portable fine grained global time stamp
* Task Scheduler: direct access to control the creation and activation of tasks

## Open Issues

* Additional (portable to Win32 via TinyThread) wrappers for other TBB constructs?

* Alternatively, sort out Rtools configuration issues required to get TBB working on Windows.

* Education: Parallel Programming is _hard_. 
    
* Simple `parallelFor` and `parallelReduce` are reasonably easy to grasp, but more advanced idioms aren't trivial to learn and use (but for some applications have lots of upside so are worth the effort).