1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326
|
---
title: "R, Rcpp and Parallel Computing"
author: "Dirk Eddelbuettel and JJ Allaire"
date: "Jan 26-27, 2015\\newline Workshop for Distributed Computing in R\\newline HP Research, Palo Alto, CA"
output:
beamer_presentation:
includes:
in_header: header.tex
keep_tex: yes
theme: Warsaw
fontsize: 12pt
subtitle: Notes from our Rcpp Experience
classoption: compress
---
# Intro
## One View on Parallel Computing
> The whole "let's parallelize" thing is a huge waste of everybody's
> time. There's this huge body of "knowledge" that parallel is somehow more
> efficient, and that whole huge body is pure and utter garbage. Big caches
> are efficient. Parallel stupid small cores without caches are horrible
> unless you have a very specific load that is hugely regular (ie graphics).
>
> [...]
>
> Give it up. The whole "parallel computing is the future" is a bunch of crock.
[Linus Torvalds, Dec 2014](http://www.realworldtech.com/forum/?threadid=146066&curpostid=146227)
## Another View on Big Data
\framesubtitle{Imagine a \texttt{gsub("DBMs", "", tweet)} to complement further...}
\centering{\includegraphics[width=\textwidth,height=0.8\textheight,keepaspectratio]{images/big-data-big-machine-tweet.png}}
# R
## CRAN Task View on HPC
\framesubtitle{\texttt{http://cran.r-project.org/web/views/HighPerformanceComputing.html}}
Things R does well:
\medskip
- Package snow by Tierney et al a trailblazer
- Package Rmpi by Yu equally important
- multicore / snow / parallel even work on Windows
- Hundreds of applications
- _It just works_ for data-parallel tasks
# Rcpp
## Rcpp: Early Days
In the fairly early days of Rcpp, we also put out RInside as a simple C++
class wrapper around the R-embedding API.
It got one clever patch taking this (ie: R wrapped in C++ with its own
`main()` function) and encapsulating it within MPI.
HP Vertica also uses Rcpp and RInside in
[DistributedR](https://github.com/vertica/DistributedR/tree/master/third_party).
## Rcpp: More recently
Rcpp is now easy to deploy; Rcpp Attributes played a key role:
```r
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
double piSugar(const int N) {
NumericVector x = runif(N);
NumericVector y = runif(N);
NumericVector d = sqrt(x*x + y*y);
return 4.0 * sum(d < 1.0) / N;
}
```
## Rcpp: Extensions
Rcpp Attributes also support "plugins"
OpenMP is easy to use and widely
supported (on suitable OS / compiler combinations).
So we added support via a plugin. Use is still not as wide-spread.
Errors have commonality: calling back into R.
# RcppParallel
## Parallel Programming for Rcpp Users
\framesubtitle{NOT like this...}
```cpp
using namespace boost;
void task()
{
lock_guard<boost::mutex> lock(mutex);
// etc...
}
threadpool::pool tp(thread::hardware_concurrency());
for (int i=0; i<slices; i++)
tp.schedule(&task);
```
## Parallel Programming for Rcpp Users
Goals:
* Encapsulate threading and locking
* Provide high level constructs for common parallel tasks
* High performance: locality, work stealing, etc.
* Safe access to R data structures on multiple threads
## Parallel Programming Alternatives
\footnotesize
| | TBB | OMP | RAW |
|---|:----------:|:------:|:-------:|
Task level parallelism | \textbullet | \textbullet | |
Data decomposition support | \textbullet | \textbullet | |
Non loop parallel patterns | \textbullet | | |
Generic parallel patterns | \textbullet | | |
Nested parallelism support | \textbullet | | |
Built in load balancing | \textbullet | \textbullet | |
Affinity support | | \textbullet | \textbullet |
Static scheduling | | \textbullet | |
Concurrent data structures | \textbullet | | |
Scalable memory allocator | \textbullet | | |
## TBB vs. OpenMP vs. Threads
* Raw threads shift too much burden for parallelization onto the developer (error prone and not performant)
* OpenMP is excellent for parallelizing existing loops where the iterations are independent (R already has some support for OpenMP)
* TBB fares better when there is more potential interaction between threads (e.g. more complex loops, simulations, or where concurrent containers are required).
* RcppParallel: Enable use of TBB with R to complement existing OpenMP stack.
## Win32 Platform Complications
* TBB supports mingw on Win32 however we haven't (yet) sorted out how to build it with Rtools
* As a result we use [TinyThread](http://tinythreadpp.bitsnbites.eu/) on Win32
* This requires that we create a layer to abstract over TBB and TinyThread (thus limiting the expressiveness of code that wants to be portable to Windows).
* Developers are still free to use all of TBB if they are content targeting only Linux and OSX
* Would love to see TBB working on Win32 (pull requests welcome!)
## R Concurrency Complications
R is single-threaded and includes this warning in [Writing R Extensions](http://cran.r-project.org/doc/manuals/r-release/R-exts.html) when discussing the use of OpenMP:
> Calling any of the R API from threaded code is ‘for experts only’: they will need to read the source code to determine if it is thread-safe. In particular, code which makes use of the stack-checking mechanism must not be called from threaded code.
However we don't really want to force Rcpp users to resort to reading the Rcpp and R source code to assess thread safety issues.
## RcppParallel Threadsafe Accessors
Since R vectors and matrices are just raw contiguous arrays it's easy to create threadsafe C++ wrappers for them:
* `RVector<T>` is a very thin wrapper over a C array.
* `RMatrix<T>` is the same but also provides `Row<T>` and `Column<T>` accessors/iterators.
The implementions of these classes are extremely lightweight and never call into Rcpp or the R API (so are always threadsafe).
## RcppParallel Operations
Two high-level operations are provided (with TBB and TinyThread implementations of each):
* `parallelFor` -- Convert the work of a standard serial "for" loop into a parallel one
* `parallelReduce` -- Used for accumulating aggregate or other values.
Not surprisingly the TBB versions of these operations perform ~ 50% better than the "naive" parallel implementation provided by TinyThread.
## Basic Mechanics: Create a Worker
Create a `Worker` class with `operator()` that RcppParallel uses to operate on discrete slices of the input data on different threads:
```cpp
class MyWorker : public RcppParallel::Worker {
void operator()(size_t begin, size_t end) {
// do some work from begin to end
// within the input data
}
}
```
## Basic Mechanics: Call the Worker
Worker would typically take input and output data in it's constructor then save them as members (for reading/writing within `operator()`):
```cpp
NumericMatrix matrixSqrt(NumericMatrix x) {
NumericMatrix output(x.nrow(), x.ncol());
SquareRootWorker worker(x, output);
parallelFor(0, x.length(), worker);
return output;
}
```
## Basic Mechanics: Join Function
For `parallelReduce` you need to specify how data is to be combined. Typically you save data in a member within `operator()` then fuse it with another `Worker` instance in the `join` function.
```cpp
class SumWorker : public RcppParallel::Worker
// join my value with that of another SumWorker
void join(const SumWorker& rhs) {
value += rhs.value;
}
}
```
## What does all of this buy us?
* Developers just write pieces of code that are called at the correct time by an intelligent parallel supervisor
* In most cases no locking or explicit thread management required!
* Supervisor does some intelligent optimization around:
- Grain size (which affects locality of reference and therefore cache hit rates). Note that grain size can also be tuned directly per-application.
- Work stealing (detecting idle threads and pushing work to them)
* In the case of TBB, high performance concurrent containers are available if necessary
## Examples
* All available on the Rcpp Gallery <http://gallery.rcpp.org>
* Tested with 4 cores on a 2.6GHz Haswell MacBook Pro
* Note that benchmarks will be 30-50% slower on Windows because we aren't using the more sophisticated scheduling of TBB
## Example: Transforming a Matrix in Parallel
\framesubtitle{\texttt{http://gallery.rcpp.org/articles/parallel-matrix-transform}}
```cpp
void operator()(size_t begin, size_t end) {
std::transform(input.begin() + begin,
input.begin() + end,
output.begin() + begin,
::sqrt);
}
```
```
test replications elapsed relative
2 parallelMatrixSqrt(m) 100 0.294 1.000
1 matrixSqrt(m) 100 0.755 2.568
```
## Example: Summing a Vector in Parallel
\framesubtitle{\texttt{http://gallery.rcpp.org/articles/parallel-vector-sum}}
```cpp
void operator()(size_t begin, size_t end) {
value += std::accumulate(input.begin() + begin,
input.begin() + end,
0.0);
}
void join(const Sum& rhs) {
value += rhs.value;
}
```
```
test replications elapsed relative
2 parallelVectorSum(v) 100 0.182 1.000
1 vectorSum(v) 100 0.857 4.709
```
## Example: Parallel Distance Matrix Calculation
\framesubtitle{\texttt{http://gallery.rcpp.org/articles/parallel-distance-matrix}}
```
test reps elapsed relative
3 rcpp_parallel_distance(m) 3 0.110 1.000
2 rcpp_distance(m) 3 0.618 5.618
1 distance(m) 3 35.560 323.273
```
* Rcpp + RcppParallel = 323x over R implementation!
* Unbalanced workload benefits from work stealing
## The Rest of TBB
* Advanced algorithms: `parallel_scan`, `parallel_while`, `parallel_do`, `parallel_pipeline`, `parallel_sort`
* Containers: `concurrent_queue`, `concurrent_priority_queue`, `concurrent_vector`, `concurrent_hash_map`
* Mutual exclusion: `mutex`, `spin_mutex`, `queuing_mutex`, `spin_rw_mutex`, `queuing_rw_mutex`, `recursive_mutex`
* Atomic operations: `fetch_and_add`, `fetch_and_increment`, `fetch_and_decrement`, `compare_and_swap`, `fetch_and_store`
* Timing: portable fine grained global time stamp
* Task Scheduler: direct access to control the creation and activation of tasks
## Open Issues
* Additional (portable to Win32 via TinyThread) wrappers for other TBB constructs?
* Alternatively, sort out Rtools configuration issues required to get TBB working on Windows.
* Education: Parallel Programming is _hard_.
* Simple `parallelFor` and `parallelReduce` are reasonably easy to grasp, but more advanced idioms aren't trivial to learn and use (but for some applications have lots of upside so are worth the effort).
|