1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385
|
---
title: "Configuring a developer environment"
description: >
Learn how to configure your environment to allow you to contribute
to the arrow package
output: rmarkdown::html_vignette
---
```{r setup-options, include=FALSE}
knitr::opts_chunk$set(error = TRUE, eval = FALSE)
# Get environment variables describing what to evaluate
run <- tolower(Sys.getenv("RUN_DEVDOCS", "false")) == "true"
macos <- tolower(Sys.getenv("DEVDOCS_MACOS", "false")) == "true"
ubuntu <- tolower(Sys.getenv("DEVDOCS_UBUNTU", "false")) == "true"
windows <- tolower(Sys.getenv("DEVDOCS_WINDOWS", "false")) == "true"
sys_install <- tolower(Sys.getenv("DEVDOCS_SYSTEM_INSTALL", "false")) == "true"
# Update the source knit_hook to save the chunk (if it is marked to be saved)
knit_hooks_source <- knitr::knit_hooks$get("source")
knitr::knit_hooks$set(source = function(x, options) {
# Extra paranoia about when this will write the chunks to the script, we will
# only save when:
# * CI is true
# * RUN_DEVDOCS is true
# * options$save is TRUE (and a check that not NULL won't crash it)
if (as.logical(Sys.getenv("CI", FALSE)) && run && !is.null(options$save) && options$save)
cat(x, file = "script.sh", append = TRUE, sep = "\n")
# but hide the blocks we want hidden:
if (!is.null(options$hide) && options$hide) {
return(NULL)
}
knit_hooks_source(x, options)
})
```
```{bash, save=run, hide=TRUE}
# Stop on failure, echo input as we go
set -e
set -x
```
The Arrow R package is unique compared to other R packages that you may have
contributed to because it builds on top of the large and feature-rich Arrow C++
implementation. Because the R package integrates tightly with Arrow C++,
it typically requires a dedicated copy of the library (i.e., it is usually
not possible to link to a system version of libarrow during development).
## Option 1: Using nightly libarrow binaries
On Linux, macOS, and Windows you can use the same workflow you might use for another
package that contains compiled code (e.g., `R CMD INSTALL .` from
a terminal, `devtools::load_all()` from an R prompt, or `Install & Restart` from
RStudio). If the `arrow/r/libarrow` directory is not populated, the configure script will
attempt to download the latest nightly libarrow binary, extract it to the
`arrow/r/libarrow` directory (macOS, Linux) or `arrow/r/windows`
directory (Windows), and continue building the R package as usual.
Most of the time, you won't need to update your version of libarrow because
the R package rarely changes with updates to the C++ library; however, if you
start to get errors when rebuilding the R package, you may have to remove the
`libarrow` directory (macOS, Linux) or `windows` directory (Windows)
and do a "clean" rebuild. You can do this from a terminal with
`R CMD INSTALL . --preclean`, from RStudio using the "Clean and Install"
option from "Build" tab, or using `make clean` if you are using the `Makefile`
located in the root of the R package.
## Option 2: Use a local Arrow C++ development build
If you need to alter both libarrow and the R package code, or if you can't get a binary version of the latest libarrow elsewhere, you'll need to build it from source. This section discusses how to set up a C++ libarrow build configured to work with the R package. For more general resources, see the [Arrow C++ developer guide](https://arrow.apache.org/docs/developers/cpp/building.html).
There are five major steps to the process.
### Step 1 - Install dependencies
When building libarrow, by default, system dependencies will be used if suitable versions are found. If system dependencies are not present, libarrow will build them during its own build process. The only dependencies that you need to install _outside_ of the build process are [cmake](https://cmake.org/) (for configuring the build) and [openssl](https://www.openssl.org/) if you are building with S3 support.
For a faster build, you may choose to pre-install more C++ library dependencies (such as [lz4](http://lz4.github.io/lz4/), [zstd](https://facebook.github.io/zstd/), etc.) on the system so that they don't need to be built from source in the libarrow build.
#### Ubuntu
```{bash, save=run & ubuntu}
sudo apt install -y cmake libcurl4-openssl-dev libssl-dev
```
#### macOS
```{bash, save=run & macos}
brew install cmake openssl
```
### Step 2 - Configure the libarrow build
We recommend that you configure libarrow to be built to a user-level directory rather than a system directory for your development work. This is so that the development version you are using doesn't overwrite a released version of libarrow you may already have installed, and so that you are also able work with more than one version of libarrow (by using different `ARROW_HOME` directories for the different versions).
In the example below, libarrow is installed to a directory called `dist` that has the same parent directory as the arrow checkout. Your installation of the Arrow R package can point to any directory with any name, though we recommend *not* placing it inside of the arrow git checkout directory as unwanted changes could stop it working properly.
```{bash, save=run & !sys_install}
export ARROW_HOME=$(pwd)/dist
mkdir $ARROW_HOME
```
_Special instructions on Linux:_ You will need to set `LD_LIBRARY_PATH` to the `lib` directory that is under where you set `$ARROW_HOME`, before launching R and using arrow. One way to do this is to add it to your profile (we use `~/.bash_profile` here, but you might need to put this in a different file depending on your setup, e.g. if you use a shell other than `bash`). On macOS you do not need to do this because the macOS shared library paths are hardcoded to their locations during build time.
```{bash, save=run & ubuntu & !sys_install}
export LD_LIBRARY_PATH=$ARROW_HOME/lib:$LD_LIBRARY_PATH
echo "export LD_LIBRARY_PATH=$ARROW_HOME/lib:$LD_LIBRARY_PATH" >> ~/.bash_profile
```
Start by navigating in a terminal to the arrow repository. You will need to create a directory into which the C++ build will put its contents. We recommend that you make a `build` directory inside of the `cpp` directory of the Arrow git repository (it is git-ignored, so you won't accidentally check it in). Next, change directories to be inside `cpp/build`:
```{bash, save=run & !sys_install}
pushd arrow
mkdir -p cpp/build
pushd cpp/build
```
You'll first call `cmake` to configure the build and then `make install`. For the R package, you'll need to enable several features in libarrow using `-D` flags:
#### {.tabset}
##### Linux / Mac OS
```{bash, save=run & !sys_install & !windows}
cmake \
-DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
-DCMAKE_INSTALL_LIBDIR=lib \
-DARROW_COMPUTE=ON \
-DARROW_CSV=ON \
-DARROW_DATASET=ON \
-DARROW_EXTRA_ERROR_CONTEXT=ON \
-DARROW_FILESYSTEM=ON \
-DARROW_INSTALL_NAME_RPATH=OFF \
-DARROW_JEMALLOC=ON \
-DARROW_JSON=ON \
-DARROW_PARQUET=ON \
-DARROW_WITH_SNAPPY=ON \
-DARROW_WITH_ZLIB=ON \
..
```
#### {-}
`..` refers to the C++ source directory: you're in `cpp/build` and the source is in `cpp`.
```{bash, save=run & !sys_install, hide=TRUE}
# For testing purposes, build with only shared libraries
cmake \
-DARROW_BUILD_SHARED=ON \
-DARROW_BUILD_STATIC=OFF \
..
```
#### Enabling more Arrow features
To enable optional features including: S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags to your call to `cmake` (the trailing `\` makes them easier to paste into a bash shell on a new line):
```bash
-DARROW_GCS=ON \
-DARROW_MIMALLOC=ON \
-DARROW_S3=ON \
-DARROW_WITH_BROTLI=ON \
-DARROW_WITH_BZ2=ON \
-DARROW_WITH_LZ4=ON \
-DARROW_WITH_SNAPPY=ON \
-DARROW_WITH_ZSTD=ON \
```
Other flags that may be useful:
* `-DBoost_SOURCE=BUNDLED` and `-DThrift_SOURCE=BUNDLED`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source.
* `-DCMAKE_BUILD_TYPE=debug` or `-DCMAKE_BUILD_TYPE=relwithdebinfo` can be useful for debugging. You probably don't want to do this generally because a debug build is much slower at runtime than the default `release` build.
* `-DARROW_BUILD_STATIC=ON` and `-DARROW_BUILD_SHARED=OFF` if you want to use static libraries instead of dynamic libraries. With static libraries there isn't a risk of the R package linking to the wrong library, but it does mean if you change the C++ code you have to recompile both the C++ libraries and the R package. Compilers typically will link to static libraries only if the dynamic ones are not present, which is why we need to set `-DARROW_BUILD_SHARED=OFF`. If you are switching after compiling and installing previously, you may need to remove the `.dll` or `.so` files from `$ARROW_HOME/dist/bin`.
_Note_ `cmake` is particularly sensitive to whitespacing, if you see errors, check that you don't have any errant whitespace.
### Step 3 - Building libarrow
You can add `-j#` at the end of the command here too to speed up compilation by running in parallel (where `#` is the number of cores you have available).
```{bash, save=run & !(sys_install & ubuntu)}
cmake --build . --target install -j8
```
### Step 4 - Build the Arrow R package
Once you've built libarrow, you can install the R package and its dependencies, along with additional dev dependencies, from the git checkout like below. You might need to either pick and set a repository interactively or you could add a repository to the `install.packages()` command with `repos="https://cloud.r-project.org"`.
```{bash, save=run}
popd # To go back to the root directory of the project, from cpp/build
pushd r
R -e 'install.packages("remotes"); remotes::install_deps(dependencies = TRUE)'
R CMD INSTALL --no-multiarch .
```
The `--no-multiarch` flag makes it only compile on the "main" architecture. This will compile for the architecture that the R in your path corresponds to. If you compile on one architecture and then switch to another, make sure to pass the `--preclean` flag so that the R package code is recompiled for the new architecture. Otherwise, you may see errors like `LoadLibrary failure: %1 is not a valid Win32 application`.
#### Compilation flags
If you need to set any compilation flags while building the C++
extensions, you can use the `ARROW_R_CXXFLAGS` environment variable. For
example, if you are using `perf` to profile the R extensions, you may
need to set
```bash
export ARROW_R_CXXFLAGS=-fno-omit-frame-pointer
```
#### Recompiling the C++ code
With the setup described here, you should not need to rebuild the Arrow library or even the C++ source in the R package as you iterate and work on the R package. The only time those should need to be rebuilt is if you have changed the C++ in the R package (and even then, `R CMD INSTALL .` should only need to recompile the files that have changed) _or_ if the libarrow C++ has changed and there is a mismatch between libarrow and the R package. If you find yourself rebuilding either or both each time you install the package or run tests, something is probably wrong with your set up.
<details>
<summary>For a full build: a `cmake` command with all of the R-relevant optional dependencies turned on. Development with other languages might require different flags as well. For example, to develop Python, you would need to also add `-DARROW_PYTHON=ON` (though all of the other flags used for Python are already included here).</summary>
<p>
```bash
cmake \
-DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
-DCMAKE_INSTALL_LIBDIR=lib \
-DARROW_COMPUTE=ON \
-DARROW_CSV=ON \
-DARROW_DATASET=ON \
-DARROW_EXTRA_ERROR_CONTEXT=ON \
-DARROW_FILESYSTEM=ON \
-DARROW_GCS=ON \
-DARROW_INSTALL_NAME_RPATH=OFF \
-DARROW_JEMALLOC=ON \
-DARROW_JSON=ON \
-DARROW_MIMALLOC=ON \
-DARROW_PARQUET=ON \
-DARROW_S3=ON \
-DARROW_WITH_BROTLI=ON \
-DARROW_WITH_BZ2=ON \
-DARROW_WITH_LZ4=ON \
-DARROW_WITH_SNAPPY=ON \
-DARROW_WITH_ZLIB=ON \
-DARROW_WITH_ZSTD=ON \
..
```
</p>
</details>
## Installing a version of the R package with a specific git reference
If you need an arrow installation from a specific repository or git reference, on most platforms except Windows, you can run:
```{r}
remotes::install_github("apache/arrow/r", build = FALSE)
```
The `build = FALSE` argument is important so that the installation can access the
C++ source in the `cpp/` directory in `apache/arrow`.
As with other installation methods, setting the environment variables `LIBARROW_MINIMAL=false` and `ARROW_R_DEV=true` will provide a more full-featured version of Arrow and provide more verbose output, respectively.
For example, to install from the (fictional) branch `bugfix` from `apache/arrow` you could run:
```r
Sys.setenv(LIBARROW_MINIMAL="false")
remotes::install_github("apache/arrow/r@bugfix", build = FALSE)
```
Developers may wish to use this method of installing a specific commit
separate from another Arrow development environment or system installation
(e.g. we use this in [arrowbench](https://github.com/ursacomputing/arrowbench)
to install development versions of libarrow isolated from the system install). If
you already have libarrow installed system-wide, you may need to set
some additional variables in order to isolate this build from your system libraries:
* Setting the environment variable `FORCE_BUNDLED_BUILD` to `true` will skip the `pkg-config` search for libarrow and attempt to build from the same source at the repository+ref given.
* You may also need to set the Makevars `CPPFLAGS` and `LDFLAGS` to `""` in order to prevent the installation process from attempting to link to already installed system versions of libarrow. One way to do this temporarily is wrapping your `remotes::install_github()` call like so:
```{r}
withr::with_makevars(list(CPPFLAGS = "", LDFLAGS = ""), remotes::install_github(...))
```
# Summary of environment variables
* See the user-facing [article on installation](../install.html) for a large number of
environment variables that determine how the build works and what features
get built.
* `ARROW_OFFLINE_BUILD`: When set to `true`, the build script will not download
prebuilt the C++ library binary or, if needed, `cmake`.
It will turn off any features that require a download, unless they're available
in `ARROW_THIRDPARTY_DEPENDENCY_DIR` or the `tools/thirdparty_download/` subfolder.
`create_package_with_all_dependencies()` creates that subfolder.
# Troubleshooting
Note that after any change to libarrow, you must reinstall it and
run `make clean` or `git clean -fdx .` to remove any cached object code
in the `r/src/` directory before reinstalling the R package. This is
only necessary if you make changes to libarrow source; you do not
need to manually purge object files if you are only editing R or C++
code inside `r/`.
## Arrow library - R package mismatches
If libarrow and the R package have diverged, you will see errors like:
```
Error: package or namespace load failed for ‘arrow' in dyn.load(file, DLLpath = DLLpath, ...):
unable to load shared object '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so':
dlopen(/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so, 6): Symbol not found: __ZN5arrow2io16RandomAccessFile9ReadAsyncERKNS0_9IOContextExx
Referenced from: /Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so
Expected in: flat namespace
in /Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so
Error: loading failed
Execution halted
ERROR: loading failed
```
To resolve this, try [rebuilding the Arrow library](#step-3-building-arrow).
## Multiple versions of libarrow
If you are installing from a user-level directory, and you already have a
previous installation of libarrow in a system directory, you get you may get
errors like the following when you install the R package:
```
Error: package or namespace load failed for ‘arrow' in dyn.load(file, DLLpath = DLLpath, ...):
unable to load shared object '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so':
dlopen(/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: /usr/local/lib/libarrow.400.dylib
Referenced from: /usr/local/lib/libparquet.400.dylib
Reason: image not found
```
If this happens, you need to make sure that you don't let R link to your system
library when building arrow. You can do this a number of different ways:
* Setting the `MAKEFLAGS` environment variable to `"LDFLAGS="` (see below for an example) this is the recommended way to accomplish this
* Using {withr}'s `with_makevars(list(LDFLAGS = ""), ...)`
* adding `LDFLAGS=` to your `~/.R/Makevars` file (the least recommended way, though it is a common debugging approach suggested online)
```{bash, save=run & !sys_install & macos, hide=TRUE}
# Setup troubleshooting section
# install a system-level arrow on macOS
brew install apache-arrow
```
```{bash, save=run & !sys_install & ubuntu, hide=TRUE}
# Setup troubleshooting section
# install a system-level arrow on Ubuntu
sudo apt update
sudo apt install -y -V ca-certificates lsb-release wget
wget https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
sudo apt install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
sudo apt update
sudo apt install -y -V libarrow-dev
```
```{bash, save=run & !sys_install & macos}
MAKEFLAGS="LDFLAGS=" R CMD INSTALL .
```
## `rpath` issues
If the package fails to install/load with an error like this:
```
** testing if installed package can be loaded from temporary location
Error: package or namespace load failed for 'arrow' in dyn.load(file, DLLpath = DLLpath, ...):
unable to load shared object '/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so':
dlopen(/Users/you/R/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: @rpath/libarrow.14.dylib
```
ensure that `-DARROW_INSTALL_NAME_RPATH=OFF` was passed (this is important on
macOS to prevent problems at link time and is a no-op on other platforms).
Alternatively, try setting the environment variable `R_LD_LIBRARY_PATH` to
wherever Arrow C++ was put in `make install`, e.g. `export
R_LD_LIBRARY_PATH=/usr/local/lib`, and retry installing the R package.
When installing from source, if the R and C++ library versions do not
match, installation may fail. If you've previously installed the
libraries and want to upgrade the R package, you'll need to update the
Arrow C++ library first.
For any other build/configuration challenges, see the [C++ developer
guide](https://arrow.apache.org/docs/developers/cpp/building.html).
## Other installation issues
There are a number of scripts that are triggered when the arrow R package is installed. For package users who are not interacting with the underlying code, these should all just work without configuration and pull in the most complete pieces (e.g. official binaries that we host). However, knowing about these scripts can help package developers troubleshoot if things go wrong in them or things go wrong in an install. See [the article on R package installation](./install_details.html) for more information.
|