diff --git a/r/_pkgdown.yml b/r/_pkgdown.yml index a76345e1abf..08578baad53 100644 --- a/r/_pkgdown.yml +++ b/r/_pkgdown.yml @@ -80,6 +80,8 @@ navbar: href: articles/developers/workflow.html - text: Debugging href: articles/developers/debugging.html + - text: Package Installation Details + href: articles/developers/install_details.html reference: - title: Multi-file datasets contents: diff --git a/r/vignettes/developers/install_details.Rmd b/r/vignettes/developers/install_details.Rmd new file mode 100644 index 00000000000..aef6420ca38 --- /dev/null +++ b/r/vignettes/developers/install_details.Rmd @@ -0,0 +1,124 @@ +--- +title: "How the R package is installed - advanced" +--- + +This document is intended specifically for arrow _developers_ who wish to know +more about these scripts. If you are an arrow _user_ looking for help with +installing arrow, please see [the installation guide](../install.html) + +The arrow R package requires that Arrow C++ library (also known as libarrow) to +be installed in order to work properly. There are a number of different ways +in which libarrow could be installed: + +* as part of the R package installation process +* a system package +* a library you've built yourself outside of the context of installing the R package + +Below, we discuss each of these setups in turn. + +# Installing libarrow during R package installation + +There are a number of scripts that are triggered +when `R CMD INSTALL .` is run and for Arrow users, these should all just work +without configuration and pull in the most complete pieces (e.g. official +binaries that we host). One of the jobs of these scripts is to work out +if libarrow is installed, and if not, install it. + +An overview of these scripts is shown below: + +* `configure` and `configure.win` - these scripts are triggered during +`R CMD INSTALL .` on non-Windows and Windows platforms, respectively. They +handle finding the libarrow, setting up the build variables necessary, and +writing the package Makevars file that is used to compile the C++ code in the R +package. + +* `tools/nixlibs.R` - this script is sometimes called by `configure` on Linux +(or on any non-windows OS with the environment variable +`FORCE_BUNDLED_BUILD=true`) if an existing libarrow installation cannot be found. +This sets up the build process for our bundled builds (which is the default on +linux) and checks for binaries or downloads libarrow from source depending on +dependency availability and build configuration. + +* `tools/winlibs.R` - this script is sometimes called by `configure.win` on Windows +when environment variable `ARROW_HOME` is not set. It looks for an existing libarrow +installation, and if it can't find one downloads an appropriate libarrow binary. + +* `inst/build_arrow_static.sh` - called by `tools/nixlibs.R` when libarrow +needs to be built. It builds libarrow for a bundled, static build, and +mirrors the steps described in the ["Arrow R Developer Guide" vignette](./setup.html) +This build script is also what is used to generate our prebuilt binaries. + +The actions taken by these scripts to resolve dependencies and install the +correct components are described below. + +## How the R package finds libarrow + +### Windows + +The diagram below shows how the R package finds a libarrow installation on Windows. + +```{r, echo=FALSE, out.width="70%"} +knitr::include_graphics("./install_diagram_windows.png") +``` + +### Linux + +The diagram below shows how the R package finds a libarrow installation on non-Windows systems. + +```{r, echo=FALSE, out.width="70%"} +knitr::include_graphics("./install_nix.png") +``` + +More information about these steps can be found below. + +#### Using pkg-config + +When you install the arrow R package on Linux, if no environment variables +relating to the location of an existing libarrow installation have already by +set, the installation code will attempt to find libarrow on +your system using the `pkg-config` command. + +This will find either installed system packages or libraries you've built yourself. +In order for `install.packages("arrow")` to work with these system packages, +you'll need to install them before installing the R package. + +#### Prebuilt binaries + +If libarrow is not found on the system, the R package installation +script will next attempt to download prebuilt libarrow binaries +that match your both your local operating system and arrow R package version. +The libarrow binaries will only be retrieved if you have set the environment variable +`LIBARROW_BINARY` or `NOT_CRAN`. + +If found, they will be downloaded and bundled when your R package compiles. +For a list of supported distributions and versions, +see the [arrow-r-nightly](https://github.com/ursa-labs/arrow-r-nightly/blob/master/README.md) project. + +#### Building from source + +If no libarrow binary is found, it will attempt to build it locally. +First, it will also look to see if you are in a checkout of the `apache/arrow` +git repository and thus have the libarrow source files there. +Otherwise, it builds from the source files included in the package. +Depending on your system, building libarrow from source may be slow. If +libarrow is built from source, `inst/build_arrow_static.sh` is executed. + +# Using the R package with libarrow installed as a system package + +If you are authorized to install system packages and you're installing a CRAN release, +you may want to use the official Apache Arrow release packages corresponding to +the R package version via software distribution tools such as `apt` or `yum` +(though there are some drawbacks: see the +["Troubleshooting" section in the main installation docs]("../install.html)). +See the [Arrow project installation page](https://arrow.apache.org/install/) +to find pre-compiled binary packages for some common Linux distributions, +including Debian, Ubuntu, and CentOS. + +Generally, we do not recommend this method of working with libarrow with the R +package unless you have a specific reason to do so. + +# Using the R package with an existing libarrow build + +This setup is much more common for arrow developers, who may be needing to make +changes to both the R package and libarrow source code. See +the [developer setup docs](./setup.html) for more information. diff --git a/r/vignettes/developers/install_diagram_windows.png b/r/vignettes/developers/install_diagram_windows.png new file mode 100644 index 00000000000..9fc50ea9330 Binary files /dev/null and b/r/vignettes/developers/install_diagram_windows.png differ diff --git a/r/vignettes/developers/install_nix.png b/r/vignettes/developers/install_nix.png new file mode 100644 index 00000000000..e8ddef94c19 Binary files /dev/null and b/r/vignettes/developers/install_nix.png differ diff --git a/r/vignettes/install.Rmd b/r/vignettes/install.Rmd index 7c813f85c32..1ef8a35b6ef 100644 --- a/r/vignettes/install.Rmd +++ b/r/vignettes/install.Rmd @@ -7,49 +7,249 @@ vignette: > %\VignetteEncoding{UTF-8} --- -On macOS and Windows, when you `install.packages("arrow")`, -you get a binary package that contains Arrow’s C++ dependencies along with it. -On Linux, `install.packages()` retrieves a source package that has to be compiled locally, -and C++ dependencies need to be resolved as well. +The Apache Arrow project is implemented in multiple languages, and the R package depends on the Arrow C++ library (referred to from here on as libarrow). This means that when you install arrow, you need both the R and C++ versions. If you install arrow from CRAN on a machine running Windows or MacOS, when you call `install.packages("arrow")`, a precompiled binary containing both the R package and libarrow will be downloaded. However, CRAN does not host R package binaries for Linux, and so you must choose from one of the alternative approaches. -On linux we recommend one of the following for the quickest and easiest -installation: +This vignette outlines the recommend approaches to installing arrow on Linux, starting from the simplest and least customisable to the most complex but with more flexbility to customise your installation. -* Set the environment variable `NOT_CRAN=true` before installing, which will both - check for compatible Apache binaries and use those and if those aren't available - set a more fully-featured build than default. -* Using [RStudio's public package manager](https://packagemanager.rstudio.com/client/#/) - which includes pre-built binaries +The intended audience for this document is arrow R package _users_ on Linux, and not Arrow _developers_. +If you're contributing to the Arrow project, see `vignette("developing", package = "arrow")` for +resources to help you on set up your development environment. You can also find +a more detailed discussion of the code run during the installation process in the +[developers' installation docs](https://arrow.apache.org/docs/r/articles/developers/install_details.html) -Our goal is to make `install.packages("arrow")` "just work" for as many Linux distributions, -versions, and configurations as possible with the above options. +> Having trouble installing arrow? See the "Troubleshooting" section below. -This rest of this document describes how it works and the options for fine-tuning Linux installation. -The intended audience for this document is `arrow` R package users on Linux, not Arrow developers. -If you're contributing to the Arrow project, see `vignette("developing", package = "arrow") for guidance on setting up your development environment. +# Installing a release version (the easy way) + +## Method 1 - Installation with a precompiled libarrow binary + +As mentioned above, on macOS and Windows, when you run `install.packages("arrow")`, and install arrow from CRAN, you get an R binary package that contains a precompiled version of libarrow, though CRAN does not host binary packages for Linux. This means that the default behaviour when you run `install.packages()` on Linux is to retrieve the source version of the R package that has to be compiled locally, including building libarrow from source. See method 2 below for details of this. + +For a faster installation, we recommend that you instead use one of the methods below for installing arrow with a precompiled libarrow binary. + +### Method 1a - Binary R package containing libarrow binary via RSPM/conda + +```{r, echo=FALSE, out.width="30%"} +knitr::include_graphics("./r_binary_libarrow_binary.png") +``` + +If you want a quicker installation process, and by default a more fully-featured build, you could install arrow from [RStudio's public package manager](https://packagemanager.rstudio.com/client/#/), which hosts binaries for both Windows and Linux. + +For example, if you are using Ubuntu 20.04 (Focal): + +```{r, eval = FALSE} +install.packages("arrow", repos = "https://packagemanager.rstudio.com/all/__linux__/focal/latest") +``` + +For other Linux distributions, to get the relevant URL, you can visit +[the RSPM site](https://packagemanager.rstudio.com/client/#/repos/1/overview), +click on 'binary', and select your preferred distribution. + +Similarly, if you use `conda` to manage your R environment, you can get the +latest official release of the R package including libarrow via: + +```shell +conda install -c conda-forge --strict-channel-priority r-arrow +``` + +### Method 1b - R source package with libarrow binary + +```{r, echo=FALSE, out.width="50%"} +knitr::include_graphics("./r_source_libarrow_binary.png") +``` + +Another way of achieving faster installation with all key features enabled is to use our self-hosted libarrow binaries. You can do this by setting the `NOT_CRAN` environment variable before you call `install.packages()`: + +```{r, eval = FALSE} +Sys.setenv("NOT_CRAN" = TRUE) +install.packages("arrow") +``` + +This installs the source version of the R package, but during the installation process will check for compatible libarrow binaries that we host and use those if available. If no binary is available or can't be found, then this option falls back onto method 2 below, but results in a more fully-featured build than default. + +# Installing libarrow dependencies + +When you install libarrow, its dependencies will be automatically downloaded. +The environment variable `ARROW_DEPENDENCY_SOURCE` controls whether the libarrow +installation also downloads or installs all dependencies (when set to `BUNDLED`), +uses only system-installed dependencies (when set to `SYSTEM`) or checks +system-installed dependencies first and only installs dependencies which aren't +already present (when set to `AUTO`). + +These dependencies vary by platform; however, if you wish to install these +yourself prior to libarrow installation, we recommend that you take a look at +the [docker file for whichever of our CI builds](https://github.com/apache/arrow/tree/master/ci/docker) +(the ones ending in "cpp" are for building Arrow's C++ libaries aka libarrow) +corresponds most closely to your setup. This will contain the most up-to-date +information about dependencies and minimum versions. + +## Dependencies for S3 support + +The arrow package allows you to work with data in AWS S3 or in other cloud +storage system that emulate S3. However, support for working with S3 is not +enabled in the default build, and it has additional system requirements. To +enable it, set the environment variable `LIBARROW_MINIMAL=false` or +`NOT_CRAN=true` to choose the full-featured build, or more selectively set +`ARROW_S3=ON`. You also need the following system dependencies: + +* `gcc` >= 4.9 or `clang` >= 3.3; note that the default compiler on CentOS 7 is gcc 4.8.5, which is not sufficient +* CURL: install `libcurl-devel` (rpm) or `libcurl4-openssl-dev` (deb) +* OpenSSL >= 1.0.2: install `openssl-devel` (rpm) or `libssl-dev` (deb) + +The prebuilt libarrow binaries come with S3 support enabled, so you will need to meet these system requirements in order to use them--the package will not install without them (and will error with a message that explains this).If you're building everything from source, the install script will check for the presence of these dependencies and turn off S3 support in the build if the prerequisites are not met--installation will succeed but without S3 functionality. If afterwards you install the missing system requirements, you'll need to reinstall the package in order to enable S3 support. + +# Installing a release version (the less easy way) + +## Method 2 - Installing an R source package and building libarrow from source + +```{r, echo=FALSE, out.width="50%"} +knitr::include_graphics("./r_source_libarrow_source.png") +``` Generally compiling and installing R packages with C++ dependencies, requires either installing system packages, which you may not have privileges to do, or building the C++ dependencies separately, which introduces all sorts of -additional ways for things to go wrong. +additional ways for things to go wrong, which is why we recommend method 1 above. -Note also that if you use `conda` to manage your R environment, this document does not apply. -You can `conda install -c conda-forge --strict-channel-priority r-arrow` and you'll get the latest official -release of the R package along with any C++ dependencies. +However, if you wish to fine-tune or customise your Linux installation, the +instructions in this section explain how to do that. -> Having trouble installing `arrow`? See the "Troubleshooting" section below. +### Basic configuration for building from source with fully featured installation -# Installation basics +If you wish to install libarrow from source instead of looking for pre-compiled +binaries, you can set the `LIBARROW_BINARY` variable. -Install the latest release of `arrow` from CRAN with +```{r, eval = FALSE} +Sys.setenv("LIBARROW_BINARY" = FALSE) +``` -```r -Sys.setenv(NOT_CRAN = TRUE) +By default, this is set to `TRUE`, and so libarrow will only be built from +source if this environment variable is set to `FALSE` or no compatible binary +for your OS can be found. + +When compiling libarrow from source, you have the power to really fine-tune +which features to install. You can set the environment variable +`LIBARROW_MINIMAL` to `FALSE` to enable a more full-featured build including S3 support +and alternative memory allocators. + +```{r, eval = FALSE} +Sys.setenv("LIBARROW_MINIMAL" = FALSE) +``` + +By default this variable is unset; if set to `TRUE` a trimmed-down version of +arrow is installed with many features disabled. + +Note that in this guide, you will have seen us mention the environment variable +`NOT_CRAN` - this is a convenience variable, which when set to `TRUE`, +automatically sets `LIBARROW_MINIMAL` to `FALSE` and `LIBARROW_BINARY` to `TRUE`. + +Building libarrow from source requires more time and resources than installing +a binary. We recommend that you set the environment variable `ARROW_R_DEV` to +`TRUE` for more verbose output during the installation process if anything goes +wrong. + +```{r, eval = FALSE} +Sys.setenv("ARROW_R_DEV" = TRUE) +``` + +Once you have set these variables, call `install.packages()` to install arrow +using this configuration. + +```{r, eval = FALSE} install.packages("arrow") ``` -Daily development builds, which are not official releases, -can be installed from the Ursa Labs repository: +The section below discusses environment variables you can set before calling +`install.packages("arrow")` to build from source and customise your configuration. + +### Advanced configuration for building from source + +In this section, we describe how to fine-tune your installation at a more granular level. + +#### libarrow configuration + +Some features are optional when you build Arrow from source - you can configure +whether these components are built via the use of environment variables. The +names of the environment variables which control these features and their +default values are shown below. + +| Name | Description | Default Value | +| ---| --- | :-: | +| `ARROW_S3` | S3 support (if dependencies are met)* | `OFF` | +| `ARROW_JEMALLOC` | The `jemalloc` memory allocator | `ON` | +| `ARROW_MIMALLOC` | The `mimalloc` memory allocator | `ON` | +| `ARROW_PARQUET` | | `ON` | +| `ARROW_DATASET` | | `ON` | +| `ARROW_JSON` | The JSON parsing library | `ON` | +| `ARROW_WITH_RE2` | The RE2 regular expression library, used in some string compute functions | `ON` | +| `ARROW_WITH_UTF8PROC` | The UTF8Proc string library, used in many other string compute functions | `ON` | +| `ARROW_WITH_BROTLI` | Compression algorithm | `ON` | +| `ARROW_WITH_BZ2` | Compression algorithm | `ON` | +| `ARROW_WITH_LZ4` | Compression algorithm | `ON` | +| `ARROW_WITH_SNAPPY` | Compression algorithm | `ON` | +| `ARROW_WITH_ZLIB` | Compression algorithm | `ON` | +| `ARROW_WITH_ZSTD` | Compression algorithm | `ON` | + +#### R package configuration + +There are a number of other variables that affect the `configure` script and +the bundled build script. All boolean variables are case-insensitive. + +| Name | Description | Default | +| --- | --- | :-: | +| `LIBARROW_BUILD` | Allow building from source | `true` | +| `LIBARROW_BINARY` | Try to install `libarrow` binary instead of building from source | `true` | +| `LIBARROW_MINIMAL` | Build with minimal features enabled | (unset) | +| `NOT_CRAN` | Set `LIBARROW_BINARY=true` and `LIBARROW_MINIMAL=false` | `false` | +| `ARROW_R_DEV` | More verbose messaging and regenerates some code | `false` | +| `ARROW_USE_PKG_CONFIG` | Use `pkg-config` to search for `libarrow` install | `true` | +| `LIBARROW_DEBUG_DIR` | Directory to save source build logs | (unset) | +| `CMAKE` | Alternative CMake path | (unset) | + +See below for more in-depth explanations of these environment variables. + +* `LIBARROW_BINARY` : If set to `true`, the script will try to download a binary + C++ library built for your operating system. You may also set it to some other string, a related "distro-version" that has binaries built that work for your OS. See the [distro map](https://raw.githubusercontent.com/ursa-labs/arrow-r-nightly/master/linux/distro-map.csv) for compatible binaries and OSs. If no binary is found, installation will fall back to building C++ dependencies from source. +* `LIBARROW_BUILD` : If set to `false`, the build script + will not attempt to build the C++ from source. This means you will only get + a working arrow R package if a prebuilt binary is found. + Use this if you want to avoid compiling the C++ library, which may be slow + and resource-intensive, and ensure that you only use a prebuilt binary. +* `LIBARROW_MINIMAL` : If set to `false`, the build script + will enable some optional features, including S3 + support and additional alternative memory allocators. This will increase the + source build time but results in a more fully functional library. If set to + `true` turns off Parquet, Datasets, compression libraries, and other optional + features. This is not commonly used but may be helpful if needing to compile + on a platform that does not support these features, e.g. Solaris. +* `NOT_CRAN` : If this variable is set to `true`, as the `devtools` package does, + the build script will set `LIBARROW_BINARY=true` and `LIBARROW_MINIMAL=false` + unless those environment variables are already set. This provides for a more + complete and fast installation experience for users who already have + `NOT_CRAN=true` as part of their workflow, without requiring additional + environment variables to be set. +* `ARROW_R_DEV` : If set to `true`, more verbose messaging will be printed + in the build script. `arrow::install_arrow(verbose = TRUE)` sets this. + This variable also is needed if you're modifying C++ + code in the package: see the developer guide vignette. +* `ARROW_USE_PKG_CONFIG`: If set to `false`, the configure script won't look for +Arrow libraries on your system and instead will look to download/build them. + Use this if you have a version mismatch between installed system libraries and + the version of the R package you're installing. +* `LIBARROW_DEBUG_DIR` : If the C++ library building from source fails (`cmake`), + there may be messages telling you to check some log file in the build directory. + However, when the library is built during R package installation, + that location is in a temp directory that is already deleted. + To capture those logs, set this variable to an absolute (not relative) path + and the log files will be copied there. + The directory will be created if it does not exist. +* `CMAKE` : When building the C++ library from source, you can specify a + `/path/to/cmake` to use a different version than whatever is found on the `$PATH`. + +# Install the nightly build + +Daily development builds, which are not official releases, can be installed +from the Ursa Labs repository: ```r Sys.setenv(NOT_CRAN = TRUE) @@ -62,6 +262,8 @@ or for conda users via: conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow ``` +# Install from git repo + You can also install the R package from a git checkout: ```shell @@ -70,57 +272,61 @@ cd arrow/r R CMD INSTALL . ``` -If you don't already have the Arrow C++ libraries on your system, +If you don't already have libarrow on your system, when installing the R package from source, it will also download and build -the Arrow C++ libraries for you. To speed installation up, you can set +libarrow for you. See the section above on build environment +variables for options for configuring the build source and enabled features. -```shell -export LIBARROW_BINARY=true +# Installation using install_arrow() + +The previous instructions are useful for a fresh arrow installation, but arrow +provides the function `install_arrow()`, which you can use if you: + +* already have arrow installed and want to upgrade to a different version +* want to install a development build +* want to try to reinstall and fix issues with Linux C++ binaries + +`install_arrow()` provides some convenience wrappers around the various +environment variables described below. + +Although this function is part of the arrow package, it is also available as +a standalone script, so you can access it for convenience without first installing the package: + +```r +source("https://raw.githubusercontent.com/apache/arrow/master/r/R/install-arrow.R") ``` -to look for C++ binaries prebuilt for your Linux distribution/version. -Alternatively, you can set +## Install the latest release -```shell -export LIBARROW_MINIMAL=false +```r +install_arrow() ``` -to build the Arrow libraries from source with optional features such as compression libraries -enabled. This will increase the build time but provides many useful features. -Prebuilt binaries are built with this flag enabled, so you get the full -functionality by using them as well. +## Install the nightly build -Both of these variables are also set this way if you have the `NOT_CRAN=true` -environment variable set. +```r +install_arrow(nightly = TRUE) +``` -## Helper function: install_arrow() +## Install with more verbose output for debugging errors -If you already have `arrow` installed and want to upgrade to a different version, -install a development build, or try to reinstall and fix issues with Linux -C++ binaries, you can call `install_arrow()`. -`install_arrow()` provides some convenience wrappers around the various -environment variables described below. -This function is part of the `arrow` package, -and it is also available as a standalone script, so you can -access it for convenience without first installing the package: ```r -source("https://raw.githubusercontent.com/apache/arrow/master/r/R/install-arrow.R") +install_arrow(verbose = TRUE) ``` -`install_arrow()` will install from CRAN, -while `install_arrow(nightly = TRUE)` will give you a development build. `install_arrow()` does not require environment variables to be set in order to satisfy C++ dependencies. -> Note that, unlike packages like `tensorflow`, `blogdown`, and others that require external dependencies, you do not need to run `install_arrow()` after a successful `arrow` installation. +> Note that, unlike packages like `tensorflow`, `blogdown`, and others that require external dependencies, you do not need to run `install_arrow()` after a successful arrow installation. -## Offline installation +# Offline installation The `install-arrow.R` file also includes the `create_package_with_all_dependencies()` function. Normally, when installing on a computer with internet access, the build process will download third-party dependencies as needed. This function provides a way to download them in advance. + Doing so may be useful when installing Arrow on a computer without internet access. Note that Arrow _can_ be installed on a computer without internet access without doing this, but many useful features will be disabled, as they depend on third-party components. @@ -141,14 +347,14 @@ make a source bundle with this function, make sure to set the first repo in `options("repos")` to be a mirror that contains source packages (that is: something other than the RSPM binary mirror URLs). -### Using a computer with internet access, pre-download the dependencies: -* Install the `arrow` package _or_ run +### Step 1 - Using a computer with internet access, pre-download the dependencies: +* Install the arrow package _or_ run `source("https://raw.githubusercontent.com/apache/arrow/master/r/R/install-arrow.R")` * Run `create_package_with_all_dependencies("my_arrow_pkg.tar.gz")` * Copy the newly created `my_arrow_pkg.tar.gz` to the computer without internet access -### On the computer without internet access, install the prepared package: -* Install the `arrow` package from the copied file +### Step 2 - On the computer without internet access, install the prepared package: +* Install the arrow package from the copied file * `install.packages("my_arrow_pkg.tar.gz", dependencies = c("Depends", "Imports", "LinkingTo"))` * This installation will build from source, so `cmake` must be available * Run `arrow_info()` to check installed capabilities @@ -157,109 +363,7 @@ something other than the RSPM binary mirror URLs). * Download the dependency files (`cpp/thirdparty/download_dependencies.sh` may be helpful) * Copy the directory of dependencies to the offline computer * Create the environment variable `ARROW_THIRDPARTY_DEPENDENCY_DIR` on the offline computer, pointing to the copied directory. -* Install the `arrow` package as usual. - -## S3 support - -The `arrow` package allows you to work with data in AWS S3 or in other cloud -storage system that emulate S3. However, support for working with S3 is not -enabled in the default build, and it has additional system requirements. To -enable it, set the environment variable `LIBARROW_MINIMAL=false` or -`NOT_CRAN=true` to choose the full-featured build, or more selectively set -`ARROW_S3=ON`. You also need the following system dependencies: - -* `gcc` >= 4.9 or `clang` >= 3.3; note that the default compiler on CentOS 7 is gcc 4.8.5, which is not sufficient -* CURL: install `libcurl-devel` (rpm) or `libcurl4-openssl-dev` (deb) -* OpenSSL >= 1.0.2: install `openssl-devel` (rpm) or `libssl-dev` (deb) - -The prebuilt C++ binaries come with S3 support enabled, so you will need to meet -these system requirements in order to use them--the package will not install -without them. If you're building everything from source, the install script -will check for the presence of these dependencies and turn off S3 support in the -build if the prerequisites are not met--installation will succeed but without -S3 functionality. If afterwards you install the missing system requirements, -you'll need to reinstall the package in order to enable S3 support. - -# How dependencies are resolved - -In order for the `arrow` R package to work, it needs the Arrow C++ library. -There are a number of ways you can get it: a system package; a library you've -built yourself outside of the context of installing the R package; -or, if you don't already have it, the R package will attempt to resolve it -automatically when it installs. - -If you are authorized to install system packages and you're installing a CRAN release, -you may want to use the official Apache Arrow release packages corresponding to the R package version (though there are some drawbacks: see "Troubleshooting" below). -See the [Arrow project installation page](https://arrow.apache.org/install/) -to find pre-compiled binary packages for some common Linux distributions, -including Debian, Ubuntu, and CentOS. -You'll need to install `libparquet-dev` on Debian and Ubuntu, or `parquet-devel` on CentOS. -This will also automatically install the Arrow C++ library as a dependency. - -When you install the `arrow` R package on Linux, -it will first attempt to find the Arrow C++ libraries on your system using -the `pkg-config` command. -This will find either installed system packages or libraries you've built yourself. -In order for `install.packages("arrow")` to work with these system packages, -you'll need to install them before installing the R package. - -If no Arrow C++ libraries are found on the system, -the R package installation script will next attempt to download -prebuilt static Arrow C++ libraries -that match your both your local operating system and `arrow` R package version. -C++ binaries will only be retrieved if you have set the environment variable -`LIBARROW_BINARY` or `NOT_CRAN`. -If found, they will be downloaded and bundled when your R package compiles. -For a list of supported distributions and versions, -see the [arrow-r-nightly](https://github.com/ursa-labs/arrow-r-nightly/blob/master/README.md) project. - -If no C++ library binary is found, it will attempt to build it locally. -First, it will also look to see if you are in -a checkout of the `apache/arrow` git repository and thus have the C++ source there. -Otherwise, it builds from the C++ files included in the package. -Depending on your system, building Arrow C++ from source may be slow. - -For the specific mechanics of how all this works, see the R package `configure` script, -which calls `tools/nixlibs.R`. - -If the C++ library is built from source, `inst/build_arrow_static.sh` is executed. -This build script is also what is used to generate the prebuilt binaries. - -## How the package is installed - advanced - -This subsection contains information which is likely to be most relevant mostly -to Arrow developers and is not necessary for Arrow users to install Arrow. - -There are a number of scripts that are triggered when `R CMD INSTALL .` is run. -For Arrow users, these should all just work without configuration and pull in -the most complete pieces (e.g. official binaries that we host). - -An overview of these scripts is shown below: - -* `configure` and `configure.win` - these scripts are triggered during -`R CMD INSTALL .` on non-Windows and Windows platforms, respectively. They -handle finding the Arrow library, setting up the build variables necessary, and -writing the package Makevars file that is used to compile the C++ code in the R -package. - -* `tools/nixlibs.R` - this script is sometimes called by `configure` on Linux -(or on any non-windows OS with the environment variable -`FORCE_BUNDLED_BUILD=true`). This sets up the build process for our bundled -builds (which is the default on linux). The operative logic is at the end of -the script, but it will do the following (and it will stop with the first one -that succeeds and some of the steps are only checked if they are enabled via an -environment variable): - * Check if there is an already built libarrow in `arrow/r/libarrow-{version}`, - use that to link against if it exists. - * Check if a binary is available from our hosted unofficial builds. - * Download the Arrow source and build the Arrow Library from source. - * `*** Proceed without C++` dependencies (this is an error and the package - will not work, but if you see this message you know the previous steps have - not succeeded/were not enabled) - -* `inst/build_arrow_static.sh` - called by `tools/nixlibs.R` when the Arrow -library is being built. It builds Arrow for a bundled, static build, and -mirrors the steps described in the ["Arrow R Developer Guide" vignette](./developing.html) +* Install the arrow package as usual. # Troubleshooting @@ -279,16 +383,14 @@ See https://arrow.apache.org/docs/r/articles/install.html ``` in the output when the package fails to install, -that means that installation failed to retrieve or build C++ libraries +that means that installation failed to retrieve or build the libarrow version compatible with the current version of the R package. -It is expected that C++ dependencies should be built successfully -on all Linux distributions, so you should not see this message. If you do, -please check the "Known installation issues" below to see if any apply. -If none apply, set the environment variable `ARROW_R_DEV=TRUE` -so that details on what failed are shown, and try installing again. Then, +Please check the "Known installation issues" below to see if any apply, and if +none apply, set the environment variable `ARROW_R_DEV=TRUE` for more verbose +output and try installing again. Then, please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues) -and include the full verbose installation output. +and include the full installation output. ## Using system libraries @@ -296,39 +398,39 @@ If a system library or other installed Arrow is found but it doesn't match the R (for example, you have libarrow 1.0.0 on your system and are installing R package 2.0.0), it is likely that the R bindings will fail to compile. Because the Apache Arrow project is under active development, -is it essential that versions of the C++ and R libraries match. -When `install.packages("arrow")` has to download the C++ libraries, -the install script ensures that you fetch the C++ libraries that correspond to your R package version. -However, if you are using Arrow libraries already on your system, version match isn't guaranteed. +it is essential that versions of libarrow and the R package matches. +When `install.packages("arrow")` has to download libarrow, +the install script ensures that you fetch the libarrow version that corresponds to your R package version. +However, if you are using a version of libarrow already on your system, version match isn't guaranteed. -To fix version mismatch, you can either update your system packages to match the R package version, +To fix version mismatch, you can either update your libarrow system packages to match the R package version, or set the environment variable `ARROW_USE_PKG_CONFIG=FALSE` -to tell the configure script not to look for system Arrow packages. +to tell the configure script not to look for system version of libarrow. (The latter is the default of `install_arrow()`.) -System packages are available corresponding to all CRAN releases +System libarrow versions are available corresponding to all CRAN releases but not for nightly or dev versions, so depending on the R package version you're installing, -system packages may not be an option. +system libarrow version may not be an option. Note also that once you have a working R package installation based on system (shared) libraries, -if you update your system Arrow, you'll need to reinstall the R package to match its version. -Similarly, if you're using Arrow system libraries, running `update.packages()` -after a new release of the `arrow` package will likely fail unless you first -update the system packages. +if you update your system libarrow installation, you'll need to reinstall the R package to match its version. +Similarly, if you're using libarrow system libraries, running `update.packages()` +after a new release of the arrow package will likely fail unless you first +update the libarrow system packages. ## Using prebuilt binaries -If the R package finds and downloads a prebuilt binary of the C++ library, -but then the `arrow` package can't be loaded, perhaps with "undefined symbols" errors, +If the R package finds and downloads a prebuilt binary of libarrow, +but then the arrow package can't be loaded, perhaps with "undefined symbols" errors, please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues). This is likely a compiler mismatch and may be resolvable by setting some -environment variables to instruct R to compile the packages to match the C++ library. +environment variables to instruct R to compile the packages to match libarrow. A workaround would be to set the environment variable `LIBARROW_BINARY=FALSE` -and retry installation: this value instructs the package to build the C++ library from source +and retry installation: this value instructs the package to build libarrow from source instead of downloading the prebuilt binary. That should guarantee that the compiler settings match. -If a prebuilt binary wasn't found for your operating system but you think it should have been, +If a prebuilt libarrow binary wasn't found for your operating system but you think it should have been, check the logs for a message that says `*** Unable to identify current OS/version`, or a message that says `*** No C++ binaries found for` an invalid OS. If you see either, please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues). @@ -347,12 +449,12 @@ This table is checked during the installation process and tells the script to use binaries built on a different operating system/version because they're known to work. -## Building C++ from source +## Building libarrow from source -If building the C++ library from source fails, check the error message. +If building libarrow from source fails, check the error message. (If you don't see an error message, only the `----- NOTE -----`, set the environment variable `ARROW_R_DEV=TRUE` to increase verbosity and retry installation.) -The install script should work everywhere, so if the C++ library fails to compile, +The install script should work everywhere, so if libarrow fails to compile, please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues) so that we can improve the script. @@ -364,80 +466,21 @@ For CentOS 7 and above, both the Arrow system packages and the C++ binaries for R are built with the default system compilers. If you want to use either of these and you have a `devtoolset` installed, set `CC=/usr/bin/gcc CXX=/usr/bin/g++` to use the system compilers instead of the `devtoolset`. -Alternatively, if you want to build `arrow` with the newer `devtoolset` compilers, +Alternatively, if you want to build arrow with the newer `devtoolset` compilers, set both `ARROW_USE_PKG_CONFIG` and `LIBARROW_BINARY` to `false` so that you build the Arrow C++ from source using those compilers. Compiler mismatch between the arrow system libraries and the R -package may cause R to segfault when `arrow` package functions are used. +package may cause R to segfault when arrow package functions are used. See discussions [here](https://issues.apache.org/jira/browse/ARROW-8586) and [here](https://issues.apache.org/jira/browse/ARROW-10780). * If you have multiple versions of `zstd` installed on your system, -installation by building the C++ from source may fail with an undefined symbols +installation by building libarrow from source may fail with an "undefined symbols" error. Workarounds include (1) setting `LIBARROW_BINARY` to use a C++ binary; (2) setting `ARROW_WITH_ZSTD=OFF` to build without `zstd`; or (3) uninstalling the conflicting `zstd`. See discussion [here](https://issues.apache.org/jira/browse/ARROW-8556). -## Summary of build environment variables - -Some features are optional when you build Arrow from source. With the exception of `ARROW_S3`, these are all `ON` by default in the bundled C++ build, but you can set them to `OFF` to disable them. - -* `ARROW_S3`: If set to `ON` S3 support will be built as long as the - dependencies are met; if they are not met, the build script will turn this `OFF` -* `ARROW_JEMALLOC` for the `jemalloc` memory allocator -* `ARROW_MIMALLOC` for the `mimalloc` memmory allocator -* `ARROW_PARQUET` -* `ARROW_DATASET` -* `ARROW_JSON` for the JSON parsing library -* `ARROW_WITH_RE2` for the RE2 regular expression library, used in some string compute functions -* `ARROW_WITH_UTF8PROC` for the UTF8Proc string library, used in many other string compute functions -* `ARROW_JSON` for JSON parsing -* `ARROW_WITH_BROTLI`, `ARROW_WITH_BZ2`, `ARROW_WITH_LZ4`, `ARROW_WITH_SNAPPY`, `ARROW_WITH_ZLIB`, and `ARROW_WITH_ZSTD` for various compression algorithms - - -There are a number of other variables that affect the `configure` script and the bundled build script. -By default, these are all unset. All boolean variables are case-insensitive. - -* `ARROW_USE_PKG_CONFIG`: If set to `false`, the configure script - won't look for Arrow libraries on your system and instead will look to download/build them. - Use this if you have a version mismatch between installed system libraries - and the version of the R package you're installing. -* `LIBARROW_BINARY`: If set to `true`, the script will try to download a binary - C++ library built for your operating system. - You may also set it to some other string, - a related "distro-version" that has binaries built that work for your OS. - If no binary is found, installation will fall back to building C++ - dependencies from source. -* `LIBARROW_BUILD`: If set to `false`, the build script - will not attempt to build the C++ from source. This means you will only get - a working `arrow` R package if a prebuilt binary is found. - Use this if you want to avoid compiling the C++ library, which may be slow - and resource-intensive, and ensure that you only use a prebuilt binary. -* `LIBARROW_MINIMAL`: If set to `false`, the build script - will enable some optional features, including compression libraries, S3 - support, and additional alternative memory allocators. This will increase the - source build time but results in a more fully functional library. -* `NOT_CRAN`: If this variable is set to `true`, as the `devtools` package does, - the build script will set `LIBARROW_BINARY=true` and `LIBARROW_MINIMAL=false` - unless those environment variables are already set. This provides for a more - complete and fast installation experience for users who already have - `NOT_CRAN=true` as part of their workflow, without requiring additional - environment variables to be set. -* `ARROW_R_DEV`: If set to `true`, more verbose messaging will be printed - in the build script. `arrow::install_arrow(verbose = TRUE)` sets this. - This variable also is needed if you're modifying C++ - code in the package: see the developer guide vignette. -* `LIBARROW_DEBUG_DIR`: If the C++ library building from source fails (`cmake`), - there may be messages telling you to check some log file in the build directory. - However, when the library is built during R package installation, - that location is in a temp directory that is already deleted. - To capture those logs, set this variable to an absolute (not relative) path - and the log files will be copied there. - The directory will be created if it does not exist. -* `CMAKE`: When building the C++ library from source, you can specify a - `/path/to/cmake` to use a different version than whatever is found on the `$PATH` - # Contributing As mentioned above, please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues) @@ -448,7 +491,7 @@ Docker images should be minimal, containing only R and the dependencies it requires. (For reference, see the images that [R-hub](https://github.com/r-hub/rhub-linux-builders) uses.) -You can test the `arrow` R package installation using the `docker-compose` +You can test the arrow R package installation using the `docker-compose` setup included in the `apache/arrow` git repository. For example, ``` @@ -456,6 +499,6 @@ R_ORG=rhub R_IMAGE=ubuntu-gcc-release R_TAG=latest docker-compose build r R_ORG=rhub R_IMAGE=ubuntu-gcc-release R_TAG=latest docker-compose run r ``` -installs the `arrow` R package, including the C++ source build, on the +installs the arrow R package, including libarrow, on the [rhub/ubuntu-gcc-release](https://hub.docker.com/r/rhub/ubuntu-gcc-release) image. diff --git a/r/vignettes/r_binary_libarrow_binary.png b/r/vignettes/r_binary_libarrow_binary.png new file mode 100644 index 00000000000..d1e968e8acf Binary files /dev/null and b/r/vignettes/r_binary_libarrow_binary.png differ diff --git a/r/vignettes/r_source_libarrow_binary.png b/r/vignettes/r_source_libarrow_binary.png new file mode 100644 index 00000000000..3017b7d6f50 Binary files /dev/null and b/r/vignettes/r_source_libarrow_binary.png differ diff --git a/r/vignettes/r_source_libarrow_source.png b/r/vignettes/r_source_libarrow_source.png new file mode 100644 index 00000000000..47d7b3fa1d6 Binary files /dev/null and b/r/vignettes/r_source_libarrow_source.png differ