From 06c06e2df2c17e39dccf2986bd801f52dd3903c1 Mon Sep 17 00:00:00 2001 From: Neal Richardson Date: Thu, 25 Jul 2019 12:25:41 -0700 Subject: [PATCH 1/6] First draft of R package release announcement --- site/_posts/2019-08-01-r-cran-release.md | 87 ++++++++++++++++++++++++ 1 file changed, 87 insertions(+) create mode 100644 site/_posts/2019-08-01-r-cran-release.md diff --git a/site/_posts/2019-08-01-r-cran-release.md b/site/_posts/2019-08-01-r-cran-release.md new file mode 100644 index 00000000000..4a9b2c77919 --- /dev/null +++ b/site/_posts/2019-08-01-r-cran-release.md @@ -0,0 +1,87 @@ +--- +layout: post +title: "Apache Arrow R Package CRAN Release" +date: "2019-08-01 00:00:00 -0600" +author: npr +categories: [application] +--- + + +We are very excited to announce that the `arrow` R package is now available on CRAN. + +[Apache Arrow](https://arrow.apache.org/) is a cross-language development platform for in-memory data that specifies a standardized columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. The `arrow` package provides an R interface to the Arrow C++ library, including support for working with Parquet and Feather files, as well as lower-level access to Arrow memory and messages. + +You can install the package from CRAN with + +```r +install.packages("arrow") +``` + +On macOS and Windows, installing a binary package from CRAN will handle Arrow’s C++ dependencies for you. On Linux, you’ll need to first install the C++ library. See the [Arrow project installation page](https://arrow.apache.org/install/) for a list of PPAs from which you can obtain it. If you install the `arrow` package from source and the C++ library is not found, the R package functions will notify you that Arrow is not available. Call + +```r +arrow::install_arrow() +``` + +for version- and platform-specific guidance on installing the Arrow C++ +library. + +## Parquet files + +This release introduces read and write support for the [Parquet](https://parquet.apache.org/) columnar data file format. Prior to this release, options for accessing Parquet data in R were limited; the most common recommendation was to use Spark. The `arrow` package greatly simplifies this access and lets you go from a Parquet file to a `data.frame` and back easily, without having to set up a database. + +```r +library(arrow) +df <- read_parquet("path/to/file.parquet") +``` + +This function, along with the other readers in the package, takes an optional `col_select` argument, inspired by the [`vroom`](https://vroom.r-lib.org/reference/vroom.html) package. This argument lets you use the ["tidyselect" helper functions](https://tidyselect.r-lib.org/reference/select_helpers.html), as you can do in `dplyr::select()`, to specify that you only want to keep certain columns. By narrowing your selection at read time, you can load a `data.frame` with less memory overhead. + +For example, suppose you had written the `iris` dataset to Parquet. You could read a `data.frame` with only the columns `c("Sepal.Length", "Sepal.Width")` by doing + +```r +df <- read_parquet("iris.parquet", col_select = starts_with("Sepal")) +``` + +Just as you can read, you can write Parquet files: + +```r +write_parquet(df, "path/to/different_file.parquet") +``` + +## Feather files + +This release also includes full support for the Feather file format, providing `read_feather()` and `write_feather()`. [Feather](https://github.com/wesm/feather) was one of the initial products coming out of the Arrow project, providing an efficient, common file format language-agnostic data frame storage, along with implementations in R and Python. + +As Arrow progressed, development of Feather moved to the [`apache/arrow`](https://github.com/apache/arrow) project, and for the last two years, the Python implementation of Feather has just been a wrapper around `pyarrow`. This meant that as Arrow progressed and bugs were fixed, the Python version of Feather got the improvements but sadly R did not. + +With this release, the R implementation of Feather catches up and now depends on the same underlying C++ library as the Python version does. This should result in more reliable and consistent behavior across the two languages. + +We encourage all R users of `feather` to switch to using `arrow::read_feather()` and `arrow::write_feather()`. + +Note that both Feather and Parquet are columnar data formats that allow sharing data frames across R, Pandas, and other tools. When should you use Feather and when should you use Parquet? We currently recommend Parquet for long-term storage, as well as for cases where the size on disk matters because Parquet supports various compression formats. Feather, on the other hand, may be faster to read in because it matches the in-memory format and doesn't require deserialization, and it also allows for memory mapping so that you can access data that is larger than can fit into memory. See the [Arrow project FAQ](https://arrow.apache.org/faq/) for more. + +## Other capabilities + +In addition to these readers and writers, the `arrow` package has wrappers for other readers in the C++ library; see `?read_csv_arrow` and `?read_json_arrow`. It also provides many lower-level bindings to the C++ library, which enable you to access and manipulate Arrow objects. You can use these to build connectors to other applications and services that use Arrow. One example is Spark: the [`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to move data to and from Spark, yielding [significant performance gains](http://arrow.apache.org/blog/2019/01/25/r-spark-improvements/). + +## Acknowledgements + +In addition to the work on wiring the R package up to the Arrow Parquet C++ library, a lot of effort went into building and packaging Arrow for R users, ensuring its ease of installation across platforms. We'd like to thank the support of Jeroen Ooms, Javier Luraschi, JJ Allaire, Davis Vaughan, the CRAN team, and many others in the Apache Arrow community for helping us get to this point. From ddb1857f15b22f7331392ee8f88e0cb118115ade Mon Sep 17 00:00:00 2001 From: Neal Richardson Date: Fri, 26 Jul 2019 09:08:49 -0700 Subject: [PATCH 2/6] Add self to contributors.yml; remove thoughtcrime from post title --- site/_data/contributors.yml | 3 +++ site/_posts/2019-08-01-r-cran-release.md | 2 +- 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/site/_data/contributors.yml b/site/_data/contributors.yml index 3d86a48f4ea..e70d9afad1f 100644 --- a/site/_data/contributors.yml +++ b/site/_data/contributors.yml @@ -46,4 +46,7 @@ apacheId: agrove githubId: andygrove role: PMC +- name: Neal Richardson + apacheId: npr # Not a real apacheId + githubId: nealrichardson # End contributors.yml diff --git a/site/_posts/2019-08-01-r-cran-release.md b/site/_posts/2019-08-01-r-cran-release.md index 4a9b2c77919..ab92c366250 100644 --- a/site/_posts/2019-08-01-r-cran-release.md +++ b/site/_posts/2019-08-01-r-cran-release.md @@ -1,6 +1,6 @@ --- layout: post -title: "Apache Arrow R Package CRAN Release" +title: "Apache Arrow R Package On CRAN" date: "2019-08-01 00:00:00 -0600" author: npr categories: [application] From c5dd6fad440b8d43bf26ec7f980d1c9b59361a62 Mon Sep 17 00:00:00 2001 From: Neal Richardson Date: Wed, 31 Jul 2019 13:48:44 -0700 Subject: [PATCH 3/6] Incorporate Wes's revisions --- r/README.Rmd | 2 +- r/README.md | 7 ++++--- ...ase.md => 2019-08-01-r-package-on-cran.md} | 20 ++++++++++++------- 3 files changed, 18 insertions(+), 11 deletions(-) rename site/_posts/{2019-08-01-r-cran-release.md => 2019-08-01-r-package-on-cran.md} (57%) diff --git a/r/README.Rmd b/r/README.Rmd index 586d4bc0497..e66e30205db 100644 --- a/r/README.Rmd +++ b/r/README.Rmd @@ -30,7 +30,7 @@ Install the latest release of `arrow` from CRAN with install.packages("arrow") ``` -On macOS and Windows, installing a binary package from CRAN will handle Arrow's C++ dependencies for you. On Linux, you'll need to first install the C++ library. See the [Arrow project installation page](https://arrow.apache.org/install/) for a list of PPAs from which you can obtain it. +On macOS and Windows, installing a binary package from CRAN will handle Arrow's C++ dependencies for you. On Linux, you'll need to first install the C++ library. See the [Arrow project installation page](https://arrow.apache.org/install/) to find pre-compiled binary packages for some common Linux distributions, such as Debian, Ubuntu, CentOS, and Fedora. Other Linux distributions must install the C++ library from source. If you install the `arrow` package from source and the C++ library is not found, the R package functions will notify you that Arrow is not available. Call diff --git a/r/README.md b/r/README.md index c2754bb3aa1..1d4ddcd96c4 100644 --- a/r/README.md +++ b/r/README.md @@ -30,8 +30,9 @@ install.packages("arrow") On macOS and Windows, installing a binary package from CRAN will handle Arrow’s C++ dependencies for you. On Linux, you’ll need to first install the C++ library. See the [Arrow project installation -page](https://arrow.apache.org/install/) for a list of PPAs from which -you can obtain it. +page](https://arrow.apache.org/install/) to find pre-compiled binary packages +for some common Linux distributions, such as Debian, Ubuntu, CentOS, and +Fedora. Other Linux distributions must install the C++ library from source. If you install the `arrow` package from source and the C++ library is not found, the R package functions will notify you that Arrow is not @@ -57,7 +58,7 @@ set.seed(24) tab <- arrow::table(x = 1:10, y = rnorm(10)) tab$schema -#> arrow::Schema +#> arrow::Schema #> x: int32 #> y: double tab diff --git a/site/_posts/2019-08-01-r-cran-release.md b/site/_posts/2019-08-01-r-package-on-cran.md similarity index 57% rename from site/_posts/2019-08-01-r-cran-release.md rename to site/_posts/2019-08-01-r-package-on-cran.md index ab92c366250..fa54f83daf3 100644 --- a/site/_posts/2019-08-01-r-cran-release.md +++ b/site/_posts/2019-08-01-r-package-on-cran.md @@ -24,7 +24,7 @@ limitations under the License. {% endcomment %} --> -We are very excited to announce that the `arrow` R package is now available on CRAN. +We are very excited to announce that the `arrow` R package is now available on [CRAN](https://cran.r-project.org/). [Apache Arrow](https://arrow.apache.org/) is a cross-language development platform for in-memory data that specifies a standardized columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. The `arrow` package provides an R interface to the Arrow C++ library, including support for working with Parquet and Feather files, as well as lower-level access to Arrow memory and messages. @@ -34,7 +34,9 @@ You can install the package from CRAN with install.packages("arrow") ``` -On macOS and Windows, installing a binary package from CRAN will handle Arrow’s C++ dependencies for you. On Linux, you’ll need to first install the C++ library. See the [Arrow project installation page](https://arrow.apache.org/install/) for a list of PPAs from which you can obtain it. If you install the `arrow` package from source and the C++ library is not found, the R package functions will notify you that Arrow is not available. Call +On macOS and Windows, installing a binary package from CRAN will handle Arrow’s C++ dependencies for you. On Linux, you’ll need to first install the C++ library. See the [Arrow project installation page](https://arrow.apache.org/install/) to find pre-compiled binary packages for some common Linux distributions, such as Debian, Ubuntu, CentOS, and Fedora. Other Linux distributions must install the C++ library from source. + +If you install the `arrow` R package from source and the C++ library is not found, the R package functions will notify you that Arrow is not available. Call ```r arrow::install_arrow() @@ -45,7 +47,7 @@ library. ## Parquet files -This release introduces read and write support for the [Parquet](https://parquet.apache.org/) columnar data file format. Prior to this release, options for accessing Parquet data in R were limited; the most common recommendation was to use Spark. The `arrow` package greatly simplifies this access and lets you go from a Parquet file to a `data.frame` and back easily, without having to set up a database. +This release introduces basic read and write support for the [Apache Parquet](https://parquet.apache.org/) columnar data file format. Prior to this release, options for accessing Parquet data in R were limited; the most common recommendation was to use Apache Spark. The `arrow` package greatly simplifies this access and lets you go from a Parquet file to a `data.frame` and back easily, without having to set up a database. ```r library(arrow) @@ -66,21 +68,25 @@ Just as you can read, you can write Parquet files: write_parquet(df, "path/to/different_file.parquet") ``` +Note that this read and write support for Parquet files in R is in its early stages of development. The Python Arrow library ([pyarrow](https://arrow.apache.org/docs/python/)) still has much richer support for Parquet files, including working with multi-file datasets. In the coming months, we hope to bring the R package towards feature equivalency. + ## Feather files -This release also includes full support for the Feather file format, providing `read_feather()` and `write_feather()`. [Feather](https://github.com/wesm/feather) was one of the initial products coming out of the Arrow project, providing an efficient, common file format language-agnostic data frame storage, along with implementations in R and Python. +This release also includes a much faster and robust implementation of the Feather file format, providing `read_feather()` and `write_feather()`. [Feather](https://github.com/wesm/feather) was one of the initial applications of Apache Arrow for Python and R, providing an efficient, common file format language-agnostic data frame storage, along with implementations in R and Python. As Arrow progressed, development of Feather moved to the [`apache/arrow`](https://github.com/apache/arrow) project, and for the last two years, the Python implementation of Feather has just been a wrapper around `pyarrow`. This meant that as Arrow progressed and bugs were fixed, the Python version of Feather got the improvements but sadly R did not. -With this release, the R implementation of Feather catches up and now depends on the same underlying C++ library as the Python version does. This should result in more reliable and consistent behavior across the two languages. +With this release, the R implementation of Feather catches up and now depends on the same underlying C++ library as the Python version does. This should result in more reliable and consistent behavior across the two languages, as well as [improved performance](https://wesmckinney.com/blog/feather-arrow-future/). We encourage all R users of `feather` to switch to using `arrow::read_feather()` and `arrow::write_feather()`. -Note that both Feather and Parquet are columnar data formats that allow sharing data frames across R, Pandas, and other tools. When should you use Feather and when should you use Parquet? We currently recommend Parquet for long-term storage, as well as for cases where the size on disk matters because Parquet supports various compression formats. Feather, on the other hand, may be faster to read in because it matches the in-memory format and doesn't require deserialization, and it also allows for memory mapping so that you can access data that is larger than can fit into memory. See the [Arrow project FAQ](https://arrow.apache.org/faq/) for more. +Note that both Feather and Parquet are columnar data formats that allow sharing data frames across R, Pandas, and other tools. When should you use Feather and when should you use Parquet? Parquet is optimized to create small files and as a result can be more expensive to read locally, but it performs very well with remote storage like HDFS or Amazon S3. Feather is designed for fast local reads, particularly with solid-state drives, and is not intended for use with remote storage systems. Feather files can be memory-mapped and read in Arrow format without any deserialization while Parquet files always must be decompressed and decoded. See the [Arrow project FAQ](https://arrow.apache.org/faq/) for more. ## Other capabilities -In addition to these readers and writers, the `arrow` package has wrappers for other readers in the C++ library; see `?read_csv_arrow` and `?read_json_arrow`. It also provides many lower-level bindings to the C++ library, which enable you to access and manipulate Arrow objects. You can use these to build connectors to other applications and services that use Arrow. One example is Spark: the [`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to move data to and from Spark, yielding [significant performance gains](http://arrow.apache.org/blog/2019/01/25/r-spark-improvements/). +In addition to these readers and writers, the `arrow` package has wrappers for other readers in the C++ library; see `?read_csv_arrow` and `?read_json_arrow`. These readers are being developed to optimize for the memory layout of the Arrow columnar format and are not intended as a direct replacement for existing R CSV readers (`base::read.csv`, `readr::read_csv`, `data.table::fread`) that return an R `data.frame`. + +It also provides many lower-level bindings to the C++ library, which enable you to access and manipulate Arrow objects. You can use these to build connectors to other applications and services that use Arrow. One example is Spark: the [`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to move data to and from Spark, yielding [significant performance gains](http://arrow.apache.org/blog/2019/01/25/r-spark-improvements/). ## Acknowledgements From fe98d6a54d99f5081475afcb0f76f977f37c536d Mon Sep 17 00:00:00 2001 From: Neal Richardson Date: Thu, 8 Aug 2019 08:53:18 -0700 Subject: [PATCH 4/6] Add macOS R installation warning --- r/README.Rmd | 2 +- site/_posts/2019-08-01-r-package-on-cran.md | 6 +++++- 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/r/README.Rmd b/r/README.Rmd index 8ef3c6052da..2dd1eb55096 100644 --- a/r/README.Rmd +++ b/r/README.Rmd @@ -36,7 +36,7 @@ install.packages("arrow") > `install.packages("arrow", type = "source")`. We hope to have this resolved > in the next release. -On macOS and Windows, installing a binary package from CRAN will handle Arrow's C++ dependencies for you. On Linux, you'll need to first install the C++ library. See the [Arrow project installation page](https://arrow.apache.org/install/) to find pre-compiled binary packages for some common Linux distributions, including Debian, Ubuntu, and CentOS. You'll need to install `libparquet-dev` on Debian and Ubuntu, or `parquet-devel` on CentOS. This will also automatically install the Arrow C++ library as a dependency. +On macOS and Windows, installing a binary package from CRAN will handle Arrow's C++ dependencies for you. On Linux, you'll need to first install the C++ library. See the [Arrow project installation page](https://arrow.apache.org/install/) to find pre-compiled binary packages for some common Linux distributions, including Debian, Ubuntu, and CentOS. You'll need to install `libparquet-dev` on Debian and Ubuntu, or `parquet-devel` on CentOS. This will also automatically install the Arrow C++ library as a dependency. Other Linux distributions must install the C++ library from source. If you install the `arrow` package from source and the C++ library is not found, the R package functions will notify you that Arrow is not available. Call diff --git a/site/_posts/2019-08-01-r-package-on-cran.md b/site/_posts/2019-08-01-r-package-on-cran.md index fa54f83daf3..1f2910ae342 100644 --- a/site/_posts/2019-08-01-r-package-on-cran.md +++ b/site/_posts/2019-08-01-r-package-on-cran.md @@ -34,7 +34,11 @@ You can install the package from CRAN with install.packages("arrow") ``` -On macOS and Windows, installing a binary package from CRAN will handle Arrow’s C++ dependencies for you. On Linux, you’ll need to first install the C++ library. See the [Arrow project installation page](https://arrow.apache.org/install/) to find pre-compiled binary packages for some common Linux distributions, such as Debian, Ubuntu, CentOS, and Fedora. Other Linux distributions must install the C++ library from source. +On macOS and Windows, installing a binary package from CRAN will generally handle Arrow’s C++ dependencies for you. However, the macOS CRAN binaries are unfortunately incomplete for this version, so to install 0.14.1, you’ll first need to use Homebrew to get the Arrow C++ library (`brew install apache-arrow`), and then from R you can `install.packages("arrow", type = "source")`. + +Windows binaries are not yet available on CRAN but should be published soon. + +On Linux, you’ll need to first install the C++ library. See the [Arrow project installation page](https://arrow.apache.org/install/) to find pre-compiled binary packages for some common Linux distributions, including Debian, Ubuntu, and CentOS. You'll need to install `libparquet-dev` on Debian and Ubuntu, or `parquet-devel` on CentOS. This will also automatically install the Arrow C++ library as a dependency. Other Linux distributions must install the C++ library from source. If you install the `arrow` R package from source and the C++ library is not found, the R package functions will notify you that Arrow is not available. Call From 3b06bb43af189094b7b65038380088cfa5826106 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 8 Aug 2019 12:10:19 -0500 Subject: [PATCH 5/6] Update date, small language tweaks --- site/_posts/2019-08-01-r-package-on-cran.md | 120 ++++++++++++++++---- 1 file changed, 98 insertions(+), 22 deletions(-) diff --git a/site/_posts/2019-08-01-r-package-on-cran.md b/site/_posts/2019-08-01-r-package-on-cran.md index 1f2910ae342..d3d817266fe 100644 --- a/site/_posts/2019-08-01-r-package-on-cran.md +++ b/site/_posts/2019-08-01-r-package-on-cran.md @@ -1,7 +1,7 @@ --- layout: post title: "Apache Arrow R Package On CRAN" -date: "2019-08-01 00:00:00 -0600" +date: "2019-08-08 06:00:00 -0600" author: npr categories: [application] --- @@ -24,9 +24,15 @@ limitations under the License. {% endcomment %} --> -We are very excited to announce that the `arrow` R package is now available on [CRAN](https://cran.r-project.org/). +We are very excited to announce that the `arrow` R package is now available on +[CRAN](https://cran.r-project.org/). -[Apache Arrow](https://arrow.apache.org/) is a cross-language development platform for in-memory data that specifies a standardized columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. The `arrow` package provides an R interface to the Arrow C++ library, including support for working with Parquet and Feather files, as well as lower-level access to Arrow memory and messages. +[Apache Arrow](https://arrow.apache.org/) is a cross-language development +platform for in-memory data that specifies a standardized columnar memory +format for flat and hierarchical data, organized for efficient analytic +operations on modern hardware. The `arrow` package provides an R interface to +the Arrow C++ library, including support for working with Parquet and Feather +files, as well as lower-level access to Arrow memory and messages. You can install the package from CRAN with @@ -34,13 +40,26 @@ You can install the package from CRAN with install.packages("arrow") ``` -On macOS and Windows, installing a binary package from CRAN will generally handle Arrow’s C++ dependencies for you. However, the macOS CRAN binaries are unfortunately incomplete for this version, so to install 0.14.1, you’ll first need to use Homebrew to get the Arrow C++ library (`brew install apache-arrow`), and then from R you can `install.packages("arrow", type = "source")`. +On macOS and Windows, installing a binary package from CRAN will generally +handle Arrow's C++ dependencies for you. However, the macOS CRAN binaries are +unfortunately incomplete for this version, so to install 0.14.1, you'll first +need to use Homebrew to get the Arrow C++ library (`brew install +apache-arrow`), and then from R you can `install.packages("arrow", type = +"source")`. Windows binaries are not yet available on CRAN but should be published soon. -On Linux, you’ll need to first install the C++ library. See the [Arrow project installation page](https://arrow.apache.org/install/) to find pre-compiled binary packages for some common Linux distributions, including Debian, Ubuntu, and CentOS. You'll need to install `libparquet-dev` on Debian and Ubuntu, or `parquet-devel` on CentOS. This will also automatically install the Arrow C++ library as a dependency. Other Linux distributions must install the C++ library from source. +On Linux, you'll need to first install the C++ library. See the [Arrow project +installation page](https://arrow.apache.org/install/) to find pre-compiled +binary packages for some common Linux distributions, including Debian, Ubuntu, +and CentOS. You'll need to install `libparquet-dev` on Debian and Ubuntu, or +`parquet-devel` on CentOS. This will also automatically install the Arrow C++ +library as a dependency. Other Linux distributions must install the C++ library +from source. -If you install the `arrow` R package from source and the C++ library is not found, the R package functions will notify you that Arrow is not available. Call +If you install the `arrow` R package from source and the C++ library is not +found, the R package functions will notify you that Arrow is not +available. Call ```r arrow::install_arrow() @@ -51,16 +70,30 @@ library. ## Parquet files -This release introduces basic read and write support for the [Apache Parquet](https://parquet.apache.org/) columnar data file format. Prior to this release, options for accessing Parquet data in R were limited; the most common recommendation was to use Apache Spark. The `arrow` package greatly simplifies this access and lets you go from a Parquet file to a `data.frame` and back easily, without having to set up a database. +This release introduces basic read and write support for the [Apache +Parquet](https://parquet.apache.org/) columnar data file format. Prior to this +release, options for accessing Parquet data in R were limited; the most common +recommendation was to use Apache Spark. The `arrow` package greatly simplifies +this access and lets you go from a Parquet file to a `data.frame` and back +easily, without having to set up a database. ```r library(arrow) df <- read_parquet("path/to/file.parquet") ``` -This function, along with the other readers in the package, takes an optional `col_select` argument, inspired by the [`vroom`](https://vroom.r-lib.org/reference/vroom.html) package. This argument lets you use the ["tidyselect" helper functions](https://tidyselect.r-lib.org/reference/select_helpers.html), as you can do in `dplyr::select()`, to specify that you only want to keep certain columns. By narrowing your selection at read time, you can load a `data.frame` with less memory overhead. +This function, along with the other readers in the package, takes an optional +`col_select` argument, inspired by the +[`vroom`](https://vroom.r-lib.org/reference/vroom.html) package. This argument +lets you use the ["tidyselect" helper +functions](https://tidyselect.r-lib.org/reference/select_helpers.html), as you +can do in `dplyr::select()`, to specify that you only want to keep certain +columns. By narrowing your selection at read time, you can load a `data.frame` +with less memory overhead. -For example, suppose you had written the `iris` dataset to Parquet. You could read a `data.frame` with only the columns `c("Sepal.Length", "Sepal.Width")` by doing +For example, suppose you had written the `iris` dataset to Parquet. You could +read a `data.frame` with only the columns `c("Sepal.Length", "Sepal.Width")` by +doing ```r df <- read_parquet("iris.parquet", col_select = starts_with("Sepal")) @@ -72,26 +105,69 @@ Just as you can read, you can write Parquet files: write_parquet(df, "path/to/different_file.parquet") ``` -Note that this read and write support for Parquet files in R is in its early stages of development. The Python Arrow library ([pyarrow](https://arrow.apache.org/docs/python/)) still has much richer support for Parquet files, including working with multi-file datasets. In the coming months, we hope to bring the R package towards feature equivalency. +Note that this read and write support for Parquet files in R is in its early +stages of development. The Python Arrow library +([pyarrow](https://arrow.apache.org/docs/python/)) still has much richer +support for Parquet files, including working with multi-file datasets. We +intend to reach feature equivalency between the R and Python packages in the +future. ## Feather files -This release also includes a much faster and robust implementation of the Feather file format, providing `read_feather()` and `write_feather()`. [Feather](https://github.com/wesm/feather) was one of the initial applications of Apache Arrow for Python and R, providing an efficient, common file format language-agnostic data frame storage, along with implementations in R and Python. - -As Arrow progressed, development of Feather moved to the [`apache/arrow`](https://github.com/apache/arrow) project, and for the last two years, the Python implementation of Feather has just been a wrapper around `pyarrow`. This meant that as Arrow progressed and bugs were fixed, the Python version of Feather got the improvements but sadly R did not. - -With this release, the R implementation of Feather catches up and now depends on the same underlying C++ library as the Python version does. This should result in more reliable and consistent behavior across the two languages, as well as [improved performance](https://wesmckinney.com/blog/feather-arrow-future/). - -We encourage all R users of `feather` to switch to using `arrow::read_feather()` and `arrow::write_feather()`. - -Note that both Feather and Parquet are columnar data formats that allow sharing data frames across R, Pandas, and other tools. When should you use Feather and when should you use Parquet? Parquet is optimized to create small files and as a result can be more expensive to read locally, but it performs very well with remote storage like HDFS or Amazon S3. Feather is designed for fast local reads, particularly with solid-state drives, and is not intended for use with remote storage systems. Feather files can be memory-mapped and read in Arrow format without any deserialization while Parquet files always must be decompressed and decoded. See the [Arrow project FAQ](https://arrow.apache.org/faq/) for more. +This release also includes a faster and more robust implementation of the +Feather file format, providing `read_feather()` and +`write_feather()`. [Feather](https://github.com/wesm/feather) was one of the +initial applications of Apache Arrow for Python and R, providing an efficient, +common file format language-agnostic data frame storage, along with +implementations in R and Python. + +As Arrow progressed, development of Feather moved to the +[`apache/arrow`](https://github.com/apache/arrow) project, and for the last two +years, the Python implementation of Feather has just been a wrapper around +`pyarrow`. This meant that as Arrow progressed and bugs were fixed, the Python +version of Feather got the improvements but sadly R did not. + +With this release, the R implementation of Feather catches up and now depends +on the same underlying C++ library as the Python version does. This should +result in more reliable and consistent behavior across the two languages, as +well as [improved +performance](https://wesmckinney.com/blog/feather-arrow-future/). + +We encourage all R users of `feather` to switch to using +`arrow::read_feather()` and `arrow::write_feather()`. + +Note that both Feather and Parquet are columnar data formats that allow sharing +data frames across R, Pandas, and other tools. When should you use Feather and +when should you use Parquet? Parquet balances space-efficiency with +deserialization costs, making it an ideal choice for remote storage systems +like HDFS or Amazon S3. Feather is designed for fast local reads, particularly +with solid-state drives, and is not intended for use with remote storage +systems. Feather files can be memory-mapped and accessed as Arrow columnar data +in-memory without any deserialization while Parquet files always must be +decompressed and decoded. See the [Arrow project +FAQ](https://arrow.apache.org/faq/) for more. ## Other capabilities -In addition to these readers and writers, the `arrow` package has wrappers for other readers in the C++ library; see `?read_csv_arrow` and `?read_json_arrow`. These readers are being developed to optimize for the memory layout of the Arrow columnar format and are not intended as a direct replacement for existing R CSV readers (`base::read.csv`, `readr::read_csv`, `data.table::fread`) that return an R `data.frame`. +In addition to these readers and writers, the `arrow` package has wrappers for +other readers in the C++ library; see `?read_csv_arrow` and +`?read_json_arrow`. These readers are being developed to optimize for the +memory layout of the Arrow columnar format and are not intended as a direct +replacement for existing R CSV readers (`base::read.csv`, `readr::read_csv`, +`data.table::fread`) that return an R `data.frame`. -It also provides many lower-level bindings to the C++ library, which enable you to access and manipulate Arrow objects. You can use these to build connectors to other applications and services that use Arrow. One example is Spark: the [`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to move data to and from Spark, yielding [significant performance gains](http://arrow.apache.org/blog/2019/01/25/r-spark-improvements/). +It also provides many lower-level bindings to the C++ library, which enable you +to access and manipulate Arrow objects. You can use these to build connectors +to other applications and services that use Arrow. One example is Spark: the +[`sparklyr`](https://spark.rstudio.com/) package has support for using Arrow to +move data to and from Spark, yielding [significant performance +gains](http://arrow.apache.org/blog/2019/01/25/r-spark-improvements/). ## Acknowledgements -In addition to the work on wiring the R package up to the Arrow Parquet C++ library, a lot of effort went into building and packaging Arrow for R users, ensuring its ease of installation across platforms. We'd like to thank the support of Jeroen Ooms, Javier Luraschi, JJ Allaire, Davis Vaughan, the CRAN team, and many others in the Apache Arrow community for helping us get to this point. +In addition to the work on wiring the R package up to the Arrow Parquet C++ +library, a lot of effort went into building and packaging Arrow for R users, +ensuring its ease of installation across platforms. We'd like to thank the +support of Jeroen Ooms, Javier Luraschi, JJ Allaire, Davis Vaughan, the CRAN +team, and many others in the Apache Arrow community for helping us get to this +point. From 7c8254b2b9921979290efc0b3cf8c82f18c6119b Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 8 Aug 2019 12:16:23 -0500 Subject: [PATCH 6/6] Add note about nokogiri requirements --- site/README.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/site/README.md b/site/README.md index 651b013c512..8dddbf074fc 100644 --- a/site/README.md +++ b/site/README.md @@ -47,6 +47,13 @@ such cases the following configuration option may help: bundle config build.nokogiri --use-system-libraries ``` +`nokogiri` depends on the `libxml2` and `libxslt1` libraries, which can be +installed on Debian-like systems with + +``` +apt-get install libxml2-dev libxslt1-dev +``` + If you are planning to publish the website, you must clone the arrow-site git repository. Run this command from the `site` directory so that `asf-site` is a subdirectory of `site`.