From 9929e55a22e7cd2f6aada6acce89cfe6af5c70cc Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Wed, 21 Jul 2021 12:29:24 +0100 Subject: [PATCH 01/24] Remove backticks to make easier to read --- r/vignettes/dataset.Rmd | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index b5e17578b29..e05939bd3d9 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -8,13 +8,14 @@ vignette: > --- Apache Arrow lets you work efficiently with large, multi-file datasets. -The `arrow` R package provides a `dplyr` interface to Arrow Datasets, +The arrow R package provides a `dplyr` interface to Arrow Datasets, as well as other tools for interactive exploration of Arrow data. This vignette introduces Datasets and shows how to use `dplyr` to analyze them. It describes both what is possible to do with Arrow now and what is on the immediate development roadmap. + ## Example: NYC taxi data The [New York City taxi trip record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) From fbe74668114901ae6e6bd43dfe5fac4a4fc2dc68 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Wed, 21 Jul 2021 12:51:50 +0100 Subject: [PATCH 02/24] Grammarly suggestions and adding some subheadings --- r/vignettes/dataset.Rmd | 59 +++++++++++++++++++---------------------- 1 file changed, 27 insertions(+), 32 deletions(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index e05939bd3d9..11ab986e2d7 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -8,30 +8,29 @@ vignette: > --- Apache Arrow lets you work efficiently with large, multi-file datasets. -The arrow R package provides a `dplyr` interface to Arrow Datasets, -as well as other tools for interactive exploration of Arrow data. +The arrow R package provides a dplyr interface to Arrow Datasets, +and other tools for interactive exploration of Arrow data. This vignette introduces Datasets and shows how to use `dplyr` to analyze them. It describes both what is possible to do with Arrow now and what is on the immediate development roadmap. - ## Example: NYC taxi data The [New York City taxi trip record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) is widely used in big data exercises and competitions. For demonstration purposes, we have hosted a Parquet-formatted version -of about 10 years of the trip data in a public Amazon S3 bucket. +of about ten years of the trip data in a public Amazon S3 bucket. The total file size is around 37 gigabytes, even in the efficient Parquet file format. That's bigger than memory on most people's computers, so we can't just read it all in and stack it into a single data frame. In Windows and macOS binary packages, S3 support is included. -On Linux when installing from source, S3 support is not enabled by default, +On Linux, when installing from source, S3 support is not enabled by default, and it has additional system requirements. See `vignette("install", package = "arrow")` for details. -To see if your `arrow` installation has S3 support, run +To see if your `arrow` installation has S3 support, run: ```{r} arrow::arrow_with_s3() @@ -120,9 +119,9 @@ have file paths like ``` By providing a character vector to `partitioning`, we're saying that the first -path segment gives the value for `year` and the second segment is `month`. +path segment gives the value for `year`, and the second segment is `month`. Every row in `2009/01/data.parquet` has a value of 2009 for `year` -and 1 for `month`, even though those columns may not actually be present in the file. +and 1 for `month`, even though those columns may not be present in the file. Indeed, when we look at the dataset, we see that in addition to the columns present in every file, there are also columns `year` and `month`. @@ -185,9 +184,8 @@ In the current release, `arrow` supports the dplyr verbs `mutate()`, or other verbs with aggregate functions, use `collect()` to pull the selected subset of the data into an in-memory R data frame. -If you attempt to call unsupported `dplyr` verbs or unimplemented functions in -your query on an Arrow Dataset, the `arrow` package raises an error. However, -for `dplyr` queries on `Table` objects (which are typically smaller in size) the +Suppose you attempt to call unsupported `dplyr` verbs or unimplemented functions in your query on an Arrow Dataset. In that case, the `arrow` package raises an error. However, +for `dplyr` queries on `Table` objects (typically smaller in size than Datasets), the package automatically calls `collect()` before processing that `dplyr` verb. Here's an example. Suppose I was curious about tipping behavior among the @@ -230,11 +228,10 @@ cat(" ``` We just selected a subset out of a dataset with around 2 billion rows, computed -a new column, and aggregated on it in under 2 seconds on my laptop. How does +a new column, and aggregated it in under 2 seconds on my laptop. How does this work? -First, -`mutate()`/`transmute()`, `select()`/`rename()`/`relocate()`, `filter()`, +First, `mutate()`/`transmute()`, `select()`/`rename()`/`relocate()`, `filter()`, `group_by()`, and `arrange()` record their actions but don't evaluate on the data until you run `collect()`. @@ -260,7 +257,7 @@ See $.data for the source Arrow object ") ``` -This returns instantly and shows the manipulations you've made, without +This code returns an output instantly and shows the manipulations you've made, without loading data from the files. Because the evaluation of these queries is deferred, you can build up a query that selects down to a small subset without generating intermediate datasets that would potentially be large. @@ -268,8 +265,7 @@ intermediate datasets that would potentially be large. Second, all work is pushed down to the individual data files, and depending on the file format, chunks of data within the files. As a result, we can select a subset of data from a much larger dataset by collecting the -smaller slices from each file--we don't have to load the whole dataset in memory -in order to slice from it. +smaller slices from each file--we don't have to load the whole dataset in memory to slice from it. Third, because of partitioning, we can ignore some files entirely. In this example, by filtering `year == 2015`, all files corresponding to other years @@ -281,27 +277,26 @@ avoid scanning because they have no rows where `total_amount > 100`. ## More dataset options There are a few ways you can control the Dataset creation to adapt to special use cases. -For one, if you are working with a single file or a set of files that are not -all in the same directory, you can provide a file path or a vector of multiple -file paths to `open_dataset()`. This is useful if, for example, you have a -single CSV file that is too big to read into memory. You could pass the file -path to `open_dataset()`, use `group_by()` to partition the Dataset into -manageable chunks, then use `write_dataset()` to write each chunk to a separate -Parquet file---all without needing to read the full CSV file into R. - -You can specify a `schema` argument to `open_dataset()` to declare the columns -and their data types. This is useful if you have data files that have different -storage schema (for example, a column could be `int32` in one and `int8` in another) -and you want to ensure that the resulting Dataset has a specific type. -To be clear, it's not necessary to specify a schema, even in this example of -mixed integer types, because the Dataset constructor will reconcile differences like these. -The schema specification just lets you declare what you want the result to be. + +### Working with files in a directory + +If you are working with a single file or a set of files that are not all in the same directory, you can provide a file path or a vector of multiple file paths to `open_dataset()`. This is useful if, for example, you have a single CSV file that is too big to read into memory. You could pass the file path to `open_dataset()`, use `group_by()` to partition the Dataset into manageable chunks, then use `write_dataset()` to write each chunk to a separate Parquet file---all without needing to read the full CSV file into R. + +### Explicitly declare column names and data types + +You can specify a `schema` argument to `open_dataset()` to declare the columns and their data types. This is useful if you have data files that have different storage schema (for example, a column could be `int32` in one and `int8` in another) and you want to ensure that the resulting Dataset has a specific type. + +To be clear, it's not necessary to specify a schema, even in this example of mixed integer types, because the Dataset constructor will reconcile differences like these. The schema specification just lets you declare what you want the result to be. + +### Explicitly declare partition format Similarly, you can provide a Schema in the `partitioning` argument of `open_dataset()` in order to declare the types of the virtual columns that define the partitions. This would be useful, in our taxi dataset example, if you wanted to keep `month` as a string instead of an integer for some reason. +### Work with multiple data sources + Another feature of Datasets is that they can be composed of multiple data sources. That is, you may have a directory of partitioned Parquet files in one location, and in another directory, files that haven't been partitioned. From 9726952c95bace81df8801d8f3f928cbe809223c Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Wed, 21 Jul 2021 13:33:00 +0100 Subject: [PATCH 03/24] "we" -> "you" --- r/vignettes/dataset.Rmd | 74 ++++++++++++++++++++--------------------- 1 file changed, 36 insertions(+), 38 deletions(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index 11ab986e2d7..d04195a22c9 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -23,7 +23,7 @@ For demonstration purposes, we have hosted a Parquet-formatted version of about ten years of the trip data in a public Amazon S3 bucket. The total file size is around 37 gigabytes, even in the efficient Parquet file -format. That's bigger than memory on most people's computers, so we can't just +format. That's bigger than memory on most people's computers, so you can't just read it all in and stack it into a single data frame. In Windows and macOS binary packages, S3 support is included. @@ -77,7 +77,7 @@ feel free to grab only a year or two of data. If you don't have the taxi data downloaded, the vignette will still run and will yield previously cached output for reference. To be explicit about which version -is running, let's check whether we're running with live data: +is running, let's check whether you're running with live data: ```{r} dir.exists("nyc-taxi") @@ -87,29 +87,29 @@ dir.exists("nyc-taxi") Because `dplyr` is not necessary for many Arrow workflows, it is an optional (`Suggests`) dependency. So, to work with Datasets, -we need to load both `arrow` and `dplyr`. +you need to load both `arrow` and `dplyr`. ```{r} library(arrow, warn.conflicts = FALSE) library(dplyr, warn.conflicts = FALSE) ``` -The first step is to create our Dataset object, pointing at the directory of data. +The first step is to create a Dataset object, pointing at the directory of data. ```{r, eval = file.exists("nyc-taxi")} ds <- open_dataset("nyc-taxi", partitioning = c("year", "month")) ``` -The default file format for `open_dataset()` is Parquet; if we had a directory -of Arrow format files, we could include `format = "arrow"` in the call. +The default file format for `open_dataset()` is Parquet; if you had a directory +of Arrow format files, you could include `format = "arrow"` in the call. Other supported formats include: `"feather"` (an alias for `"arrow"`, as Feather v2 is the Arrow file format), `"csv"`, `"tsv"` (for tab-delimited), and `"text"` for generic text-delimited files. For text files, you can pass any parsing options (`delim`, `quote`, etc.) to `open_dataset()` that you would otherwise pass to `read_csv_arrow()`. -The `partitioning` argument lets us specify how the file paths provide information -about how the dataset is chunked into different files. Our files in this example +The `partitioning` argument lets you specify how the file paths provide information +about how the dataset is chunked into different files. The files in this example have file paths like ``` @@ -118,12 +118,12 @@ have file paths like ... ``` -By providing a character vector to `partitioning`, we're saying that the first +By providing a character vector to `partitioning`, you're saying that the first path segment gives the value for `year`, and the second segment is `month`. Every row in `2009/01/data.parquet` has a value of 2009 for `year` and 1 for `month`, even though those columns may not be present in the file. -Indeed, when we look at the dataset, we see that in addition to the columns present +Indeed, when you look at the dataset, you can see that in addition to the columns present in every file, there are also columns `year` and `month`. ```{r, eval = file.exists("nyc-taxi")} @@ -159,7 +159,7 @@ See $metadata for additional Schema metadata The other form of partitioning currently supported is [Hive](https://hive.apache.org/)-style, in which the partition variable names are included in the path segments. -If we had saved our files in paths like +If you had saved your files in paths like ``` year=2009/month=01/data.parquet @@ -167,15 +167,15 @@ year=2009/month=02/data.parquet ... ``` -we would not have had to provide the names in `partitioning`: -we could have just called `ds <- open_dataset("nyc-taxi")` and the partitions +you would not have had to provide the names in `partitioning`: +you could have just called `ds <- open_dataset("nyc-taxi")` and the partitions would have been detected automatically. ## Querying the dataset -Up to this point, we haven't loaded any data: we have walked directories to find -files, we've parsed file paths to identify partitions, and we've read the -headers of the Parquet files to inspect their schemas so that we can make sure +Up to this point, you haven't loaded any data: you have walked directories to find +files, you've parsed file paths to identify partitions, and you've read the +headers of the Parquet files to inspect their schemas so that you can make sure they all line up. In the current release, `arrow` supports the dplyr verbs `mutate()`, @@ -227,7 +227,7 @@ cat(" ") ``` -We just selected a subset out of a dataset with around 2 billion rows, computed +You just selected a subset out of a dataset with around 2 billion rows, computed a new column, and aggregated it in under 2 seconds on my laptop. How does this work? @@ -264,14 +264,14 @@ intermediate datasets that would potentially be large. Second, all work is pushed down to the individual data files, and depending on the file format, chunks of data within the files. As a result, -we can select a subset of data from a much larger dataset by collecting the -smaller slices from each file--we don't have to load the whole dataset in memory to slice from it. +you can select a subset of data from a much larger dataset by collecting the +smaller slices from each file--you don't have to load the whole dataset in memory to slice from it. -Third, because of partitioning, we can ignore some files entirely. +Third, because of partitioning, you can ignore some files entirely. In this example, by filtering `year == 2015`, all files corresponding to other years -are immediately excluded: we don't have to load them in order to find that no +are immediately excluded: you don't have to load them in order to find that no rows match the filter. Relatedly, since Parquet files contain row groups with -statistics on the data within, there may be entire chunks of data we can +statistics on the data within, there may be entire chunks of data you can avoid scanning because they have no rows where `total_amount > 100`. ## More dataset options @@ -292,8 +292,8 @@ To be clear, it's not necessary to specify a schema, even in this example of mix Similarly, you can provide a Schema in the `partitioning` argument of `open_dataset()` in order to declare the types of the virtual columns that define the partitions. -This would be useful, in our taxi dataset example, if you wanted to keep -`month` as a string instead of an integer for some reason. +This would be useful, in the taxi dataset example, if you wanted to keep +`month` as a string instead of an integer. ### Work with multiple data sources @@ -309,27 +309,25 @@ instead of a file path, or simply concatenate them like `big_dataset <- c(ds1, d As you can see, querying a large dataset can be made quite fast by storage in an efficient binary columnar format like Parquet or Feather and partitioning based on -columns commonly used for filtering. However, we don't always get our data delivered -to us that way. Sometimes we start with one giant CSV. Our first step in analyzing data -is cleaning is up and reshaping it into a more usable form. +columns commonly used for filtering. However, data isn't always stored that way. Sometimes you might start with one giant CSV. The first step in analyzing data is cleaning is up and reshaping it into a more usable form. The `write_dataset()` function allows you to take a Dataset or other tabular data object---an Arrow `Table` or `RecordBatch`, or an R `data.frame`---and write it to a different file format, partitioned into multiple files. -Assume we have a version of the NYC Taxi data as CSV: +Assume you have a version of the NYC Taxi data as CSV: ```r ds <- open_dataset("nyc-taxi/csv/", format = "csv") ``` -We can write it to a new location and translate the files to the Feather format +You can write it to a new location and translate the files to the Feather format by calling `write_dataset()` on it: ```r write_dataset(ds, "nyc-taxi/feather", format = "feather") ``` -Next, let's imagine that the `payment_type` column is something we often filter -on, so we want to partition the data by that variable. By doing so we ensure +Next, let's imagine that the `payment_type` column is something you often filter +on, so you want to partition the data by that variable. By doing so you ensure that a filter like `payment_type == "Cash"` will touch only a subset of files where `payment_type` is always `"Cash"`. @@ -363,14 +361,14 @@ system("tree nyc-taxi/feather") Note that the directory names are `payment_type=Cash` and similar: this is the Hive-style partitioning described above. This means that when -we call `open_dataset()` on this directory, we don't have to declare what the +you call `open_dataset()` on this directory, you don't have to declare what the partitions are because they can be read from the file paths. (To instead write bare values for partition segments, i.e. `Cash` rather than `payment_type=Cash`, call `write_dataset()` with `hive_style = FALSE`.) -Perhaps, though, `payment_type == "Cash"` is the only data we ever care about, -and we just want to drop the rest and have a smaller working set. -For this, we can `filter()` them out when writing: +Perhaps, though, `payment_type == "Cash"` is the only data you ever care about, +and you just want to drop the rest and have a smaller working set. +For this, you can `filter()` them out when writing: ```r ds %>% @@ -378,9 +376,9 @@ ds %>% write_dataset("nyc-taxi/feather", format = "feather") ``` -The other thing we can do when writing datasets is select a subset of and/or reorder -columns. Suppose we never care about `vendor_id`, and being a string column, -it can take up a lot of space when we read it in, so let's drop it: +The other thing you can do when writing datasets is select a subset of and/or reorder +columns. Suppose you never care about `vendor_id`, and being a string column, +it can take up a lot of space when you read it in, so let's drop it: ```r ds %>% From e13543dafca351969aa513cb1289f34dbdd0a1ee Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Wed, 21 Jul 2021 13:42:30 +0100 Subject: [PATCH 04/24] Use bold instead of backticks to make package names more readable --- r/vignettes/dataset.Rmd | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index d04195a22c9..32c041f1730 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -8,10 +8,10 @@ vignette: > --- Apache Arrow lets you work efficiently with large, multi-file datasets. -The arrow R package provides a dplyr interface to Arrow Datasets, +The __arrow__ R package provides a __dplyr__ interface to Arrow Datasets, and other tools for interactive exploration of Arrow data. -This vignette introduces Datasets and shows how to use `dplyr` to analyze them. +This vignette introduces Datasets and shows how to use __dplyr__ to analyze them. It describes both what is possible to do with Arrow now and what is on the immediate development roadmap. @@ -30,7 +30,7 @@ In Windows and macOS binary packages, S3 support is included. On Linux, when installing from source, S3 support is not enabled by default, and it has additional system requirements. See `vignette("install", package = "arrow")` for details. -To see if your `arrow` installation has S3 support, run: +To see if your __arrow__ installation has S3 support, run: ```{r} arrow::arrow_with_s3() @@ -41,13 +41,13 @@ machine is located in the same AWS region as the data. So, for this vignette, we assume that the NYC taxi dataset has been downloaded locally in a "nyc-taxi" directory. -If your `arrow` build has S3 support, you can sync the data locally with: +If your __arrow__ build has S3 support, you can sync the data locally with: ```{r, eval = FALSE} arrow::copy_files("s3://ursa-labs-taxi-data", "nyc-taxi") ``` -If your `arrow` build doesn't have S3 support, you can download the files +If your __arrow__ build doesn't have S3 support, you can download the files with some additional code: ```{r, eval = FALSE} @@ -85,9 +85,9 @@ dir.exists("nyc-taxi") ## Getting started -Because `dplyr` is not necessary for many Arrow workflows, +Because __dplyr__ is not necessary for many Arrow workflows, it is an optional (`Suggests`) dependency. So, to work with Datasets, -you need to load both `arrow` and `dplyr`. +you need to load both __arrow__ and __dplyr__. ```{r} library(arrow, warn.conflicts = FALSE) @@ -178,15 +178,15 @@ files, you've parsed file paths to identify partitions, and you've read the headers of the Parquet files to inspect their schemas so that you can make sure they all line up. -In the current release, `arrow` supports the dplyr verbs `mutate()`, +In the current release, __arrow__ supports the dplyr verbs `mutate()`, `transmute()`, `select()`, `rename()`, `relocate()`, `filter()`, and `arrange()`. Aggregation is not yet supported, so before you call `summarise()` or other verbs with aggregate functions, use `collect()` to pull the selected subset of the data into an in-memory R data frame. -Suppose you attempt to call unsupported `dplyr` verbs or unimplemented functions in your query on an Arrow Dataset. In that case, the `arrow` package raises an error. However, -for `dplyr` queries on `Table` objects (typically smaller in size than Datasets), the -package automatically calls `collect()` before processing that `dplyr` verb. +Suppose you attempt to call unsupported __dplyr__ verbs or unimplemented functions in your query on an Arrow Dataset. In that case, the __arrow__ package raises an error. However, +for __dplyr__ queries on `Table` objects (typically smaller in size than Datasets), the +package automatically calls `collect()` before processing that __dplyr__ verb. Here's an example. Suppose I was curious about tipping behavior among the longest taxi rides. Let's find the median tip percentage for rides with From a9414440bd07b4bb8b0614d41d7202c0b39bc4ff Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Wed, 21 Jul 2021 14:38:53 +0100 Subject: [PATCH 05/24] Split paragraph into bullets --- r/vignettes/dataset.Rmd | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index 32c041f1730..34df15c591a 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -102,11 +102,12 @@ ds <- open_dataset("nyc-taxi", partitioning = c("year", "month")) The default file format for `open_dataset()` is Parquet; if you had a directory of Arrow format files, you could include `format = "arrow"` in the call. -Other supported formats include: `"feather"` (an alias for `"arrow"`, as Feather -v2 is the Arrow file format), `"csv"`, `"tsv"` (for tab-delimited), and `"text"` -for generic text-delimited files. For text files, you can pass any parsing -options (`delim`, `quote`, etc.) to `open_dataset()` that you would otherwise -pass to `read_csv_arrow()`. +Other supported formats include: + +* `"feather"` (an alias for `"arrow"`, as Feather v2 is the Arrow file format) +* `"csv"`, `"tsv"` (for tab-delimited), and `"text"` for generic text-delimited files. + +For text files, you can pass any parsing options (`delim`, `quote`, etc.) to `open_dataset()` that you would otherwise pass to `read_csv_arrow()`. The `partitioning` argument lets you specify how the file paths provide information about how the dataset is chunked into different files. The files in this example @@ -167,13 +168,13 @@ year=2009/month=02/data.parquet ... ``` -you would not have had to provide the names in `partitioning`: +you would not have had to provide the names in `partitioning`; you could have just called `ds <- open_dataset("nyc-taxi")` and the partitions would have been detected automatically. ## Querying the dataset -Up to this point, you haven't loaded any data: you have walked directories to find +Up to this point, you haven't loaded any data. You have walked directories to find files, you've parsed file paths to identify partitions, and you've read the headers of the Parquet files to inspect their schemas so that you can make sure they all line up. From 02ffc4c8cf021e6050e04965c9854d62f14694d4 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Wed, 21 Jul 2021 14:57:36 +0100 Subject: [PATCH 06/24] Breaking sections down and tweaks --- r/vignettes/dataset.Rmd | 60 ++++++++++++++++++++++++++++------------- 1 file changed, 42 insertions(+), 18 deletions(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index 34df15c591a..31d3a2dca69 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -41,6 +41,8 @@ machine is located in the same AWS region as the data. So, for this vignette, we assume that the NYC taxi dataset has been downloaded locally in a "nyc-taxi" directory. +### Retrieving data from a public Amazon S3 bucket + If your __arrow__ build has S3 support, you can sync the data locally with: ```{r, eval = FALSE} @@ -100,14 +102,18 @@ The first step is to create a Dataset object, pointing at the directory of data. ds <- open_dataset("nyc-taxi", partitioning = c("year", "month")) ``` -The default file format for `open_dataset()` is Parquet; if you had a directory -of Arrow format files, you could include `format = "arrow"` in the call. +The file format for `open_dataset()` is controlled by the `format` parameter, +which has a default value of `"parquet"`. If you had a directory +of Arrow format files, you could instead specify `format = "arrow"` in the call. + Other supported formats include: -* `"feather"` (an alias for `"arrow"`, as Feather v2 is the Arrow file format) -* `"csv"`, `"tsv"` (for tab-delimited), and `"text"` for generic text-delimited files. +* `"feather"` or `"ipc"` (aliases for `"arrow"`, as Feather v2 is the Arrow file format) +* `"csv"` (comma-delimited files) and `"tsv"` (tab-delimited files) +* `"text"` (generic text-delimited files - use the `delimiter` argument to specify which to use) -For text files, you can pass any parsing options (`delim`, `quote`, etc.) to `open_dataset()` that you would otherwise pass to `read_csv_arrow()`. +For text files, you can pass any parsing options (`delim`, `quote`, etc.) to +`open_dataset()` that you would otherwise pass to `read_csv_arrow()`. The `partitioning` argument lets you specify how the file paths provide information about how the dataset is chunked into different files. The files in this example @@ -174,10 +180,10 @@ would have been detected automatically. ## Querying the dataset -Up to this point, you haven't loaded any data. You have walked directories to find +Up to this point, you haven't loaded any data. You've walked directories to find files, you've parsed file paths to identify partitions, and you've read the headers of the Parquet files to inspect their schemas so that you can make sure -they all line up. +they all are as expected. In the current release, __arrow__ supports the dplyr verbs `mutate()`, `transmute()`, `select()`, `rename()`, `relocate()`, `filter()`, and @@ -185,11 +191,12 @@ In the current release, __arrow__ supports the dplyr verbs `mutate()`, or other verbs with aggregate functions, use `collect()` to pull the selected subset of the data into an in-memory R data frame. -Suppose you attempt to call unsupported __dplyr__ verbs or unimplemented functions in your query on an Arrow Dataset. In that case, the __arrow__ package raises an error. However, +Suppose you attempt to call unsupported __dplyr__ verbs or unimplemented functions +in your query on an Arrow Dataset. In that case, the __arrow__ package raises an error. However, for __dplyr__ queries on `Table` objects (typically smaller in size than Datasets), the package automatically calls `collect()` before processing that __dplyr__ verb. -Here's an example. Suppose I was curious about tipping behavior among the +Here's an example. Suppose that you are curious about tipping behavior among the longest taxi rides. Let's find the median tip percentage for rides with fares greater than $100 in 2015, broken down by the number of passengers: @@ -228,8 +235,8 @@ cat(" ") ``` -You just selected a subset out of a dataset with around 2 billion rows, computed -a new column, and aggregated it in under 2 seconds on my laptop. How does +You've just selected a subset out of a dataset with around 2 billion rows, computed +a new column, and aggregated it in under 2 seconds on most modern laptops. How does this work? First, `mutate()`/`transmute()`, `select()`/`rename()`/`relocate()`, `filter()`, @@ -266,7 +273,8 @@ intermediate datasets that would potentially be large. Second, all work is pushed down to the individual data files, and depending on the file format, chunks of data within the files. As a result, you can select a subset of data from a much larger dataset by collecting the -smaller slices from each file--you don't have to load the whole dataset in memory to slice from it. +smaller slices from each file--you don't have to load the whole dataset in +memory to slice from it. Third, because of partitioning, you can ignore some files entirely. In this example, by filtering `year == 2015`, all files corresponding to other years @@ -281,13 +289,25 @@ There are a few ways you can control the Dataset creation to adapt to special us ### Working with files in a directory -If you are working with a single file or a set of files that are not all in the same directory, you can provide a file path or a vector of multiple file paths to `open_dataset()`. This is useful if, for example, you have a single CSV file that is too big to read into memory. You could pass the file path to `open_dataset()`, use `group_by()` to partition the Dataset into manageable chunks, then use `write_dataset()` to write each chunk to a separate Parquet file---all without needing to read the full CSV file into R. +If you are working with a single file or a set of files that are not all in the +same directory, you can provide a file path or a vector of multiple file paths +to `open_dataset()`. This is useful if, for example, you have a single CSV file +that is too big to read into memory. You could pass the file path to +`open_dataset()`, use `group_by()` to partition the Dataset into manageable chunks, +then use `write_dataset()` to write each chunk to a separate Parquet file---all +without needing to read the full CSV file into R. ### Explicitly declare column names and data types -You can specify a `schema` argument to `open_dataset()` to declare the columns and their data types. This is useful if you have data files that have different storage schema (for example, a column could be `int32` in one and `int8` in another) and you want to ensure that the resulting Dataset has a specific type. +You can specify a `schema` argument to `open_dataset()` to declare the columns +and their data types. This is useful if you have data files that have different +storage schema (for example, a column could be `int32` in one and `int8` in +another) and you want to ensure that the resulting Dataset has a specific type. -To be clear, it's not necessary to specify a schema, even in this example of mixed integer types, because the Dataset constructor will reconcile differences like these. The schema specification just lets you declare what you want the result to be. +To be clear, it's not necessary to specify a schema, even in this example of +mixed integer types, because the Dataset constructor will reconcile differences +like these. The schema specification just lets you declare what you want the +result to be. ### Explicitly declare partition format @@ -310,11 +330,15 @@ instead of a file path, or simply concatenate them like `big_dataset <- c(ds1, d As you can see, querying a large dataset can be made quite fast by storage in an efficient binary columnar format like Parquet or Feather and partitioning based on -columns commonly used for filtering. However, data isn't always stored that way. Sometimes you might start with one giant CSV. The first step in analyzing data is cleaning is up and reshaping it into a more usable form. +columns commonly used for filtering. However, data isn't always stored that way. +Sometimes you might start with one giant CSV. The first step in analyzing data +is cleaning is up and reshaping it into a more usable form. -The `write_dataset()` function allows you to take a Dataset or other tabular data object---an Arrow `Table` or `RecordBatch`, or an R `data.frame`---and write it to a different file format, partitioned into multiple files. +The `write_dataset()` function allows you to take a Dataset or another tabular +data object---an Arrow `Table` or `RecordBatch`, or an R `data.frame`---and write +it to a different file format, partitioned into multiple files. -Assume you have a version of the NYC Taxi data as CSV: +Assume that you have a version of the NYC Taxi data as CSV: ```r ds <- open_dataset("nyc-taxi/csv/", format = "csv") From ae9b685aede7ed7acdd16cef16988b1edd756e35 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Wed, 21 Jul 2021 14:58:56 +0100 Subject: [PATCH 07/24] Rename section heading --- r/vignettes/dataset.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index 31d3a2dca69..5b4b0ed7524 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -85,7 +85,7 @@ is running, let's check whether you're running with live data: dir.exists("nyc-taxi") ``` -## Getting started +## Opening the dataset Because __dplyr__ is not necessary for many Arrow workflows, it is an optional (`Suggests`) dependency. So, to work with Datasets, From 2c4f9f424709d915a154bddce3c24dd514acece5 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Wed, 21 Jul 2021 16:10:04 +0100 Subject: [PATCH 08/24] Specify Windows version for S3, minor tweaks --- r/vignettes/dataset.Rmd | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index 5b4b0ed7524..56826b5fbd1 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -26,7 +26,7 @@ The total file size is around 37 gigabytes, even in the efficient Parquet file format. That's bigger than memory on most people's computers, so you can't just read it all in and stack it into a single data frame. -In Windows and macOS binary packages, S3 support is included. +In Windows (for R > 3.6) and macOS binary packages, S3 support is included. On Linux, when installing from source, S3 support is not enabled by default, and it has additional system requirements. See `vignette("install", package = "arrow")` for details. @@ -36,7 +36,7 @@ To see if your __arrow__ installation has S3 support, run: arrow::arrow_with_s3() ``` -Even with S3 support enabled network, speed will be a bottleneck unless your +Even with an S3 support enabled network, speed will be a bottleneck unless your machine is located in the same AWS region as the data. So, for this vignette, we assume that the NYC taxi dataset has been downloaded locally in a "nyc-taxi" directory. @@ -287,7 +287,7 @@ avoid scanning because they have no rows where `total_amount > 100`. There are a few ways you can control the Dataset creation to adapt to special use cases. -### Working with files in a directory +### Work with files in a directory If you are working with a single file or a set of files that are not all in the same directory, you can provide a file path or a vector of multiple file paths @@ -299,7 +299,7 @@ without needing to read the full CSV file into R. ### Explicitly declare column names and data types -You can specify a `schema` argument to `open_dataset()` to declare the columns +You can specify the `schema` argument to `open_dataset()` to declare the columns and their data types. This is useful if you have data files that have different storage schema (for example, a column could be `int32` in one and `int8` in another) and you want to ensure that the resulting Dataset has a specific type. @@ -401,8 +401,8 @@ ds %>% write_dataset("nyc-taxi/feather", format = "feather") ``` -The other thing you can do when writing datasets is select a subset of and/or reorder -columns. Suppose you never care about `vendor_id`, and being a string column, +The other thing you can do when writing datasets is select a subset of columns +or reorder them. Suppose you never care about `vendor_id`, and being a string column, it can take up a lot of space when you read it in, so let's drop it: ```r From 5fabe6602a4c11c6ff9f30b30b49739edd1527ff Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Wed, 21 Jul 2021 18:24:52 +0100 Subject: [PATCH 09/24] More signposting, remove unnecessary words/sentences, split things out into bullet points. --- r/vignettes/developing.Rmd | 100 ++++++++++++++++++++----------------- 1 file changed, 54 insertions(+), 46 deletions(-) diff --git a/r/vignettes/developing.Rmd b/r/vignettes/developing.Rmd index d6e31392056..d58cfe0a922 100644 --- a/r/vignettes/developing.Rmd +++ b/r/vignettes/developing.Rmd @@ -40,18 +40,26 @@ set -e set -x ``` -If you're looking to contribute to `arrow`, this document can help you set up a development environment that will enable you to write code and run tests locally. It outlines how to build the various components that make up the Arrow project and R package, as well as some common troubleshooting and workflows developers use. Many contributions can be accomplished with the instructions in [R-only development](#r-only-development). But if you're working on both the C++ library and the R package, the [Developer environment setup](#-developer-environment-setup) section will guide you through setting up a developer environment. +If you're looking to contribute to __arrow__, this vignette can help you set up a development environment that will enable you to write code and run tests locally. It outlines: +* how to build the components that make up the Arrow project and R package +* some common troubleshooting and workflows that developers use + +Many contributions can be accomplished with the instructions in [R-only development](#r-only-development), but if you're working on both the C++ library and the R package, the [Developer environment setup](#-developer-environment-setup) section will guide you through setting up a developer environment. This document is intended only for developers of Apache Arrow or the Arrow R package. Users of the package in R do not need to do any of this setup. If you're looking for how to install Arrow, see [the instructions in the readme](https://arrow.apache.org/docs/r/#installation); Linux users can find more details on building from source at `vignette("install", package = "arrow")`. -This document is a work in progress and will grow + change as the Apache Arrow project grows and changes. We have tried to make these steps as robust as possible (in fact, we even test exactly these instructions on our nightly CI to ensure they don't become stale!), but certain custom configurations might conflict with these instructions and there are differences of opinion across developers about if and what the one true way to set up development environments like this is. We also solicit any feedback you have about things that are confusing or additions you would like to see here. Please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues) if there you see anything that is confusing, odd, or just plain wrong. +This document is a work in progress and will grow and change as the Apache Arrow project grows and changes. We have tried to make these steps as robust as possible (in fact, we even test exactly these instructions on our nightly CI to ensure they don't become stale!), but custom configurations might conflict with these instructions and there are differences of opinion across developers about how to set up development environments like this is. + +We welcome any feedback you have about things that are confusing or additions you would like to see here. Please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues) if there you see anything that is confusing, odd, or just plain wrong. -## R-only development +# R-only developer environment setup Windows and macOS users who wish to contribute to the R package and -don’t need to alter the Arrow C++ library may be able to obtain a -recent version of the library without building from source. On macOS, -you may install the C++ library using [Homebrew](https://brew.sh/): +don't need to alter the Arrow C++ library may be able to obtain a +recent version of the library without building from source. + +## macOS +On macOS, you may install the C++ library using [Homebrew](https://brew.sh/): ``` shell # For the released version: @@ -60,11 +68,12 @@ brew install apache-arrow brew install apache-arrow --HEAD ``` +## Windows and Linux + On Windows and Linux, you can download a .zip file with the arrow dependencies from the nightly repository. Windows users then can set the `RWINLIB_LOCAL` environment variable to point to that -zip file before installing the `arrow` R package. On Linux, you'll need to create a `libarrow` directory inside the R package directory and unzip that file into it. Version numbers in that -repository correspond to dates, and you will likely want the most recent. +zip file before installing the `arrow` R package. On Linux, you'll need to create a `libarrow` directory inside the R package directory and unzip that file into it. Version numbers in that repository correspond to dates. To see what nightlies are available, you can use Arrow's (or any other S3 client's) S3 listing functionality to see what is in the bucket `s3://arrow-r-nightly/libarrow/bin`: @@ -73,41 +82,41 @@ nightly <- s3_bucket("arrow-r-nightly") nightly$ls("libarrow/bin") ``` -## Developer environment setup +# R and C++ developer environment setup -If you need to alter both the Arrow C++ library and the R package code, or if you can’t get a binary version of the latest C++ library elsewhere, you’ll need to build it from source too. This section discusses how to set up a C++ build configured to work with the R package. For more general resources, see the [Arrow C++ developer -guide](https://arrow.apache.org/docs/developers/cpp/building.html). +If you need to alter both the Arrow C++ library and the R package code, or if you can't get a binary version of the latest C++ library elsewhere, you'll need to build it from source too. This section discusses how to set up a C++ build configured to work with the R package. For more general resources, see the [Arrow C++ developer guide](https://arrow.apache.org/docs/developers/cpp/building.html). -There are four major steps to the process — the first three are relevant to all Arrow developers, and the last one is specific to the R bindings: +There are five major steps to the process — the first three are relevant to all Arrow developers, and the last one is specific to the R bindings: -1. Configuring the Arrow library build (using `cmake`) — this specifies how you want the build to go, what features to include, etc. -2. Building the Arrow library — this actually compiles the Arrow library -3. Install the Arrow library — this organizes and moves the compiled Arrow library files into the location specified in the configuration -4. Building the R package — this builds the C++ code in the R package, and installs the R package for you +1. Install dependencies +2. Configuring the Arrow library build (using `cmake`) — this specifies how you want the build to go, what features to include, etc. +3. Building the Arrow library — this actually compiles the Arrow library +4. Install the Arrow library — this organizes and moves the compiled Arrow library files into the location specified in the configuration +5. Building the R package — this builds the C++ code in the R package, and installs the R package for you -### Install dependencies {.tabset} +## Step 1 - Install dependencies The Arrow C++ library will by default use system dependencies if suitable versions are found; if they are not present, it will build them during its own build process. The only dependencies that one needs to install outside of the build process are `cmake` (for configuring the build) and `openssl` if you are building with S3 support. For a faster build, you may choose to install on the system more C++ library dependencies (such as `lz4`, `zstd`, etc.) so that they don't need to be built from source in the Arrow build. This is optional. -#### macOS +### macOS ```{bash, save=run & macos} brew install cmake openssl ``` -#### Ubuntu +### Ubuntu ```{bash, save=run & ubuntu} sudo apt install -y cmake libcurl4-openssl-dev libssl-dev ``` -### Configure the Arrow build {.tabset} +## Step 2 - Configure the Arrow build {.tabset} You can choose to build and then install the Arrow library into a user-defined directory or into a system-level directory. You only need to do one of these two options. It is recommended that you install the arrow library to a user-level directory to be used in development. This is so that the development version you are using doesn't overwrite a released version of Arrow you may have installed. You are also able to have more than one version of the Arrow library to link to with this approach (by using different `ARROW_HOME` directories for the different versions). This approach also matches the recommendations for other Arrow bindings like [Python](http://arrow.apache.org/docs/developers/python.html). -#### Configure for installing to a user directory +### Configure for installing to a user directory In this example we will install it to a directory called `dist` that has the same parent as our `arrow` checkout, but it could be named or located anywhere you would like. However, note that your installation of the Arrow R package will point to this directory and need it to remain intact for the package to continue to work. This is one reason we recommend *not* placing it inside of the arrow git checkout. @@ -131,7 +140,7 @@ mkdir -p cpp/build pushd cpp/build ``` -You’ll first call `cmake` to configure the build and then `make install`. For the R package, you’ll need to enable several features in the C++ library using `-D` flags: +You'll first call `cmake` to configure the build and then `make install`. For the R package, you'll need to enable several features in the C++ library using `-D` flags: ```{bash, save=run & !sys_install} cmake \ @@ -153,7 +162,7 @@ cmake \ `..` refers to the C++ source directory: we're in `cpp/build`, and the source is in `cpp`. -#### Configure to install to a system directory +### Configure to install to a system directory If you would like to install Arrow as a system library you can do that as well. This is in some respects simpler, but if you already have Arrow libraries installed there, it would disrupt them and possibly require `sudo` permissions. @@ -165,7 +174,7 @@ mkdir -p cpp/build pushd cpp/build ``` -You’ll first call `cmake` to configure the build and then `make install`. For the R package, you’ll need to enable several features in the C++ library using `-D` flags: +You'll first call `cmake` to configure the build and then `make install`. For the R package, you'll need to enable several features in the C++ library using `-D` flags: ```{bash, save=run & sys_install} cmake \ @@ -185,7 +194,7 @@ cmake \ `..` refers to the C++ source directory: we're in `cpp/build`, and the source is in `cpp`. -### More Arrow features +## More Arrow features To enable optional features including: S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags (the trailing `\` makes them easier to paste into a bash shell on a new line): @@ -206,7 +215,7 @@ Other flags that may be useful: _Note_ `cmake` is particularly sensitive to whitespacing, if you see errors, check that you don't have any errant whitespace around -### Build Arrow +## Step 3 - Building Arrow You can add `-j#` between `make` and `install` here too to speed up compilation by running in parallel (where `#` is the number of cores you have available). @@ -221,10 +230,9 @@ need to use `sudo`: sudo make install ``` +## Step 4 - Build the Arrow R package -### Build the Arrow R package - -Once you’ve built the C++ library, you can install the R package and its +Once you've built the C++ library, you can install the R package and its dependencies, along with additional dev dependencies, from the git checkout: @@ -290,7 +298,7 @@ The documentation for the R package uses features of `roxygen2` that haven't yet remotes::install_github("r-lib/roxygen2") ``` -## Troubleshooting +# Troubleshooting Note that after any change to the C++ library, you must reinstall it and run `make clean` or `git clean -fdx .` to remove any cached object code @@ -299,12 +307,12 @@ only necessary if you make changes to the C++ library source; you do not need to manually purge object files if you are only editing R or C++ code inside `r/`. -### Arrow library-R package mismatches +## Arrow library-R package mismatches If the Arrow library and the R package have diverged, you will see errors like: ``` -Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath = DLLpath, ...): +Error: package or namespace load failed for ‘arrow' in dyn.load(file, DLLpath = DLLpath, ...): unable to load shared object '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so': dlopen(/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so, 6): Symbol not found: __ZN5arrow2io16RandomAccessFile9ReadAsyncERKNS0_9IOContextExx Referenced from: /Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so @@ -322,7 +330,7 @@ To resolve this, try rebuilding the Arrow library from [Building Arrow above](#b If rebuilding the Arrow library doesn't work and you are [installing from a user-level directory](#installing-to-another-directory) and you already have a previous installation of libarrow in a system directory or you get you may get errors like the following when you install the R package: ``` -Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath = DLLpath, ...): +Error: package or namespace load failed for ‘arrow' in dyn.load(file, DLLpath = DLLpath, ...): unable to load shared object '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so': dlopen(/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: /usr/local/lib/libarrow.400.dylib Referenced from: /usr/local/lib/libparquet.400.dylib @@ -376,15 +384,15 @@ wherever Arrow C++ was put in `make install`, e.g. `export R_LD_LIBRARY_PATH=/usr/local/lib`, and retry installing the R package. When installing from source, if the R and C++ library versions do not -match, installation may fail. If you’ve previously installed the -libraries and want to upgrade the R package, you’ll need to update the +match, installation may fail. If you've previously installed the +libraries and want to upgrade the R package, you'll need to update the Arrow C++ library first. For any other build/configuration challenges, see the [C++ developer guide](https://arrow.apache.org/docs/developers/cpp/building.html). -## Using `remotes::install_github(...)` +# Using `remotes::install_github(...)` If you need an Arrow installation from a specific repository or at a specific ref, `remotes::install_github("apache/arrow/r", build = FALSE)` @@ -408,7 +416,7 @@ separate from another Arrow development environment or system installation * Setting the environment variable `FORCE_BUNDLED_BUILD` to `true` will skip the `pkg-config` search for Arrow libraries and attempt to build from the same source at the repository+ref given. * You may also need to set the Makevars `CPPFLAGS` and `LDFLAGS` to `""` in order to prevent the installation process from attempting to link to already installed system versions of Arrow. One way to do this temporarily is wrapping your `remotes::install_github()` call like so: `withr::with_makevars(list(CPPFLAGS = "", LDFLAGS = ""), remotes::install_github(...))`. -## What happens when you `R CMD INSTALL`? +# What happens when you `R CMD INSTALL`? There are a number of scripts that are triggered when `R CMD INSTALL .`. For Arrow users, these should all just work without configuration and pull in the most complete pieces (e.g. official binaries that we host) so the installation process is easy. However knowing about these scripts can help troubleshoot if things go wrong in them or things go wrong in an install: @@ -418,12 +426,12 @@ There are a number of scripts that are triggered when `R CMD INSTALL .`. For Arr * Check if a binary is available from our hosted unofficial builds. * Download the Arrow source and build the Arrow Library from source. * `*** Proceed without C++` dependencies (this is an error and the package will not work, but if you see this message you know the previous steps have not succeeded/were not enabled) -* `inst/build_arrow_static.sh` this script builds Arrow for a bundled, static build. It is called by `tools/nixlibs.R` when the Arrow library is being built. (If you're looking at this script, and you've gotten this far, it should look _incredibly_ familiar: it's basically the contents of this guide in script form — with a few important changes) +* `inst/build_arrow_static.sh` this script builds Arrow for a bundled, static build. It is called by `tools/nixlibs.R` when the Arrow library is being built. (If you're looking at this script, and you've gotten this far, it might look incredibly familiar: it's basically the contents of this guide in script form — with a few important changes) -## Editing C++ code in the R package +# Editing C++ code in the R package The `arrow` package uses some customized tools on top of `cpp11` to prepare its -C++ code in `src/`. This is because we have some features that are only enabled +C++ code in `src/`. This is because there are some features that are only enabled and built conditionally during build time. If you change C++ code in the R package, you will need to set the `ARROW_R_DEV` environment variable to `true` (optionally, add it to your `~/.Renviron` file to persist across sessions) so @@ -448,7 +456,7 @@ Fix any style issues before committing with ``` The lint script requires Python 3 and `clang-format-8`. If the command -isn’t found, you can explicitly provide the path to it like +isn't found, you can explicitly provide the path to it like `CLANG_FORMAT=$(which clang-format-8) ./lint.sh`. On macOS, you can get this by installing LLVM via Homebrew and running the script as `CLANG_FORMAT=$(brew --prefix llvm@8)/bin/clang-format ./lint.sh` @@ -460,7 +468,7 @@ _Note_ that the lint script requires Python 3 and the Python dependencies * flake8 * cmake_format==0.5.2 -## Running tests +# Running tests Some tests are conditionally enabled based on the availability of certain features in the package build (S3 support, compression libraries, etc.). @@ -481,7 +489,7 @@ variables or other settings: settings, you can set `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY`, and `MINIO_PORT` to override the defaults. -## Github workflows +# Github workflows On a pull request, there are some actions you can trigger by commenting on the PR. We have additional CI checks that run nightly and can be requested on demand using an internal tool called [crosssbow](https://arrow.apache.org/docs/developers/crossbow.html). A few important GitHub comment commands include: @@ -490,7 +498,7 @@ On a pull request, there are some actions you can trigger by commenting on the P * `@github-actions autotune` will run and fix lint c++ linting errors + run R documentation (among other cleanup tasks) and commit them to the branch -## Useful functions for Arrow developers +# Useful functions for Arrow developers Within an R session, these can help with package development: @@ -518,10 +526,10 @@ covr::package_coverage() ``` Any of those can be run from the command line by wrapping them in `R -e -'$COMMAND'`. There’s also a `Makefile` to help with some common tasks +'$COMMAND'`. There's also a `Makefile` to help with some common tasks from the command line (`make test`, `make doc`, `make clean`, etc.) -### Full package validation +## Full package validation ``` shell R CMD build . From c788f302bf7e02bd20b5677021e882621ea5a109 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Mon, 26 Jul 2021 13:52:50 +0100 Subject: [PATCH 10/24] Add a note about no Windows C++ dev --- r/vignettes/developing.Rmd | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/r/vignettes/developing.Rmd b/r/vignettes/developing.Rmd index d58cfe0a922..8adf7bbf62b 100644 --- a/r/vignettes/developing.Rmd +++ b/r/vignettes/developing.Rmd @@ -40,7 +40,7 @@ set -e set -x ``` -If you're looking to contribute to __arrow__, this vignette can help you set up a development environment that will enable you to write code and run tests locally. It outlines: +If you're looking to contribute to arrow, this vignette can help you set up a development environment that will enable you to write code and run tests locally. It outlines: * how to build the components that make up the Arrow project and R package * some common troubleshooting and workflows that developers use @@ -73,7 +73,7 @@ brew install apache-arrow --HEAD On Windows and Linux, you can download a .zip file with the arrow dependencies from the nightly repository. Windows users then can set the `RWINLIB_LOCAL` environment variable to point to that -zip file before installing the `arrow` R package. On Linux, you'll need to create a `libarrow` directory inside the R package directory and unzip that file into it. Version numbers in that repository correspond to dates. +zip file before installing the arrow R package. On Linux, you'll need to create a `libarrow` directory inside the R package directory and unzip that file into it. Version numbers in that repository correspond to dates. To see what nightlies are available, you can use Arrow's (or any other S3 client's) S3 listing functionality to see what is in the bucket `s3://arrow-r-nightly/libarrow/bin`: @@ -110,6 +110,10 @@ brew install cmake openssl sudo apt install -y cmake libcurl4-openssl-dev libssl-dev ``` +### Windows + +Currently, the R package cannot be made to work with a locally-built Arrow C++ library. This will be resolved in a future release. + ## Step 2 - Configure the Arrow build {.tabset} You can choose to build and then install the Arrow library into a user-defined directory or into a system-level directory. You only need to do one of these two options. @@ -430,7 +434,7 @@ There are a number of scripts that are triggered when `R CMD INSTALL .`. For Arr # Editing C++ code in the R package -The `arrow` package uses some customized tools on top of `cpp11` to prepare its +The arrow package uses some customized tools on top of `cpp11` to prepare its C++ code in `src/`. This is because there are some features that are only enabled and built conditionally during build time. If you change C++ code in the R package, you will need to set the `ARROW_R_DEV` environment variable to `true` From 85495f2105c9714a8dde307112799e4a39ccfbea Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Wed, 21 Jul 2021 12:29:24 +0100 Subject: [PATCH 11/24] Remove backticks to make easier to read --- r/vignettes/dataset.Rmd | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index b5e17578b29..e05939bd3d9 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -8,13 +8,14 @@ vignette: > --- Apache Arrow lets you work efficiently with large, multi-file datasets. -The `arrow` R package provides a `dplyr` interface to Arrow Datasets, +The arrow R package provides a `dplyr` interface to Arrow Datasets, as well as other tools for interactive exploration of Arrow data. This vignette introduces Datasets and shows how to use `dplyr` to analyze them. It describes both what is possible to do with Arrow now and what is on the immediate development roadmap. + ## Example: NYC taxi data The [New York City taxi trip record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) From c573c49cbec5e1690b6d3c67906d16c6439a5377 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Wed, 21 Jul 2021 12:51:50 +0100 Subject: [PATCH 12/24] Grammarly suggestions and adding some subheadings --- r/vignettes/dataset.Rmd | 59 +++++++++++++++++++---------------------- 1 file changed, 27 insertions(+), 32 deletions(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index e05939bd3d9..11ab986e2d7 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -8,30 +8,29 @@ vignette: > --- Apache Arrow lets you work efficiently with large, multi-file datasets. -The arrow R package provides a `dplyr` interface to Arrow Datasets, -as well as other tools for interactive exploration of Arrow data. +The arrow R package provides a dplyr interface to Arrow Datasets, +and other tools for interactive exploration of Arrow data. This vignette introduces Datasets and shows how to use `dplyr` to analyze them. It describes both what is possible to do with Arrow now and what is on the immediate development roadmap. - ## Example: NYC taxi data The [New York City taxi trip record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) is widely used in big data exercises and competitions. For demonstration purposes, we have hosted a Parquet-formatted version -of about 10 years of the trip data in a public Amazon S3 bucket. +of about ten years of the trip data in a public Amazon S3 bucket. The total file size is around 37 gigabytes, even in the efficient Parquet file format. That's bigger than memory on most people's computers, so we can't just read it all in and stack it into a single data frame. In Windows and macOS binary packages, S3 support is included. -On Linux when installing from source, S3 support is not enabled by default, +On Linux, when installing from source, S3 support is not enabled by default, and it has additional system requirements. See `vignette("install", package = "arrow")` for details. -To see if your `arrow` installation has S3 support, run +To see if your `arrow` installation has S3 support, run: ```{r} arrow::arrow_with_s3() @@ -120,9 +119,9 @@ have file paths like ``` By providing a character vector to `partitioning`, we're saying that the first -path segment gives the value for `year` and the second segment is `month`. +path segment gives the value for `year`, and the second segment is `month`. Every row in `2009/01/data.parquet` has a value of 2009 for `year` -and 1 for `month`, even though those columns may not actually be present in the file. +and 1 for `month`, even though those columns may not be present in the file. Indeed, when we look at the dataset, we see that in addition to the columns present in every file, there are also columns `year` and `month`. @@ -185,9 +184,8 @@ In the current release, `arrow` supports the dplyr verbs `mutate()`, or other verbs with aggregate functions, use `collect()` to pull the selected subset of the data into an in-memory R data frame. -If you attempt to call unsupported `dplyr` verbs or unimplemented functions in -your query on an Arrow Dataset, the `arrow` package raises an error. However, -for `dplyr` queries on `Table` objects (which are typically smaller in size) the +Suppose you attempt to call unsupported `dplyr` verbs or unimplemented functions in your query on an Arrow Dataset. In that case, the `arrow` package raises an error. However, +for `dplyr` queries on `Table` objects (typically smaller in size than Datasets), the package automatically calls `collect()` before processing that `dplyr` verb. Here's an example. Suppose I was curious about tipping behavior among the @@ -230,11 +228,10 @@ cat(" ``` We just selected a subset out of a dataset with around 2 billion rows, computed -a new column, and aggregated on it in under 2 seconds on my laptop. How does +a new column, and aggregated it in under 2 seconds on my laptop. How does this work? -First, -`mutate()`/`transmute()`, `select()`/`rename()`/`relocate()`, `filter()`, +First, `mutate()`/`transmute()`, `select()`/`rename()`/`relocate()`, `filter()`, `group_by()`, and `arrange()` record their actions but don't evaluate on the data until you run `collect()`. @@ -260,7 +257,7 @@ See $.data for the source Arrow object ") ``` -This returns instantly and shows the manipulations you've made, without +This code returns an output instantly and shows the manipulations you've made, without loading data from the files. Because the evaluation of these queries is deferred, you can build up a query that selects down to a small subset without generating intermediate datasets that would potentially be large. @@ -268,8 +265,7 @@ intermediate datasets that would potentially be large. Second, all work is pushed down to the individual data files, and depending on the file format, chunks of data within the files. As a result, we can select a subset of data from a much larger dataset by collecting the -smaller slices from each file--we don't have to load the whole dataset in memory -in order to slice from it. +smaller slices from each file--we don't have to load the whole dataset in memory to slice from it. Third, because of partitioning, we can ignore some files entirely. In this example, by filtering `year == 2015`, all files corresponding to other years @@ -281,27 +277,26 @@ avoid scanning because they have no rows where `total_amount > 100`. ## More dataset options There are a few ways you can control the Dataset creation to adapt to special use cases. -For one, if you are working with a single file or a set of files that are not -all in the same directory, you can provide a file path or a vector of multiple -file paths to `open_dataset()`. This is useful if, for example, you have a -single CSV file that is too big to read into memory. You could pass the file -path to `open_dataset()`, use `group_by()` to partition the Dataset into -manageable chunks, then use `write_dataset()` to write each chunk to a separate -Parquet file---all without needing to read the full CSV file into R. - -You can specify a `schema` argument to `open_dataset()` to declare the columns -and their data types. This is useful if you have data files that have different -storage schema (for example, a column could be `int32` in one and `int8` in another) -and you want to ensure that the resulting Dataset has a specific type. -To be clear, it's not necessary to specify a schema, even in this example of -mixed integer types, because the Dataset constructor will reconcile differences like these. -The schema specification just lets you declare what you want the result to be. + +### Working with files in a directory + +If you are working with a single file or a set of files that are not all in the same directory, you can provide a file path or a vector of multiple file paths to `open_dataset()`. This is useful if, for example, you have a single CSV file that is too big to read into memory. You could pass the file path to `open_dataset()`, use `group_by()` to partition the Dataset into manageable chunks, then use `write_dataset()` to write each chunk to a separate Parquet file---all without needing to read the full CSV file into R. + +### Explicitly declare column names and data types + +You can specify a `schema` argument to `open_dataset()` to declare the columns and their data types. This is useful if you have data files that have different storage schema (for example, a column could be `int32` in one and `int8` in another) and you want to ensure that the resulting Dataset has a specific type. + +To be clear, it's not necessary to specify a schema, even in this example of mixed integer types, because the Dataset constructor will reconcile differences like these. The schema specification just lets you declare what you want the result to be. + +### Explicitly declare partition format Similarly, you can provide a Schema in the `partitioning` argument of `open_dataset()` in order to declare the types of the virtual columns that define the partitions. This would be useful, in our taxi dataset example, if you wanted to keep `month` as a string instead of an integer for some reason. +### Work with multiple data sources + Another feature of Datasets is that they can be composed of multiple data sources. That is, you may have a directory of partitioned Parquet files in one location, and in another directory, files that haven't been partitioned. From 83862d5c32225db9dd2d53c1551eab9f7d93dde3 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Wed, 21 Jul 2021 13:33:00 +0100 Subject: [PATCH 13/24] "we" -> "you" --- r/vignettes/dataset.Rmd | 74 ++++++++++++++++++++--------------------- 1 file changed, 36 insertions(+), 38 deletions(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index 11ab986e2d7..d04195a22c9 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -23,7 +23,7 @@ For demonstration purposes, we have hosted a Parquet-formatted version of about ten years of the trip data in a public Amazon S3 bucket. The total file size is around 37 gigabytes, even in the efficient Parquet file -format. That's bigger than memory on most people's computers, so we can't just +format. That's bigger than memory on most people's computers, so you can't just read it all in and stack it into a single data frame. In Windows and macOS binary packages, S3 support is included. @@ -77,7 +77,7 @@ feel free to grab only a year or two of data. If you don't have the taxi data downloaded, the vignette will still run and will yield previously cached output for reference. To be explicit about which version -is running, let's check whether we're running with live data: +is running, let's check whether you're running with live data: ```{r} dir.exists("nyc-taxi") @@ -87,29 +87,29 @@ dir.exists("nyc-taxi") Because `dplyr` is not necessary for many Arrow workflows, it is an optional (`Suggests`) dependency. So, to work with Datasets, -we need to load both `arrow` and `dplyr`. +you need to load both `arrow` and `dplyr`. ```{r} library(arrow, warn.conflicts = FALSE) library(dplyr, warn.conflicts = FALSE) ``` -The first step is to create our Dataset object, pointing at the directory of data. +The first step is to create a Dataset object, pointing at the directory of data. ```{r, eval = file.exists("nyc-taxi")} ds <- open_dataset("nyc-taxi", partitioning = c("year", "month")) ``` -The default file format for `open_dataset()` is Parquet; if we had a directory -of Arrow format files, we could include `format = "arrow"` in the call. +The default file format for `open_dataset()` is Parquet; if you had a directory +of Arrow format files, you could include `format = "arrow"` in the call. Other supported formats include: `"feather"` (an alias for `"arrow"`, as Feather v2 is the Arrow file format), `"csv"`, `"tsv"` (for tab-delimited), and `"text"` for generic text-delimited files. For text files, you can pass any parsing options (`delim`, `quote`, etc.) to `open_dataset()` that you would otherwise pass to `read_csv_arrow()`. -The `partitioning` argument lets us specify how the file paths provide information -about how the dataset is chunked into different files. Our files in this example +The `partitioning` argument lets you specify how the file paths provide information +about how the dataset is chunked into different files. The files in this example have file paths like ``` @@ -118,12 +118,12 @@ have file paths like ... ``` -By providing a character vector to `partitioning`, we're saying that the first +By providing a character vector to `partitioning`, you're saying that the first path segment gives the value for `year`, and the second segment is `month`. Every row in `2009/01/data.parquet` has a value of 2009 for `year` and 1 for `month`, even though those columns may not be present in the file. -Indeed, when we look at the dataset, we see that in addition to the columns present +Indeed, when you look at the dataset, you can see that in addition to the columns present in every file, there are also columns `year` and `month`. ```{r, eval = file.exists("nyc-taxi")} @@ -159,7 +159,7 @@ See $metadata for additional Schema metadata The other form of partitioning currently supported is [Hive](https://hive.apache.org/)-style, in which the partition variable names are included in the path segments. -If we had saved our files in paths like +If you had saved your files in paths like ``` year=2009/month=01/data.parquet @@ -167,15 +167,15 @@ year=2009/month=02/data.parquet ... ``` -we would not have had to provide the names in `partitioning`: -we could have just called `ds <- open_dataset("nyc-taxi")` and the partitions +you would not have had to provide the names in `partitioning`: +you could have just called `ds <- open_dataset("nyc-taxi")` and the partitions would have been detected automatically. ## Querying the dataset -Up to this point, we haven't loaded any data: we have walked directories to find -files, we've parsed file paths to identify partitions, and we've read the -headers of the Parquet files to inspect their schemas so that we can make sure +Up to this point, you haven't loaded any data: you have walked directories to find +files, you've parsed file paths to identify partitions, and you've read the +headers of the Parquet files to inspect their schemas so that you can make sure they all line up. In the current release, `arrow` supports the dplyr verbs `mutate()`, @@ -227,7 +227,7 @@ cat(" ") ``` -We just selected a subset out of a dataset with around 2 billion rows, computed +You just selected a subset out of a dataset with around 2 billion rows, computed a new column, and aggregated it in under 2 seconds on my laptop. How does this work? @@ -264,14 +264,14 @@ intermediate datasets that would potentially be large. Second, all work is pushed down to the individual data files, and depending on the file format, chunks of data within the files. As a result, -we can select a subset of data from a much larger dataset by collecting the -smaller slices from each file--we don't have to load the whole dataset in memory to slice from it. +you can select a subset of data from a much larger dataset by collecting the +smaller slices from each file--you don't have to load the whole dataset in memory to slice from it. -Third, because of partitioning, we can ignore some files entirely. +Third, because of partitioning, you can ignore some files entirely. In this example, by filtering `year == 2015`, all files corresponding to other years -are immediately excluded: we don't have to load them in order to find that no +are immediately excluded: you don't have to load them in order to find that no rows match the filter. Relatedly, since Parquet files contain row groups with -statistics on the data within, there may be entire chunks of data we can +statistics on the data within, there may be entire chunks of data you can avoid scanning because they have no rows where `total_amount > 100`. ## More dataset options @@ -292,8 +292,8 @@ To be clear, it's not necessary to specify a schema, even in this example of mix Similarly, you can provide a Schema in the `partitioning` argument of `open_dataset()` in order to declare the types of the virtual columns that define the partitions. -This would be useful, in our taxi dataset example, if you wanted to keep -`month` as a string instead of an integer for some reason. +This would be useful, in the taxi dataset example, if you wanted to keep +`month` as a string instead of an integer. ### Work with multiple data sources @@ -309,27 +309,25 @@ instead of a file path, or simply concatenate them like `big_dataset <- c(ds1, d As you can see, querying a large dataset can be made quite fast by storage in an efficient binary columnar format like Parquet or Feather and partitioning based on -columns commonly used for filtering. However, we don't always get our data delivered -to us that way. Sometimes we start with one giant CSV. Our first step in analyzing data -is cleaning is up and reshaping it into a more usable form. +columns commonly used for filtering. However, data isn't always stored that way. Sometimes you might start with one giant CSV. The first step in analyzing data is cleaning is up and reshaping it into a more usable form. The `write_dataset()` function allows you to take a Dataset or other tabular data object---an Arrow `Table` or `RecordBatch`, or an R `data.frame`---and write it to a different file format, partitioned into multiple files. -Assume we have a version of the NYC Taxi data as CSV: +Assume you have a version of the NYC Taxi data as CSV: ```r ds <- open_dataset("nyc-taxi/csv/", format = "csv") ``` -We can write it to a new location and translate the files to the Feather format +You can write it to a new location and translate the files to the Feather format by calling `write_dataset()` on it: ```r write_dataset(ds, "nyc-taxi/feather", format = "feather") ``` -Next, let's imagine that the `payment_type` column is something we often filter -on, so we want to partition the data by that variable. By doing so we ensure +Next, let's imagine that the `payment_type` column is something you often filter +on, so you want to partition the data by that variable. By doing so you ensure that a filter like `payment_type == "Cash"` will touch only a subset of files where `payment_type` is always `"Cash"`. @@ -363,14 +361,14 @@ system("tree nyc-taxi/feather") Note that the directory names are `payment_type=Cash` and similar: this is the Hive-style partitioning described above. This means that when -we call `open_dataset()` on this directory, we don't have to declare what the +you call `open_dataset()` on this directory, you don't have to declare what the partitions are because they can be read from the file paths. (To instead write bare values for partition segments, i.e. `Cash` rather than `payment_type=Cash`, call `write_dataset()` with `hive_style = FALSE`.) -Perhaps, though, `payment_type == "Cash"` is the only data we ever care about, -and we just want to drop the rest and have a smaller working set. -For this, we can `filter()` them out when writing: +Perhaps, though, `payment_type == "Cash"` is the only data you ever care about, +and you just want to drop the rest and have a smaller working set. +For this, you can `filter()` them out when writing: ```r ds %>% @@ -378,9 +376,9 @@ ds %>% write_dataset("nyc-taxi/feather", format = "feather") ``` -The other thing we can do when writing datasets is select a subset of and/or reorder -columns. Suppose we never care about `vendor_id`, and being a string column, -it can take up a lot of space when we read it in, so let's drop it: +The other thing you can do when writing datasets is select a subset of and/or reorder +columns. Suppose you never care about `vendor_id`, and being a string column, +it can take up a lot of space when you read it in, so let's drop it: ```r ds %>% From 8b45e8982abc500c1483bcc69cb69650219bf5da Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Wed, 21 Jul 2021 13:42:30 +0100 Subject: [PATCH 14/24] Use bold instead of backticks to make package names more readable --- r/vignettes/dataset.Rmd | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index d04195a22c9..32c041f1730 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -8,10 +8,10 @@ vignette: > --- Apache Arrow lets you work efficiently with large, multi-file datasets. -The arrow R package provides a dplyr interface to Arrow Datasets, +The __arrow__ R package provides a __dplyr__ interface to Arrow Datasets, and other tools for interactive exploration of Arrow data. -This vignette introduces Datasets and shows how to use `dplyr` to analyze them. +This vignette introduces Datasets and shows how to use __dplyr__ to analyze them. It describes both what is possible to do with Arrow now and what is on the immediate development roadmap. @@ -30,7 +30,7 @@ In Windows and macOS binary packages, S3 support is included. On Linux, when installing from source, S3 support is not enabled by default, and it has additional system requirements. See `vignette("install", package = "arrow")` for details. -To see if your `arrow` installation has S3 support, run: +To see if your __arrow__ installation has S3 support, run: ```{r} arrow::arrow_with_s3() @@ -41,13 +41,13 @@ machine is located in the same AWS region as the data. So, for this vignette, we assume that the NYC taxi dataset has been downloaded locally in a "nyc-taxi" directory. -If your `arrow` build has S3 support, you can sync the data locally with: +If your __arrow__ build has S3 support, you can sync the data locally with: ```{r, eval = FALSE} arrow::copy_files("s3://ursa-labs-taxi-data", "nyc-taxi") ``` -If your `arrow` build doesn't have S3 support, you can download the files +If your __arrow__ build doesn't have S3 support, you can download the files with some additional code: ```{r, eval = FALSE} @@ -85,9 +85,9 @@ dir.exists("nyc-taxi") ## Getting started -Because `dplyr` is not necessary for many Arrow workflows, +Because __dplyr__ is not necessary for many Arrow workflows, it is an optional (`Suggests`) dependency. So, to work with Datasets, -you need to load both `arrow` and `dplyr`. +you need to load both __arrow__ and __dplyr__. ```{r} library(arrow, warn.conflicts = FALSE) @@ -178,15 +178,15 @@ files, you've parsed file paths to identify partitions, and you've read the headers of the Parquet files to inspect their schemas so that you can make sure they all line up. -In the current release, `arrow` supports the dplyr verbs `mutate()`, +In the current release, __arrow__ supports the dplyr verbs `mutate()`, `transmute()`, `select()`, `rename()`, `relocate()`, `filter()`, and `arrange()`. Aggregation is not yet supported, so before you call `summarise()` or other verbs with aggregate functions, use `collect()` to pull the selected subset of the data into an in-memory R data frame. -Suppose you attempt to call unsupported `dplyr` verbs or unimplemented functions in your query on an Arrow Dataset. In that case, the `arrow` package raises an error. However, -for `dplyr` queries on `Table` objects (typically smaller in size than Datasets), the -package automatically calls `collect()` before processing that `dplyr` verb. +Suppose you attempt to call unsupported __dplyr__ verbs or unimplemented functions in your query on an Arrow Dataset. In that case, the __arrow__ package raises an error. However, +for __dplyr__ queries on `Table` objects (typically smaller in size than Datasets), the +package automatically calls `collect()` before processing that __dplyr__ verb. Here's an example. Suppose I was curious about tipping behavior among the longest taxi rides. Let's find the median tip percentage for rides with From b6865dbc4fa624f139f6ab659df0ccd136605aa5 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Wed, 21 Jul 2021 14:38:53 +0100 Subject: [PATCH 15/24] Split paragraph into bullets --- r/vignettes/dataset.Rmd | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index 32c041f1730..34df15c591a 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -102,11 +102,12 @@ ds <- open_dataset("nyc-taxi", partitioning = c("year", "month")) The default file format for `open_dataset()` is Parquet; if you had a directory of Arrow format files, you could include `format = "arrow"` in the call. -Other supported formats include: `"feather"` (an alias for `"arrow"`, as Feather -v2 is the Arrow file format), `"csv"`, `"tsv"` (for tab-delimited), and `"text"` -for generic text-delimited files. For text files, you can pass any parsing -options (`delim`, `quote`, etc.) to `open_dataset()` that you would otherwise -pass to `read_csv_arrow()`. +Other supported formats include: + +* `"feather"` (an alias for `"arrow"`, as Feather v2 is the Arrow file format) +* `"csv"`, `"tsv"` (for tab-delimited), and `"text"` for generic text-delimited files. + +For text files, you can pass any parsing options (`delim`, `quote`, etc.) to `open_dataset()` that you would otherwise pass to `read_csv_arrow()`. The `partitioning` argument lets you specify how the file paths provide information about how the dataset is chunked into different files. The files in this example @@ -167,13 +168,13 @@ year=2009/month=02/data.parquet ... ``` -you would not have had to provide the names in `partitioning`: +you would not have had to provide the names in `partitioning`; you could have just called `ds <- open_dataset("nyc-taxi")` and the partitions would have been detected automatically. ## Querying the dataset -Up to this point, you haven't loaded any data: you have walked directories to find +Up to this point, you haven't loaded any data. You have walked directories to find files, you've parsed file paths to identify partitions, and you've read the headers of the Parquet files to inspect their schemas so that you can make sure they all line up. From 66307ebbfba1f8e5be7ba65ea2f639226ff04d86 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Wed, 21 Jul 2021 14:57:36 +0100 Subject: [PATCH 16/24] Breaking sections down and tweaks --- r/vignettes/dataset.Rmd | 60 ++++++++++++++++++++++++++++------------- 1 file changed, 42 insertions(+), 18 deletions(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index 34df15c591a..31d3a2dca69 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -41,6 +41,8 @@ machine is located in the same AWS region as the data. So, for this vignette, we assume that the NYC taxi dataset has been downloaded locally in a "nyc-taxi" directory. +### Retrieving data from a public Amazon S3 bucket + If your __arrow__ build has S3 support, you can sync the data locally with: ```{r, eval = FALSE} @@ -100,14 +102,18 @@ The first step is to create a Dataset object, pointing at the directory of data. ds <- open_dataset("nyc-taxi", partitioning = c("year", "month")) ``` -The default file format for `open_dataset()` is Parquet; if you had a directory -of Arrow format files, you could include `format = "arrow"` in the call. +The file format for `open_dataset()` is controlled by the `format` parameter, +which has a default value of `"parquet"`. If you had a directory +of Arrow format files, you could instead specify `format = "arrow"` in the call. + Other supported formats include: -* `"feather"` (an alias for `"arrow"`, as Feather v2 is the Arrow file format) -* `"csv"`, `"tsv"` (for tab-delimited), and `"text"` for generic text-delimited files. +* `"feather"` or `"ipc"` (aliases for `"arrow"`, as Feather v2 is the Arrow file format) +* `"csv"` (comma-delimited files) and `"tsv"` (tab-delimited files) +* `"text"` (generic text-delimited files - use the `delimiter` argument to specify which to use) -For text files, you can pass any parsing options (`delim`, `quote`, etc.) to `open_dataset()` that you would otherwise pass to `read_csv_arrow()`. +For text files, you can pass any parsing options (`delim`, `quote`, etc.) to +`open_dataset()` that you would otherwise pass to `read_csv_arrow()`. The `partitioning` argument lets you specify how the file paths provide information about how the dataset is chunked into different files. The files in this example @@ -174,10 +180,10 @@ would have been detected automatically. ## Querying the dataset -Up to this point, you haven't loaded any data. You have walked directories to find +Up to this point, you haven't loaded any data. You've walked directories to find files, you've parsed file paths to identify partitions, and you've read the headers of the Parquet files to inspect their schemas so that you can make sure -they all line up. +they all are as expected. In the current release, __arrow__ supports the dplyr verbs `mutate()`, `transmute()`, `select()`, `rename()`, `relocate()`, `filter()`, and @@ -185,11 +191,12 @@ In the current release, __arrow__ supports the dplyr verbs `mutate()`, or other verbs with aggregate functions, use `collect()` to pull the selected subset of the data into an in-memory R data frame. -Suppose you attempt to call unsupported __dplyr__ verbs or unimplemented functions in your query on an Arrow Dataset. In that case, the __arrow__ package raises an error. However, +Suppose you attempt to call unsupported __dplyr__ verbs or unimplemented functions +in your query on an Arrow Dataset. In that case, the __arrow__ package raises an error. However, for __dplyr__ queries on `Table` objects (typically smaller in size than Datasets), the package automatically calls `collect()` before processing that __dplyr__ verb. -Here's an example. Suppose I was curious about tipping behavior among the +Here's an example. Suppose that you are curious about tipping behavior among the longest taxi rides. Let's find the median tip percentage for rides with fares greater than $100 in 2015, broken down by the number of passengers: @@ -228,8 +235,8 @@ cat(" ") ``` -You just selected a subset out of a dataset with around 2 billion rows, computed -a new column, and aggregated it in under 2 seconds on my laptop. How does +You've just selected a subset out of a dataset with around 2 billion rows, computed +a new column, and aggregated it in under 2 seconds on most modern laptops. How does this work? First, `mutate()`/`transmute()`, `select()`/`rename()`/`relocate()`, `filter()`, @@ -266,7 +273,8 @@ intermediate datasets that would potentially be large. Second, all work is pushed down to the individual data files, and depending on the file format, chunks of data within the files. As a result, you can select a subset of data from a much larger dataset by collecting the -smaller slices from each file--you don't have to load the whole dataset in memory to slice from it. +smaller slices from each file--you don't have to load the whole dataset in +memory to slice from it. Third, because of partitioning, you can ignore some files entirely. In this example, by filtering `year == 2015`, all files corresponding to other years @@ -281,13 +289,25 @@ There are a few ways you can control the Dataset creation to adapt to special us ### Working with files in a directory -If you are working with a single file or a set of files that are not all in the same directory, you can provide a file path or a vector of multiple file paths to `open_dataset()`. This is useful if, for example, you have a single CSV file that is too big to read into memory. You could pass the file path to `open_dataset()`, use `group_by()` to partition the Dataset into manageable chunks, then use `write_dataset()` to write each chunk to a separate Parquet file---all without needing to read the full CSV file into R. +If you are working with a single file or a set of files that are not all in the +same directory, you can provide a file path or a vector of multiple file paths +to `open_dataset()`. This is useful if, for example, you have a single CSV file +that is too big to read into memory. You could pass the file path to +`open_dataset()`, use `group_by()` to partition the Dataset into manageable chunks, +then use `write_dataset()` to write each chunk to a separate Parquet file---all +without needing to read the full CSV file into R. ### Explicitly declare column names and data types -You can specify a `schema` argument to `open_dataset()` to declare the columns and their data types. This is useful if you have data files that have different storage schema (for example, a column could be `int32` in one and `int8` in another) and you want to ensure that the resulting Dataset has a specific type. +You can specify a `schema` argument to `open_dataset()` to declare the columns +and their data types. This is useful if you have data files that have different +storage schema (for example, a column could be `int32` in one and `int8` in +another) and you want to ensure that the resulting Dataset has a specific type. -To be clear, it's not necessary to specify a schema, even in this example of mixed integer types, because the Dataset constructor will reconcile differences like these. The schema specification just lets you declare what you want the result to be. +To be clear, it's not necessary to specify a schema, even in this example of +mixed integer types, because the Dataset constructor will reconcile differences +like these. The schema specification just lets you declare what you want the +result to be. ### Explicitly declare partition format @@ -310,11 +330,15 @@ instead of a file path, or simply concatenate them like `big_dataset <- c(ds1, d As you can see, querying a large dataset can be made quite fast by storage in an efficient binary columnar format like Parquet or Feather and partitioning based on -columns commonly used for filtering. However, data isn't always stored that way. Sometimes you might start with one giant CSV. The first step in analyzing data is cleaning is up and reshaping it into a more usable form. +columns commonly used for filtering. However, data isn't always stored that way. +Sometimes you might start with one giant CSV. The first step in analyzing data +is cleaning is up and reshaping it into a more usable form. -The `write_dataset()` function allows you to take a Dataset or other tabular data object---an Arrow `Table` or `RecordBatch`, or an R `data.frame`---and write it to a different file format, partitioned into multiple files. +The `write_dataset()` function allows you to take a Dataset or another tabular +data object---an Arrow `Table` or `RecordBatch`, or an R `data.frame`---and write +it to a different file format, partitioned into multiple files. -Assume you have a version of the NYC Taxi data as CSV: +Assume that you have a version of the NYC Taxi data as CSV: ```r ds <- open_dataset("nyc-taxi/csv/", format = "csv") From abe4a617840ec88a6128382066881215ce191f23 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Wed, 21 Jul 2021 14:58:56 +0100 Subject: [PATCH 17/24] Rename section heading --- r/vignettes/dataset.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index 31d3a2dca69..5b4b0ed7524 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -85,7 +85,7 @@ is running, let's check whether you're running with live data: dir.exists("nyc-taxi") ``` -## Getting started +## Opening the dataset Because __dplyr__ is not necessary for many Arrow workflows, it is an optional (`Suggests`) dependency. So, to work with Datasets, From 61f39cf72c361806d4acb76c18915d9946379064 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Wed, 21 Jul 2021 16:10:04 +0100 Subject: [PATCH 18/24] Specify Windows version for S3, minor tweaks --- r/vignettes/dataset.Rmd | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index 5b4b0ed7524..56826b5fbd1 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -26,7 +26,7 @@ The total file size is around 37 gigabytes, even in the efficient Parquet file format. That's bigger than memory on most people's computers, so you can't just read it all in and stack it into a single data frame. -In Windows and macOS binary packages, S3 support is included. +In Windows (for R > 3.6) and macOS binary packages, S3 support is included. On Linux, when installing from source, S3 support is not enabled by default, and it has additional system requirements. See `vignette("install", package = "arrow")` for details. @@ -36,7 +36,7 @@ To see if your __arrow__ installation has S3 support, run: arrow::arrow_with_s3() ``` -Even with S3 support enabled network, speed will be a bottleneck unless your +Even with an S3 support enabled network, speed will be a bottleneck unless your machine is located in the same AWS region as the data. So, for this vignette, we assume that the NYC taxi dataset has been downloaded locally in a "nyc-taxi" directory. @@ -287,7 +287,7 @@ avoid scanning because they have no rows where `total_amount > 100`. There are a few ways you can control the Dataset creation to adapt to special use cases. -### Working with files in a directory +### Work with files in a directory If you are working with a single file or a set of files that are not all in the same directory, you can provide a file path or a vector of multiple file paths @@ -299,7 +299,7 @@ without needing to read the full CSV file into R. ### Explicitly declare column names and data types -You can specify a `schema` argument to `open_dataset()` to declare the columns +You can specify the `schema` argument to `open_dataset()` to declare the columns and their data types. This is useful if you have data files that have different storage schema (for example, a column could be `int32` in one and `int8` in another) and you want to ensure that the resulting Dataset has a specific type. @@ -401,8 +401,8 @@ ds %>% write_dataset("nyc-taxi/feather", format = "feather") ``` -The other thing you can do when writing datasets is select a subset of and/or reorder -columns. Suppose you never care about `vendor_id`, and being a string column, +The other thing you can do when writing datasets is select a subset of columns +or reorder them. Suppose you never care about `vendor_id`, and being a string column, it can take up a lot of space when you read it in, so let's drop it: ```r From a2ed71e9237a940b1240dbe672f505fad9699045 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Wed, 21 Jul 2021 18:24:52 +0100 Subject: [PATCH 19/24] More signposting, remove unnecessary words/sentences, split things out into bullet points. --- r/vignettes/developing.Rmd | 100 ++++++++++++++++++++----------------- 1 file changed, 54 insertions(+), 46 deletions(-) diff --git a/r/vignettes/developing.Rmd b/r/vignettes/developing.Rmd index d6e31392056..d58cfe0a922 100644 --- a/r/vignettes/developing.Rmd +++ b/r/vignettes/developing.Rmd @@ -40,18 +40,26 @@ set -e set -x ``` -If you're looking to contribute to `arrow`, this document can help you set up a development environment that will enable you to write code and run tests locally. It outlines how to build the various components that make up the Arrow project and R package, as well as some common troubleshooting and workflows developers use. Many contributions can be accomplished with the instructions in [R-only development](#r-only-development). But if you're working on both the C++ library and the R package, the [Developer environment setup](#-developer-environment-setup) section will guide you through setting up a developer environment. +If you're looking to contribute to __arrow__, this vignette can help you set up a development environment that will enable you to write code and run tests locally. It outlines: +* how to build the components that make up the Arrow project and R package +* some common troubleshooting and workflows that developers use + +Many contributions can be accomplished with the instructions in [R-only development](#r-only-development), but if you're working on both the C++ library and the R package, the [Developer environment setup](#-developer-environment-setup) section will guide you through setting up a developer environment. This document is intended only for developers of Apache Arrow or the Arrow R package. Users of the package in R do not need to do any of this setup. If you're looking for how to install Arrow, see [the instructions in the readme](https://arrow.apache.org/docs/r/#installation); Linux users can find more details on building from source at `vignette("install", package = "arrow")`. -This document is a work in progress and will grow + change as the Apache Arrow project grows and changes. We have tried to make these steps as robust as possible (in fact, we even test exactly these instructions on our nightly CI to ensure they don't become stale!), but certain custom configurations might conflict with these instructions and there are differences of opinion across developers about if and what the one true way to set up development environments like this is. We also solicit any feedback you have about things that are confusing or additions you would like to see here. Please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues) if there you see anything that is confusing, odd, or just plain wrong. +This document is a work in progress and will grow and change as the Apache Arrow project grows and changes. We have tried to make these steps as robust as possible (in fact, we even test exactly these instructions on our nightly CI to ensure they don't become stale!), but custom configurations might conflict with these instructions and there are differences of opinion across developers about how to set up development environments like this is. + +We welcome any feedback you have about things that are confusing or additions you would like to see here. Please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues) if there you see anything that is confusing, odd, or just plain wrong. -## R-only development +# R-only developer environment setup Windows and macOS users who wish to contribute to the R package and -don’t need to alter the Arrow C++ library may be able to obtain a -recent version of the library without building from source. On macOS, -you may install the C++ library using [Homebrew](https://brew.sh/): +don't need to alter the Arrow C++ library may be able to obtain a +recent version of the library without building from source. + +## macOS +On macOS, you may install the C++ library using [Homebrew](https://brew.sh/): ``` shell # For the released version: @@ -60,11 +68,12 @@ brew install apache-arrow brew install apache-arrow --HEAD ``` +## Windows and Linux + On Windows and Linux, you can download a .zip file with the arrow dependencies from the nightly repository. Windows users then can set the `RWINLIB_LOCAL` environment variable to point to that -zip file before installing the `arrow` R package. On Linux, you'll need to create a `libarrow` directory inside the R package directory and unzip that file into it. Version numbers in that -repository correspond to dates, and you will likely want the most recent. +zip file before installing the `arrow` R package. On Linux, you'll need to create a `libarrow` directory inside the R package directory and unzip that file into it. Version numbers in that repository correspond to dates. To see what nightlies are available, you can use Arrow's (or any other S3 client's) S3 listing functionality to see what is in the bucket `s3://arrow-r-nightly/libarrow/bin`: @@ -73,41 +82,41 @@ nightly <- s3_bucket("arrow-r-nightly") nightly$ls("libarrow/bin") ``` -## Developer environment setup +# R and C++ developer environment setup -If you need to alter both the Arrow C++ library and the R package code, or if you can’t get a binary version of the latest C++ library elsewhere, you’ll need to build it from source too. This section discusses how to set up a C++ build configured to work with the R package. For more general resources, see the [Arrow C++ developer -guide](https://arrow.apache.org/docs/developers/cpp/building.html). +If you need to alter both the Arrow C++ library and the R package code, or if you can't get a binary version of the latest C++ library elsewhere, you'll need to build it from source too. This section discusses how to set up a C++ build configured to work with the R package. For more general resources, see the [Arrow C++ developer guide](https://arrow.apache.org/docs/developers/cpp/building.html). -There are four major steps to the process — the first three are relevant to all Arrow developers, and the last one is specific to the R bindings: +There are five major steps to the process — the first three are relevant to all Arrow developers, and the last one is specific to the R bindings: -1. Configuring the Arrow library build (using `cmake`) — this specifies how you want the build to go, what features to include, etc. -2. Building the Arrow library — this actually compiles the Arrow library -3. Install the Arrow library — this organizes and moves the compiled Arrow library files into the location specified in the configuration -4. Building the R package — this builds the C++ code in the R package, and installs the R package for you +1. Install dependencies +2. Configuring the Arrow library build (using `cmake`) — this specifies how you want the build to go, what features to include, etc. +3. Building the Arrow library — this actually compiles the Arrow library +4. Install the Arrow library — this organizes and moves the compiled Arrow library files into the location specified in the configuration +5. Building the R package — this builds the C++ code in the R package, and installs the R package for you -### Install dependencies {.tabset} +## Step 1 - Install dependencies The Arrow C++ library will by default use system dependencies if suitable versions are found; if they are not present, it will build them during its own build process. The only dependencies that one needs to install outside of the build process are `cmake` (for configuring the build) and `openssl` if you are building with S3 support. For a faster build, you may choose to install on the system more C++ library dependencies (such as `lz4`, `zstd`, etc.) so that they don't need to be built from source in the Arrow build. This is optional. -#### macOS +### macOS ```{bash, save=run & macos} brew install cmake openssl ``` -#### Ubuntu +### Ubuntu ```{bash, save=run & ubuntu} sudo apt install -y cmake libcurl4-openssl-dev libssl-dev ``` -### Configure the Arrow build {.tabset} +## Step 2 - Configure the Arrow build {.tabset} You can choose to build and then install the Arrow library into a user-defined directory or into a system-level directory. You only need to do one of these two options. It is recommended that you install the arrow library to a user-level directory to be used in development. This is so that the development version you are using doesn't overwrite a released version of Arrow you may have installed. You are also able to have more than one version of the Arrow library to link to with this approach (by using different `ARROW_HOME` directories for the different versions). This approach also matches the recommendations for other Arrow bindings like [Python](http://arrow.apache.org/docs/developers/python.html). -#### Configure for installing to a user directory +### Configure for installing to a user directory In this example we will install it to a directory called `dist` that has the same parent as our `arrow` checkout, but it could be named or located anywhere you would like. However, note that your installation of the Arrow R package will point to this directory and need it to remain intact for the package to continue to work. This is one reason we recommend *not* placing it inside of the arrow git checkout. @@ -131,7 +140,7 @@ mkdir -p cpp/build pushd cpp/build ``` -You’ll first call `cmake` to configure the build and then `make install`. For the R package, you’ll need to enable several features in the C++ library using `-D` flags: +You'll first call `cmake` to configure the build and then `make install`. For the R package, you'll need to enable several features in the C++ library using `-D` flags: ```{bash, save=run & !sys_install} cmake \ @@ -153,7 +162,7 @@ cmake \ `..` refers to the C++ source directory: we're in `cpp/build`, and the source is in `cpp`. -#### Configure to install to a system directory +### Configure to install to a system directory If you would like to install Arrow as a system library you can do that as well. This is in some respects simpler, but if you already have Arrow libraries installed there, it would disrupt them and possibly require `sudo` permissions. @@ -165,7 +174,7 @@ mkdir -p cpp/build pushd cpp/build ``` -You’ll first call `cmake` to configure the build and then `make install`. For the R package, you’ll need to enable several features in the C++ library using `-D` flags: +You'll first call `cmake` to configure the build and then `make install`. For the R package, you'll need to enable several features in the C++ library using `-D` flags: ```{bash, save=run & sys_install} cmake \ @@ -185,7 +194,7 @@ cmake \ `..` refers to the C++ source directory: we're in `cpp/build`, and the source is in `cpp`. -### More Arrow features +## More Arrow features To enable optional features including: S3 support, an alternative memory allocator, and additional compression libraries, add some or all of these flags (the trailing `\` makes them easier to paste into a bash shell on a new line): @@ -206,7 +215,7 @@ Other flags that may be useful: _Note_ `cmake` is particularly sensitive to whitespacing, if you see errors, check that you don't have any errant whitespace around -### Build Arrow +## Step 3 - Building Arrow You can add `-j#` between `make` and `install` here too to speed up compilation by running in parallel (where `#` is the number of cores you have available). @@ -221,10 +230,9 @@ need to use `sudo`: sudo make install ``` +## Step 4 - Build the Arrow R package -### Build the Arrow R package - -Once you’ve built the C++ library, you can install the R package and its +Once you've built the C++ library, you can install the R package and its dependencies, along with additional dev dependencies, from the git checkout: @@ -290,7 +298,7 @@ The documentation for the R package uses features of `roxygen2` that haven't yet remotes::install_github("r-lib/roxygen2") ``` -## Troubleshooting +# Troubleshooting Note that after any change to the C++ library, you must reinstall it and run `make clean` or `git clean -fdx .` to remove any cached object code @@ -299,12 +307,12 @@ only necessary if you make changes to the C++ library source; you do not need to manually purge object files if you are only editing R or C++ code inside `r/`. -### Arrow library-R package mismatches +## Arrow library-R package mismatches If the Arrow library and the R package have diverged, you will see errors like: ``` -Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath = DLLpath, ...): +Error: package or namespace load failed for ‘arrow' in dyn.load(file, DLLpath = DLLpath, ...): unable to load shared object '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so': dlopen(/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so, 6): Symbol not found: __ZN5arrow2io16RandomAccessFile9ReadAsyncERKNS0_9IOContextExx Referenced from: /Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so @@ -322,7 +330,7 @@ To resolve this, try rebuilding the Arrow library from [Building Arrow above](#b If rebuilding the Arrow library doesn't work and you are [installing from a user-level directory](#installing-to-another-directory) and you already have a previous installation of libarrow in a system directory or you get you may get errors like the following when you install the R package: ``` -Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath = DLLpath, ...): +Error: package or namespace load failed for ‘arrow' in dyn.load(file, DLLpath = DLLpath, ...): unable to load shared object '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so': dlopen(/Library/Frameworks/R.framework/Versions/4.0/Resources/library/00LOCK-r/00new/arrow/libs/arrow.so, 6): Library not loaded: /usr/local/lib/libarrow.400.dylib Referenced from: /usr/local/lib/libparquet.400.dylib @@ -376,15 +384,15 @@ wherever Arrow C++ was put in `make install`, e.g. `export R_LD_LIBRARY_PATH=/usr/local/lib`, and retry installing the R package. When installing from source, if the R and C++ library versions do not -match, installation may fail. If you’ve previously installed the -libraries and want to upgrade the R package, you’ll need to update the +match, installation may fail. If you've previously installed the +libraries and want to upgrade the R package, you'll need to update the Arrow C++ library first. For any other build/configuration challenges, see the [C++ developer guide](https://arrow.apache.org/docs/developers/cpp/building.html). -## Using `remotes::install_github(...)` +# Using `remotes::install_github(...)` If you need an Arrow installation from a specific repository or at a specific ref, `remotes::install_github("apache/arrow/r", build = FALSE)` @@ -408,7 +416,7 @@ separate from another Arrow development environment or system installation * Setting the environment variable `FORCE_BUNDLED_BUILD` to `true` will skip the `pkg-config` search for Arrow libraries and attempt to build from the same source at the repository+ref given. * You may also need to set the Makevars `CPPFLAGS` and `LDFLAGS` to `""` in order to prevent the installation process from attempting to link to already installed system versions of Arrow. One way to do this temporarily is wrapping your `remotes::install_github()` call like so: `withr::with_makevars(list(CPPFLAGS = "", LDFLAGS = ""), remotes::install_github(...))`. -## What happens when you `R CMD INSTALL`? +# What happens when you `R CMD INSTALL`? There are a number of scripts that are triggered when `R CMD INSTALL .`. For Arrow users, these should all just work without configuration and pull in the most complete pieces (e.g. official binaries that we host) so the installation process is easy. However knowing about these scripts can help troubleshoot if things go wrong in them or things go wrong in an install: @@ -418,12 +426,12 @@ There are a number of scripts that are triggered when `R CMD INSTALL .`. For Arr * Check if a binary is available from our hosted unofficial builds. * Download the Arrow source and build the Arrow Library from source. * `*** Proceed without C++` dependencies (this is an error and the package will not work, but if you see this message you know the previous steps have not succeeded/were not enabled) -* `inst/build_arrow_static.sh` this script builds Arrow for a bundled, static build. It is called by `tools/nixlibs.R` when the Arrow library is being built. (If you're looking at this script, and you've gotten this far, it should look _incredibly_ familiar: it's basically the contents of this guide in script form — with a few important changes) +* `inst/build_arrow_static.sh` this script builds Arrow for a bundled, static build. It is called by `tools/nixlibs.R` when the Arrow library is being built. (If you're looking at this script, and you've gotten this far, it might look incredibly familiar: it's basically the contents of this guide in script form — with a few important changes) -## Editing C++ code in the R package +# Editing C++ code in the R package The `arrow` package uses some customized tools on top of `cpp11` to prepare its -C++ code in `src/`. This is because we have some features that are only enabled +C++ code in `src/`. This is because there are some features that are only enabled and built conditionally during build time. If you change C++ code in the R package, you will need to set the `ARROW_R_DEV` environment variable to `true` (optionally, add it to your `~/.Renviron` file to persist across sessions) so @@ -448,7 +456,7 @@ Fix any style issues before committing with ``` The lint script requires Python 3 and `clang-format-8`. If the command -isn’t found, you can explicitly provide the path to it like +isn't found, you can explicitly provide the path to it like `CLANG_FORMAT=$(which clang-format-8) ./lint.sh`. On macOS, you can get this by installing LLVM via Homebrew and running the script as `CLANG_FORMAT=$(brew --prefix llvm@8)/bin/clang-format ./lint.sh` @@ -460,7 +468,7 @@ _Note_ that the lint script requires Python 3 and the Python dependencies * flake8 * cmake_format==0.5.2 -## Running tests +# Running tests Some tests are conditionally enabled based on the availability of certain features in the package build (S3 support, compression libraries, etc.). @@ -481,7 +489,7 @@ variables or other settings: settings, you can set `MINIO_ACCESS_KEY`, `MINIO_SECRET_KEY`, and `MINIO_PORT` to override the defaults. -## Github workflows +# Github workflows On a pull request, there are some actions you can trigger by commenting on the PR. We have additional CI checks that run nightly and can be requested on demand using an internal tool called [crosssbow](https://arrow.apache.org/docs/developers/crossbow.html). A few important GitHub comment commands include: @@ -490,7 +498,7 @@ On a pull request, there are some actions you can trigger by commenting on the P * `@github-actions autotune` will run and fix lint c++ linting errors + run R documentation (among other cleanup tasks) and commit them to the branch -## Useful functions for Arrow developers +# Useful functions for Arrow developers Within an R session, these can help with package development: @@ -518,10 +526,10 @@ covr::package_coverage() ``` Any of those can be run from the command line by wrapping them in `R -e -'$COMMAND'`. There’s also a `Makefile` to help with some common tasks +'$COMMAND'`. There's also a `Makefile` to help with some common tasks from the command line (`make test`, `make doc`, `make clean`, etc.) -### Full package validation +## Full package validation ``` shell R CMD build . From 9353a2441764368e1a05719d338fbb908f7bd38b Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Mon, 26 Jul 2021 13:52:50 +0100 Subject: [PATCH 20/24] Add a note about no Windows C++ dev --- r/vignettes/developing.Rmd | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/r/vignettes/developing.Rmd b/r/vignettes/developing.Rmd index d58cfe0a922..8adf7bbf62b 100644 --- a/r/vignettes/developing.Rmd +++ b/r/vignettes/developing.Rmd @@ -40,7 +40,7 @@ set -e set -x ``` -If you're looking to contribute to __arrow__, this vignette can help you set up a development environment that will enable you to write code and run tests locally. It outlines: +If you're looking to contribute to arrow, this vignette can help you set up a development environment that will enable you to write code and run tests locally. It outlines: * how to build the components that make up the Arrow project and R package * some common troubleshooting and workflows that developers use @@ -73,7 +73,7 @@ brew install apache-arrow --HEAD On Windows and Linux, you can download a .zip file with the arrow dependencies from the nightly repository. Windows users then can set the `RWINLIB_LOCAL` environment variable to point to that -zip file before installing the `arrow` R package. On Linux, you'll need to create a `libarrow` directory inside the R package directory and unzip that file into it. Version numbers in that repository correspond to dates. +zip file before installing the arrow R package. On Linux, you'll need to create a `libarrow` directory inside the R package directory and unzip that file into it. Version numbers in that repository correspond to dates. To see what nightlies are available, you can use Arrow's (or any other S3 client's) S3 listing functionality to see what is in the bucket `s3://arrow-r-nightly/libarrow/bin`: @@ -110,6 +110,10 @@ brew install cmake openssl sudo apt install -y cmake libcurl4-openssl-dev libssl-dev ``` +### Windows + +Currently, the R package cannot be made to work with a locally-built Arrow C++ library. This will be resolved in a future release. + ## Step 2 - Configure the Arrow build {.tabset} You can choose to build and then install the Arrow library into a user-defined directory or into a system-level directory. You only need to do one of these two options. @@ -430,7 +434,7 @@ There are a number of scripts that are triggered when `R CMD INSTALL .`. For Arr # Editing C++ code in the R package -The `arrow` package uses some customized tools on top of `cpp11` to prepare its +The arrow package uses some customized tools on top of `cpp11` to prepare its C++ code in `src/`. This is because there are some features that are only enabled and built conditionally during build time. If you change C++ code in the R package, you will need to set the `ARROW_R_DEV` environment variable to `true` From 10ca0f4dc3e08a1c0898c492ec983c3caac18a85 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Mon, 26 Jul 2021 14:12:42 +0100 Subject: [PATCH 21/24] Resey changes to dataset --- r/vignettes/dataset.Rmd | 187 ++++++++++++++++++---------------------- 1 file changed, 84 insertions(+), 103 deletions(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index 56826b5fbd1..b5e17578b29 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -8,10 +8,10 @@ vignette: > --- Apache Arrow lets you work efficiently with large, multi-file datasets. -The __arrow__ R package provides a __dplyr__ interface to Arrow Datasets, -and other tools for interactive exploration of Arrow data. +The `arrow` R package provides a `dplyr` interface to Arrow Datasets, +as well as other tools for interactive exploration of Arrow data. -This vignette introduces Datasets and shows how to use __dplyr__ to analyze them. +This vignette introduces Datasets and shows how to use `dplyr` to analyze them. It describes both what is possible to do with Arrow now and what is on the immediate development roadmap. @@ -20,36 +20,34 @@ and what is on the immediate development roadmap. The [New York City taxi trip record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) is widely used in big data exercises and competitions. For demonstration purposes, we have hosted a Parquet-formatted version -of about ten years of the trip data in a public Amazon S3 bucket. +of about 10 years of the trip data in a public Amazon S3 bucket. The total file size is around 37 gigabytes, even in the efficient Parquet file -format. That's bigger than memory on most people's computers, so you can't just +format. That's bigger than memory on most people's computers, so we can't just read it all in and stack it into a single data frame. -In Windows (for R > 3.6) and macOS binary packages, S3 support is included. -On Linux, when installing from source, S3 support is not enabled by default, +In Windows and macOS binary packages, S3 support is included. +On Linux when installing from source, S3 support is not enabled by default, and it has additional system requirements. See `vignette("install", package = "arrow")` for details. -To see if your __arrow__ installation has S3 support, run: +To see if your `arrow` installation has S3 support, run ```{r} arrow::arrow_with_s3() ``` -Even with an S3 support enabled network, speed will be a bottleneck unless your +Even with S3 support enabled network, speed will be a bottleneck unless your machine is located in the same AWS region as the data. So, for this vignette, we assume that the NYC taxi dataset has been downloaded locally in a "nyc-taxi" directory. -### Retrieving data from a public Amazon S3 bucket - -If your __arrow__ build has S3 support, you can sync the data locally with: +If your `arrow` build has S3 support, you can sync the data locally with: ```{r, eval = FALSE} arrow::copy_files("s3://ursa-labs-taxi-data", "nyc-taxi") ``` -If your __arrow__ build doesn't have S3 support, you can download the files +If your `arrow` build doesn't have S3 support, you can download the files with some additional code: ```{r, eval = FALSE} @@ -79,44 +77,39 @@ feel free to grab only a year or two of data. If you don't have the taxi data downloaded, the vignette will still run and will yield previously cached output for reference. To be explicit about which version -is running, let's check whether you're running with live data: +is running, let's check whether we're running with live data: ```{r} dir.exists("nyc-taxi") ``` -## Opening the dataset +## Getting started -Because __dplyr__ is not necessary for many Arrow workflows, +Because `dplyr` is not necessary for many Arrow workflows, it is an optional (`Suggests`) dependency. So, to work with Datasets, -you need to load both __arrow__ and __dplyr__. +we need to load both `arrow` and `dplyr`. ```{r} library(arrow, warn.conflicts = FALSE) library(dplyr, warn.conflicts = FALSE) ``` -The first step is to create a Dataset object, pointing at the directory of data. +The first step is to create our Dataset object, pointing at the directory of data. ```{r, eval = file.exists("nyc-taxi")} ds <- open_dataset("nyc-taxi", partitioning = c("year", "month")) ``` -The file format for `open_dataset()` is controlled by the `format` parameter, -which has a default value of `"parquet"`. If you had a directory -of Arrow format files, you could instead specify `format = "arrow"` in the call. - -Other supported formats include: - -* `"feather"` or `"ipc"` (aliases for `"arrow"`, as Feather v2 is the Arrow file format) -* `"csv"` (comma-delimited files) and `"tsv"` (tab-delimited files) -* `"text"` (generic text-delimited files - use the `delimiter` argument to specify which to use) +The default file format for `open_dataset()` is Parquet; if we had a directory +of Arrow format files, we could include `format = "arrow"` in the call. +Other supported formats include: `"feather"` (an alias for `"arrow"`, as Feather +v2 is the Arrow file format), `"csv"`, `"tsv"` (for tab-delimited), and `"text"` +for generic text-delimited files. For text files, you can pass any parsing +options (`delim`, `quote`, etc.) to `open_dataset()` that you would otherwise +pass to `read_csv_arrow()`. -For text files, you can pass any parsing options (`delim`, `quote`, etc.) to -`open_dataset()` that you would otherwise pass to `read_csv_arrow()`. - -The `partitioning` argument lets you specify how the file paths provide information -about how the dataset is chunked into different files. The files in this example +The `partitioning` argument lets us specify how the file paths provide information +about how the dataset is chunked into different files. Our files in this example have file paths like ``` @@ -125,12 +118,12 @@ have file paths like ... ``` -By providing a character vector to `partitioning`, you're saying that the first -path segment gives the value for `year`, and the second segment is `month`. +By providing a character vector to `partitioning`, we're saying that the first +path segment gives the value for `year` and the second segment is `month`. Every row in `2009/01/data.parquet` has a value of 2009 for `year` -and 1 for `month`, even though those columns may not be present in the file. +and 1 for `month`, even though those columns may not actually be present in the file. -Indeed, when you look at the dataset, you can see that in addition to the columns present +Indeed, when we look at the dataset, we see that in addition to the columns present in every file, there are also columns `year` and `month`. ```{r, eval = file.exists("nyc-taxi")} @@ -166,7 +159,7 @@ See $metadata for additional Schema metadata The other form of partitioning currently supported is [Hive](https://hive.apache.org/)-style, in which the partition variable names are included in the path segments. -If you had saved your files in paths like +If we had saved our files in paths like ``` year=2009/month=01/data.parquet @@ -174,29 +167,29 @@ year=2009/month=02/data.parquet ... ``` -you would not have had to provide the names in `partitioning`; -you could have just called `ds <- open_dataset("nyc-taxi")` and the partitions +we would not have had to provide the names in `partitioning`: +we could have just called `ds <- open_dataset("nyc-taxi")` and the partitions would have been detected automatically. ## Querying the dataset -Up to this point, you haven't loaded any data. You've walked directories to find -files, you've parsed file paths to identify partitions, and you've read the -headers of the Parquet files to inspect their schemas so that you can make sure -they all are as expected. +Up to this point, we haven't loaded any data: we have walked directories to find +files, we've parsed file paths to identify partitions, and we've read the +headers of the Parquet files to inspect their schemas so that we can make sure +they all line up. -In the current release, __arrow__ supports the dplyr verbs `mutate()`, +In the current release, `arrow` supports the dplyr verbs `mutate()`, `transmute()`, `select()`, `rename()`, `relocate()`, `filter()`, and `arrange()`. Aggregation is not yet supported, so before you call `summarise()` or other verbs with aggregate functions, use `collect()` to pull the selected subset of the data into an in-memory R data frame. -Suppose you attempt to call unsupported __dplyr__ verbs or unimplemented functions -in your query on an Arrow Dataset. In that case, the __arrow__ package raises an error. However, -for __dplyr__ queries on `Table` objects (typically smaller in size than Datasets), the -package automatically calls `collect()` before processing that __dplyr__ verb. +If you attempt to call unsupported `dplyr` verbs or unimplemented functions in +your query on an Arrow Dataset, the `arrow` package raises an error. However, +for `dplyr` queries on `Table` objects (which are typically smaller in size) the +package automatically calls `collect()` before processing that `dplyr` verb. -Here's an example. Suppose that you are curious about tipping behavior among the +Here's an example. Suppose I was curious about tipping behavior among the longest taxi rides. Let's find the median tip percentage for rides with fares greater than $100 in 2015, broken down by the number of passengers: @@ -235,11 +228,12 @@ cat(" ") ``` -You've just selected a subset out of a dataset with around 2 billion rows, computed -a new column, and aggregated it in under 2 seconds on most modern laptops. How does +We just selected a subset out of a dataset with around 2 billion rows, computed +a new column, and aggregated on it in under 2 seconds on my laptop. How does this work? -First, `mutate()`/`transmute()`, `select()`/`rename()`/`relocate()`, `filter()`, +First, +`mutate()`/`transmute()`, `select()`/`rename()`/`relocate()`, `filter()`, `group_by()`, and `arrange()` record their actions but don't evaluate on the data until you run `collect()`. @@ -265,58 +259,47 @@ See $.data for the source Arrow object ") ``` -This code returns an output instantly and shows the manipulations you've made, without +This returns instantly and shows the manipulations you've made, without loading data from the files. Because the evaluation of these queries is deferred, you can build up a query that selects down to a small subset without generating intermediate datasets that would potentially be large. Second, all work is pushed down to the individual data files, and depending on the file format, chunks of data within the files. As a result, -you can select a subset of data from a much larger dataset by collecting the -smaller slices from each file--you don't have to load the whole dataset in -memory to slice from it. +we can select a subset of data from a much larger dataset by collecting the +smaller slices from each file--we don't have to load the whole dataset in memory +in order to slice from it. -Third, because of partitioning, you can ignore some files entirely. +Third, because of partitioning, we can ignore some files entirely. In this example, by filtering `year == 2015`, all files corresponding to other years -are immediately excluded: you don't have to load them in order to find that no +are immediately excluded: we don't have to load them in order to find that no rows match the filter. Relatedly, since Parquet files contain row groups with -statistics on the data within, there may be entire chunks of data you can +statistics on the data within, there may be entire chunks of data we can avoid scanning because they have no rows where `total_amount > 100`. ## More dataset options There are a few ways you can control the Dataset creation to adapt to special use cases. - -### Work with files in a directory - -If you are working with a single file or a set of files that are not all in the -same directory, you can provide a file path or a vector of multiple file paths -to `open_dataset()`. This is useful if, for example, you have a single CSV file -that is too big to read into memory. You could pass the file path to -`open_dataset()`, use `group_by()` to partition the Dataset into manageable chunks, -then use `write_dataset()` to write each chunk to a separate Parquet file---all -without needing to read the full CSV file into R. - -### Explicitly declare column names and data types - -You can specify the `schema` argument to `open_dataset()` to declare the columns -and their data types. This is useful if you have data files that have different -storage schema (for example, a column could be `int32` in one and `int8` in -another) and you want to ensure that the resulting Dataset has a specific type. - -To be clear, it's not necessary to specify a schema, even in this example of -mixed integer types, because the Dataset constructor will reconcile differences -like these. The schema specification just lets you declare what you want the -result to be. - -### Explicitly declare partition format +For one, if you are working with a single file or a set of files that are not +all in the same directory, you can provide a file path or a vector of multiple +file paths to `open_dataset()`. This is useful if, for example, you have a +single CSV file that is too big to read into memory. You could pass the file +path to `open_dataset()`, use `group_by()` to partition the Dataset into +manageable chunks, then use `write_dataset()` to write each chunk to a separate +Parquet file---all without needing to read the full CSV file into R. + +You can specify a `schema` argument to `open_dataset()` to declare the columns +and their data types. This is useful if you have data files that have different +storage schema (for example, a column could be `int32` in one and `int8` in another) +and you want to ensure that the resulting Dataset has a specific type. +To be clear, it's not necessary to specify a schema, even in this example of +mixed integer types, because the Dataset constructor will reconcile differences like these. +The schema specification just lets you declare what you want the result to be. Similarly, you can provide a Schema in the `partitioning` argument of `open_dataset()` in order to declare the types of the virtual columns that define the partitions. -This would be useful, in the taxi dataset example, if you wanted to keep -`month` as a string instead of an integer. - -### Work with multiple data sources +This would be useful, in our taxi dataset example, if you wanted to keep +`month` as a string instead of an integer for some reason. Another feature of Datasets is that they can be composed of multiple data sources. That is, you may have a directory of partitioned Parquet files in one location, @@ -330,29 +313,27 @@ instead of a file path, or simply concatenate them like `big_dataset <- c(ds1, d As you can see, querying a large dataset can be made quite fast by storage in an efficient binary columnar format like Parquet or Feather and partitioning based on -columns commonly used for filtering. However, data isn't always stored that way. -Sometimes you might start with one giant CSV. The first step in analyzing data +columns commonly used for filtering. However, we don't always get our data delivered +to us that way. Sometimes we start with one giant CSV. Our first step in analyzing data is cleaning is up and reshaping it into a more usable form. -The `write_dataset()` function allows you to take a Dataset or another tabular -data object---an Arrow `Table` or `RecordBatch`, or an R `data.frame`---and write -it to a different file format, partitioned into multiple files. +The `write_dataset()` function allows you to take a Dataset or other tabular data object---an Arrow `Table` or `RecordBatch`, or an R `data.frame`---and write it to a different file format, partitioned into multiple files. -Assume that you have a version of the NYC Taxi data as CSV: +Assume we have a version of the NYC Taxi data as CSV: ```r ds <- open_dataset("nyc-taxi/csv/", format = "csv") ``` -You can write it to a new location and translate the files to the Feather format +We can write it to a new location and translate the files to the Feather format by calling `write_dataset()` on it: ```r write_dataset(ds, "nyc-taxi/feather", format = "feather") ``` -Next, let's imagine that the `payment_type` column is something you often filter -on, so you want to partition the data by that variable. By doing so you ensure +Next, let's imagine that the `payment_type` column is something we often filter +on, so we want to partition the data by that variable. By doing so we ensure that a filter like `payment_type == "Cash"` will touch only a subset of files where `payment_type` is always `"Cash"`. @@ -386,14 +367,14 @@ system("tree nyc-taxi/feather") Note that the directory names are `payment_type=Cash` and similar: this is the Hive-style partitioning described above. This means that when -you call `open_dataset()` on this directory, you don't have to declare what the +we call `open_dataset()` on this directory, we don't have to declare what the partitions are because they can be read from the file paths. (To instead write bare values for partition segments, i.e. `Cash` rather than `payment_type=Cash`, call `write_dataset()` with `hive_style = FALSE`.) -Perhaps, though, `payment_type == "Cash"` is the only data you ever care about, -and you just want to drop the rest and have a smaller working set. -For this, you can `filter()` them out when writing: +Perhaps, though, `payment_type == "Cash"` is the only data we ever care about, +and we just want to drop the rest and have a smaller working set. +For this, we can `filter()` them out when writing: ```r ds %>% @@ -401,9 +382,9 @@ ds %>% write_dataset("nyc-taxi/feather", format = "feather") ``` -The other thing you can do when writing datasets is select a subset of columns -or reorder them. Suppose you never care about `vendor_id`, and being a string column, -it can take up a lot of space when you read it in, so let's drop it: +The other thing we can do when writing datasets is select a subset of and/or reorder +columns. Suppose we never care about `vendor_id`, and being a string column, +it can take up a lot of space when we read it in, so let's drop it: ```r ds %>% From e5586a2cd30be310ddcefaa49a5ec04b83a5b529 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Mon, 26 Jul 2021 14:41:42 +0100 Subject: [PATCH 22/24] Restructure the R-only developer environment setup section --- r/vignettes/developing.Rmd | 24 ++++++++++++++++++------ 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/r/vignettes/developing.Rmd b/r/vignettes/developing.Rmd index 8adf7bbf62b..14509bf93c0 100644 --- a/r/vignettes/developing.Rmd +++ b/r/vignettes/developing.Rmd @@ -59,7 +59,7 @@ don't need to alter the Arrow C++ library may be able to obtain a recent version of the library without building from source. ## macOS -On macOS, you may install the C++ library using [Homebrew](https://brew.sh/): +On macOS, you can install the C++ library using [Homebrew](https://brew.sh/): ``` shell # For the released version: @@ -68,12 +68,10 @@ brew install apache-arrow brew install apache-arrow --HEAD ``` -## Windows and Linux +## Windows or Linux On Windows and Linux, you can download a .zip file with the arrow dependencies from the nightly repository. -Windows users then can set the `RWINLIB_LOCAL` environment variable to point to that -zip file before installing the arrow R package. On Linux, you'll need to create a `libarrow` directory inside the R package directory and unzip that file into it. Version numbers in that repository correspond to dates. To see what nightlies are available, you can use Arrow's (or any other S3 client's) S3 listing functionality to see what is in the bucket `s3://arrow-r-nightly/libarrow/bin`: @@ -81,6 +79,15 @@ To see what nightlies are available, you can use Arrow's (or any other S3 client nightly <- s3_bucket("arrow-r-nightly") nightly$ls("libarrow/bin") ``` +Version numbers in that repository correspond to dates. + +### Windows + +Windows users then can set the `RWINLIB_LOCAL` environment variable to point to the zip file containing the arrow dependencies before installing the arrow R package. + +### Linux + +On Linux, you'll need to create a `libarrow` directory inside the R package directory and unzip the zip file containing the arrow dependencies into it. # R and C++ developer environment setup @@ -114,9 +121,14 @@ sudo apt install -y cmake libcurl4-openssl-dev libssl-dev Currently, the R package cannot be made to work with a locally-built Arrow C++ library. This will be resolved in a future release. -## Step 2 - Configure the Arrow build {.tabset} +## Step 2 - Configure the Arrow build + +There are two different ways that you can choose to build and then install the Arrow library: + +1. into a user-defined directory +2. into a system-level directory -You can choose to build and then install the Arrow library into a user-defined directory or into a system-level directory. You only need to do one of these two options. +You only need to do one of these two options. It is recommended that you install the arrow library to a user-level directory to be used in development. This is so that the development version you are using doesn't overwrite a released version of Arrow you may have installed. You are also able to have more than one version of the Arrow library to link to with this approach (by using different `ARROW_HOME` directories for the different versions). This approach also matches the recommendations for other Arrow bindings like [Python](http://arrow.apache.org/docs/developers/python.html). From e75c0113805de6ed3998b9d12bde8d61c8d337be Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Mon, 26 Jul 2021 14:54:57 +0100 Subject: [PATCH 23/24] Rephrase for clarity --- r/vignettes/developing.Rmd | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/r/vignettes/developing.Rmd b/r/vignettes/developing.Rmd index 14509bf93c0..5c9eb4c9226 100644 --- a/r/vignettes/developing.Rmd +++ b/r/vignettes/developing.Rmd @@ -68,7 +68,7 @@ brew install apache-arrow brew install apache-arrow --HEAD ``` -## Windows or Linux +## Windows and Linux On Windows and Linux, you can download a .zip file with the arrow dependencies from the nightly repository. @@ -91,9 +91,9 @@ On Linux, you'll need to create a `libarrow` directory inside the R package dire # R and C++ developer environment setup -If you need to alter both the Arrow C++ library and the R package code, or if you can't get a binary version of the latest C++ library elsewhere, you'll need to build it from source too. This section discusses how to set up a C++ build configured to work with the R package. For more general resources, see the [Arrow C++ developer guide](https://arrow.apache.org/docs/developers/cpp/building.html). +If you need to alter both the Arrow C++ library and the R package code, or if you can't get a binary version of the latest C++ library elsewhere, you'll need to build it from source. This section discusses how to set up a C++ build configured to work with the R package. For more general resources, see the [Arrow C++ developer guide](https://arrow.apache.org/docs/developers/cpp/building.html). -There are five major steps to the process — the first three are relevant to all Arrow developers, and the last one is specific to the R bindings: +There are five major steps to the process — the first four are relevant to all Arrow developers, and the last one is specific to developers making changes to the R package: 1. Install dependencies 2. Configuring the Arrow library build (using `cmake`) — this specifies how you want the build to go, what features to include, etc. @@ -103,9 +103,9 @@ There are five major steps to the process — the first three are relevant to al ## Step 1 - Install dependencies -The Arrow C++ library will by default use system dependencies if suitable versions are found; if they are not present, it will build them during its own build process. The only dependencies that one needs to install outside of the build process are `cmake` (for configuring the build) and `openssl` if you are building with S3 support. +The Arrow C++ library will by default use system dependencies if suitable versions are found. If system dependencies are not present, the Arrow C++ library will build them during its own build process. The only dependencies that you need to install _outside_ of the build process are `cmake` (for configuring the build) and `openssl` if you are building with S3 support. -For a faster build, you may choose to install on the system more C++ library dependencies (such as `lz4`, `zstd`, etc.) so that they don't need to be built from source in the Arrow build. This is optional. +For a faster build, you may choose to pre-install more C++ library dependencies (such as `lz4`, `zstd`, etc.) on the system so that they don't need to be built from source in the Arrow build. ### macOS ```{bash, save=run & macos} @@ -134,7 +134,7 @@ It is recommended that you install the arrow library to a user-level directory t ### Configure for installing to a user directory -In this example we will install it to a directory called `dist` that has the same parent as our `arrow` checkout, but it could be named or located anywhere you would like. However, note that your installation of the Arrow R package will point to this directory and need it to remain intact for the package to continue to work. This is one reason we recommend *not* placing it inside of the arrow git checkout. +In this example we will install the Arrow C++ library to a directory called `dist` that has the same parent directory as our `arrow` checkout but your installation of the Arrow R package can point to any directory with any name. However, we recommend *not* placing it inside of the arrow git checkout directory as unwanted changes could stop it working properly. ```{bash, save=run & !sys_install} export ARROW_HOME=$(pwd)/dist From cea84ff5ba039e25124afe06526afcea708d36c6 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Mon, 26 Jul 2021 15:04:00 +0100 Subject: [PATCH 24/24] Remove words to simplify --- r/vignettes/developing.Rmd | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/r/vignettes/developing.Rmd b/r/vignettes/developing.Rmd index 5c9eb4c9226..c8061c1647d 100644 --- a/r/vignettes/developing.Rmd +++ b/r/vignettes/developing.Rmd @@ -123,16 +123,18 @@ Currently, the R package cannot be made to work with a locally-built Arrow C++ l ## Step 2 - Configure the Arrow build +### Build location + There are two different ways that you can choose to build and then install the Arrow library: 1. into a user-defined directory 2. into a system-level directory -You only need to do one of these two options. +You only need to do one of these options. It is recommended that you install the arrow library to a user-level directory to be used in development. This is so that the development version you are using doesn't overwrite a released version of Arrow you may have installed. You are also able to have more than one version of the Arrow library to link to with this approach (by using different `ARROW_HOME` directories for the different versions). This approach also matches the recommendations for other Arrow bindings like [Python](http://arrow.apache.org/docs/developers/python.html). -### Configure for installing to a user directory +#### Configure for installing to a user directory In this example we will install the Arrow C++ library to a directory called `dist` that has the same parent directory as our `arrow` checkout but your installation of the Arrow R package can point to any directory with any name. However, we recommend *not* placing it inside of the arrow git checkout directory as unwanted changes could stop it working properly. @@ -141,14 +143,14 @@ export ARROW_HOME=$(pwd)/dist mkdir $ARROW_HOME ``` -_Special instructions on Linux:_ You will need to set `LD_LIBRARY_PATH` to the `lib` directory that is under where we set `$ARROW_HOME`, before launching R and using Arrow. One way to do this is to add it to your profile (we use `~/.bash_profile` here, but you might need to put this in a different file depending on your setup, e.g. if you use a shell other than `bash`). On macOS we do not need to do this because the macOS shared library paths are hardcoded to their locations during build time. +_Special instructions on Linux:_ You will need to set `LD_LIBRARY_PATH` to the `lib` directory that is under where you set `$ARROW_HOME`, before launching R and using Arrow. One way to do this is to add it to your profile (we use `~/.bash_profile` here, but you might need to put this in a different file depending on your setup, e.g. if you use a shell other than `bash`). On macOS you do not need to do this because the macOS shared library paths are hardcoded to their locations during build time. ```{bash, save=run & ubuntu & !sys_install} export LD_LIBRARY_PATH=$ARROW_HOME/lib:$LD_LIBRARY_PATH echo "export LD_LIBRARY_PATH=$ARROW_HOME/lib:$LD_LIBRARY_PATH" >> ~/.bash_profile ``` -Now we can move into the arrow repository to start the build process. You will need to create a directory into which the C++ build will put its contents. It is recommended to make a `build` directory inside of the `cpp` directory of the Arrow git repository (it is git-ignored, so you won't accidentally check it in). And then, change directories to be inside `cpp/build`: +Now you can move into the arrow repository to start the build process. You will need to create a directory into which the C++ build will put its contents. It is recommended to make a `build` directory inside of the `cpp` directory of the Arrow git repository (it is git-ignored, so you won't accidentally check it in). And then, change directories to be inside `cpp/build`: ```{bash, save=run & !sys_install} pushd arrow @@ -178,11 +180,11 @@ cmake \ `..` refers to the C++ source directory: we're in `cpp/build`, and the source is in `cpp`. -### Configure to install to a system directory +#### Configure to install to a system directory If you would like to install Arrow as a system library you can do that as well. This is in some respects simpler, but if you already have Arrow libraries installed there, it would disrupt them and possibly require `sudo` permissions. -Now we can move into the arrow repository to start the build process. You will need to create a directory into which the C++ build will put its contents. It is recommended to make a `build` directory inside of the `cpp` directory of the Arrow git repository (it is git-ignored, so you won't accidentally check it in). And then, change directories to be inside `cpp/build`: +Now you can move into the arrow repository to start the build process. You will need to create a directory into which the C++ build will put its contents. We recommend that you make a `build` directory inside of the `cpp` directory of the Arrow git repository (it is git-ignored, so you won't accidentally check it in). And then, change directories to be inside `cpp/build`: ```{bash, save=run & sys_install} pushd arrow @@ -227,9 +229,10 @@ To enable optional features including: S3 support, an alternative memory allocat Other flags that may be useful: * `-DBoost_SOURCE=BUNDLED` and `-DThrift_SOURCE=bundled`, for example, or any other dependency `*_SOURCE`, if you have a system version of a C++ dependency that doesn't work correctly with Arrow. This tells the build to compile its own version of the dependency from source. + * `-DCMAKE_BUILD_TYPE=debug` or `-DCMAKE_BUILD_TYPE=relwithdebinfo` can be useful for debugging. You probably don't want to do this generally because a debug build is much slower at runtime than the default `release` build. -_Note_ `cmake` is particularly sensitive to whitespacing, if you see errors, check that you don't have any errant whitespace around +_Note_ `cmake` is particularly sensitive to whitespacing, if you see errors, check that you don't have any errant whitespace ## Step 3 - Building Arrow