From 9929e55a22e7cd2f6aada6acce89cfe6af5c70cc Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Wed, 21 Jul 2021 12:29:24 +0100 Subject: [PATCH 01/33] Remove backticks to make easier to read --- r/vignettes/dataset.Rmd | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index b5e17578b29..e05939bd3d9 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -8,13 +8,14 @@ vignette: > --- Apache Arrow lets you work efficiently with large, multi-file datasets. -The `arrow` R package provides a `dplyr` interface to Arrow Datasets, +The arrow R package provides a `dplyr` interface to Arrow Datasets, as well as other tools for interactive exploration of Arrow data. This vignette introduces Datasets and shows how to use `dplyr` to analyze them. It describes both what is possible to do with Arrow now and what is on the immediate development roadmap. + ## Example: NYC taxi data The [New York City taxi trip record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) From fbe74668114901ae6e6bd43dfe5fac4a4fc2dc68 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Wed, 21 Jul 2021 12:51:50 +0100 Subject: [PATCH 02/33] Grammarly suggestions and adding some subheadings --- r/vignettes/dataset.Rmd | 59 +++++++++++++++++++---------------------- 1 file changed, 27 insertions(+), 32 deletions(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index e05939bd3d9..11ab986e2d7 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -8,30 +8,29 @@ vignette: > --- Apache Arrow lets you work efficiently with large, multi-file datasets. -The arrow R package provides a `dplyr` interface to Arrow Datasets, -as well as other tools for interactive exploration of Arrow data. +The arrow R package provides a dplyr interface to Arrow Datasets, +and other tools for interactive exploration of Arrow data. This vignette introduces Datasets and shows how to use `dplyr` to analyze them. It describes both what is possible to do with Arrow now and what is on the immediate development roadmap. - ## Example: NYC taxi data The [New York City taxi trip record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) is widely used in big data exercises and competitions. For demonstration purposes, we have hosted a Parquet-formatted version -of about 10 years of the trip data in a public Amazon S3 bucket. +of about ten years of the trip data in a public Amazon S3 bucket. The total file size is around 37 gigabytes, even in the efficient Parquet file format. That's bigger than memory on most people's computers, so we can't just read it all in and stack it into a single data frame. In Windows and macOS binary packages, S3 support is included. -On Linux when installing from source, S3 support is not enabled by default, +On Linux, when installing from source, S3 support is not enabled by default, and it has additional system requirements. See `vignette("install", package = "arrow")` for details. -To see if your `arrow` installation has S3 support, run +To see if your `arrow` installation has S3 support, run: ```{r} arrow::arrow_with_s3() @@ -120,9 +119,9 @@ have file paths like ``` By providing a character vector to `partitioning`, we're saying that the first -path segment gives the value for `year` and the second segment is `month`. +path segment gives the value for `year`, and the second segment is `month`. Every row in `2009/01/data.parquet` has a value of 2009 for `year` -and 1 for `month`, even though those columns may not actually be present in the file. +and 1 for `month`, even though those columns may not be present in the file. Indeed, when we look at the dataset, we see that in addition to the columns present in every file, there are also columns `year` and `month`. @@ -185,9 +184,8 @@ In the current release, `arrow` supports the dplyr verbs `mutate()`, or other verbs with aggregate functions, use `collect()` to pull the selected subset of the data into an in-memory R data frame. -If you attempt to call unsupported `dplyr` verbs or unimplemented functions in -your query on an Arrow Dataset, the `arrow` package raises an error. However, -for `dplyr` queries on `Table` objects (which are typically smaller in size) the +Suppose you attempt to call unsupported `dplyr` verbs or unimplemented functions in your query on an Arrow Dataset. In that case, the `arrow` package raises an error. However, +for `dplyr` queries on `Table` objects (typically smaller in size than Datasets), the package automatically calls `collect()` before processing that `dplyr` verb. Here's an example. Suppose I was curious about tipping behavior among the @@ -230,11 +228,10 @@ cat(" ``` We just selected a subset out of a dataset with around 2 billion rows, computed -a new column, and aggregated on it in under 2 seconds on my laptop. How does +a new column, and aggregated it in under 2 seconds on my laptop. How does this work? -First, -`mutate()`/`transmute()`, `select()`/`rename()`/`relocate()`, `filter()`, +First, `mutate()`/`transmute()`, `select()`/`rename()`/`relocate()`, `filter()`, `group_by()`, and `arrange()` record their actions but don't evaluate on the data until you run `collect()`. @@ -260,7 +257,7 @@ See $.data for the source Arrow object ") ``` -This returns instantly and shows the manipulations you've made, without +This code returns an output instantly and shows the manipulations you've made, without loading data from the files. Because the evaluation of these queries is deferred, you can build up a query that selects down to a small subset without generating intermediate datasets that would potentially be large. @@ -268,8 +265,7 @@ intermediate datasets that would potentially be large. Second, all work is pushed down to the individual data files, and depending on the file format, chunks of data within the files. As a result, we can select a subset of data from a much larger dataset by collecting the -smaller slices from each file--we don't have to load the whole dataset in memory -in order to slice from it. +smaller slices from each file--we don't have to load the whole dataset in memory to slice from it. Third, because of partitioning, we can ignore some files entirely. In this example, by filtering `year == 2015`, all files corresponding to other years @@ -281,27 +277,26 @@ avoid scanning because they have no rows where `total_amount > 100`. ## More dataset options There are a few ways you can control the Dataset creation to adapt to special use cases. -For one, if you are working with a single file or a set of files that are not -all in the same directory, you can provide a file path or a vector of multiple -file paths to `open_dataset()`. This is useful if, for example, you have a -single CSV file that is too big to read into memory. You could pass the file -path to `open_dataset()`, use `group_by()` to partition the Dataset into -manageable chunks, then use `write_dataset()` to write each chunk to a separate -Parquet file---all without needing to read the full CSV file into R. - -You can specify a `schema` argument to `open_dataset()` to declare the columns -and their data types. This is useful if you have data files that have different -storage schema (for example, a column could be `int32` in one and `int8` in another) -and you want to ensure that the resulting Dataset has a specific type. -To be clear, it's not necessary to specify a schema, even in this example of -mixed integer types, because the Dataset constructor will reconcile differences like these. -The schema specification just lets you declare what you want the result to be. + +### Working with files in a directory + +If you are working with a single file or a set of files that are not all in the same directory, you can provide a file path or a vector of multiple file paths to `open_dataset()`. This is useful if, for example, you have a single CSV file that is too big to read into memory. You could pass the file path to `open_dataset()`, use `group_by()` to partition the Dataset into manageable chunks, then use `write_dataset()` to write each chunk to a separate Parquet file---all without needing to read the full CSV file into R. + +### Explicitly declare column names and data types + +You can specify a `schema` argument to `open_dataset()` to declare the columns and their data types. This is useful if you have data files that have different storage schema (for example, a column could be `int32` in one and `int8` in another) and you want to ensure that the resulting Dataset has a specific type. + +To be clear, it's not necessary to specify a schema, even in this example of mixed integer types, because the Dataset constructor will reconcile differences like these. The schema specification just lets you declare what you want the result to be. + +### Explicitly declare partition format Similarly, you can provide a Schema in the `partitioning` argument of `open_dataset()` in order to declare the types of the virtual columns that define the partitions. This would be useful, in our taxi dataset example, if you wanted to keep `month` as a string instead of an integer for some reason. +### Work with multiple data sources + Another feature of Datasets is that they can be composed of multiple data sources. That is, you may have a directory of partitioned Parquet files in one location, and in another directory, files that haven't been partitioned. From 9726952c95bace81df8801d8f3f928cbe809223c Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Wed, 21 Jul 2021 13:33:00 +0100 Subject: [PATCH 03/33] "we" -> "you" --- r/vignettes/dataset.Rmd | 74 ++++++++++++++++++++--------------------- 1 file changed, 36 insertions(+), 38 deletions(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index 11ab986e2d7..d04195a22c9 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -23,7 +23,7 @@ For demonstration purposes, we have hosted a Parquet-formatted version of about ten years of the trip data in a public Amazon S3 bucket. The total file size is around 37 gigabytes, even in the efficient Parquet file -format. That's bigger than memory on most people's computers, so we can't just +format. That's bigger than memory on most people's computers, so you can't just read it all in and stack it into a single data frame. In Windows and macOS binary packages, S3 support is included. @@ -77,7 +77,7 @@ feel free to grab only a year or two of data. If you don't have the taxi data downloaded, the vignette will still run and will yield previously cached output for reference. To be explicit about which version -is running, let's check whether we're running with live data: +is running, let's check whether you're running with live data: ```{r} dir.exists("nyc-taxi") @@ -87,29 +87,29 @@ dir.exists("nyc-taxi") Because `dplyr` is not necessary for many Arrow workflows, it is an optional (`Suggests`) dependency. So, to work with Datasets, -we need to load both `arrow` and `dplyr`. +you need to load both `arrow` and `dplyr`. ```{r} library(arrow, warn.conflicts = FALSE) library(dplyr, warn.conflicts = FALSE) ``` -The first step is to create our Dataset object, pointing at the directory of data. +The first step is to create a Dataset object, pointing at the directory of data. ```{r, eval = file.exists("nyc-taxi")} ds <- open_dataset("nyc-taxi", partitioning = c("year", "month")) ``` -The default file format for `open_dataset()` is Parquet; if we had a directory -of Arrow format files, we could include `format = "arrow"` in the call. +The default file format for `open_dataset()` is Parquet; if you had a directory +of Arrow format files, you could include `format = "arrow"` in the call. Other supported formats include: `"feather"` (an alias for `"arrow"`, as Feather v2 is the Arrow file format), `"csv"`, `"tsv"` (for tab-delimited), and `"text"` for generic text-delimited files. For text files, you can pass any parsing options (`delim`, `quote`, etc.) to `open_dataset()` that you would otherwise pass to `read_csv_arrow()`. -The `partitioning` argument lets us specify how the file paths provide information -about how the dataset is chunked into different files. Our files in this example +The `partitioning` argument lets you specify how the file paths provide information +about how the dataset is chunked into different files. The files in this example have file paths like ``` @@ -118,12 +118,12 @@ have file paths like ... ``` -By providing a character vector to `partitioning`, we're saying that the first +By providing a character vector to `partitioning`, you're saying that the first path segment gives the value for `year`, and the second segment is `month`. Every row in `2009/01/data.parquet` has a value of 2009 for `year` and 1 for `month`, even though those columns may not be present in the file. -Indeed, when we look at the dataset, we see that in addition to the columns present +Indeed, when you look at the dataset, you can see that in addition to the columns present in every file, there are also columns `year` and `month`. ```{r, eval = file.exists("nyc-taxi")} @@ -159,7 +159,7 @@ See $metadata for additional Schema metadata The other form of partitioning currently supported is [Hive](https://hive.apache.org/)-style, in which the partition variable names are included in the path segments. -If we had saved our files in paths like +If you had saved your files in paths like ``` year=2009/month=01/data.parquet @@ -167,15 +167,15 @@ year=2009/month=02/data.parquet ... ``` -we would not have had to provide the names in `partitioning`: -we could have just called `ds <- open_dataset("nyc-taxi")` and the partitions +you would not have had to provide the names in `partitioning`: +you could have just called `ds <- open_dataset("nyc-taxi")` and the partitions would have been detected automatically. ## Querying the dataset -Up to this point, we haven't loaded any data: we have walked directories to find -files, we've parsed file paths to identify partitions, and we've read the -headers of the Parquet files to inspect their schemas so that we can make sure +Up to this point, you haven't loaded any data: you have walked directories to find +files, you've parsed file paths to identify partitions, and you've read the +headers of the Parquet files to inspect their schemas so that you can make sure they all line up. In the current release, `arrow` supports the dplyr verbs `mutate()`, @@ -227,7 +227,7 @@ cat(" ") ``` -We just selected a subset out of a dataset with around 2 billion rows, computed +You just selected a subset out of a dataset with around 2 billion rows, computed a new column, and aggregated it in under 2 seconds on my laptop. How does this work? @@ -264,14 +264,14 @@ intermediate datasets that would potentially be large. Second, all work is pushed down to the individual data files, and depending on the file format, chunks of data within the files. As a result, -we can select a subset of data from a much larger dataset by collecting the -smaller slices from each file--we don't have to load the whole dataset in memory to slice from it. +you can select a subset of data from a much larger dataset by collecting the +smaller slices from each file--you don't have to load the whole dataset in memory to slice from it. -Third, because of partitioning, we can ignore some files entirely. +Third, because of partitioning, you can ignore some files entirely. In this example, by filtering `year == 2015`, all files corresponding to other years -are immediately excluded: we don't have to load them in order to find that no +are immediately excluded: you don't have to load them in order to find that no rows match the filter. Relatedly, since Parquet files contain row groups with -statistics on the data within, there may be entire chunks of data we can +statistics on the data within, there may be entire chunks of data you can avoid scanning because they have no rows where `total_amount > 100`. ## More dataset options @@ -292,8 +292,8 @@ To be clear, it's not necessary to specify a schema, even in this example of mix Similarly, you can provide a Schema in the `partitioning` argument of `open_dataset()` in order to declare the types of the virtual columns that define the partitions. -This would be useful, in our taxi dataset example, if you wanted to keep -`month` as a string instead of an integer for some reason. +This would be useful, in the taxi dataset example, if you wanted to keep +`month` as a string instead of an integer. ### Work with multiple data sources @@ -309,27 +309,25 @@ instead of a file path, or simply concatenate them like `big_dataset <- c(ds1, d As you can see, querying a large dataset can be made quite fast by storage in an efficient binary columnar format like Parquet or Feather and partitioning based on -columns commonly used for filtering. However, we don't always get our data delivered -to us that way. Sometimes we start with one giant CSV. Our first step in analyzing data -is cleaning is up and reshaping it into a more usable form. +columns commonly used for filtering. However, data isn't always stored that way. Sometimes you might start with one giant CSV. The first step in analyzing data is cleaning is up and reshaping it into a more usable form. The `write_dataset()` function allows you to take a Dataset or other tabular data object---an Arrow `Table` or `RecordBatch`, or an R `data.frame`---and write it to a different file format, partitioned into multiple files. -Assume we have a version of the NYC Taxi data as CSV: +Assume you have a version of the NYC Taxi data as CSV: ```r ds <- open_dataset("nyc-taxi/csv/", format = "csv") ``` -We can write it to a new location and translate the files to the Feather format +You can write it to a new location and translate the files to the Feather format by calling `write_dataset()` on it: ```r write_dataset(ds, "nyc-taxi/feather", format = "feather") ``` -Next, let's imagine that the `payment_type` column is something we often filter -on, so we want to partition the data by that variable. By doing so we ensure +Next, let's imagine that the `payment_type` column is something you often filter +on, so you want to partition the data by that variable. By doing so you ensure that a filter like `payment_type == "Cash"` will touch only a subset of files where `payment_type` is always `"Cash"`. @@ -363,14 +361,14 @@ system("tree nyc-taxi/feather") Note that the directory names are `payment_type=Cash` and similar: this is the Hive-style partitioning described above. This means that when -we call `open_dataset()` on this directory, we don't have to declare what the +you call `open_dataset()` on this directory, you don't have to declare what the partitions are because they can be read from the file paths. (To instead write bare values for partition segments, i.e. `Cash` rather than `payment_type=Cash`, call `write_dataset()` with `hive_style = FALSE`.) -Perhaps, though, `payment_type == "Cash"` is the only data we ever care about, -and we just want to drop the rest and have a smaller working set. -For this, we can `filter()` them out when writing: +Perhaps, though, `payment_type == "Cash"` is the only data you ever care about, +and you just want to drop the rest and have a smaller working set. +For this, you can `filter()` them out when writing: ```r ds %>% @@ -378,9 +376,9 @@ ds %>% write_dataset("nyc-taxi/feather", format = "feather") ``` -The other thing we can do when writing datasets is select a subset of and/or reorder -columns. Suppose we never care about `vendor_id`, and being a string column, -it can take up a lot of space when we read it in, so let's drop it: +The other thing you can do when writing datasets is select a subset of and/or reorder +columns. Suppose you never care about `vendor_id`, and being a string column, +it can take up a lot of space when you read it in, so let's drop it: ```r ds %>% From e13543dafca351969aa513cb1289f34dbdd0a1ee Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Wed, 21 Jul 2021 13:42:30 +0100 Subject: [PATCH 04/33] Use bold instead of backticks to make package names more readable --- r/vignettes/dataset.Rmd | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index d04195a22c9..32c041f1730 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -8,10 +8,10 @@ vignette: > --- Apache Arrow lets you work efficiently with large, multi-file datasets. -The arrow R package provides a dplyr interface to Arrow Datasets, +The __arrow__ R package provides a __dplyr__ interface to Arrow Datasets, and other tools for interactive exploration of Arrow data. -This vignette introduces Datasets and shows how to use `dplyr` to analyze them. +This vignette introduces Datasets and shows how to use __dplyr__ to analyze them. It describes both what is possible to do with Arrow now and what is on the immediate development roadmap. @@ -30,7 +30,7 @@ In Windows and macOS binary packages, S3 support is included. On Linux, when installing from source, S3 support is not enabled by default, and it has additional system requirements. See `vignette("install", package = "arrow")` for details. -To see if your `arrow` installation has S3 support, run: +To see if your __arrow__ installation has S3 support, run: ```{r} arrow::arrow_with_s3() @@ -41,13 +41,13 @@ machine is located in the same AWS region as the data. So, for this vignette, we assume that the NYC taxi dataset has been downloaded locally in a "nyc-taxi" directory. -If your `arrow` build has S3 support, you can sync the data locally with: +If your __arrow__ build has S3 support, you can sync the data locally with: ```{r, eval = FALSE} arrow::copy_files("s3://ursa-labs-taxi-data", "nyc-taxi") ``` -If your `arrow` build doesn't have S3 support, you can download the files +If your __arrow__ build doesn't have S3 support, you can download the files with some additional code: ```{r, eval = FALSE} @@ -85,9 +85,9 @@ dir.exists("nyc-taxi") ## Getting started -Because `dplyr` is not necessary for many Arrow workflows, +Because __dplyr__ is not necessary for many Arrow workflows, it is an optional (`Suggests`) dependency. So, to work with Datasets, -you need to load both `arrow` and `dplyr`. +you need to load both __arrow__ and __dplyr__. ```{r} library(arrow, warn.conflicts = FALSE) @@ -178,15 +178,15 @@ files, you've parsed file paths to identify partitions, and you've read the headers of the Parquet files to inspect their schemas so that you can make sure they all line up. -In the current release, `arrow` supports the dplyr verbs `mutate()`, +In the current release, __arrow__ supports the dplyr verbs `mutate()`, `transmute()`, `select()`, `rename()`, `relocate()`, `filter()`, and `arrange()`. Aggregation is not yet supported, so before you call `summarise()` or other verbs with aggregate functions, use `collect()` to pull the selected subset of the data into an in-memory R data frame. -Suppose you attempt to call unsupported `dplyr` verbs or unimplemented functions in your query on an Arrow Dataset. In that case, the `arrow` package raises an error. However, -for `dplyr` queries on `Table` objects (typically smaller in size than Datasets), the -package automatically calls `collect()` before processing that `dplyr` verb. +Suppose you attempt to call unsupported __dplyr__ verbs or unimplemented functions in your query on an Arrow Dataset. In that case, the __arrow__ package raises an error. However, +for __dplyr__ queries on `Table` objects (typically smaller in size than Datasets), the +package automatically calls `collect()` before processing that __dplyr__ verb. Here's an example. Suppose I was curious about tipping behavior among the longest taxi rides. Let's find the median tip percentage for rides with From a9414440bd07b4bb8b0614d41d7202c0b39bc4ff Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Wed, 21 Jul 2021 14:38:53 +0100 Subject: [PATCH 05/33] Split paragraph into bullets --- r/vignettes/dataset.Rmd | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index 32c041f1730..34df15c591a 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -102,11 +102,12 @@ ds <- open_dataset("nyc-taxi", partitioning = c("year", "month")) The default file format for `open_dataset()` is Parquet; if you had a directory of Arrow format files, you could include `format = "arrow"` in the call. -Other supported formats include: `"feather"` (an alias for `"arrow"`, as Feather -v2 is the Arrow file format), `"csv"`, `"tsv"` (for tab-delimited), and `"text"` -for generic text-delimited files. For text files, you can pass any parsing -options (`delim`, `quote`, etc.) to `open_dataset()` that you would otherwise -pass to `read_csv_arrow()`. +Other supported formats include: + +* `"feather"` (an alias for `"arrow"`, as Feather v2 is the Arrow file format) +* `"csv"`, `"tsv"` (for tab-delimited), and `"text"` for generic text-delimited files. + +For text files, you can pass any parsing options (`delim`, `quote`, etc.) to `open_dataset()` that you would otherwise pass to `read_csv_arrow()`. The `partitioning` argument lets you specify how the file paths provide information about how the dataset is chunked into different files. The files in this example @@ -167,13 +168,13 @@ year=2009/month=02/data.parquet ... ``` -you would not have had to provide the names in `partitioning`: +you would not have had to provide the names in `partitioning`; you could have just called `ds <- open_dataset("nyc-taxi")` and the partitions would have been detected automatically. ## Querying the dataset -Up to this point, you haven't loaded any data: you have walked directories to find +Up to this point, you haven't loaded any data. You have walked directories to find files, you've parsed file paths to identify partitions, and you've read the headers of the Parquet files to inspect their schemas so that you can make sure they all line up. From 02ffc4c8cf021e6050e04965c9854d62f14694d4 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Wed, 21 Jul 2021 14:57:36 +0100 Subject: [PATCH 06/33] Breaking sections down and tweaks --- r/vignettes/dataset.Rmd | 60 ++++++++++++++++++++++++++++------------- 1 file changed, 42 insertions(+), 18 deletions(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index 34df15c591a..31d3a2dca69 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -41,6 +41,8 @@ machine is located in the same AWS region as the data. So, for this vignette, we assume that the NYC taxi dataset has been downloaded locally in a "nyc-taxi" directory. +### Retrieving data from a public Amazon S3 bucket + If your __arrow__ build has S3 support, you can sync the data locally with: ```{r, eval = FALSE} @@ -100,14 +102,18 @@ The first step is to create a Dataset object, pointing at the directory of data. ds <- open_dataset("nyc-taxi", partitioning = c("year", "month")) ``` -The default file format for `open_dataset()` is Parquet; if you had a directory -of Arrow format files, you could include `format = "arrow"` in the call. +The file format for `open_dataset()` is controlled by the `format` parameter, +which has a default value of `"parquet"`. If you had a directory +of Arrow format files, you could instead specify `format = "arrow"` in the call. + Other supported formats include: -* `"feather"` (an alias for `"arrow"`, as Feather v2 is the Arrow file format) -* `"csv"`, `"tsv"` (for tab-delimited), and `"text"` for generic text-delimited files. +* `"feather"` or `"ipc"` (aliases for `"arrow"`, as Feather v2 is the Arrow file format) +* `"csv"` (comma-delimited files) and `"tsv"` (tab-delimited files) +* `"text"` (generic text-delimited files - use the `delimiter` argument to specify which to use) -For text files, you can pass any parsing options (`delim`, `quote`, etc.) to `open_dataset()` that you would otherwise pass to `read_csv_arrow()`. +For text files, you can pass any parsing options (`delim`, `quote`, etc.) to +`open_dataset()` that you would otherwise pass to `read_csv_arrow()`. The `partitioning` argument lets you specify how the file paths provide information about how the dataset is chunked into different files. The files in this example @@ -174,10 +180,10 @@ would have been detected automatically. ## Querying the dataset -Up to this point, you haven't loaded any data. You have walked directories to find +Up to this point, you haven't loaded any data. You've walked directories to find files, you've parsed file paths to identify partitions, and you've read the headers of the Parquet files to inspect their schemas so that you can make sure -they all line up. +they all are as expected. In the current release, __arrow__ supports the dplyr verbs `mutate()`, `transmute()`, `select()`, `rename()`, `relocate()`, `filter()`, and @@ -185,11 +191,12 @@ In the current release, __arrow__ supports the dplyr verbs `mutate()`, or other verbs with aggregate functions, use `collect()` to pull the selected subset of the data into an in-memory R data frame. -Suppose you attempt to call unsupported __dplyr__ verbs or unimplemented functions in your query on an Arrow Dataset. In that case, the __arrow__ package raises an error. However, +Suppose you attempt to call unsupported __dplyr__ verbs or unimplemented functions +in your query on an Arrow Dataset. In that case, the __arrow__ package raises an error. However, for __dplyr__ queries on `Table` objects (typically smaller in size than Datasets), the package automatically calls `collect()` before processing that __dplyr__ verb. -Here's an example. Suppose I was curious about tipping behavior among the +Here's an example. Suppose that you are curious about tipping behavior among the longest taxi rides. Let's find the median tip percentage for rides with fares greater than $100 in 2015, broken down by the number of passengers: @@ -228,8 +235,8 @@ cat(" ") ``` -You just selected a subset out of a dataset with around 2 billion rows, computed -a new column, and aggregated it in under 2 seconds on my laptop. How does +You've just selected a subset out of a dataset with around 2 billion rows, computed +a new column, and aggregated it in under 2 seconds on most modern laptops. How does this work? First, `mutate()`/`transmute()`, `select()`/`rename()`/`relocate()`, `filter()`, @@ -266,7 +273,8 @@ intermediate datasets that would potentially be large. Second, all work is pushed down to the individual data files, and depending on the file format, chunks of data within the files. As a result, you can select a subset of data from a much larger dataset by collecting the -smaller slices from each file--you don't have to load the whole dataset in memory to slice from it. +smaller slices from each file--you don't have to load the whole dataset in +memory to slice from it. Third, because of partitioning, you can ignore some files entirely. In this example, by filtering `year == 2015`, all files corresponding to other years @@ -281,13 +289,25 @@ There are a few ways you can control the Dataset creation to adapt to special us ### Working with files in a directory -If you are working with a single file or a set of files that are not all in the same directory, you can provide a file path or a vector of multiple file paths to `open_dataset()`. This is useful if, for example, you have a single CSV file that is too big to read into memory. You could pass the file path to `open_dataset()`, use `group_by()` to partition the Dataset into manageable chunks, then use `write_dataset()` to write each chunk to a separate Parquet file---all without needing to read the full CSV file into R. +If you are working with a single file or a set of files that are not all in the +same directory, you can provide a file path or a vector of multiple file paths +to `open_dataset()`. This is useful if, for example, you have a single CSV file +that is too big to read into memory. You could pass the file path to +`open_dataset()`, use `group_by()` to partition the Dataset into manageable chunks, +then use `write_dataset()` to write each chunk to a separate Parquet file---all +without needing to read the full CSV file into R. ### Explicitly declare column names and data types -You can specify a `schema` argument to `open_dataset()` to declare the columns and their data types. This is useful if you have data files that have different storage schema (for example, a column could be `int32` in one and `int8` in another) and you want to ensure that the resulting Dataset has a specific type. +You can specify a `schema` argument to `open_dataset()` to declare the columns +and their data types. This is useful if you have data files that have different +storage schema (for example, a column could be `int32` in one and `int8` in +another) and you want to ensure that the resulting Dataset has a specific type. -To be clear, it's not necessary to specify a schema, even in this example of mixed integer types, because the Dataset constructor will reconcile differences like these. The schema specification just lets you declare what you want the result to be. +To be clear, it's not necessary to specify a schema, even in this example of +mixed integer types, because the Dataset constructor will reconcile differences +like these. The schema specification just lets you declare what you want the +result to be. ### Explicitly declare partition format @@ -310,11 +330,15 @@ instead of a file path, or simply concatenate them like `big_dataset <- c(ds1, d As you can see, querying a large dataset can be made quite fast by storage in an efficient binary columnar format like Parquet or Feather and partitioning based on -columns commonly used for filtering. However, data isn't always stored that way. Sometimes you might start with one giant CSV. The first step in analyzing data is cleaning is up and reshaping it into a more usable form. +columns commonly used for filtering. However, data isn't always stored that way. +Sometimes you might start with one giant CSV. The first step in analyzing data +is cleaning is up and reshaping it into a more usable form. -The `write_dataset()` function allows you to take a Dataset or other tabular data object---an Arrow `Table` or `RecordBatch`, or an R `data.frame`---and write it to a different file format, partitioned into multiple files. +The `write_dataset()` function allows you to take a Dataset or another tabular +data object---an Arrow `Table` or `RecordBatch`, or an R `data.frame`---and write +it to a different file format, partitioned into multiple files. -Assume you have a version of the NYC Taxi data as CSV: +Assume that you have a version of the NYC Taxi data as CSV: ```r ds <- open_dataset("nyc-taxi/csv/", format = "csv") From ae9b685aede7ed7acdd16cef16988b1edd756e35 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Wed, 21 Jul 2021 14:58:56 +0100 Subject: [PATCH 07/33] Rename section heading --- r/vignettes/dataset.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index 31d3a2dca69..5b4b0ed7524 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -85,7 +85,7 @@ is running, let's check whether you're running with live data: dir.exists("nyc-taxi") ``` -## Getting started +## Opening the dataset Because __dplyr__ is not necessary for many Arrow workflows, it is an optional (`Suggests`) dependency. So, to work with Datasets, From 2c4f9f424709d915a154bddce3c24dd514acece5 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Wed, 21 Jul 2021 16:10:04 +0100 Subject: [PATCH 08/33] Specify Windows version for S3, minor tweaks --- r/vignettes/dataset.Rmd | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index 5b4b0ed7524..56826b5fbd1 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -26,7 +26,7 @@ The total file size is around 37 gigabytes, even in the efficient Parquet file format. That's bigger than memory on most people's computers, so you can't just read it all in and stack it into a single data frame. -In Windows and macOS binary packages, S3 support is included. +In Windows (for R > 3.6) and macOS binary packages, S3 support is included. On Linux, when installing from source, S3 support is not enabled by default, and it has additional system requirements. See `vignette("install", package = "arrow")` for details. @@ -36,7 +36,7 @@ To see if your __arrow__ installation has S3 support, run: arrow::arrow_with_s3() ``` -Even with S3 support enabled network, speed will be a bottleneck unless your +Even with an S3 support enabled network, speed will be a bottleneck unless your machine is located in the same AWS region as the data. So, for this vignette, we assume that the NYC taxi dataset has been downloaded locally in a "nyc-taxi" directory. @@ -287,7 +287,7 @@ avoid scanning because they have no rows where `total_amount > 100`. There are a few ways you can control the Dataset creation to adapt to special use cases. -### Working with files in a directory +### Work with files in a directory If you are working with a single file or a set of files that are not all in the same directory, you can provide a file path or a vector of multiple file paths @@ -299,7 +299,7 @@ without needing to read the full CSV file into R. ### Explicitly declare column names and data types -You can specify a `schema` argument to `open_dataset()` to declare the columns +You can specify the `schema` argument to `open_dataset()` to declare the columns and their data types. This is useful if you have data files that have different storage schema (for example, a column could be `int32` in one and `int8` in another) and you want to ensure that the resulting Dataset has a specific type. @@ -401,8 +401,8 @@ ds %>% write_dataset("nyc-taxi/feather", format = "feather") ``` -The other thing you can do when writing datasets is select a subset of and/or reorder -columns. Suppose you never care about `vendor_id`, and being a string column, +The other thing you can do when writing datasets is select a subset of columns +or reorder them. Suppose you never care about `vendor_id`, and being a string column, it can take up a lot of space when you read it in, so let's drop it: ```r From 52ec5f6e0e113456b787be50863dc6183b87197f Mon Sep 17 00:00:00 2001 From: Nic Date: Wed, 21 Jul 2021 20:07:48 +0000 Subject: [PATCH 09/33] Update r/vignettes/dataset.Rmd Co-authored-by: Neal Richardson --- r/vignettes/dataset.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index 56826b5fbd1..a75ceb5b556 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -36,7 +36,7 @@ To see if your __arrow__ installation has S3 support, run: arrow::arrow_with_s3() ``` -Even with an S3 support enabled network, speed will be a bottleneck unless your +Even with S3 support enabled, network speed will be a bottleneck unless your machine is located in the same AWS region as the data. So, for this vignette, we assume that the NYC taxi dataset has been downloaded locally in a "nyc-taxi" directory. From 254771fbf9f35d0813110ce1823edd8065ccaea1 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Mon, 26 Jul 2021 13:28:58 +0100 Subject: [PATCH 10/33] Remove highlighting of arrow/dplyr to make easier to read and more like tidyverse package vignettes --- r/vignettes/dataset.Rmd | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index a75ceb5b556..b2c7709aabf 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -8,10 +8,10 @@ vignette: > --- Apache Arrow lets you work efficiently with large, multi-file datasets. -The __arrow__ R package provides a __dplyr__ interface to Arrow Datasets, +The arrow R package provides a [dplyr](https://dplyr.tidyverse.org/) interface to Arrow Datasets, and other tools for interactive exploration of Arrow data. -This vignette introduces Datasets and shows how to use __dplyr__ to analyze them. +This vignette introduces Datasets and shows how to use dplyr to analyze them. It describes both what is possible to do with Arrow now and what is on the immediate development roadmap. @@ -30,7 +30,7 @@ In Windows (for R > 3.6) and macOS binary packages, S3 support is included. On Linux, when installing from source, S3 support is not enabled by default, and it has additional system requirements. See `vignette("install", package = "arrow")` for details. -To see if your __arrow__ installation has S3 support, run: +To see if your arrow installation has S3 support, run: ```{r} arrow::arrow_with_s3() @@ -43,13 +43,13 @@ directory. ### Retrieving data from a public Amazon S3 bucket -If your __arrow__ build has S3 support, you can sync the data locally with: +If your arrow build has S3 support, you can sync the data locally with: ```{r, eval = FALSE} arrow::copy_files("s3://ursa-labs-taxi-data", "nyc-taxi") ``` -If your __arrow__ build doesn't have S3 support, you can download the files +If your arrow build doesn't have S3 support, you can download the files with some additional code: ```{r, eval = FALSE} @@ -87,9 +87,9 @@ dir.exists("nyc-taxi") ## Opening the dataset -Because __dplyr__ is not necessary for many Arrow workflows, +Because dplyr is not necessary for many Arrow workflows, it is an optional (`Suggests`) dependency. So, to work with Datasets, -you need to load both __arrow__ and __dplyr__. +you need to load both arrow and dplyr. ```{r} library(arrow, warn.conflicts = FALSE) @@ -166,7 +166,7 @@ See $metadata for additional Schema metadata The other form of partitioning currently supported is [Hive](https://hive.apache.org/)-style, in which the partition variable names are included in the path segments. -If you had saved your files in paths like +If you had saved your files in paths like: ``` year=2009/month=01/data.parquet @@ -185,16 +185,16 @@ files, you've parsed file paths to identify partitions, and you've read the headers of the Parquet files to inspect their schemas so that you can make sure they all are as expected. -In the current release, __arrow__ supports the dplyr verbs `mutate()`, +In the current release, arrow supports the dplyr verbs `mutate()`, `transmute()`, `select()`, `rename()`, `relocate()`, `filter()`, and `arrange()`. Aggregation is not yet supported, so before you call `summarise()` or other verbs with aggregate functions, use `collect()` to pull the selected subset of the data into an in-memory R data frame. -Suppose you attempt to call unsupported __dplyr__ verbs or unimplemented functions -in your query on an Arrow Dataset. In that case, the __arrow__ package raises an error. However, -for __dplyr__ queries on `Table` objects (typically smaller in size than Datasets), the -package automatically calls `collect()` before processing that __dplyr__ verb. +Suppose you attempt to call unsupported dplyr verbs or unimplemented functions +in your query on an Arrow Dataset. In that case, the arrow package raises an error. However, +for dplyr queries on `Table` objects (typically smaller in size than Datasets), the +package automatically calls `collect()` before processing that dplyr verb. Here's an example. Suppose that you are curious about tipping behavior among the longest taxi rides. Let's find the median tip percentage for rides with From ff090548fd66e192b10891d2cb58fd19616eb039 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Mon, 26 Jul 2021 13:33:55 +0100 Subject: [PATCH 11/33] Remove extra backticks on Arrow object names --- r/vignettes/dataset.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index b2c7709aabf..7e693a872d7 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -193,7 +193,7 @@ subset of the data into an in-memory R data frame. Suppose you attempt to call unsupported dplyr verbs or unimplemented functions in your query on an Arrow Dataset. In that case, the arrow package raises an error. However, -for dplyr queries on `Table` objects (typically smaller in size than Datasets), the +for dplyr queries on Arrow Table objects (typically smaller in size than Arrow Datasets), the package automatically calls `collect()` before processing that dplyr verb. Here's an example. Suppose that you are curious about tipping behavior among the @@ -335,7 +335,7 @@ Sometimes you might start with one giant CSV. The first step in analyzing data is cleaning is up and reshaping it into a more usable form. The `write_dataset()` function allows you to take a Dataset or another tabular -data object---an Arrow `Table` or `RecordBatch`, or an R `data.frame`---and write +data object - an Arrow Table or RecordBatch, or an R data frame - and write it to a different file format, partitioned into multiple files. Assume that you have a version of the NYC Taxi data as CSV: From 4ee3b4295506354a0c9a1b2030ce23533938521d Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Mon, 26 Jul 2021 13:37:10 +0100 Subject: [PATCH 12/33] Remove unnecessary "Arrow" --- r/vignettes/dataset.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index 7e693a872d7..cf93c65c242 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -193,7 +193,7 @@ subset of the data into an in-memory R data frame. Suppose you attempt to call unsupported dplyr verbs or unimplemented functions in your query on an Arrow Dataset. In that case, the arrow package raises an error. However, -for dplyr queries on Arrow Table objects (typically smaller in size than Arrow Datasets), the +for dplyr queries on Arrow Table objects (typically smaller in size than Datasets), the package automatically calls `collect()` before processing that dplyr verb. Here's an example. Suppose that you are curious about tipping behavior among the From 9df7b5c569a88e44f901a504a49f48193f2dc682 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Mon, 26 Jul 2021 13:43:32 +0100 Subject: [PATCH 13/33] Remove unnecessary hyphens --- r/vignettes/dataset.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index cf93c65c242..bb1c9d4d7de 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -273,7 +273,7 @@ intermediate datasets that would potentially be large. Second, all work is pushed down to the individual data files, and depending on the file format, chunks of data within the files. As a result, you can select a subset of data from a much larger dataset by collecting the -smaller slices from each file--you don't have to load the whole dataset in +smaller slices from each file - you don't have to load the whole dataset in memory to slice from it. Third, because of partitioning, you can ignore some files entirely. @@ -294,7 +294,7 @@ same directory, you can provide a file path or a vector of multiple file paths to `open_dataset()`. This is useful if, for example, you have a single CSV file that is too big to read into memory. You could pass the file path to `open_dataset()`, use `group_by()` to partition the Dataset into manageable chunks, -then use `write_dataset()` to write each chunk to a separate Parquet file---all +then use `write_dataset()` to write each chunk to a separate Parquet file - all without needing to read the full CSV file into R. ### Explicitly declare column names and data types From 5cb9469a5ff208dc8699f6b81d6bf98ac8b57ae0 Mon Sep 17 00:00:00 2001 From: Nic Date: Thu, 29 Jul 2021 11:27:27 +0000 Subject: [PATCH 14/33] "a" -> "an" Co-authored-by: Jonathan Keane --- r/vignettes/dataset.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index bb1c9d4d7de..7585f527da6 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -38,7 +38,7 @@ arrow::arrow_with_s3() Even with S3 support enabled, network speed will be a bottleneck unless your machine is located in the same AWS region as the data. So, for this vignette, -we assume that the NYC taxi dataset has been downloaded locally in a "nyc-taxi" +we assume that the NYC taxi dataset has been downloaded locally in an "nyc-taxi" directory. ### Retrieving data from a public Amazon S3 bucket From 40860cfc6ee1bc01bbdeb30d653b6e1f1e00972c Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Thu, 29 Jul 2021 14:23:29 +0100 Subject: [PATCH 15/33] Make explanation more explicit --- r/vignettes/dataset.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index 7585f527da6..47a10ec9be4 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -125,13 +125,13 @@ have file paths like ... ``` -By providing a character vector to `partitioning`, you're saying that the first +By providing `c("year", "month")` to the `partitioning` argument, you're saying that the first path segment gives the value for `year`, and the second segment is `month`. Every row in `2009/01/data.parquet` has a value of 2009 for `year` and 1 for `month`, even though those columns may not be present in the file. Indeed, when you look at the dataset, you can see that in addition to the columns present -in every file, there are also columns `year` and `month`. +in every file, there are also columns `year` and `month` even though they are not present in the files themselves. ```{r, eval = file.exists("nyc-taxi")} ds From df69b6dc06780da9acdf00e623ab44f97a716c24 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Thu, 29 Jul 2021 14:24:21 +0100 Subject: [PATCH 16/33] Stylistic change --- r/vignettes/dataset.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index 47a10ec9be4..85971157da9 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -196,7 +196,7 @@ in your query on an Arrow Dataset. In that case, the arrow package raises an err for dplyr queries on Arrow Table objects (typically smaller in size than Datasets), the package automatically calls `collect()` before processing that dplyr verb. -Here's an example. Suppose that you are curious about tipping behavior among the +Here's an example: suppose that you are curious about tipping behavior among the longest taxi rides. Let's find the median tip percentage for rides with fares greater than $100 in 2015, broken down by the number of passengers: From 5c415c3f5e3a34cf803fe9bff587db76258a072c Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Thu, 29 Jul 2021 17:57:41 +0100 Subject: [PATCH 17/33] Add STYLE.md --- r/.Rbuildignore | 2 ++ r/STYLE.md | 9 +++++++++ 2 files changed, 11 insertions(+) create mode 100644 r/STYLE.md diff --git a/r/.Rbuildignore b/r/.Rbuildignore index cf4b7ce31ba..4de043a709b 100644 --- a/r/.Rbuildignore +++ b/r/.Rbuildignore @@ -24,3 +24,5 @@ clang_format.sh ^apache-arrow.rb$ ^.*\.Rhistory$ ^extra-tests +STYLE.md + diff --git a/r/STYLE.md b/r/STYLE.md new file mode 100644 index 00000000000..e55c3269266 --- /dev/null +++ b/r/STYLE.md @@ -0,0 +1,9 @@ +# Style + +This is a style guide for writing documentation for [arrow](https://arrow.apache.org/docs/r/). + +* Please use the [tidyverse coding style](https://style.tidyverse.org/). + +* When referring to packages, link to the package at the first mention, and subsequently refer to it in plain text (without backticks). + +* When referring the concept, write use the phrase "data frame", whereas when referring to an object of that class, or when the class is important, write `data.frame` (e.g. "the `head` method for a `data.frame` object..."). From f8ab48ac9cb0559f43a34cd011d05024b41c4f20 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Thu, 29 Jul 2021 18:36:26 +0100 Subject: [PATCH 18/33] Update style file to make clearer --- r/STYLE.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/r/STYLE.md b/r/STYLE.md index e55c3269266..3dce3deed33 100644 --- a/r/STYLE.md +++ b/r/STYLE.md @@ -4,6 +4,6 @@ This is a style guide for writing documentation for [arrow](https://arrow.apache * Please use the [tidyverse coding style](https://style.tidyverse.org/). -* When referring to packages, link to the package at the first mention, and subsequently refer to it in plain text (without backticks). +* When referring to packages, include a link to the package at the first mention, and subsequently refer to it in plain text (e.g. "The arrow package implements multiple [dplyr](https://dplyr.tidyverse.org/) verbs and allows similar syntax to dplyr.") -* When referring the concept, write use the phrase "data frame", whereas when referring to an object of that class, or when the class is important, write `data.frame` (e.g. "the `head` method for a `data.frame` object..."). +* When referring the concept, write use the phrase "data frame", whereas when referring to an object of that class, or when the class is important, write `data.frame` (e.g. "Similar concepts to the `head` method for a `data.frame` object can be found in other data frame implementations."). From c7052f90103042b1062d8dc70cde837eeeff2372 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Fri, 30 Jul 2021 11:02:53 +0100 Subject: [PATCH 19/33] Explicitly reference supported parsing options --- r/vignettes/dataset.Rmd | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index 85971157da9..8e7973998a1 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -112,8 +112,15 @@ Other supported formats include: * `"csv"` (comma-delimited files) and `"tsv"` (tab-delimited files) * `"text"` (generic text-delimited files - use the `delimiter` argument to specify which to use) -For text files, you can pass any parsing options (`delim`, `quote`, etc.) to -`open_dataset()` that you would otherwise pass to `read_csv_arrow()`. +For text files, you can pass the following parsing options to `open_dataset()`: + +* `delim` +* `quote` +* `escape_double` +* `escape_backslash` +* `skip_empty_rows` + +For more information on the usage of these parameters, see `?read_delim_arrow()`. The `partitioning` argument lets you specify how the file paths provide information about how the dataset is chunked into different files. The files in this example From f065df20c24e426bbb11493b4a1cc1878fb8a572 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Fri, 30 Jul 2021 11:17:01 +0100 Subject: [PATCH 20/33] Tweak STYLE.md and its examples --- r/STYLE.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/r/STYLE.md b/r/STYLE.md index 3dce3deed33..83e31811f29 100644 --- a/r/STYLE.md +++ b/r/STYLE.md @@ -4,6 +4,6 @@ This is a style guide for writing documentation for [arrow](https://arrow.apache * Please use the [tidyverse coding style](https://style.tidyverse.org/). -* When referring to packages, include a link to the package at the first mention, and subsequently refer to it in plain text (e.g. "The arrow package implements multiple [dplyr](https://dplyr.tidyverse.org/) verbs and allows similar syntax to dplyr.") +* When referring to external packages, include a link to the package at the first mention, and subsequently refer to it in plain text (e.g. "The arrow R package provides a [dplyr](https://dplyr.tidyverse.org/) interface to Arrow Datasets. This vignette introduces Datasets and shows how to use dplyr to analyze them.") -* When referring the concept, write use the phrase "data frame", whereas when referring to an object of that class, or when the class is important, write `data.frame` (e.g. "Similar concepts to the `head` method for a `data.frame` object can be found in other data frame implementations."). +* When referring to the concept, use the phrase "data frame", whereas when referring to an object of that class or when the class is important, write `data.frame` (e.g. "You can call `write_dataset()` on tabular data objects such as Arrow Tables or RecordBatchs, or R data frames. If working with data frames you might want to use a `tibble` instead of a `data.frame` to take advantage of the default behaviour of partitioning data based on grouped variables." From c150bbcad40e1326f19ad2c9a51200dbdba2c801 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Fri, 30 Jul 2021 11:20:34 +0100 Subject: [PATCH 21/33] Use headings --- r/STYLE.md | 17 +++++++++++++---- 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/r/STYLE.md b/r/STYLE.md index 83e31811f29..fc55eeeb76d 100644 --- a/r/STYLE.md +++ b/r/STYLE.md @@ -1,9 +1,18 @@ # Style -This is a style guide for writing documentation for [arrow](https://arrow.apache.org/docs/r/). +This is a style guide for writing documentation for arrow. -* Please use the [tidyverse coding style](https://style.tidyverse.org/). +## Coding style -* When referring to external packages, include a link to the package at the first mention, and subsequently refer to it in plain text (e.g. "The arrow R package provides a [dplyr](https://dplyr.tidyverse.org/) interface to Arrow Datasets. This vignette introduces Datasets and shows how to use dplyr to analyze them.") +Please use the [tidyverse coding style](https://style.tidyverse.org/). -* When referring to the concept, use the phrase "data frame", whereas when referring to an object of that class or when the class is important, write `data.frame` (e.g. "You can call `write_dataset()` on tabular data objects such as Arrow Tables or RecordBatchs, or R data frames. If working with data frames you might want to use a `tibble` instead of a `data.frame` to take advantage of the default behaviour of partitioning data based on grouped variables." +## Referring to external packages + +When referring to external packages, include a link to the package at the first mention, and subsequently refer to it in plain text, e.g. + +* "The arrow R package provides a [dplyr](https://dplyr.tidyverse.org/) interface to Arrow Datasets. This vignette introduces Datasets and shows how to use dplyr to analyze them." + +## Data frames +When referring to the concept, use the phrase "data frame", whereas when referring to an object of that class or when the class is important, write `data.frame`, e.g. + +* "You can call `write_dataset()` on tabular data objects such as Arrow Tables or RecordBatchs, or R data frames. If working with data frames you might want to use a `tibble` instead of a `data.frame` to take advantage of the default behaviour of partitioning data based on grouped variables." From 88b63aede68d98e85d84fd49580ba24be3dc52f3 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Fri, 30 Jul 2021 11:21:34 +0100 Subject: [PATCH 22/33] "for" -> "to" --- r/STYLE.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/r/STYLE.md b/r/STYLE.md index fc55eeeb76d..d8ea29e928d 100644 --- a/r/STYLE.md +++ b/r/STYLE.md @@ -1,6 +1,6 @@ # Style -This is a style guide for writing documentation for arrow. +This is a style guide to writing documentation for arrow. ## Coding style From b5f75b7989ddff6525715b52e1bab5ab8ba69262 Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Fri, 30 Jul 2021 11:21:51 +0100 Subject: [PATCH 23/33] Pedantry --- r/STYLE.md | 1 + 1 file changed, 1 insertion(+) diff --git a/r/STYLE.md b/r/STYLE.md index d8ea29e928d..c2460f7aa9e 100644 --- a/r/STYLE.md +++ b/r/STYLE.md @@ -13,6 +13,7 @@ When referring to external packages, include a link to the package at the first * "The arrow R package provides a [dplyr](https://dplyr.tidyverse.org/) interface to Arrow Datasets. This vignette introduces Datasets and shows how to use dplyr to analyze them." ## Data frames + When referring to the concept, use the phrase "data frame", whereas when referring to an object of that class or when the class is important, write `data.frame`, e.g. * "You can call `write_dataset()` on tabular data objects such as Arrow Tables or RecordBatchs, or R data frames. If working with data frames you might want to use a `tibble` instead of a `data.frame` to take advantage of the default behaviour of partitioning data based on grouped variables." From 1bb3c880fee61c19e3168362adebdbcaa10cc2dc Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Tue, 3 Aug 2021 18:01:48 +0100 Subject: [PATCH 24/33] add ASF header --- r/STYLE.md | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/r/STYLE.md b/r/STYLE.md index c2460f7aa9e..c305ab11d90 100644 --- a/r/STYLE.md +++ b/r/STYLE.md @@ -1,3 +1,22 @@ + + # Style This is a style guide to writing documentation for arrow. From be3e7bba44e385c9d36dbb1572d802141c4ab8ac Mon Sep 17 00:00:00 2001 From: Nic Crane Date: Tue, 3 Aug 2021 18:02:58 +0100 Subject: [PATCH 25/33] Delete sentence --- r/vignettes/dataset.Rmd | 2 -- 1 file changed, 2 deletions(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index 8e7973998a1..bef0368fd86 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -12,8 +12,6 @@ The arrow R package provides a [dplyr](https://dplyr.tidyverse.org/) interface t and other tools for interactive exploration of Arrow data. This vignette introduces Datasets and shows how to use dplyr to analyze them. -It describes both what is possible to do with Arrow now -and what is on the immediate development roadmap. ## Example: NYC taxi data From feed3b5de1629f84f0e2cae26e2be4c25e41e457 Mon Sep 17 00:00:00 2001 From: Nic Date: Tue, 3 Aug 2021 18:36:58 +0100 Subject: [PATCH 26/33] Update .Rbuildignore --- r/.Rbuildignore | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/r/.Rbuildignore b/r/.Rbuildignore index 428726c5321..3f67ef7cf3c 100644 --- a/r/.Rbuildignore +++ b/r/.Rbuildignore @@ -25,4 +25,4 @@ clang_format.sh ^.*\.Rhistory$ ^extra-tests STYLE.md -^.lintr \ No newline at end of file +^.lintr From 6666c13610508a578be6bfa801c0db685ab8911a Mon Sep 17 00:00:00 2001 From: Nic Date: Wed, 4 Aug 2021 13:21:33 +0000 Subject: [PATCH 27/33] Update r/STYLE.md Co-authored-by: Neal Richardson --- r/STYLE.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/r/STYLE.md b/r/STYLE.md index c305ab11d90..760084936a4 100644 --- a/r/STYLE.md +++ b/r/STYLE.md @@ -35,4 +35,4 @@ When referring to external packages, include a link to the package at the first When referring to the concept, use the phrase "data frame", whereas when referring to an object of that class or when the class is important, write `data.frame`, e.g. -* "You can call `write_dataset()` on tabular data objects such as Arrow Tables or RecordBatchs, or R data frames. If working with data frames you might want to use a `tibble` instead of a `data.frame` to take advantage of the default behaviour of partitioning data based on grouped variables." +* "You can call `write_dataset()` on tabular data objects such as Arrow Tables or RecordBatches, or R data frames. If working with data frames you might want to use a `tibble` instead of a `data.frame` to take advantage of the default behaviour of partitioning data based on grouped variables." From 0cde04fcfe1b2db153a72ef3b80b77c0ee181dd7 Mon Sep 17 00:00:00 2001 From: Nic Date: Wed, 4 Aug 2021 13:21:51 +0000 Subject: [PATCH 28/33] Update r/vignettes/dataset.Rmd Co-authored-by: Neal Richardson --- r/vignettes/dataset.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index bef0368fd86..c7bd6608284 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -185,7 +185,7 @@ would have been detected automatically. ## Querying the dataset -Up to this point, you haven't loaded any data. You've walked directories to find +Up to this point, you haven't loaded any data. You've walked directories to find files, you've parsed file paths to identify partitions, and you've read the headers of the Parquet files to inspect their schemas so that you can make sure they all are as expected. From 987380839dfd40845f52829b789c6c525fe38289 Mon Sep 17 00:00:00 2001 From: Nic Date: Wed, 4 Aug 2021 13:22:06 +0000 Subject: [PATCH 29/33] Update r/vignettes/dataset.Rmd Co-authored-by: Neal Richardson --- r/vignettes/dataset.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index c7bd6608284..71ebab1d4ef 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -198,7 +198,7 @@ subset of the data into an in-memory R data frame. Suppose you attempt to call unsupported dplyr verbs or unimplemented functions in your query on an Arrow Dataset. In that case, the arrow package raises an error. However, -for dplyr queries on Arrow Table objects (typically smaller in size than Datasets), the +for dplyr queries on Arrow Table objects (which are already in memory), the package automatically calls `collect()` before processing that dplyr verb. Here's an example: suppose that you are curious about tipping behavior among the From 52a66f8bcb5ce2394c659756f13eadb51ae7aed8 Mon Sep 17 00:00:00 2001 From: Nic Date: Wed, 4 Aug 2021 13:22:22 +0000 Subject: [PATCH 30/33] Update r/vignettes/dataset.Rmd Co-authored-by: Neal Richardson --- r/vignettes/dataset.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index 71ebab1d4ef..96d778d554a 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -241,7 +241,7 @@ cat(" ``` You've just selected a subset out of a dataset with around 2 billion rows, computed -a new column, and aggregated it in under 2 seconds on most modern laptops. How does +a new column, and aggregated it in under 2 seconds on a modern laptop. How does this work? First, `mutate()`/`transmute()`, `select()`/`rename()`/`relocate()`, `filter()`, From 1a4ff1d03b404a7f430964e697340ffbf617433e Mon Sep 17 00:00:00 2001 From: Nic Date: Wed, 4 Aug 2021 13:22:31 +0000 Subject: [PATCH 31/33] Update r/vignettes/dataset.Rmd Co-authored-by: Neal Richardson --- r/vignettes/dataset.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index 96d778d554a..0a3a282fecf 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -278,7 +278,7 @@ intermediate datasets that would potentially be large. Second, all work is pushed down to the individual data files, and depending on the file format, chunks of data within the files. As a result, you can select a subset of data from a much larger dataset by collecting the -smaller slices from each file - you don't have to load the whole dataset in +smaller slices from each file—you don't have to load the whole dataset in memory to slice from it. Third, because of partitioning, you can ignore some files entirely. From 8ba084c50dbad2494324f28604cc30fff3df1efb Mon Sep 17 00:00:00 2001 From: Nic Date: Wed, 4 Aug 2021 13:23:03 +0000 Subject: [PATCH 32/33] Update r/vignettes/dataset.Rmd Co-authored-by: Neal Richardson --- r/vignettes/dataset.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index 0a3a282fecf..10a43a89b2f 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -299,7 +299,7 @@ same directory, you can provide a file path or a vector of multiple file paths to `open_dataset()`. This is useful if, for example, you have a single CSV file that is too big to read into memory. You could pass the file path to `open_dataset()`, use `group_by()` to partition the Dataset into manageable chunks, -then use `write_dataset()` to write each chunk to a separate Parquet file - all +then use `write_dataset()` to write each chunk to a separate Parquet file—all without needing to read the full CSV file into R. ### Explicitly declare column names and data types From b4ea47ba8d63ee4d2f5015c53e97f04ea53e2858 Mon Sep 17 00:00:00 2001 From: Nic Date: Wed, 4 Aug 2021 13:23:26 +0000 Subject: [PATCH 33/33] Update r/vignettes/dataset.Rmd Co-authored-by: Neal Richardson --- r/vignettes/dataset.Rmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index 10a43a89b2f..3f33cbae47c 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -340,7 +340,7 @@ Sometimes you might start with one giant CSV. The first step in analyzing data is cleaning is up and reshaping it into a more usable form. The `write_dataset()` function allows you to take a Dataset or another tabular -data object - an Arrow Table or RecordBatch, or an R data frame - and write +data object—an Arrow Table or RecordBatch, or an R data frame—and write it to a different file format, partitioned into multiple files. Assume that you have a version of the NYC Taxi data as CSV: