diff --git a/r/.Rbuildignore b/r/.Rbuildignore index 2f4cea9a34d..3f67ef7cf3c 100644 --- a/r/.Rbuildignore +++ b/r/.Rbuildignore @@ -24,4 +24,5 @@ clang_format.sh ^apache-arrow.rb$ ^.*\.Rhistory$ ^extra-tests +STYLE.md ^.lintr diff --git a/r/STYLE.md b/r/STYLE.md new file mode 100644 index 00000000000..760084936a4 --- /dev/null +++ b/r/STYLE.md @@ -0,0 +1,38 @@ + + +# Style + +This is a style guide to writing documentation for arrow. + +## Coding style + +Please use the [tidyverse coding style](https://style.tidyverse.org/). + +## Referring to external packages + +When referring to external packages, include a link to the package at the first mention, and subsequently refer to it in plain text, e.g. + +* "The arrow R package provides a [dplyr](https://dplyr.tidyverse.org/) interface to Arrow Datasets. This vignette introduces Datasets and shows how to use dplyr to analyze them." + +## Data frames + +When referring to the concept, use the phrase "data frame", whereas when referring to an object of that class or when the class is important, write `data.frame`, e.g. + +* "You can call `write_dataset()` on tabular data objects such as Arrow Tables or RecordBatches, or R data frames. If working with data frames you might want to use a `tibble` instead of a `data.frame` to take advantage of the default behaviour of partitioning data based on grouped variables." diff --git a/r/vignettes/dataset.Rmd b/r/vignettes/dataset.Rmd index b5e17578b29..3f33cbae47c 100644 --- a/r/vignettes/dataset.Rmd +++ b/r/vignettes/dataset.Rmd @@ -8,46 +8,46 @@ vignette: > --- Apache Arrow lets you work efficiently with large, multi-file datasets. -The `arrow` R package provides a `dplyr` interface to Arrow Datasets, -as well as other tools for interactive exploration of Arrow data. +The arrow R package provides a [dplyr](https://dplyr.tidyverse.org/) interface to Arrow Datasets, +and other tools for interactive exploration of Arrow data. -This vignette introduces Datasets and shows how to use `dplyr` to analyze them. -It describes both what is possible to do with Arrow now -and what is on the immediate development roadmap. +This vignette introduces Datasets and shows how to use dplyr to analyze them. ## Example: NYC taxi data The [New York City taxi trip record data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) is widely used in big data exercises and competitions. For demonstration purposes, we have hosted a Parquet-formatted version -of about 10 years of the trip data in a public Amazon S3 bucket. +of about ten years of the trip data in a public Amazon S3 bucket. The total file size is around 37 gigabytes, even in the efficient Parquet file -format. That's bigger than memory on most people's computers, so we can't just +format. That's bigger than memory on most people's computers, so you can't just read it all in and stack it into a single data frame. -In Windows and macOS binary packages, S3 support is included. -On Linux when installing from source, S3 support is not enabled by default, +In Windows (for R > 3.6) and macOS binary packages, S3 support is included. +On Linux, when installing from source, S3 support is not enabled by default, and it has additional system requirements. See `vignette("install", package = "arrow")` for details. -To see if your `arrow` installation has S3 support, run +To see if your arrow installation has S3 support, run: ```{r} arrow::arrow_with_s3() ``` -Even with S3 support enabled network, speed will be a bottleneck unless your +Even with S3 support enabled, network speed will be a bottleneck unless your machine is located in the same AWS region as the data. So, for this vignette, -we assume that the NYC taxi dataset has been downloaded locally in a "nyc-taxi" +we assume that the NYC taxi dataset has been downloaded locally in an "nyc-taxi" directory. -If your `arrow` build has S3 support, you can sync the data locally with: +### Retrieving data from a public Amazon S3 bucket + +If your arrow build has S3 support, you can sync the data locally with: ```{r, eval = FALSE} arrow::copy_files("s3://ursa-labs-taxi-data", "nyc-taxi") ``` -If your `arrow` build doesn't have S3 support, you can download the files +If your arrow build doesn't have S3 support, you can download the files with some additional code: ```{r, eval = FALSE} @@ -77,39 +77,51 @@ feel free to grab only a year or two of data. If you don't have the taxi data downloaded, the vignette will still run and will yield previously cached output for reference. To be explicit about which version -is running, let's check whether we're running with live data: +is running, let's check whether you're running with live data: ```{r} dir.exists("nyc-taxi") ``` -## Getting started +## Opening the dataset -Because `dplyr` is not necessary for many Arrow workflows, +Because dplyr is not necessary for many Arrow workflows, it is an optional (`Suggests`) dependency. So, to work with Datasets, -we need to load both `arrow` and `dplyr`. +you need to load both arrow and dplyr. ```{r} library(arrow, warn.conflicts = FALSE) library(dplyr, warn.conflicts = FALSE) ``` -The first step is to create our Dataset object, pointing at the directory of data. +The first step is to create a Dataset object, pointing at the directory of data. ```{r, eval = file.exists("nyc-taxi")} ds <- open_dataset("nyc-taxi", partitioning = c("year", "month")) ``` -The default file format for `open_dataset()` is Parquet; if we had a directory -of Arrow format files, we could include `format = "arrow"` in the call. -Other supported formats include: `"feather"` (an alias for `"arrow"`, as Feather -v2 is the Arrow file format), `"csv"`, `"tsv"` (for tab-delimited), and `"text"` -for generic text-delimited files. For text files, you can pass any parsing -options (`delim`, `quote`, etc.) to `open_dataset()` that you would otherwise -pass to `read_csv_arrow()`. +The file format for `open_dataset()` is controlled by the `format` parameter, +which has a default value of `"parquet"`. If you had a directory +of Arrow format files, you could instead specify `format = "arrow"` in the call. + +Other supported formats include: + +* `"feather"` or `"ipc"` (aliases for `"arrow"`, as Feather v2 is the Arrow file format) +* `"csv"` (comma-delimited files) and `"tsv"` (tab-delimited files) +* `"text"` (generic text-delimited files - use the `delimiter` argument to specify which to use) + +For text files, you can pass the following parsing options to `open_dataset()`: -The `partitioning` argument lets us specify how the file paths provide information -about how the dataset is chunked into different files. Our files in this example +* `delim` +* `quote` +* `escape_double` +* `escape_backslash` +* `skip_empty_rows` + +For more information on the usage of these parameters, see `?read_delim_arrow()`. + +The `partitioning` argument lets you specify how the file paths provide information +about how the dataset is chunked into different files. The files in this example have file paths like ``` @@ -118,13 +130,13 @@ have file paths like ... ``` -By providing a character vector to `partitioning`, we're saying that the first -path segment gives the value for `year` and the second segment is `month`. +By providing `c("year", "month")` to the `partitioning` argument, you're saying that the first +path segment gives the value for `year`, and the second segment is `month`. Every row in `2009/01/data.parquet` has a value of 2009 for `year` -and 1 for `month`, even though those columns may not actually be present in the file. +and 1 for `month`, even though those columns may not be present in the file. -Indeed, when we look at the dataset, we see that in addition to the columns present -in every file, there are also columns `year` and `month`. +Indeed, when you look at the dataset, you can see that in addition to the columns present +in every file, there are also columns `year` and `month` even though they are not present in the files themselves. ```{r, eval = file.exists("nyc-taxi")} ds @@ -159,7 +171,7 @@ See $metadata for additional Schema metadata The other form of partitioning currently supported is [Hive](https://hive.apache.org/)-style, in which the partition variable names are included in the path segments. -If we had saved our files in paths like +If you had saved your files in paths like: ``` year=2009/month=01/data.parquet @@ -167,29 +179,29 @@ year=2009/month=02/data.parquet ... ``` -we would not have had to provide the names in `partitioning`: -we could have just called `ds <- open_dataset("nyc-taxi")` and the partitions +you would not have had to provide the names in `partitioning`; +you could have just called `ds <- open_dataset("nyc-taxi")` and the partitions would have been detected automatically. ## Querying the dataset -Up to this point, we haven't loaded any data: we have walked directories to find -files, we've parsed file paths to identify partitions, and we've read the -headers of the Parquet files to inspect their schemas so that we can make sure -they all line up. +Up to this point, you haven't loaded any data. You've walked directories to find +files, you've parsed file paths to identify partitions, and you've read the +headers of the Parquet files to inspect their schemas so that you can make sure +they all are as expected. -In the current release, `arrow` supports the dplyr verbs `mutate()`, +In the current release, arrow supports the dplyr verbs `mutate()`, `transmute()`, `select()`, `rename()`, `relocate()`, `filter()`, and `arrange()`. Aggregation is not yet supported, so before you call `summarise()` or other verbs with aggregate functions, use `collect()` to pull the selected subset of the data into an in-memory R data frame. -If you attempt to call unsupported `dplyr` verbs or unimplemented functions in -your query on an Arrow Dataset, the `arrow` package raises an error. However, -for `dplyr` queries on `Table` objects (which are typically smaller in size) the -package automatically calls `collect()` before processing that `dplyr` verb. +Suppose you attempt to call unsupported dplyr verbs or unimplemented functions +in your query on an Arrow Dataset. In that case, the arrow package raises an error. However, +for dplyr queries on Arrow Table objects (which are already in memory), the +package automatically calls `collect()` before processing that dplyr verb. -Here's an example. Suppose I was curious about tipping behavior among the +Here's an example: suppose that you are curious about tipping behavior among the longest taxi rides. Let's find the median tip percentage for rides with fares greater than $100 in 2015, broken down by the number of passengers: @@ -228,12 +240,11 @@ cat(" ") ``` -We just selected a subset out of a dataset with around 2 billion rows, computed -a new column, and aggregated on it in under 2 seconds on my laptop. How does +You've just selected a subset out of a dataset with around 2 billion rows, computed +a new column, and aggregated it in under 2 seconds on a modern laptop. How does this work? -First, -`mutate()`/`transmute()`, `select()`/`rename()`/`relocate()`, `filter()`, +First, `mutate()`/`transmute()`, `select()`/`rename()`/`relocate()`, `filter()`, `group_by()`, and `arrange()` record their actions but don't evaluate on the data until you run `collect()`. @@ -259,47 +270,58 @@ See $.data for the source Arrow object ") ``` -This returns instantly and shows the manipulations you've made, without +This code returns an output instantly and shows the manipulations you've made, without loading data from the files. Because the evaluation of these queries is deferred, you can build up a query that selects down to a small subset without generating intermediate datasets that would potentially be large. Second, all work is pushed down to the individual data files, and depending on the file format, chunks of data within the files. As a result, -we can select a subset of data from a much larger dataset by collecting the -smaller slices from each file--we don't have to load the whole dataset in memory -in order to slice from it. +you can select a subset of data from a much larger dataset by collecting the +smaller slices from each file—you don't have to load the whole dataset in +memory to slice from it. -Third, because of partitioning, we can ignore some files entirely. +Third, because of partitioning, you can ignore some files entirely. In this example, by filtering `year == 2015`, all files corresponding to other years -are immediately excluded: we don't have to load them in order to find that no +are immediately excluded: you don't have to load them in order to find that no rows match the filter. Relatedly, since Parquet files contain row groups with -statistics on the data within, there may be entire chunks of data we can +statistics on the data within, there may be entire chunks of data you can avoid scanning because they have no rows where `total_amount > 100`. ## More dataset options There are a few ways you can control the Dataset creation to adapt to special use cases. -For one, if you are working with a single file or a set of files that are not -all in the same directory, you can provide a file path or a vector of multiple -file paths to `open_dataset()`. This is useful if, for example, you have a -single CSV file that is too big to read into memory. You could pass the file -path to `open_dataset()`, use `group_by()` to partition the Dataset into -manageable chunks, then use `write_dataset()` to write each chunk to a separate -Parquet file---all without needing to read the full CSV file into R. - -You can specify a `schema` argument to `open_dataset()` to declare the columns -and their data types. This is useful if you have data files that have different -storage schema (for example, a column could be `int32` in one and `int8` in another) -and you want to ensure that the resulting Dataset has a specific type. -To be clear, it's not necessary to specify a schema, even in this example of -mixed integer types, because the Dataset constructor will reconcile differences like these. -The schema specification just lets you declare what you want the result to be. + +### Work with files in a directory + +If you are working with a single file or a set of files that are not all in the +same directory, you can provide a file path or a vector of multiple file paths +to `open_dataset()`. This is useful if, for example, you have a single CSV file +that is too big to read into memory. You could pass the file path to +`open_dataset()`, use `group_by()` to partition the Dataset into manageable chunks, +then use `write_dataset()` to write each chunk to a separate Parquet file—all +without needing to read the full CSV file into R. + +### Explicitly declare column names and data types + +You can specify the `schema` argument to `open_dataset()` to declare the columns +and their data types. This is useful if you have data files that have different +storage schema (for example, a column could be `int32` in one and `int8` in +another) and you want to ensure that the resulting Dataset has a specific type. + +To be clear, it's not necessary to specify a schema, even in this example of +mixed integer types, because the Dataset constructor will reconcile differences +like these. The schema specification just lets you declare what you want the +result to be. + +### Explicitly declare partition format Similarly, you can provide a Schema in the `partitioning` argument of `open_dataset()` in order to declare the types of the virtual columns that define the partitions. -This would be useful, in our taxi dataset example, if you wanted to keep -`month` as a string instead of an integer for some reason. +This would be useful, in the taxi dataset example, if you wanted to keep +`month` as a string instead of an integer. + +### Work with multiple data sources Another feature of Datasets is that they can be composed of multiple data sources. That is, you may have a directory of partitioned Parquet files in one location, @@ -313,27 +335,29 @@ instead of a file path, or simply concatenate them like `big_dataset <- c(ds1, d As you can see, querying a large dataset can be made quite fast by storage in an efficient binary columnar format like Parquet or Feather and partitioning based on -columns commonly used for filtering. However, we don't always get our data delivered -to us that way. Sometimes we start with one giant CSV. Our first step in analyzing data +columns commonly used for filtering. However, data isn't always stored that way. +Sometimes you might start with one giant CSV. The first step in analyzing data is cleaning is up and reshaping it into a more usable form. -The `write_dataset()` function allows you to take a Dataset or other tabular data object---an Arrow `Table` or `RecordBatch`, or an R `data.frame`---and write it to a different file format, partitioned into multiple files. +The `write_dataset()` function allows you to take a Dataset or another tabular +data object—an Arrow Table or RecordBatch, or an R data frame—and write +it to a different file format, partitioned into multiple files. -Assume we have a version of the NYC Taxi data as CSV: +Assume that you have a version of the NYC Taxi data as CSV: ```r ds <- open_dataset("nyc-taxi/csv/", format = "csv") ``` -We can write it to a new location and translate the files to the Feather format +You can write it to a new location and translate the files to the Feather format by calling `write_dataset()` on it: ```r write_dataset(ds, "nyc-taxi/feather", format = "feather") ``` -Next, let's imagine that the `payment_type` column is something we often filter -on, so we want to partition the data by that variable. By doing so we ensure +Next, let's imagine that the `payment_type` column is something you often filter +on, so you want to partition the data by that variable. By doing so you ensure that a filter like `payment_type == "Cash"` will touch only a subset of files where `payment_type` is always `"Cash"`. @@ -367,14 +391,14 @@ system("tree nyc-taxi/feather") Note that the directory names are `payment_type=Cash` and similar: this is the Hive-style partitioning described above. This means that when -we call `open_dataset()` on this directory, we don't have to declare what the +you call `open_dataset()` on this directory, you don't have to declare what the partitions are because they can be read from the file paths. (To instead write bare values for partition segments, i.e. `Cash` rather than `payment_type=Cash`, call `write_dataset()` with `hive_style = FALSE`.) -Perhaps, though, `payment_type == "Cash"` is the only data we ever care about, -and we just want to drop the rest and have a smaller working set. -For this, we can `filter()` them out when writing: +Perhaps, though, `payment_type == "Cash"` is the only data you ever care about, +and you just want to drop the rest and have a smaller working set. +For this, you can `filter()` them out when writing: ```r ds %>% @@ -382,9 +406,9 @@ ds %>% write_dataset("nyc-taxi/feather", format = "feather") ``` -The other thing we can do when writing datasets is select a subset of and/or reorder -columns. Suppose we never care about `vendor_id`, and being a string column, -it can take up a lot of space when we read it in, so let's drop it: +The other thing you can do when writing datasets is select a subset of columns +or reorder them. Suppose you never care about `vendor_id`, and being a string column, +it can take up a lot of space when you read it in, so let's drop it: ```r ds %>%