ARROW-13399: [R] Update dataset.Rmd vignette #10765

thisisnic · 2021-07-21T11:31:55Z

Various updates to dataset.Rmd including:

separating out dense text chunks
rephrasing based on suggestions by Grammarly to simplify phrasing
rephrasing "we" to "you"

github-actions · 2021-07-21T11:32:15Z

https://issues.apache.org/jira/browse/ARROW-13399

r/vignettes/dataset.Rmd

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>

…ke tidyverse package vignettes

thisisnic · 2021-07-26T12:42:35Z

I've also removed a lot of the backticks around "arrow" and "dplyr" as I think they make the text harder to scan. I took a look at the tidyverse packages and the convention in their vignettes seems to be to link to any external packages the first time they're mentioned, and then subsequently just treat the package names as if they are words.

jonkeane

This looks like a really nice enhancement for readability and style. I have a few comments (and a number of them are "we should standardise this!"), so feel free to take them, discard them, or punt on them

r/vignettes/dataset.Rmd

jonkeane · 2021-07-28T13:19:05Z

r/vignettes/dataset.Rmd

-The `partitioning` argument lets us specify how the file paths provide information
-about how the dataset is chunked into different files. Our files in this example
+For text files, you can pass any parsing options (`delim`, `quote`, etc.) to 
+`open_dataset()` that you would otherwise pass to `read_csv_arrow()`.


I could be misremembered (or remembering the issue but not that it was resolved), but I thought there were some options that weren't compatible with the dataset version of csv reading. We don't have to list them here, but if that is true + those are documented elsewhere (like the docs for read_csv_arrow() or open_dataset()), maybe a link to that elsewhere that contains those caveats would be nice.

I have no recollection of that but I'll have a search on JIRA and see what I can find

Had a look in the docs and I found this in open_dataset():

additional arguments passed to dataset_factory() when sources is a directory path/URI or vector of file paths/URIs

Is this what you meant, as in it only works for certain values of sources, or something else?

I was thinking about https://github.com/apache/arrow/blob/master/r/R/dataset-format.R#L135-L153 bit.

Running the code in the middle of that, I get:

> # Catch any readr-style options specified with full option names that are > # supported by read_delim_arrow() (and its wrappers) but are not yet > # supported here > unsup_readr_opts <- setdiff( + names(formals(read_delim_arrow)), + names(formals(readr_to_csv_parse_options)) + ) > unsup_readr_opts [1] "file" "schema" "col_names" "col_types" "col_select" [6] "na" "quoted_na" "skip" "parse_options" "convert_options" [11] "read_options" "as_data_frame" "timestamp_parsers"

you may want to intersect that with names(formals(readr::read_delim)) since some of those are arrow function args

I'm confused by @nealrichardson 's comment, please can you rephrase that?

In the meantime, it looks like there are only 5 parsing options supported and more unsupported, so I feel like I'm better off explicitly listing the ones that are supported.

Many of the args in unsup_readr_opts aren't actually from readr, they're just arguments to read_delim_arrow that aren't in readr_to_csv_parse_options. (In practice where this code is run, this doesn't matter because if you supplied as_data_frame as an argument, it will have matched that and not be in the ... passed in here.) We only care about the ones that are readr options, so:

> intersect(unsup_readr_opts, names(formals(readr::read_delim))) [1] "file" "col_names" "col_types" "col_select" "na" [6] "quoted_na" "skip"

Right, I get you, I think. I think that now I'm explicitly specifying the arguments that can be passed through rather than can't, I don't think I need to make any more changes here on account of the above comments?

jonkeane · 2021-07-28T13:21:04Z

r/vignettes/dataset.Rmd


-By providing a character vector to `partitioning`, we're saying that the first
-path segment gives the value for `year` and the second segment is `month`.
+By providing a character vector to `partitioning`, you're saying that the first


I wonder if it might be clearer / easier to wade through if we use c("year", "month") instead of "a character vector"? That way the values are right here, it will be obvious to many R users what that is, even if they don't have "character vector" in their vocabulary.

Great suggestion, being explicit about this can only help understanding.

jonkeane · 2021-07-28T13:28:10Z

r/vignettes/dataset.Rmd


-Indeed, when we look at the dataset, we see that in addition to the columns present
+Indeed, when you look at the dataset, you can see that in addition to the columns present
 in every file, there are also columns `year` and `month`.


This might be beating a dead horse, but maybe it would be good to repeat "even though they are not present in the files themselves" here too?

Not beating a dead horse at all, being explicit about these things makes it a lot easier to understand with less effort.

jonkeane · 2021-07-28T13:29:17Z

r/vignettes/dataset.Rmd

+package automatically calls `collect()` before processing that dplyr verb.

-Here's an example. Suppose I was curious about tipping behavior among the
+Here's an example. Suppose that you are curious about tipping behavior among the


Suggested change

Here's an example. Suppose that you are curious about tipping behavior among the

Here's an example: Suppose that you are curious about tipping behavior among the

Minor, and stylistic, feel free to disregard

r/vignettes/dataset.Rmd

Co-authored-by: Jonathan Keane <jkeane@gmail.com>

thisisnic · 2021-07-30T10:29:57Z

r/vignettes/dataset.Rmd

 It describes both what is possible to do with Arrow now
 and what is on the immediate development roadmap.


Does it? I think we should delete this sentence but it'd be good to hear other people's thoughts on this first.

At one point this was true (the discussion at the end talked about what's not yet implemented but coming) but perhaps that's no longer accurate.

r/STYLE.md

r/vignettes/dataset.Rmd

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>

nealrichardson

Some final nits; LGTM thanks!

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>

github-actions bot added the Component: R label Jul 21, 2021

thisisnic added 7 commits July 21, 2021 15:00

Remove backticks to make easier to read

9929e55

Grammarly suggestions and adding some subheadings

fbe7466

"we" -> "you"

9726952

Use bold instead of backticks to make package names more readable

e13543d

Split paragraph into bullets

a941444

Breaking sections down and tweaks

02ffc4c

Rename section heading

ae9b685

thisisnic force-pushed the ARROW_13399_dataset_vignette branch from 46df6df to ae9b685 Compare July 21, 2021 14:01

Specify Windows version for S3, minor tweaks

2c4f9f4

thisisnic changed the title ~~ARROW-13399: [R] Update dataset.Rmd vignette [WIP]~~ ARROW-13399: [R] Update dataset.Rmd vignette Jul 21, 2021

thisisnic marked this pull request as ready for review July 21, 2021 15:10

nealrichardson reviewed Jul 21, 2021

View reviewed changes

r/vignettes/dataset.Rmd Outdated Show resolved Hide resolved

thisisnic and others added 4 commits July 21, 2021 20:07

Update r/vignettes/dataset.Rmd

52ec5f6

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>

Remove highlighting of arrow/dplyr to make easier to read and more li…

254771f

…ke tidyverse package vignettes

Remove extra backticks on Arrow object names

ff09054

Remove unnecessary "Arrow"

4ee3b42

Remove unnecessary hyphens

9df7b5c

jonkeane reviewed Jul 28, 2021

View reviewed changes

thisisnic and others added 9 commits July 29, 2021 11:27

"a" -> "an"

5cb9469

Co-authored-by: Jonathan Keane <jkeane@gmail.com>

Make explanation more explicit

40860cf

Stylistic change

df69b6d

Add STYLE.md

5c415c3

Update style file to make clearer

f8ab48a

Explicitly reference supported parsing options

c7052f9

Tweak STYLE.md and its examples

f065df2

Use headings

c150bbc

"for" -> "to"

88b63ae

Pedantry

b5f75b7

thisisnic commented Jul 30, 2021

View reviewed changes

nealrichardson reviewed Jul 30, 2021

View reviewed changes

r/STYLE.md Show resolved Hide resolved

thisisnic added 4 commits August 3, 2021 18:01

add ASF header

1bb3c88

Delete sentence

be3e7bb

Merge branch 'master' into ARROW_13399_dataset_vignette

5c5b60a

Update .Rbuildignore

feed3b5

thisisnic requested review from jonkeane and nealrichardson August 3, 2021 17:37

nealrichardson reviewed Aug 4, 2021

View reviewed changes

r/STYLE.md Outdated Show resolved Hide resolved