-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-13399: [R] Update dataset.Rmd vignette #10765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
46df6df to
ae9b685
Compare
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
…ke tidyverse package vignettes
|
I've also removed a lot of the backticks around "arrow" and "dplyr" as I think they make the text harder to scan. I took a look at the tidyverse packages and the convention in their vignettes seems to be to link to any external packages the first time they're mentioned, and then subsequently just treat the package names as if they are words. |
jonkeane
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a really nice enhancement for readability and style. I have a few comments (and a number of them are "we should standardise this!"), so feel free to take them, discard them, or punt on them
r/vignettes/dataset.Rmd
Outdated
| The `partitioning` argument lets us specify how the file paths provide information | ||
| about how the dataset is chunked into different files. Our files in this example | ||
| For text files, you can pass any parsing options (`delim`, `quote`, etc.) to | ||
| `open_dataset()` that you would otherwise pass to `read_csv_arrow()`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be misremembered (or remembering the issue but not that it was resolved), but I thought there were some options that weren't compatible with the dataset version of csv reading. We don't have to list them here, but if that is true + those are documented elsewhere (like the docs for read_csv_arrow() or open_dataset()), maybe a link to that elsewhere that contains those caveats would be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no recollection of that but I'll have a search on JIRA and see what I can find
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a look in the docs and I found this in open_dataset():
additional arguments passed to dataset_factory() when sources is a directory path/URI or vector of file paths/URIs
Is this what you meant, as in it only works for certain values of sources, or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about https://github.com/apache/arrow/blob/master/r/R/dataset-format.R#L135-L153 bit.
Running the code in the middle of that, I get:
> # Catch any readr-style options specified with full option names that are
> # supported by read_delim_arrow() (and its wrappers) but are not yet
> # supported here
> unsup_readr_opts <- setdiff(
+ names(formals(read_delim_arrow)),
+ names(formals(readr_to_csv_parse_options))
+ )
> unsup_readr_opts
[1] "file" "schema" "col_names" "col_types" "col_select"
[6] "na" "quoted_na" "skip" "parse_options" "convert_options"
[11] "read_options" "as_data_frame" "timestamp_parsers"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you may want to intersect that with names(formals(readr::read_delim)) since some of those are arrow function args
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by @nealrichardson 's comment, please can you rephrase that?
In the meantime, it looks like there are only 5 parsing options supported and more unsupported, so I feel like I'm better off explicitly listing the ones that are supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many of the args in unsup_readr_opts aren't actually from readr, they're just arguments to read_delim_arrow that aren't in readr_to_csv_parse_options. (In practice where this code is run, this doesn't matter because if you supplied as_data_frame as an argument, it will have matched that and not be in the ... passed in here.) We only care about the ones that are readr options, so:
> intersect(unsup_readr_opts, names(formals(readr::read_delim)))
[1] "file" "col_names" "col_types" "col_select" "na"
[6] "quoted_na" "skip"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I get you, I think. I think that now I'm explicitly specifying the arguments that can be passed through rather than can't, I don't think I need to make any more changes here on account of the above comments?
r/vignettes/dataset.Rmd
Outdated
|
|
||
| By providing a character vector to `partitioning`, we're saying that the first | ||
| path segment gives the value for `year` and the second segment is `month`. | ||
| By providing a character vector to `partitioning`, you're saying that the first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it might be clearer / easier to wade through if we use c("year", "month") instead of "a character vector"? That way the values are right here, it will be obvious to many R users what that is, even if they don't have "character vector" in their vocabulary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion, being explicit about this can only help understanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
r/vignettes/dataset.Rmd
Outdated
|
|
||
| Indeed, when we look at the dataset, we see that in addition to the columns present | ||
| Indeed, when you look at the dataset, you can see that in addition to the columns present | ||
| in every file, there are also columns `year` and `month`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be beating a dead horse, but maybe it would be good to repeat "even though they are not present in the files themselves" here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not beating a dead horse at all, being explicit about these things makes it a lot easier to understand with less effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
r/vignettes/dataset.Rmd
Outdated
| package automatically calls `collect()` before processing that dplyr verb. | ||
|
|
||
| Here's an example. Suppose I was curious about tipping behavior among the | ||
| Here's an example. Suppose that you are curious about tipping behavior among the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Here's an example. Suppose that you are curious about tipping behavior among the | |
| Here's an example: Suppose that you are curious about tipping behavior among the |
Minor, and stylistic, feel free to disregard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Co-authored-by: Jonathan Keane <jkeane@gmail.com>
r/vignettes/dataset.Rmd
Outdated
| It describes both what is possible to do with Arrow now | ||
| and what is on the immediate development roadmap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it? I think we should delete this sentence but it'd be good to hear other people's thoughts on this first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At one point this was true (the discussion at the end talked about what's not yet implemented but coming) but perhaps that's no longer accurate.
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some final nits; LGTM thanks!
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Various updates to dataset.Rmd including: