Skip to content

Conversation

@thisisnic
Copy link
Member

@thisisnic thisisnic commented Jul 21, 2021

Various updates to dataset.Rmd including:

  • separating out dense text chunks
  • rephrasing based on suggestions by Grammarly to simplify phrasing
  • rephrasing "we" to "you"

@github-actions
Copy link

@thisisnic thisisnic force-pushed the ARROW_13399_dataset_vignette branch from 46df6df to ae9b685 Compare July 21, 2021 14:01
@thisisnic thisisnic changed the title ARROW-13399: [R] Update dataset.Rmd vignette [WIP] ARROW-13399: [R] Update dataset.Rmd vignette Jul 21, 2021
@thisisnic thisisnic marked this pull request as ready for review July 21, 2021 15:10
@thisisnic
Copy link
Member Author

I've also removed a lot of the backticks around "arrow" and "dplyr" as I think they make the text harder to scan. I took a look at the tidyverse packages and the convention in their vignettes seems to be to link to any external packages the first time they're mentioned, and then subsequently just treat the package names as if they are words.

Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a really nice enhancement for readability and style. I have a few comments (and a number of them are "we should standardise this!"), so feel free to take them, discard them, or punt on them

The `partitioning` argument lets us specify how the file paths provide information
about how the dataset is chunked into different files. Our files in this example
For text files, you can pass any parsing options (`delim`, `quote`, etc.) to
`open_dataset()` that you would otherwise pass to `read_csv_arrow()`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be misremembered (or remembering the issue but not that it was resolved), but I thought there were some options that weren't compatible with the dataset version of csv reading. We don't have to list them here, but if that is true + those are documented elsewhere (like the docs for read_csv_arrow() or open_dataset()), maybe a link to that elsewhere that contains those caveats would be nice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no recollection of that but I'll have a search on JIRA and see what I can find

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a look in the docs and I found this in open_dataset():

additional arguments passed to dataset_factory() when sources is a directory path/URI or vector of file paths/URIs

Is this what you meant, as in it only works for certain values of sources, or something else?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about https://github.com/apache/arrow/blob/master/r/R/dataset-format.R#L135-L153 bit.

Running the code in the middle of that, I get:

>   # Catch any readr-style options specified with full option names that are
>   # supported by read_delim_arrow() (and its wrappers) but are not yet
>   # supported here
>   unsup_readr_opts <- setdiff(
+     names(formals(read_delim_arrow)),
+     names(formals(readr_to_csv_parse_options))
+   )
> unsup_readr_opts
 [1] "file"              "schema"            "col_names"         "col_types"         "col_select"       
 [6] "na"                "quoted_na"         "skip"              "parse_options"     "convert_options"  
[11] "read_options"      "as_data_frame"     "timestamp_parsers"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may want to intersect that with names(formals(readr::read_delim)) since some of those are arrow function args

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by @nealrichardson 's comment, please can you rephrase that?

In the meantime, it looks like there are only 5 parsing options supported and more unsupported, so I feel like I'm better off explicitly listing the ones that are supported.

Copy link
Member

@nealrichardson nealrichardson Jul 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many of the args in unsup_readr_opts aren't actually from readr, they're just arguments to read_delim_arrow that aren't in readr_to_csv_parse_options. (In practice where this code is run, this doesn't matter because if you supplied as_data_frame as an argument, it will have matched that and not be in the ... passed in here.) We only care about the ones that are readr options, so:

> intersect(unsup_readr_opts, names(formals(readr::read_delim)))
[1] "file"       "col_names"  "col_types"  "col_select" "na"        
[6] "quoted_na"  "skip" 

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I get you, I think. I think that now I'm explicitly specifying the arguments that can be passed through rather than can't, I don't think I need to make any more changes here on account of the above comments?


By providing a character vector to `partitioning`, we're saying that the first
path segment gives the value for `year` and the second segment is `month`.
By providing a character vector to `partitioning`, you're saying that the first
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it might be clearer / easier to wade through if we use c("year", "month") instead of "a character vector"? That way the values are right here, it will be obvious to many R users what that is, even if they don't have "character vector" in their vocabulary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion, being explicit about this can only help understanding.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


Indeed, when we look at the dataset, we see that in addition to the columns present
Indeed, when you look at the dataset, you can see that in addition to the columns present
in every file, there are also columns `year` and `month`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be beating a dead horse, but maybe it would be good to repeat "even though they are not present in the files themselves" here too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not beating a dead horse at all, being explicit about these things makes it a lot easier to understand with less effort.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

package automatically calls `collect()` before processing that dplyr verb.

Here's an example. Suppose I was curious about tipping behavior among the
Here's an example. Suppose that you are curious about tipping behavior among the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Here's an example. Suppose that you are curious about tipping behavior among the
Here's an example: Suppose that you are curious about tipping behavior among the

Minor, and stylistic, feel free to disregard

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 15 to 16
It describes both what is possible to do with Arrow now
and what is on the immediate development roadmap.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it? I think we should delete this sentence but it'd be good to hear other people's thoughts on this first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At one point this was true (the discussion at the end talked about what's not yet implemented but coming) but perhaps that's no longer accurate.

thisisnic and others added 2 commits August 4, 2021 13:21
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some final nits; LGTM thanks!

thisisnic and others added 5 commits August 4, 2021 13:22
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants