-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-34437: [R] Use FetchNode and OrderByNode #34685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b90834d to
92e841d
Compare
r/R/arrow-info.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you might need a rebase? This was done in #34943 right?
thisisnic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, just needs a rebase!
r/tests/testthat/test-dplyr-query.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
285824a to
2996779
Compare
|
rebased; will merge when green |
|
Benchmark runs are scheduled for baseline = 6b4e0f6 and contender = 47a602d. 47a602d is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
### Rationale for this change See also apache#32991. By using the new nodes, we're closer to having all dplyr query business happening inside the ExecPlan. Unfortunately, there are still two cases where we have to apply operations in R after running a query: * apache#34941: Taking head/tail on unordered data, which has non-deterministic results but that should be possible, in the case where the user wants to see a slice of the result, any slice * apache#34942: Implementing tail in the FetchNode or similar would enable removing more hacks and workarounds. Once those are resolved, we can simply further and then move to the new Declaration class. ### What changes are included in this PR? This removes the use of different SinkNodes and many R-specific workarounds to support sorting and head/tail, so *almost* everything we do in a query should be represented in an ExecPlan. ### Are these changes tested? Yes. This is mostly an internal refactor, but behavior changes are accompanied by test updates. ### Are there any user-facing changes? The `show_query()` method will print slightly different ExecPlans. In many cases, they will be more informative. `tail()` now actually returns the tail of the data in cases where the data has an implicit order (currently only in-memory tables). Previously it was non-deterministic (and would return the head or some other slice of the data). When printing query objects that include `summarize()` when the `arrow.summarize.sort = TRUE` option is set, the sorting is correctly printed. It's unclear if there should be changes in performance; running benchmarks would be good but it's also not clear that our benchmarks cover all affected scenarios. * Closes: apache#34437 * Closes: apache#31980 * Closes: apache#31982 Authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>
…d) (apache#35074) This PR updates the decorator in a function introduced in apache#34685 which inadvertently referenced `arrow` and not `acero`. * Closes: apache#35073 Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Jacob Wujciak-Jens <jacob@wujciak.de>
### Rationale for this change See also apache#32991. By using the new nodes, we're closer to having all dplyr query business happening inside the ExecPlan. Unfortunately, there are still two cases where we have to apply operations in R after running a query: * apache#34941: Taking head/tail on unordered data, which has non-deterministic results but that should be possible, in the case where the user wants to see a slice of the result, any slice * apache#34942: Implementing tail in the FetchNode or similar would enable removing more hacks and workarounds. Once those are resolved, we can simply further and then move to the new Declaration class. ### What changes are included in this PR? This removes the use of different SinkNodes and many R-specific workarounds to support sorting and head/tail, so *almost* everything we do in a query should be represented in an ExecPlan. ### Are these changes tested? Yes. This is mostly an internal refactor, but behavior changes are accompanied by test updates. ### Are there any user-facing changes? The `show_query()` method will print slightly different ExecPlans. In many cases, they will be more informative. `tail()` now actually returns the tail of the data in cases where the data has an implicit order (currently only in-memory tables). Previously it was non-deterministic (and would return the head or some other slice of the data). When printing query objects that include `summarize()` when the `arrow.summarize.sort = TRUE` option is set, the sorting is correctly printed. It's unclear if there should be changes in performance; running benchmarks would be good but it's also not clear that our benchmarks cover all affected scenarios. * Closes: apache#34437 * Closes: apache#31980 * Closes: apache#31982 Authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>
…d) (apache#35074) This PR updates the decorator in a function introduced in apache#34685 which inadvertently referenced `arrow` and not `acero`. * Closes: apache#35073 Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Jacob Wujciak-Jens <jacob@wujciak.de>
Rationale for this change
See also #32991. By using the new nodes, we're closer to having all dplyr query business happening inside the ExecPlan. Unfortunately, there are still two cases where we have to apply operations in R after running a query:
Once those are resolved, we can simply further and then move to the new Declaration class.
What changes are included in this PR?
This removes the use of different SinkNodes and many R-specific workarounds to support sorting and head/tail, so almost
everything we do in a query should be represented in an ExecPlan.
Are these changes tested?
Yes. This is mostly an internal refactor, but behavior changes are accompanied by test updates.
Are there any user-facing changes?
The
show_query()method will print slightly different ExecPlans. In many cases, they will be more informative.tail()now actually returns the tail of the data in cases where the data has an implicit order (currently only in-memory tables). Previously it was non-deterministic (and would return the head or some other slice of the data).When printing query objects that include
summarize()when thearrow.summarize.sort = TRUEoption is set, the sorting is correctly printed.It's unclear if there should be changes in performance; running benchmarks would be good but it's also not clear that our benchmarks cover all affected scenarios.