GH-34437: [R] Use FetchNode and OrderByNode #34685

nealrichardson · 2023-03-22T13:51:33Z

Rationale for this change

See also #32991. By using the new nodes, we're closer to having all dplyr query business happening inside the ExecPlan. Unfortunately, there are still two cases where we have to apply operations in R after running a query:

[C++] Non-deterministic FetchNode #34941: Taking head/tail on unordered data, which has non-deterministic results but that should be possible, in the case where the user wants to see a slice of the result, any slice
[C++]: Support tail in FetchNode #34942: Implementing tail in the FetchNode or similar would enable removing more hacks and workarounds.

Once those are resolved, we can simply further and then move to the new Declaration class.

What changes are included in this PR?

This removes the use of different SinkNodes and many R-specific workarounds to support sorting and head/tail, so almost
everything we do in a query should be represented in an ExecPlan.

Are these changes tested?

Yes. This is mostly an internal refactor, but behavior changes are accompanied by test updates.

Are there any user-facing changes?

The show_query() method will print slightly different ExecPlans. In many cases, they will be more informative.

tail() now actually returns the tail of the data in cases where the data has an implicit order (currently only in-memory tables). Previously it was non-deterministic (and would return the head or some other slice of the data).

When printing query objects that include summarize() when the arrow.summarize.sort = TRUE option is set, the sorting is correctly printed.

It's unclear if there should be changes in performance; running benchmarks would be good but it's also not clear that our benchmarks cover all affected scenarios.

Closes: [R] Use FetchNode and OrderByNode #34437
Closes: [C++] Support limit operation #31980
Closes: [C++] Support order by derived column #31982

github-actions · 2023-03-22T13:51:57Z

Closes: [R] Use FetchNode and OrderByNode #34437

thisisnic · 2023-04-11T08:24:58Z

r/R/arrow-info.R

I think you might need a rebase? This was done in #34943 right?

thisisnic

Looks great, just needs a rebase!

thisisnic · 2023-04-11T08:42:11Z

r/tests/testthat/test-dplyr-query.R

nealrichardson · 2023-04-11T15:02:42Z

rebased; will merge when green

…5074) This PR updates the decorator in a function introduced in #34685 which inadvertently referenced `arrow` and not `acero`. * Closes: #35073 Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Jacob Wujciak-Jens <jacob@wujciak.de>

ursabot · 2023-04-13T00:17:26Z

Benchmark runs are scheduled for baseline = 6b4e0f6 and contender = 47a602d. 47a602d is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.18% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 47a602db ec2-t3-xlarge-us-east-2
[Failed] 47a602db test-mac-arm
[Finished] 47a602db ursa-i9-9960x
[Finished] 47a602db ursa-thinkcentre-m75q
[Finished] 6b4e0f6d ec2-t3-xlarge-us-east-2
[Failed] 6b4e0f6d test-mac-arm
[Finished] 6b4e0f6d ursa-i9-9960x
[Finished] 6b4e0f6d ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

### Rationale for this change See also apache#32991. By using the new nodes, we're closer to having all dplyr query business happening inside the ExecPlan. Unfortunately, there are still two cases where we have to apply operations in R after running a query: * apache#34941: Taking head/tail on unordered data, which has non-deterministic results but that should be possible, in the case where the user wants to see a slice of the result, any slice * apache#34942: Implementing tail in the FetchNode or similar would enable removing more hacks and workarounds. Once those are resolved, we can simply further and then move to the new Declaration class. ### What changes are included in this PR? This removes the use of different SinkNodes and many R-specific workarounds to support sorting and head/tail, so *almost* everything we do in a query should be represented in an ExecPlan. ### Are these changes tested? Yes. This is mostly an internal refactor, but behavior changes are accompanied by test updates. ### Are there any user-facing changes? The `show_query()` method will print slightly different ExecPlans. In many cases, they will be more informative. `tail()` now actually returns the tail of the data in cases where the data has an implicit order (currently only in-memory tables). Previously it was non-deterministic (and would return the head or some other slice of the data). When printing query objects that include `summarize()` when the `arrow.summarize.sort = TRUE` option is set, the sorting is correctly printed. It's unclear if there should be changes in performance; running benchmarks would be good but it's also not clear that our benchmarks cover all affected scenarios. * Closes: apache#34437 * Closes: apache#31980 * Closes: apache#31982 Authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>

…d) (apache#35074) This PR updates the decorator in a function introduced in apache#34685 which inadvertently referenced `arrow` and not `acero`. * Closes: apache#35073 Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Jacob Wujciak-Jens <jacob@wujciak.de>

### Rationale for this change See also apache#32991. By using the new nodes, we're closer to having all dplyr query business happening inside the ExecPlan. Unfortunately, there are still two cases where we have to apply operations in R after running a query: * apache#34941: Taking head/tail on unordered data, which has non-deterministic results but that should be possible, in the case where the user wants to see a slice of the result, any slice * apache#34942: Implementing tail in the FetchNode or similar would enable removing more hacks and workarounds. Once those are resolved, we can simply further and then move to the new Declaration class. ### What changes are included in this PR? This removes the use of different SinkNodes and many R-specific workarounds to support sorting and head/tail, so *almost* everything we do in a query should be represented in an ExecPlan. ### Are these changes tested? Yes. This is mostly an internal refactor, but behavior changes are accompanied by test updates. ### Are there any user-facing changes? The `show_query()` method will print slightly different ExecPlans. In many cases, they will be more informative. `tail()` now actually returns the tail of the data in cases where the data has an implicit order (currently only in-memory tables). Previously it was non-deterministic (and would return the head or some other slice of the data). When printing query objects that include `summarize()` when the `arrow.summarize.sort = TRUE` option is set, the sorting is correctly printed. It's unclear if there should be changes in performance; running benchmarks would be good but it's also not clear that our benchmarks cover all affected scenarios. * Closes: apache#34437 * Closes: apache#31980 * Closes: apache#31982 Authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>

…d) (apache#35074) This PR updates the decorator in a function introduced in apache#34685 which inadvertently referenced `arrow` and not `acero`. * Closes: apache#35073 Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Jacob Wujciak-Jens <jacob@wujciak.de>

nealrichardson requested a review from westonpace March 22, 2023 13:51

github-actions bot added Component: R awaiting review Awaiting review labels Mar 22, 2023

nealrichardson force-pushed the order-by-node branch 2 times, most recently from b90834d to 92e841d Compare April 6, 2023 17:30

nealrichardson marked this pull request as ready for review April 6, 2023 19:01

nealrichardson requested review from paleolimbot and thisisnic as code owners April 6, 2023 19:01

thisisnic reviewed Apr 11, 2023

View reviewed changes

r/R/arrow-info.R Outdated

Copy link

Member

thisisnic Apr 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might need a rebase? This was done in #34943 right?

github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Apr 11, 2023

thisisnic approved these changes Apr 11, 2023

View reviewed changes

r/tests/testthat/test-dplyr-query.R Outdated

Comment on lines +522 to +528

Copy link

Member

thisisnic Apr 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

thisisnic added this to the 12.0.0 milestone Apr 11, 2023

github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Apr 11, 2023

nealrichardson added 7 commits April 11, 2023 10:55

Add OrderByNode and FetchNode bindings and use where supported

f330220

Delete some code and patch up tests

393ddca

Add support for implicit order in head/tail where supported

71b20cc

Remove restriction on show_exec_plan: it's been safe to do for a while

830906b

Insert issue numbers in TODOs

5f04e1d

Move arrow.summarize.sort handling out of ExecPlan

0c29cfb

💅

2996779

nealrichardson force-pushed the order-by-node branch from 285824a to 2996779 Compare April 11, 2023 14:55

thisisnic merged commit 47a602d into apache:main Apr 11, 2023

paleolimbot mentioned this pull request Apr 12, 2023

[R] Minimal build is failing (acero symbol not defined) #35073

Closed

thisisnic mentioned this pull request Apr 12, 2023

GH-35073: [R] Minimal build is failing (acero symbol not defined) #35074

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-34437: [R] Use FetchNode and OrderByNode #34685

GH-34437: [R] Use FetchNode and OrderByNode #34685

Uh oh!

nealrichardson commented Mar 22, 2023 •

edited

Loading

Uh oh!

github-actions bot commented Mar 22, 2023

Uh oh!

thisisnic Apr 11, 2023

Uh oh!

thisisnic left a comment

Uh oh!

thisisnic Apr 11, 2023

Uh oh!

nealrichardson commented Apr 11, 2023

Uh oh!

ursabot commented Apr 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

GH-34437: [R] Use FetchNode and OrderByNode #34685

GH-34437: [R] Use FetchNode and OrderByNode #34685

Uh oh!

Conversation

nealrichardson commented Mar 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Mar 22, 2023

Uh oh!

thisisnic Apr 11, 2023

Choose a reason for hiding this comment

Uh oh!

thisisnic left a comment

Choose a reason for hiding this comment

Uh oh!

thisisnic Apr 11, 2023

Choose a reason for hiding this comment

Uh oh!

nealrichardson commented Apr 11, 2023

Uh oh!

ursabot commented Apr 13, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nealrichardson commented Mar 22, 2023 •

edited

Loading