Skip to content

Conversation

@nealrichardson
Copy link
Member

@nealrichardson nealrichardson commented Mar 22, 2023

Rationale for this change

See also #32991. By using the new nodes, we're closer to having all dplyr query business happening inside the ExecPlan. Unfortunately, there are still two cases where we have to apply operations in R after running a query:

Once those are resolved, we can simply further and then move to the new Declaration class.

What changes are included in this PR?

This removes the use of different SinkNodes and many R-specific workarounds to support sorting and head/tail, so almost
everything we do in a query should be represented in an ExecPlan.

Are these changes tested?

Yes. This is mostly an internal refactor, but behavior changes are accompanied by test updates.

Are there any user-facing changes?

The show_query() method will print slightly different ExecPlans. In many cases, they will be more informative.

tail() now actually returns the tail of the data in cases where the data has an implicit order (currently only in-memory tables). Previously it was non-deterministic (and would return the head or some other slice of the data).

When printing query objects that include summarize() when the arrow.summarize.sort = TRUE option is set, the sorting is correctly printed.

It's unclear if there should be changes in performance; running benchmarks would be good but it's also not clear that our benchmarks cover all affected scenarios.

@github-actions
Copy link

@nealrichardson nealrichardson force-pushed the order-by-node branch 2 times, most recently from b90834d to 92e841d Compare April 6, 2023 17:30
@nealrichardson nealrichardson marked this pull request as ready for review April 6, 2023 19:01
r/R/arrow-info.R Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might need a rebase? This was done in #34943 right?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Apr 11, 2023
Copy link
Member

@thisisnic thisisnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, just needs a rebase!

Comment on lines +522 to +528
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

@thisisnic thisisnic added this to the 12.0.0 milestone Apr 11, 2023
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Apr 11, 2023
@nealrichardson
Copy link
Member Author

rebased; will merge when green

@thisisnic thisisnic merged commit 47a602d into apache:main Apr 11, 2023
assignUser pushed a commit that referenced this pull request Apr 12, 2023
…5074)

This PR updates the decorator in a function introduced in #34685 which inadvertently referenced `arrow` and not `acero`.
* Closes: #35073

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Jacob Wujciak-Jens <jacob@wujciak.de>
@ursabot
Copy link

ursabot commented Apr 13, 2023

Benchmark runs are scheduled for baseline = 6b4e0f6 and contender = 47a602d. 47a602d is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.18% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 47a602db ec2-t3-xlarge-us-east-2
[Failed] 47a602db test-mac-arm
[Finished] 47a602db ursa-i9-9960x
[Finished] 47a602db ursa-thinkcentre-m75q
[Finished] 6b4e0f6d ec2-t3-xlarge-us-east-2
[Failed] 6b4e0f6d test-mac-arm
[Finished] 6b4e0f6d ursa-i9-9960x
[Finished] 6b4e0f6d ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this pull request May 11, 2023
### Rationale for this change

See also apache#32991. By using the new nodes, we're closer to having all dplyr query business happening inside the ExecPlan. Unfortunately, there are still two cases where we have to apply operations in R after running a query:

* apache#34941: Taking head/tail on unordered data, which has non-deterministic results but that should be possible, in the case where the user wants to see a slice of the result, any slice
* apache#34942: Implementing tail in the FetchNode or similar would enable removing more hacks and workarounds.

Once those are resolved, we can simply further and then move to the new Declaration class.

### What changes are included in this PR?

This removes the use of different SinkNodes and many R-specific workarounds to support sorting and head/tail, so *almost* 
everything we do in a query should be represented in an ExecPlan. 

### Are these changes tested?

Yes. This is mostly an internal refactor, but behavior changes are accompanied by test updates.

### Are there any user-facing changes?

The `show_query()` method will print slightly different ExecPlans. In many cases, they will be more informative. 

`tail()` now actually returns the tail of the data in cases where the data has an implicit order (currently only in-memory tables). Previously it was non-deterministic (and would return the head or some other slice of the data).

When printing query objects that include `summarize()` when the `arrow.summarize.sort = TRUE` option is set, the sorting is correctly printed.

It's unclear if there should be changes in performance; running benchmarks would be good but it's also not clear that our benchmarks cover all affected scenarios. 

* Closes: apache#34437
* Closes: apache#31980
* Closes: apache#31982

Authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this pull request May 11, 2023
…d) (apache#35074)

This PR updates the decorator in a function introduced in apache#34685 which inadvertently referenced `arrow` and not `acero`.
* Closes: apache#35073

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Jacob Wujciak-Jens <jacob@wujciak.de>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this pull request May 15, 2023
### Rationale for this change

See also apache#32991. By using the new nodes, we're closer to having all dplyr query business happening inside the ExecPlan. Unfortunately, there are still two cases where we have to apply operations in R after running a query:

* apache#34941: Taking head/tail on unordered data, which has non-deterministic results but that should be possible, in the case where the user wants to see a slice of the result, any slice
* apache#34942: Implementing tail in the FetchNode or similar would enable removing more hacks and workarounds.

Once those are resolved, we can simply further and then move to the new Declaration class.

### What changes are included in this PR?

This removes the use of different SinkNodes and many R-specific workarounds to support sorting and head/tail, so *almost* 
everything we do in a query should be represented in an ExecPlan. 

### Are these changes tested?

Yes. This is mostly an internal refactor, but behavior changes are accompanied by test updates.

### Are there any user-facing changes?

The `show_query()` method will print slightly different ExecPlans. In many cases, they will be more informative. 

`tail()` now actually returns the tail of the data in cases where the data has an implicit order (currently only in-memory tables). Previously it was non-deterministic (and would return the head or some other slice of the data).

When printing query objects that include `summarize()` when the `arrow.summarize.sort = TRUE` option is set, the sorting is correctly printed.

It's unclear if there should be changes in performance; running benchmarks would be good but it's also not clear that our benchmarks cover all affected scenarios. 

* Closes: apache#34437
* Closes: apache#31980
* Closes: apache#31982

Authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this pull request May 15, 2023
…d) (apache#35074)

This PR updates the decorator in a function introduced in apache#34685 which inadvertently referenced `arrow` and not `acero`.
* Closes: apache#35073

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Jacob Wujciak-Jens <jacob@wujciak.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[R] Use FetchNode and OrderByNode [C++] Support order by derived column [C++] Support limit operation

3 participants