Skip to content

Conversation

@vibhatha
Copy link
Contributor

@vibhatha vibhatha commented Aug 18, 2022

Adding emit feature for Substrait plan deserialization.

This PR covers emits for read, filter, project, join and aggregate operations.

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@vibhatha vibhatha marked this pull request as ready for review August 21, 2022 05:38
@vibhatha
Copy link
Contributor Author

cc @westonpace @jeroen please take a look.

Copy link
Contributor

@jvanstraten jvanstraten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only checked for inconsistencies with Substrait, so not for C++ or Acero-related problems or code quality. Emit handling looks good to me from that perspective, but I did find a few schema deduction problems.

Comment on lines +523 to +535
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong order; keys come first.

The list of distinct columns from each grouping set (ordered by their first appearance) followed by the list of measures in declaration order, [...]

https://substrait.io/relations/logical_relations/#aggregate-operation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jvanstraten I also noticed this, but I forget to leave a comment about it. This is probably a separate JIRA because of the order used in the aggregate_node.cc[1]. Please refer to the comment in this line and the two loops after that. The aggregate fields appened first and then the key fields. One thing we can do is swap the response here.

cc @westonpace

[1].

// Aggregate fields come before key fields to match the behavior of GroupBy function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the comment this looks like intentional behavior of Arrow, so I don't think aggregate node is going to be adjusted to match Substrait. So that just means there should be a project node inserted behind the aggregate node that moves the columns around accordingly, right? I guess you could fix that in a separate JIRA/PR though. Maybe add a FIXME comment in that case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the requirement in Acero may be static here.
We can use the project to swap things around and document it properly. Probably we can do it in this PR as well.

cc @westonpace

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thoughts, it would be better to solve this one in another PR. Because I am not quite sure if this would break R test cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another PR is fine but I wouldn't consider the output order from Acero to be too static. Fixing it up to output things in the order Substrait expects would be nice so we can at least avoid the project node in some cases (when a direct emit). It'll be a breaking change and probably cause some slight heartburn to our existing tests but we should probably fix it while we still have the opportunity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@westonpace westonpace self-requested a review August 22, 2022 16:31
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems I had left this review in the pending state. Apologies.

@vibhatha
Copy link
Contributor Author

vibhatha commented Sep 7, 2022

@westonpace I updated the PR.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few naming suggestions and potential spots for cleanup but no complaints to the overall approach.

@vibhatha vibhatha requested a review from westonpace September 9, 2022 15:23
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for putting this together.

@westonpace westonpace merged commit 7f77811 into apache:master Sep 13, 2022
@vibhatha
Copy link
Contributor Author

Thank you for reviewing this one and keeping up with a few rounds of reviews.

@ursabot
Copy link

ursabot commented Sep 13, 2022

Benchmark runs are scheduled for baseline = 9d65981 and contender = 7f77811. 7f77811 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️1.12% ⬆️0.17%] test-mac-arm
[Failed ⬇️0.28% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️3.55% ⬆️1.92%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 7f778113 ec2-t3-xlarge-us-east-2
[Failed] 7f778113 test-mac-arm
[Failed] 7f778113 ursa-i9-9960x
[Finished] 7f778113 ursa-thinkcentre-m75q
[Finished] 9d659810 ec2-t3-xlarge-us-east-2
[Failed] 9d659810 test-mac-arm
[Failed] 9d659810 ursa-i9-9960x
[Finished] 9d659810 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

zagto pushed a commit to zagto/arrow that referenced this pull request Oct 7, 2022
…e#13914)

Adding emit feature for Substrait plan deserialization. 

This PR covers emits for `read`, `filter`, `project`, `join` and `aggregate` operations.

Authored-by: Vibhatha Abeykoon <vibhatha@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants