-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-15584: [C++] Add support for Substrait's RelCommon::Emit #13914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
|
cc @westonpace @jeroen please take a look. |
jvanstraten
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only checked for inconsistencies with Substrait, so not for C++ or Acero-related problems or code quality. Emit handling looks good to me from that perspective, but I did find a few schema deduction problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong order; keys come first.
The list of distinct columns from each grouping set (ordered by their first appearance) followed by the list of measures in declaration order, [...]
https://substrait.io/relations/logical_relations/#aggregate-operation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jvanstraten I also noticed this, but I forget to leave a comment about it. This is probably a separate JIRA because of the order used in the aggregate_node.cc[1]. Please refer to the comment in this line and the two loops after that. The aggregate fields appened first and then the key fields. One thing we can do is swap the response here.
cc @westonpace
[1].
| // Aggregate fields come before key fields to match the behavior of GroupBy function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the comment this looks like intentional behavior of Arrow, so I don't think aggregate node is going to be adjusted to match Substrait. So that just means there should be a project node inserted behind the aggregate node that moves the columns around accordingly, right? I guess you could fix that in a separate JIRA/PR though. Maybe add a FIXME comment in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the requirement in Acero may be static here.
We can use the project to swap things around and document it properly. Probably we can do it in this PR as well.
cc @westonpace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thoughts, it would be better to solve this one in another PR. Because I am not quite sure if this would break R test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another PR is fine but I wouldn't consider the output order from Acero to be too static. Fixing it up to output things in the order Substrait expects would be nice so we can at least avoid the project node in some cases (when a direct emit). It'll be a breaking change and probably cause some slight heartburn to our existing tests but we should probably fix it while we still have the opportunity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jira created here: https://issues.apache.org/jira/browse/ARROW-17656
f9f30f3 to
5050706
Compare
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems I had left this review in the pending state. Apologies.
5050706 to
9c099eb
Compare
|
@westonpace I updated the PR. |
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few naming suggestions and potential spots for cleanup but no complaints to the overall approach.
e185a2d to
19e49ed
Compare
westonpace
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for putting this together.
|
Thank you for reviewing this one and keeping up with a few rounds of reviews. |
|
Benchmark runs are scheduled for baseline = 9d65981 and contender = 7f77811. 7f77811 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
…e#13914) Adding emit feature for Substrait plan deserialization. This PR covers emits for `read`, `filter`, `project`, `join` and `aggregate` operations. Authored-by: Vibhatha Abeykoon <vibhatha@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
Adding emit feature for Substrait plan deserialization.
This PR covers emits for
read,filter,project,joinandaggregateoperations.