-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-9679: [Rust] [DataFusion] More efficient creation of final batch from HashAggregateExec #7936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@alamb @jorgecarleitao @houqp fyi, since you've all been contributing to DataFusion lately |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the logic carefully and I didn't see any issues. My only comment that is in the "really should fix before merging" category is removing the println!. The rest are stylistic.
Otherwise I think this PR is looking really good. Nice work @andygrove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the type of Max and Min should be the same as the input (as those functions are never going to overflow the value type of their input, the way Sum or Avg could.
I actually think the change results in cleaner code too, which is a nice bonus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the names $BUILDER_TY and $SCALAR_TY are easier to understand than the previous $TY:ident, $TY2:ty 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| println!("downcast {:#?} to {:#?}", $BUILDER, $VALUE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Some(_) => panic!(), | |
| Some(_) => Err(ExecutionError::ExecutionError("unexpected type when creating aggregate value".to_string()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| for i in 0..num_group_expr + num_aggr_expr { | |
| for i in 0..output_schema.fields().len() { |
This is just a style suggestion / defensive coding suggestion (as output_schema.field(i) is matched below).
If you wanted to get all rusty / functional, you could also think about rewriting this as a map over fields. Something like this (untested):
let builders = output_schema
.fields()
.iter()
.map(|f| { match f.data_type ... // the match statement below})
.collect::<Result<Vec<_>>?;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a quick look at doing this but I needed an explicit assignment still otherwise it complained about the match arms returning different types, so it ended up not being much cleaner really. However, I did change it to use for data_type in &output_types which is a little cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sure you have a good reason, but I didn't see quite why this match arm can't use append_aggr_value! as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue here was that ScalarValue::Utf8 contains String and the builder wants &str. In all other cases the scalar and builder types are the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that might be worth adding in a comment so future readers would not have to wonder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another good change (don't panic if there is a thread error while merging) -- maybe worth a mention in the PR title (or maybe even its own PR).
|
Thanks for the review @alamb. I've addressed the formatting, removed debug printlns, and tests are now passing. |
|
@nevi-me @paddyhoran PTAL if you have time. |
|
LGTM! Thanks a lot for this, nice cleanup! I closed #7687 in favor of this one as the overall is too high to salvage. |
Instead of walking through the map containing accumulators multiple times (once per grouping expression and once per aggregate expression) let's just walk through it once! Safer and faster.
Other changes in this PR:
ErrMapStructstruct