Don't store hashes in GroupOrdering by tustvold · Pull Request #7029 · apache/datafusion

tustvold · 2023-07-19T15:26:03Z

Which issue does this PR close?

Closes #.

Rationale for this change

The approach of storing hashes in GroupOrdering was causing merge conflicts for #7016 and is not actually necessary

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb

Looks great to me

FYI @mustafasrepo and @ozankabak -- this effectively should improve the speed of streamed / bounded group by

alamb · 2023-07-19T18:14:58Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

-                for (idx, &hash) in hashes.iter().enumerate() {
-                    self.map.insert(hash, (hash, idx), |(hash, _)| *hash);
+                self.group_ordering.remove_groups(n);
+                // SAFETY: self.map outlives iterator and is not modified concurrently


I double checked: https://docs.rs/hashbrown/latest/hashbrown/raw/struct.RawTable.html#method.iter 👍

alamb · 2023-07-19T18:19:06Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

+                unsafe {
+                    for bucket in self.map.iter() {
+                        match bucket.as_ref().1.checked_sub(n) {
+                            None => self.map.erase(bucket),
+                            Some(sub) => bucket.as_mut().1 = sub,
+                        }
+                    }


I think this is both wonderfully elegant as well as cryptic. How about some comments (this is so I don't have to refigure this out the next time I see this code):

Suggested change

unsafe {

for bucket in self.map.iter() {

match bucket.as_ref().1.checked_sub(n) {

None => self.map.erase(bucket),

Some(sub) => bucket.as_mut().1 = sub,

}

}

unsafe {

for bucket in self.map.iter() {

// decrement group index by n

match bucket.as_ref().1.checked_sub(n) {

// group index was < n, so remove from table

None => self.map.erase(bucket),

// group index was >= n, shift value down

Some(sub) => bucket.as_mut().1 = sub,

}

}

I double checked https://docs.rs/hashbrown/latest/hashbrown/raw/struct.RawIter.html

You must not free the hash table while iterating (including via growing/shrinking). It is fine to erase a bucket that has been yielded by the iterator. Erasing a bucket that has not yet been yielded by the iterator may still result in the iterator yielding that bucket (unless reflect_remove is called). It is unspecified whether an element inserted after the iterator was created will be yielded by that iterator (unless reflect_insert is called). The order in which the iterator yields bucket is unspecified and may change in the future.

Which seems to be followed 👍

Don't store hashes in GroupOrdering

ab1cc59

github-actions bot added the core Core DataFusion crate label Jul 19, 2023

Update group IDs

0a6f23b

alamb approved these changes Jul 19, 2023

View reviewed changes

Review feedback

99b3eeb

tustvold merged commit a3db191 into apache:main Jul 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't store hashes in GroupOrdering#7029

Don't store hashes in GroupOrdering#7029
tustvold merged 3 commits intoapache:mainfrom
tustvold:do-not-store-hash-group-ordering

tustvold commented Jul 19, 2023

Uh oh!

alamb left a comment

Uh oh!

alamb Jul 19, 2023

Uh oh!

alamb Jul 19, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tustvold commented Jul 19, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Jul 19, 2023

Choose a reason for hiding this comment

Uh oh!

alamb Jul 19, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants