-
Notifications
You must be signed in to change notification settings - Fork 1.9k
fix: Sort with a lot of repetition values #2182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @richox |
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn sort_with_lots_of_repetition_values() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ love the tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I ran the test locally without the changes in this PR to confirm coverage:
cargo test -p datafusion --test sql_integration -- sort_with_lots_of_repetition_valuesThey failed with:
---- sql::order::sort_with_lots_of_repetition_values stdout ----
thread 'sql::order::sort_with_lots_of_repetition_values' panicked at 'assertion failed: `(left == right)`
left: `Some(2451809)`,
right: `Some(2451816)`', datafusion/core/tests/sql/order.rs:220:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
👍
| let mut last_batch_idx = 0; | ||
| let mut start_row_idx = 0; | ||
| let mut len = 0; | ||
| let mut indices_in_batch = vec![]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why using a Vec here is better. Aren't the indices a contiguous range?
I wonder what about using Option<std::ops::Range> instead https://doc.rust-lang.org/std/ops/struct.Range.html ?
Maybe as a follow on PR
| group_indices(last_batch_idx, indices, &mut slices); | ||
| last_batch_idx = ci.batch_idx; | ||
| start_row_idx = ci.row_idx; | ||
| len = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the bug -- that len should be reset to 0 rather than 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, as described in the PR description, the bug comes from non continuous indexes, which is introduced by unstable lexsort. So it's possible we will see several disjoint ranges comes from one batch. The gap between the ranges are of same key but moved by unstable sort
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see 👍 -- that is a tricky one . So sad
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the test -- given this implementation works and the previous one has a bug I am approving the PR
I do worry about the performance implications
| } | ||
| } | ||
|
|
||
| /// Group continuous indices into a slice for better `extend` performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand how positions can be non contiguous (doesn't this function get called each time batch_idx changes to a new batch)?
| } | ||
|
|
||
| #[tokio::test] | ||
| async fn sort_with_lots_of_repetition_values() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I ran the test locally without the changes in this PR to confirm coverage:
cargo test -p datafusion --test sql_integration -- sort_with_lots_of_repetition_valuesThey failed with:
---- sql::order::sort_with_lots_of_repetition_values stdout ----
thread 'sql::order::sort_with_lots_of_repetition_values' panicked at 'assertion failed: `(left == right)`
left: `Some(2451809)`,
right: `Some(2451816)`', datafusion/core/tests/sql/order.rs:220:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
👍
| output: &mut Vec<CompositeSlice>, | ||
| ) { | ||
| // use sort instead of sort_unstable since it's likely nearly sorted. | ||
| positions.sort(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't give different results, no? Why stable sort here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per doc for sort and sort_unstable, it seems sort is more suitable for the nearly sorted indices. After cargo test --release on the newly introduced sort_with_lots_of_repetition_values, unstable yields better performance even on nearly sorted values.
Changed to use sort_unstable, Thanks!
|
Good to have this fixed! I wonder if we could better switch to use stable sort in the arrow rs lexsort implementation instead (offering both stable/unstable sort), for performance. |
@jimexist added unstable sorting here: apache/arrow-rs#552 Maybe we could add the choice of which to use |
Which issue does this PR close?
Closes #.
Rationale for this change
While sorting by a column with a great many repetition values, we may get non-continuous indices from a batch to output. This fails the current assumption that indices are always continuous, and make
extendoutput wrong slices.For example, in our test on a 1TB TPC-DS dataset, sort by
inventory.inv_date_sk. we are witnessingusing start_row_idx, 4164 with length 6 is wrong, since rows with indices from 4165 to 4169 should be output elsewhere, and it's likely 4296 or 4297 are pointing to the next sort key.
What changes are included in this PR?
Bugfix along with the minimum parquet test file that could reproduce the bug.
Are there any user-facing changes?
No.