Skip to content

Conversation

@yjshen
Copy link
Member

@yjshen yjshen commented Apr 8, 2022

Which issue does this PR close?

Closes #.

Rationale for this change

While sorting by a column with a great many repetition values, we may get non-continuous indices from a batch to output. This fails the current assumption that indices are always continuous, and make extend output wrong slices.

For example, in our test on a 1TB TPC-DS dataset, sort by inventory.inv_date_sk. we are witnessing

CompositeIndex { batch_idx: 68, row_idx: 4164 }
CompositeIndex { batch_idx: 68, row_idx: 4293 }
CompositeIndex { batch_idx: 68, row_idx: 4294 }
CompositeIndex { batch_idx: 68, row_idx: 4295 }
CompositeIndex { batch_idx: 68, row_idx: 4296 }
CompositeIndex { batch_idx: 68, row_idx: 4297 }

using start_row_idx, 4164 with length 6 is wrong, since rows with indices from 4165 to 4169 should be output elsewhere, and it's likely 4296 or 4297 are pointing to the next sort key.

What changes are included in this PR?

Bugfix along with the minimum parquet test file that could reproduce the bug.

Are there any user-facing changes?

No.

@yjshen yjshen requested review from Dandandan and alamb April 8, 2022 17:58
@yjshen yjshen marked this pull request as ready for review April 8, 2022 17:58
@yjshen
Copy link
Member Author

yjshen commented Apr 8, 2022

cc @richox

}

#[tokio::test]
async fn sort_with_lots_of_repetition_values() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ love the tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I ran the test locally without the changes in this PR to confirm coverage:

cargo test -p datafusion --test sql_integration -- sort_with_lots_of_repetition_values

They failed with:


---- sql::order::sort_with_lots_of_repetition_values stdout ----
thread 'sql::order::sort_with_lots_of_repetition_values' panicked at 'assertion failed: `(left == right)`
  left: `Some(2451809)`,
 right: `Some(2451816)`', datafusion/core/tests/sql/order.rs:220:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

👍

let mut last_batch_idx = 0;
let mut start_row_idx = 0;
let mut len = 0;
let mut indices_in_batch = vec![];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why using a Vec here is better. Aren't the indices a contiguous range?

I wonder what about using Option<std::ops::Range> instead https://doc.rust-lang.org/std/ops/struct.Range.html ?

Maybe as a follow on PR

group_indices(last_batch_idx, indices, &mut slices);
last_batch_idx = ci.batch_idx;
start_row_idx = ci.row_idx;
len = 1;
Copy link
Contributor

@alamb alamb Apr 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the bug -- that len should be reset to 0 rather than 1?

Copy link
Member Author

@yjshen yjshen Apr 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, as described in the PR description, the bug comes from non continuous indexes, which is introduced by unstable lexsort. So it's possible we will see several disjoint ranges comes from one batch. The gap between the ranges are of same key but moved by unstable sort

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see 👍 -- that is a tricky one . So sad

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the test -- given this implementation works and the previous one has a bug I am approving the PR

I do worry about the performance implications

}
}

/// Group continuous indices into a slice for better `extend` performance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand how positions can be non contiguous (doesn't this function get called each time batch_idx changes to a new batch)?

}

#[tokio::test]
async fn sort_with_lots_of_repetition_values() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I ran the test locally without the changes in this PR to confirm coverage:

cargo test -p datafusion --test sql_integration -- sort_with_lots_of_repetition_values

They failed with:


---- sql::order::sort_with_lots_of_repetition_values stdout ----
thread 'sql::order::sort_with_lots_of_repetition_values' panicked at 'assertion failed: `(left == right)`
  left: `Some(2451809)`,
 right: `Some(2451816)`', datafusion/core/tests/sql/order.rs:220:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

👍

output: &mut Vec<CompositeSlice>,
) {
// use sort instead of sort_unstable since it's likely nearly sorted.
positions.sort();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't give different results, no? Why stable sort here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per doc for sort and sort_unstable, it seems sort is more suitable for the nearly sorted indices. After cargo test --release on the newly introduced sort_with_lots_of_repetition_values, unstable yields better performance even on nearly sorted values.

Changed to use sort_unstable, Thanks!

@Dandandan
Copy link
Contributor

Good to have this fixed!

I wonder if we could better switch to use stable sort in the arrow rs lexsort implementation instead (offering both stable/unstable sort), for performance.

@Dandandan Dandandan merged commit a637514 into apache:master Apr 9, 2022
@alamb
Copy link
Contributor

alamb commented Apr 10, 2022

I wonder if we could better switch to use stable sort in the arrow rs lexsort implementation instead (offering both stable/unstable sort), for performance.

@jimexist added unstable sorting here: apache/arrow-rs#552

Maybe we could add the choice of which to use

@yjshen yjshen deleted the fix_sort_rep branch April 22, 2022 08:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants