-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-15141: [C++] fix for unstable test due to unstable sort #15142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-15141: [C++] fix for unstable test due to unstable sort #15142
Conversation
|
|
|
|
|
@vibhatha would you be interested in providing a review / sanity check? |
Sure @westonpace, I will. |
vibhatha
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@westonpace make sense to me. It’s not possible to guarantee the outcome of this.
Just a question, is there a sort option which gives precedence to the index of the row and decide which comes first, when we have a tie like this?
That's called a "stable sort". The underlying sort kernel (SortIndices) is stable. However, if the plan is run in parallel, then there is no guarantee the batches will accumulate in the same order. So even if the sort kernel is stable the sort node is not. Once we add proper ordering we can add a stable option to the sort node which resequences the data before sorting so that the sort node can remain stable. However, now that I write this, I realize it might be best to only apply my change when testing the parallel case, and to use the old comparison in the non-parallel case. |
Yes, I also think it’s better that way. |
Hmm, I tried this but it turns out not to be so simple. I'm going to proceed with this how it is for now. We can worry about a full comparison later when we add a stable sort. I'll add a new issue requesting that. I'll merge this so it doesn't bother CI |
|
Benchmark runs are scheduled for baseline = db6c59d and contender = 5a57e6d. 5a57e6d is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
|
['Python', 'R'] benchmarks have high level of regressions. |
…che#15142) The sorting done by orderby is not stable. This means, given the input: a | b --- | --- 1 | false 1 | true the test could have generated both `[false, true]` and `[true, false]` for the `b` column. We likely did not encounter this before apache@498b645 because the entire thing was run serially (even though there was a `parallel` option it was not setup correctly). Now that things are properly running parallel the results are non-deterministic. We could remove the `b` column but I feel it is a better stress test to have at least one payload column. So I changed the test to only compare the key array and not the payload array. * Closes: apache#15141 Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
The sorting done by orderby is not stable. This means, given the input:
the test could have generated both
[false, true]and[true, false]for thebcolumn. We likely did not encounter this before 498b645 because the entire thing was run serially (even though there was aparalleloption it was not setup correctly).Now that things are properly running parallel the results are non-deterministic. We could remove the
bcolumn but I feel it is a better stress test to have at least one payload column. So I changed the test to only compare the key array and not the payload array.