-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-14911: [C++] arrow-compute-hash-join-node-test failed #12894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-14911: [C++] arrow-compute-hash-join-node-test failed #12894
Conversation
… join node to mark itself finished too early when task scheduler tasks were still winding down and attempts to access its own state would fail as the node was deleted.
|
|
|
CC @michalursa PTAL |
|
Epic detective effort! Does this fix thread sanitizer too? |
|
I wasn't getting TSAN errors on this test case (this is before bloom filter) |
|
Is this ready to be merged? I'd like to rebase my bloom filter PR on it. |
|
Benchmark runs are scheduled for baseline = b995284 and contender = 4f08a9b. 4f08a9b is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
This builds on top of #13035 which is also important for avoiding segmentation faults. On top of that there were a few more problems: * The python was using `SourceNodeOptions::FromTable` which is a rather dangerous method (mainly useful for unit testing) as it doesn't share ownership of the input table (even worse, it takes a const ref). Python was not keeping the table alive and it was maybe possible for the table to deleted out from under the plan (I'm not entirely sure this was causing issues but it seemed risky). I switched to TableSourceNode which shares ownership of the table (and is a bit more efficient). * Setting use_threads to False did nothing because `_perform_join` was not passing the arg on to `execplan`. * When fixing the above and running with `use_threads=False` it was creating a single thread executor but the current best practice is to pass in nullptr. * Finally, the actual bug was my improper fix in #12894 . I had still left a small window open for `End` to be called between `Submit` and `AddTask` which would allow the task to be submitted but not participate in setting `finished` on the node. Closes #13036 from westonpace/bugfix/ARROW-16417--segfault-in-python-join Lead-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
I identified and reproduced two possible ways this sort of segmentation fault could happen. The stack traces demonstrated that worker tasks were still running for a plan after the test case had considered the plan "finished" and moved on.
First, the test case was calling gtest asserts in a helper method called from a loop:
If the plan was sometimes timing out then the assert could be triggered. A gtest assert simply returns immediately but it would then get swept up into the next iteration of the loop. I changed the helper method to return a
Resultand put all asserts in the test case. That being said, I don't think this was the likely failure as I would expect to have seen instances of this test case timing out along with instances where it had a segmentation fault.The second possibility was a rather unique set of circumstances that I was only able to trigger reliably when inserting sleeps into the test at just the right spots.
Basically, the node has three task groups,
BuildHashTable,ProbeQueuedBatches, andScanHashTable. It is possible forProbeQueuedBatchesto have zero tasks. This means, whenStartTaskGroupis called on the probe task group it will immediately finish and call the finish continuation. The finish continuation could then callStartTaskGroupon the scan hash table task. If the scan hash table task finished quickly then it is possible it would trigger the finished callback of the exec node before the call toStartTaskGroup->OnTaskGroupFinishedfor the probe task group finishes returning. This particular call returnedall_task_groups_finished=falsebecause it was the probe task group and the final task group was the scan task group. As a result it would try and callthis->ScheduleMore(still insideStartTaskGroup) but by this pointthiswas deleted. Actually, given the stack traces we have, it looks like the call toScheduleMorestarted, which makes sense as it wasn't a virtual call, but the state ofthiswas invalid).I spent some time trying to figure out how to fix
TaskSchedulerwhen I realized we already have a convenient fix for this problem. I added anAsyncTaskGroupat the node level to ensure that all thread tasks started by the node finish before the node is marked finished.