ARROW-14911: [C++] arrow-compute-hash-join-node-test failed #12894

westonpace · 2022-04-15T00:48:49Z

I identified and reproduced two possible ways this sort of segmentation fault could happen. The stack traces demonstrated that worker tasks were still running for a plan after the test case had considered the plan "finished" and moved on.

First, the test case was calling gtest asserts in a helper method called from a loop:

void RunPlan(parameters) {
  Plan plan = MakePlan(parameters);
  ASSERT_TRUE(plan.FinishesInAReasonableTime());
}
void Test() {
  // ...
  for (int i = 0; i < kNumTrials; i++) {
    RunPlan(parameters);
  }
}

If the plan was sometimes timing out then the assert could be triggered. A gtest assert simply returns immediately but it would then get swept up into the next iteration of the loop. I changed the helper method to return a Result and put all asserts in the test case. That being said, I don't think this was the likely failure as I would expect to have seen instances of this test case timing out along with instances where it had a segmentation fault.

The second possibility was a rather unique set of circumstances that I was only able to trigger reliably when inserting sleeps into the test at just the right spots.

Basically, the node has three task groups, BuildHashTable, ProbeQueuedBatches, and ScanHashTable. It is possible for ProbeQueuedBatches to have zero tasks. This means, when StartTaskGroup is called on the probe task group it will immediately finish and call the finish continuation. The finish continuation could then call StartTaskGroup on the scan hash table task. If the scan hash table task finished quickly then it is possible it would trigger the finished callback of the exec node before the call to StartTaskGroup->OnTaskGroupFinished for the probe task group finishes returning. This particular call returned all_task_groups_finished=false because it was the probe task group and the final task group was the scan task group. As a result it would try and call this->ScheduleMore (still inside StartTaskGroup) but by this point this was deleted. Actually, given the stack traces we have, it looks like the call to ScheduleMore started, which makes sense as it wasn't a virtual call, but the state of this was invalid).

I spent some time trying to figure out how to fix TaskScheduler when I realized we already have a convenient fix for this problem. I added an AsyncTaskGroup at the node level to ensure that all thread tasks started by the node finish before the node is marked finished.

… join node to mark itself finished too early when task scheduler tasks were still winding down and attempts to access its own state would fail as the node was deleted.

github-actions · 2022-04-15T00:49:07Z

https://issues.apache.org/jira/browse/ARROW-14911

github-actions · 2022-04-15T00:49:09Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

westonpace · 2022-04-15T00:49:39Z

CC @michalursa PTAL

save-buffer · 2022-04-15T01:44:41Z

Epic detective effort! Does this fix thread sanitizer too?

westonpace · 2022-04-15T03:53:26Z

I wasn't getting TSAN errors on this test case (this is before bloom filter)

save-buffer · 2022-04-21T01:58:35Z

Is this ready to be merged? I'd like to rebase my bloom filter PR on it.

ursabot · 2022-04-23T05:41:05Z

Benchmark runs are scheduled for baseline = b995284 and contender = 4f08a9b. 4f08a9b is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.91% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.38% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.25% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/564| 4f08a9b6 ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/552| 4f08a9b6 test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/550| 4f08a9b6 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/562| 4f08a9b6 ursa-thinkcentre-m75q>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/563| b9952840 ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/551| b9952840 test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/549| b9952840 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/561| b9952840 ursa-thinkcentre-m75q>
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

This builds on top of #13035 which is also important for avoiding segmentation faults. On top of that there were a few more problems: * The python was using `SourceNodeOptions::FromTable` which is a rather dangerous method (mainly useful for unit testing) as it doesn't share ownership of the input table (even worse, it takes a const ref). Python was not keeping the table alive and it was maybe possible for the table to deleted out from under the plan (I'm not entirely sure this was causing issues but it seemed risky). I switched to TableSourceNode which shares ownership of the table (and is a bit more efficient). * Setting use_threads to False did nothing because `_perform_join` was not passing the arg on to `execplan`. * When fixing the above and running with `use_threads=False` it was creating a single thread executor but the current best practice is to pass in nullptr. * Finally, the actual bug was my improper fix in #12894 . I had still left a small window open for `End` to be called between `Submit` and `AddTask` which would allow the task to be submitted but not participate in setting `finished` on the node. Closes #13036 from westonpace/bugfix/ARROW-16417--segfault-in-python-join Lead-authored-by: Weston Pace <weston.pace@gmail.com> Co-authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>

ARROW-14911: Under certain circumstances it was possible for the hash…

eb58ec0

… join node to mark itself finished too early when task scheduler tasks were still winding down and attempts to access its own state would fail as the node was deleted.

github-actions bot added the Component: C++ label Apr 15, 2022

lidavidm approved these changes Apr 18, 2022

View reviewed changes

westonpace closed this in 4f08a9b Apr 21, 2022

westonpace mentioned this pull request Apr 30, 2022

ARROW-16417: [C++][Python] Segfault in test_exec_plan.py / test_joins #13036

Closed

asfimport mentioned this pull request Apr 23, 2022

[C++] arrow-compute-hash-join-node-test failed #30435

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-14911: [C++] arrow-compute-hash-join-node-test failed #12894

ARROW-14911: [C++] arrow-compute-hash-join-node-test failed #12894

Uh oh!

westonpace commented Apr 15, 2022

Uh oh!

github-actions bot commented Apr 15, 2022

Uh oh!

github-actions bot commented Apr 15, 2022

Uh oh!

westonpace commented Apr 15, 2022

Uh oh!

save-buffer commented Apr 15, 2022

Uh oh!

westonpace commented Apr 15, 2022

Uh oh!

save-buffer commented Apr 21, 2022

Uh oh!

ursabot commented Apr 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ARROW-14911: [C++] arrow-compute-hash-join-node-test failed #12894

ARROW-14911: [C++] arrow-compute-hash-join-node-test failed #12894

Uh oh!

Conversation

westonpace commented Apr 15, 2022

Uh oh!

github-actions bot commented Apr 15, 2022

Uh oh!

github-actions bot commented Apr 15, 2022

Uh oh!

westonpace commented Apr 15, 2022

Uh oh!

save-buffer commented Apr 15, 2022

Uh oh!

westonpace commented Apr 15, 2022

Uh oh!

save-buffer commented Apr 21, 2022

Uh oh!

ursabot commented Apr 23, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants