Skip to content

Conversation

@westonpace
Copy link
Member

I identified and reproduced two possible ways this sort of segmentation fault could happen. The stack traces demonstrated that worker tasks were still running for a plan after the test case had considered the plan "finished" and moved on.

First, the test case was calling gtest asserts in a helper method called from a loop:

void RunPlan(parameters) {
  Plan plan = MakePlan(parameters);
  ASSERT_TRUE(plan.FinishesInAReasonableTime());
}
void Test() {
  // ...
  for (int i = 0; i < kNumTrials; i++) {
    RunPlan(parameters);
  }
}

If the plan was sometimes timing out then the assert could be triggered. A gtest assert simply returns immediately but it would then get swept up into the next iteration of the loop. I changed the helper method to return a Result and put all asserts in the test case. That being said, I don't think this was the likely failure as I would expect to have seen instances of this test case timing out along with instances where it had a segmentation fault.

The second possibility was a rather unique set of circumstances that I was only able to trigger reliably when inserting sleeps into the test at just the right spots.

Basically, the node has three task groups, BuildHashTable, ProbeQueuedBatches, and ScanHashTable. It is possible for ProbeQueuedBatches to have zero tasks. This means, when StartTaskGroup is called on the probe task group it will immediately finish and call the finish continuation. The finish continuation could then call StartTaskGroup on the scan hash table task. If the scan hash table task finished quickly then it is possible it would trigger the finished callback of the exec node before the call to StartTaskGroup->OnTaskGroupFinished for the probe task group finishes returning. This particular call returned all_task_groups_finished=false because it was the probe task group and the final task group was the scan task group. As a result it would try and call this->ScheduleMore (still inside StartTaskGroup) but by this point this was deleted. Actually, given the stack traces we have, it looks like the call to ScheduleMore started, which makes sense as it wasn't a virtual call, but the state of this was invalid).

I spent some time trying to figure out how to fix TaskScheduler when I realized we already have a convenient fix for this problem. I added an AsyncTaskGroup at the node level to ensure that all thread tasks started by the node finish before the node is marked finished.

… join node to mark itself finished too early when task scheduler tasks were still winding down and attempts to access its own state would fail as the node was deleted.
@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@westonpace
Copy link
Member Author

CC @michalursa PTAL

@save-buffer
Copy link
Contributor

Epic detective effort! Does this fix thread sanitizer too?

@westonpace
Copy link
Member Author

I wasn't getting TSAN errors on this test case (this is before bloom filter)

@save-buffer
Copy link
Contributor

Is this ready to be merged? I'd like to rebase my bloom filter PR on it.

@ursabot
Copy link

ursabot commented Apr 23, 2022

Benchmark runs are scheduled for baseline = b995284 and contender = 4f08a9b. 4f08a9b is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.91% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.38% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.25% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/564| 4f08a9b6 ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/552| 4f08a9b6 test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/550| 4f08a9b6 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/562| 4f08a9b6 ursa-thinkcentre-m75q>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/563| b9952840 ec2-t3-xlarge-us-east-2>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/551| b9952840 test-mac-arm>
[Failed] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/549| b9952840 ursa-i9-9960x>
[Finished] <https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/561| b9952840 ursa-thinkcentre-m75q>
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

lidavidm added a commit that referenced this pull request May 2, 2022
This builds on top of #13035 which is also important for avoiding segmentation faults.  On top of that there were a few more problems:

 * The python was using `SourceNodeOptions::FromTable` which is a rather dangerous method (mainly useful for unit testing) as it doesn't share ownership of the input table (even worse, it takes a const ref).  Python was not keeping the table alive and it was maybe possible for the table to deleted out from under the plan (I'm not entirely sure this was causing issues but it seemed risky).  I switched to TableSourceNode which shares ownership of the table (and is a bit more efficient).
 * Setting use_threads to False did nothing because `_perform_join` was not passing the arg on to `execplan`.
 * When fixing the above and running with `use_threads=False` it was creating a single thread executor but the current best practice is to pass in nullptr.
 * Finally, the actual bug was my improper fix in #12894 .  I had still left a small window open for `End` to be called between `Submit` and `AddTask` which would allow the task to be submitted but not participate in setting `finished` on the node.

Closes #13036 from westonpace/bugfix/ARROW-16417--segfault-in-python-join

Lead-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants