ARROW-16417: [C++][Python] Segfault in test_exec_plan.py / test_joins #13036

westonpace · 2022-04-30T01:16:55Z

This builds on top of #13035 which is also important for avoiding segmentation faults. On top of that there were a few more problems:

The python was using SourceNodeOptions::FromTable which is a rather dangerous method (mainly useful for unit testing) as it doesn't share ownership of the input table (even worse, it takes a const ref). Python was not keeping the table alive and it was maybe possible for the table to deleted out from under the plan (I'm not entirely sure this was causing issues but it seemed risky). I switched to TableSourceNode which shares ownership of the table (and is a bit more efficient).
Setting use_threads to False did nothing because _perform_join was not passing the arg on to execplan.
When fixing the above and running with use_threads=False it was creating a single thread executor but the current best practice is to pass in nullptr.
Finally, the actual bug was my improper fix in ARROW-14911: [C++] arrow-compute-hash-join-node-test failed #12894 . I had still left a small window open for End to be called between Submit and AddTask which would allow the task to be submitted but not participate in setting finished on the node.

…ing in NULL for executor if use_threads is false instead of short lived thread pool. Forwading use_threads from _perform_join to execplan. Fix bug in hash_join_node.cc that could allow bits of the plan to remain running after marking the plan finished.

github-actions · 2022-04-30T01:17:12Z

https://issues.apache.org/jira/browse/ARROW-16417

github-actions · 2022-04-30T01:17:14Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

lidavidm

Thanks for tracking this down! I'm curious, how did you debug it?

re: SourceNodeOptions::FromTable - maybe we should remove this method or change it to take shared_ptr<Table>?

westonpace · 2022-04-30T01:23:34Z

I was able to get it to reproduce in debug mode with stress -c 32. Then I got reasonable stack traces that showed the hash join node still working while the python was getting ready to start the next test. That was pretty similar to the debugging I did in #12894 which fortunately was still recent enough I remember roughly what was happening.

lidavidm · 2022-04-30T01:25:14Z

Aha.

Hmm, I wonder if it'd be useful to somehow have Pytest also assert that Arrow's thread pools are idle in between tests. (And frankly, Googletest as well.)

westonpace · 2022-04-30T01:28:23Z

Technically, we aren't quite there yet. The async generators in the scanner are allowed to run some cleanup after the exec plan runs. However, they strongly capture all of their state. Once #12468 merges then I think this might be a good idea.

ursabot · 2022-05-06T11:11:18Z

Benchmark runs are scheduled for baseline = 760ad20 and contender = 7809c6d. 7809c6d is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️1.98% ⬆️0.31%] test-mac-arm
[Finished ⬇️0.71% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️1.18% ⬆️0.08%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 7809c6d7 ec2-t3-xlarge-us-east-2
[Finished] 7809c6d7 test-mac-arm
[Finished] 7809c6d7 ursa-i9-9960x
[Finished] 7809c6d7 ursa-thinkcentre-m75q
[Finished] 760ad200 ec2-t3-xlarge-us-east-2
[Finished] 760ad200 test-mac-arm
[Finished] 760ad200 ursa-i9-9960x
[Finished] 760ad200 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

lidavidm and others added 2 commits April 29, 2022 16:31

ARROW-16419: [Python] Properly wait for ExecPlan to finish

bbb49f2

github-actions bot added Component: C++ Component: Python labels Apr 30, 2022

lidavidm approved these changes Apr 30, 2022

View reviewed changes

ARROW-16417: python lint

85a736b

lidavidm mentioned this pull request May 2, 2022

ARROW-16419: [Python] Properly wait for ExecPlan to finish #13035

Closed

lidavidm closed this in 7809c6d May 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ARROW-16417: [C++][Python] Segfault in test_exec_plan.py / test_joins #13036

ARROW-16417: [C++][Python] Segfault in test_exec_plan.py / test_joins #13036

Uh oh!

westonpace commented Apr 30, 2022

Uh oh!

github-actions bot commented Apr 30, 2022

Uh oh!

github-actions bot commented Apr 30, 2022

Uh oh!

lidavidm left a comment •

edited

Loading

Uh oh!

westonpace commented Apr 30, 2022

Uh oh!

lidavidm commented Apr 30, 2022

Uh oh!

westonpace commented Apr 30, 2022

Uh oh!

ursabot commented May 6, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ARROW-16417: [C++][Python] Segfault in test_exec_plan.py / test_joins #13036

ARROW-16417: [C++][Python] Segfault in test_exec_plan.py / test_joins #13036

Uh oh!

Conversation

westonpace commented Apr 30, 2022

Uh oh!

github-actions bot commented Apr 30, 2022

Uh oh!

github-actions bot commented Apr 30, 2022

Uh oh!

lidavidm left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

westonpace commented Apr 30, 2022

Uh oh!

lidavidm commented Apr 30, 2022

Uh oh!

westonpace commented Apr 30, 2022

Uh oh!

ursabot commented May 6, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lidavidm left a comment •

edited

Loading