Skip to content

Conversation

@westonpace
Copy link
Member

This builds on top of #13035 which is also important for avoiding segmentation faults. On top of that there were a few more problems:

  • The python was using SourceNodeOptions::FromTable which is a rather dangerous method (mainly useful for unit testing) as it doesn't share ownership of the input table (even worse, it takes a const ref). Python was not keeping the table alive and it was maybe possible for the table to deleted out from under the plan (I'm not entirely sure this was causing issues but it seemed risky). I switched to TableSourceNode which shares ownership of the table (and is a bit more efficient).
  • Setting use_threads to False did nothing because _perform_join was not passing the arg on to execplan.
  • When fixing the above and running with use_threads=False it was creating a single thread executor but the current best practice is to pass in nullptr.
  • Finally, the actual bug was my improper fix in ARROW-14911: [C++] arrow-compute-hash-join-node-test failed #12894 . I had still left a small window open for End to be called between Submit and AddTask which would allow the task to be submitted but not participate in setting finished on the node.

lidavidm and others added 2 commits April 29, 2022 16:31
…ing in NULL for executor if use_threads is false instead of short lived thread pool. Forwading use_threads from _perform_join to execplan. Fix bug in hash_join_node.cc that could allow bits of the plan to remain running after marking the plan finished.
@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tracking this down! I'm curious, how did you debug it?

re: SourceNodeOptions::FromTable - maybe we should remove this method or change it to take shared_ptr<Table>?

@westonpace
Copy link
Member Author

I was able to get it to reproduce in debug mode with stress -c 32. Then I got reasonable stack traces that showed the hash join node still working while the python was getting ready to start the next test. That was pretty similar to the debugging I did in #12894 which fortunately was still recent enough I remember roughly what was happening.

@lidavidm
Copy link
Member

Aha.

Hmm, I wonder if it'd be useful to somehow have Pytest also assert that Arrow's thread pools are idle in between tests. (And frankly, Googletest as well.)

@westonpace
Copy link
Member Author

Technically, we aren't quite there yet. The async generators in the scanner are allowed to run some cleanup after the exec plan runs. However, they strongly capture all of their state. Once #12468 merges then I think this might be a good idea.

@ursabot
Copy link

ursabot commented May 6, 2022

Benchmark runs are scheduled for baseline = 760ad20 and contender = 7809c6d. 7809c6d is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️1.98% ⬆️0.31%] test-mac-arm
[Finished ⬇️0.71% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️1.18% ⬆️0.08%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 7809c6d7 ec2-t3-xlarge-us-east-2
[Finished] 7809c6d7 test-mac-arm
[Finished] 7809c6d7 ursa-i9-9960x
[Finished] 7809c6d7 ursa-thinkcentre-m75q
[Finished] 760ad200 ec2-t3-xlarge-us-east-2
[Finished] 760ad200 test-mac-arm
[Finished] 760ad200 ursa-i9-9960x
[Finished] 760ad200 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants