ARROW-16713: [C++] Pull join accumulation outside of HashJoinImpl #13332

save-buffer · 2022-06-07T17:23:56Z

No description provided.

github-actions · 2022-06-07T17:24:20Z

https://issues.apache.org/jira/browse/ARROW-16713

michalursa · 2022-06-07T20:03:43Z

Looks good to me.

I have only a few comments:

I would like ProbeBatches to be implemented outside of HashJoinImpl by calling ProbeSingleBatch method of HashJoinImpl. That way the implementation of ProbeBatches can be shared between both current HashJoinImpl and SwissJoin's implementation of it.
finished_mutex_ in hash_join.cc is probably not used anymore and could be removed.
I think I would prefer if AccumulationQueue was thread-safe than having HashJoinNode managing mutexes for accumulation queues. I see that in this implementation AccumulationQueue is sometimes used on a single-thread or in a read-only way, where the mutex is not necessary, but I would still prefer to extract a vector of exec batches in a thread-safe way from AccumulationQueue and use it instead in such situations. Or alternatively having a state flag in AccumulationQueue saying that it is read-only and doesn't require a mutex.

save-buffer · 2022-06-07T20:19:12Z

That's true. I did think it would be cleaner to have batches that are being operated on be owned by the data structure that's doing the work. For ProbeBatches specifically I could make a special "probing accumulation queue" that gets moved into instead.
Good point, will remove
I initially did have it be thread-safe, but then I realized that most of them would have locks managed externally anyway: for spilling we'll be using PartitionLocks and for HashJoinNode we use the mutexes to protect stuff other than the AccumulationQueues, so in neither case will we actually ever use the AccumulationQueue's mutex.

michalursa · 2022-06-07T20:29:02Z

Re AccumulationQueue: then it becomes just a vector, and maybe doesn't need promoting it to a separate class in a separate file

westonpace

Looks good. A few minor nits and some questions but otherwise I think this helps clear things up.

cpp/src/arrow/compute/exec/accumulation_queue.h

cpp/src/arrow/compute/exec/hash_join.cc

cpp/src/arrow/compute/exec/hash_join_node.cc

westonpace · 2022-06-08T16:50:30Z

Re AccumulationQueue: then it becomes just a vector, and maybe doesn't need promoting it to a separate class in a separate file

This is true. The only difference I see between the accumulation queue and a vector of batches at the moment is row_count() but, on second examination, it doesn't seem we are currently using row_count() anywhere.

save-buffer · 2022-06-08T18:53:23Z

row_count is used for the Bloom filter build (in only one spot). I kind of like AccumulationQueue because it also disallows copying, and the Append function is handy. Also in the next PR I'll be adding a SpillingAccumulationQueue, so I kind of like having a name for these. I'm fine getting rid of it though too though

westonpace

A few minor nits but this seems ready. Thanks for doing this, it's a good cleanup, above and beyond it's potential utility for spillover. I find this easier to follow than the old implementation.

westonpace · 2022-06-09T10:41:23Z

cpp/src/arrow/compute/exec/accumulation_queue.h

It would be nice to get some consistency with how we use int64_t, uint64_t, and size_t within the engine. Do you have any convention suggestions?

I'd say we should probably be using size_t for most things that are non-negative. That guy on the mailing list had an epic struggle building for 32-bit because we were using int64_t where we should've used size_t.
BTW here I use size_t because that's what std::vector::size() returns.

I think row_count can never be nagative either, so it should be a size_t. ExecBatch::length should probably be size_t as well.

cpp/src/arrow/compute/exec/accumulation_queue.h

cpp/src/arrow/compute/exec/hash_join_benchmark.cc

save-buffer · 2022-06-10T00:17:43Z

The only failures seem to be with S3FS (unrelated to hash join completely) and the tracing span thingy which is being addressed in #13108

westonpace · 2022-06-10T18:58:16Z

I wish I knew why this PR triggers the span thing to flare up. It doesn't seem that any of the code you have is touching spans. I suspect it is something rather subtle. At the moment appveyor passes on other PRs so if I merge this it will break things, even if there is nothing in this PR that is related.

I've proposed a solution on #13108 that should work. So hopefully we can get that merged pretty quick. Otherwise we'll have to figure out what change here is causing the span bug to trigger.

westonpace

Now that the span issue is resolved it seems MSVC is happy so let's merge.

github-actions bot added the Component: C++ label Jun 7, 2022

save-buffer force-pushed the sasha_spilling branch from 0ffad81 to 9810c0d Compare June 7, 2022 17:33

westonpace self-requested a review June 8, 2022 11:05

westonpace requested changes Jun 8, 2022

View reviewed changes

save-buffer force-pushed the sasha_spilling branch from 7cf02ea to 709084c Compare June 8, 2022 19:22

westonpace reviewed Jun 9, 2022

View reviewed changes

save-buffer force-pushed the sasha_spilling branch from 5ec4ff5 to b43cada Compare June 9, 2022 18:18

save-buffer force-pushed the sasha_spilling branch from b43cada to 8d29ddb Compare June 15, 2022 14:26

Move accumulation outside of HashJoinImpl

790089a

save-buffer force-pushed the sasha_spilling branch from 8d29ddb to 790089a Compare June 15, 2022 14:28

westonpace self-requested a review June 16, 2022 18:34

westonpace merged commit 8737123 into apache:master Jun 16, 2022

westonpace reviewed Jun 16, 2022

View reviewed changes

save-buffer deleted the sasha_spilling branch June 16, 2022 22:28

kou mentioned this pull request Dec 31, 2024

GH-45135: [C++] Remove useless "hash table ready" states in swiss join #45136

Merged

ARROW-16713: [C++] Pull join accumulation outside of HashJoinImpl #13332

ARROW-16713: [C++] Pull join accumulation outside of HashJoinImpl #13332

Uh oh!

Conversation

save-buffer commented Jun 7, 2022

Uh oh!

github-actions bot commented Jun 7, 2022

Uh oh!

michalursa commented Jun 7, 2022

Uh oh!

save-buffer commented Jun 7, 2022

Uh oh!

michalursa commented Jun 7, 2022

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

westonpace commented Jun 8, 2022

Uh oh!

save-buffer commented Jun 8, 2022

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

westonpace Jun 9, 2022

Choose a reason for hiding this comment

Uh oh!

save-buffer Jun 9, 2022

Choose a reason for hiding this comment

Uh oh!

save-buffer Jun 9, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

save-buffer commented Jun 10, 2022

Uh oh!

westonpace commented Jun 10, 2022

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants