[C++][Acero] Race condition in asof join causes execution to stall for large number of record batches

### Describe the bug, including details regarding any error messages, version, and platform.

* Version: Repro'd on `HEAD`, `v12.0.0`, and `v13.0.0`

I've encountered a subtle race condition in the asof join node that is particularly common for large parquet files with many row groups:

1. The left hand side of the asofjoin completes, so `InputFinished` proceeds as [expected](https://github.com/apache/arrow/blob/2455bc07e09cd5341d1fabdb293afbd07682f0b2/cpp/src/arrow/acero/asof_join_node.cc#L1323). So far so good
2. The right hand table(s) of the join are a huge dataset scan. They're still streaming and can legally still call `AsofJoinNode::InputReceived` all they want ([doc ref](https://arrow.apache.org/docs/cpp/api/acero.html#_CPPv4N5arrow5acero8ExecNode13InputReceivedEP8ExecNode9ExecBatch))
3. Each input batch is blindly pushed to the `InputState`s, which in turn defer to `BackpressureHandler`s to decide whether to pause inputs. ([code pointer](https://github.com/apache/arrow/blob/2455bc07e09cd5341d1fabdb293afbd07682f0b2/cpp/src/arrow/acero/asof_join_node.cc#L1689))
4. If enough batches come in right after `EndFromProcessThread` is called, then we might exceed the [high_threshold](https://github.com/apache/arrow/blob/2455bc07e09cd5341d1fabdb293afbd07682f0b2/cpp/src/arrow/acero/asof_join_node.cc#L575) and tell the input node to pause via the [BackpressureController](https://github.com/apache/arrow/blob/2455bc07e09cd5341d1fabdb293afbd07682f0b2/cpp/src/arrow/acero/asof_join_node.cc#L540)
5. At this point, the process thread has stopped for the asofjoiner, so the right hand table(s) won't be dequeue'd, meaning `BackpressureController::Resume()` will never be called. This causes a [deadlock](https://arrow.apache.org/docs/cpp/api/acero.html#_CPPv4N5arrow5acero19BackpressureControl5PauseEv)

I have hackily fixed this in a local checkout by storing an `atomic<bool>` of whether `EndFromProcessQueue` was called. If it turns `true`, then at [InputReceived](https://github.com/apache/arrow/blob/2455bc07e09cd5341d1fabdb293afbd07682f0b2/cpp/src/arrow/acero/asof_join_node.cc#L1682) I shortcircuit and return a `Status::OK()` without enqueueing the batch. Also at EndFromProcessQueue, I call `ResumeProducing` for all input nodes.

For good measure, I also call `StopProducing()` on all the inputs in `EndFromProcessQueue`... though I don't know if it's necessary

Happy to submit a PR once I find bandwidth, but reporting this early in case others run into it.

### Component(s)

C++

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[C++][Acero] Race condition in asof join causes execution to stall for large number of record batches #37796

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[C++][Acero] Race condition in asof join causes execution to stall for large number of record batches #37796

Description

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions