[C++] Naive spillover implementation for join

A join is a pipeline breaker.  I believe the proposed join operators assume that the data can fit into memory and queue all incoming batches.  For example, if I understand correctly, https://github.com/apache/arrow/pull/11150 queues the right side until the left side had finished.

There are many clever and interesting ways that this can be optimized  (divide & conquer, recursive query, prioritize reading the left side and pause the right side read).  This issue is intentionally not clever or interesting.

Instead, I think it would be good to take advantage of this opportunity to start fleshing out our spillover capabilities.  A very simplistic implementation could be a standalone node which has 2 inputs and 2 outputs.  The node queues up all incoming data on the "right" input and lets the "left" input pass through.  Then, when the left input has finished the node will release the right input.

This node could then implement a basic spillover mechanism (e.g. IPC to disk) and start to flesh out the abstractions that we will eventually want to handle different spillover strategies  (abort on spill, spill to disk, and spill to s3 are all I can think of at the moment).

**Reporter**: [Weston Pace](https://issues.apache.org/jira/browse/ARROW-14163) / @westonpace
**Assignee**: [Sasha Krassovsky](https://issues.apache.org/jira/browse/ARROW-14163) / @save-buffer
#### Related issues:
- [[C++] Support hash-join on larger than memory datasets](https://github.com/apache/arrow/issues/31769) (is superceded by)

<sub>**Note**: *This issue was originally created as [ARROW-14163](https://issues.apache.org/jira/browse/ARROW-14163). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[C++] Naive spillover implementation for join #29750

Related issues:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[C++] Naive spillover implementation for join #29750

Description

Related issues:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions