-
Notifications
You must be signed in to change notification settings - Fork 4k
Closed
Description
Here is an overview of how I think we should implement support for equijoins, at least for the initial implementation.
- Read all batches from the left-side of the join into a single Vec
- Create a map something like HashMap<Vec, Vec<(usize,usize)>> to map keys to batch/row indices
- Iterate over this Vec and create an entry in a hash map, mapping the join keys to the index of the batch and row in the Vec
- For each input partition on the right-side of the join, return an output partition that is an iterator/stream that:
- For each input row, evaluate the join keys
- Look up those join keys in the hash map
- If a match is found:
- For each (batch, row) index create an output row which has the values from both the left and right row and emit it
- If no match is found:
- Do not emit a row
Reporter: Jorge Leitão / @jorgecarleitao
Assignee: Jorge Leitão / @jorgecarleitao
PRs and other links:
Note: This issue was originally created as ARROW-9555. Please see the migration documentation for further details.