Skip to content

Conversation

@jorgecarleitao
Copy link
Member

Based on #8630 , this adds support to the filter operation for StructArrays.

@github-actions
Copy link

andygrove pushed a commit that referenced this pull request Nov 21, 2020
This PR is based on top of #8630 and contains a physical node to perform an inner join in DataFusion.

This is still a draft, but IMO the design is here and the two tests already pass.

This is co-authored with @andygrove , that contributed to the design on how to perform this operation in the context of DataFusion (see ARROW-9555 for details).

The API used for the computation of the join at the arrow level is briefly discussed in [this document](https://docs.google.com/document/d/1KKuBvfx7uKi-x-tWOL60R1FNjDW8B790zPAv6yAlYcU/edit).

There is still a lot to work on, but I I though it would be a good time to have a first round of discussions, and also to gauge timings wrt to the 3.0 release.

There are two main issues being addressed in this PR:

* How to we perform the join at the partition level: this pr collects all batches from the left, and then issues a stream per part on the right. Each batch on that stream joins itself with all the ones from the left (N) via a hash. This allow us to only require computing the hash of a row once (first all the left, then one by one on the right).

* How do we build an array from `N (left)` arrays and a set of indices (matching the hash from the right): this is done using the `MutableArrayData` being worked on #8630, which incrementally memcopies slots from each of the N arrays based on the index. This implementation is useful because it works for all array types and does not require casting anything to rust native types (i.e. it operates on `ArrayData`, not specific implementations).

There are still some steps left to have a join in SQL, most notably the whole logical planning, the output_partition logic, the bindings to SQL and DataFrame API, update the optimizers to handle nodes with 2 children,  and a whole battery of tests.

There is also a natural path for the other joins, as it will be a matter of incorporating the work already on PR #8689 that introduces the option to extend the `MutableArrayData` with nulls, the operation required for left and right joins.

Closes #8709 from jorgecarleitao/join2

Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Signed-off-by: Andy Grove <andygrove73@gmail.com>
@github-actions github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Nov 25, 2020
@jorgecarleitao jorgecarleitao deleted the filter_struct branch December 2, 2020 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: Rust needs-rebase A PR that needs to be rebased by the author

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant