Skip to content

Conversation

@jorgecarleitao
Copy link
Member

This is written on top of #7687, so we should merge the other first.

This does not include any optimization to actually use this operation. We need to work out in a future PR.

@github-actions
Copy link

@jorgecarleitao jorgecarleitao changed the title ARROW-9420 [Rust][DataFusion] Added repartion physical plan WIP - ARROW-9420 [Rust][DataFusion] Added repartion physical plan Jul 13, 2020
@jorgecarleitao jorgecarleitao changed the title WIP - ARROW-9420 [Rust][DataFusion] Added repartion physical plan ARROW-9420 [Rust][DataFusion] Added repartion physical plan Jul 13, 2020
@jorgecarleitao jorgecarleitao marked this pull request as draft July 13, 2020 15:03
@jorgecarleitao
Copy link
Member Author

One thing that is not clear to me yet is the idiom to handle RecordBatch and partitions. My understanding is that a Partition can be executed in parallel (thread), but a RecordBatch is generally executed on the same thread, i.e. we normally loop through each RecordBatch using the same thread.

Is the goal of RecordBatch to split a partition in smaller chunks of data to avoid too much memory usage?

In this PR, I have not merged all the RecordBatches within a given partition in a single batch, and instead kept them separate. I am not sure if this is the correct approach here.

@jorgecarleitao
Copy link
Member Author

Another point of contest here is that I have not tested what happens to rows with one key whose value is null.

@andygrove andygrove changed the title ARROW-9420 [Rust][DataFusion] Added repartion physical plan ARROW-9420: [Rust][DataFusion] Added repartion physical plan Jul 13, 2020
@jorgecarleitao jorgecarleitao marked this pull request as ready for review July 13, 2020 20:14
This reduces

* the runtime complexity of this operation from O(N*(1 + M)) to O(N*M)
(N=number of rows, M=number of aggregations),
* the memory footprint from O(N*M) acumulators to O(M) accumulators
* the code complexity via DRY.
This is currently not very performant as there is no threading
involved, but it is nevertheless an important plan to have.
@jorgecarleitao
Copy link
Member Author

I don't think we are ready to take repartitioning at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants