Ballista: Implement scalable distributed joins#634
Merged
Dandandan merged 2 commits intoapache:masterfrom Jul 4, 2021
Merged
Conversation
f597c0c to
8acdd12
Compare
8acdd12 to
6f4cfd8
Compare
Member
Author
|
@edrevo fyi |
Dandandan
reviewed
Jul 3, 2021
| .with_repartition_joins(false) | ||
| .with_repartition_aggregations(false) | ||
| .with_physical_optimizer_rules(rules); | ||
| let config = ExecutionConfig::new().with_concurrency(2); // TODO: this is hack to enable partitioned joins |
Contributor
There was a problem hiding this comment.
What is the idea here for later? I guess the repartitioning needs to be applied with concurrency=1 too to avoid inefficient plans?
Member
Author
There was a problem hiding this comment.
I filed https://github.com/apache/arrow-datafusion/issues/661 to discuss this
jorgecarleitao
approved these changes
Jul 3, 2021
Member
jorgecarleitao
left a comment
There was a problem hiding this comment.
Ready to merge; very neat solution! 💯
H0TB0X420
pushed a commit
to H0TB0X420/datafusion
that referenced
this pull request
Oct 7, 2025
Closes apache#672 rustls Closes #682 syn Closes apache#653 parking_lot closes apache#648 object_store Closes apache#625 h2 Closes apache#623 tokio Closes apache#608 mio Closes apache#597 pyo3 Closes apache#642 pyo3-build-config Closes apache#627 prost Closes apache#634 prost-types Closes apache#637 async-trait
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #63.
This PR removes previous hacks around partitioning and now faithfully translates the DataFusion query plan, including
RepartitionExec. I have tested with TPC-H query 12 and see consistent results between DataFusion and Ballista with the 100GB data set, where each table has 8 partitions. I have tested with multiple executors as well as single executors.There is more work to do but I think this is at a good point to merge since it fixes some correctness issues.
Rationale for this change
Ballista cannot scale well without this because work is duplicated across all partitions to load the entire left side of the join into memory currently.
What changes are included in this PR?
RepartitionExecin Ballista query plans and translate them to shufflesAre there any user-facing changes?
Query plans will change.