-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark 3.2 and 3.3: Use Reblance instead of Repartition for distribution in SparkWrite #7932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
It seems it is |
|
@bowenliang123 Looks like REBALANCE_PARTITIONS_BY_COL does not have range partitioner support |
Sorry, that I pasted a wrong one as a |
Yes, you are right. RebalancePartitions only supports RoundRobinPartitioning and HashPartitioning. I initialled this PR as a workaround in my case to reduce written data files dramatically (366 files -> 45 files). It might not be perfect for satisfying ranger support and semantics of |
|
@bowenliang123 @ConeyLiu i understand REBALANCE_PARTITIONS_BY_COL this adds a adaptive coalesce(AQE) which just coalesces the partitions local to executor( hence reducing number of files written) . Is this effective if the partitions are spread across different workers since the partitions wont be local anymore( for coalesce to work) ? |
Since RebalancePartitions is introduced, a shuffle read stage was introduced. So I think it works for partitions across worker nodes. @namrathamyske |
|
@bowenliang123 Can we merge this to master by having a flag called "strictDistributionRequired" Similar to https://github.com/apache/spark/blob/453300b418bc03511ad9167bbaad49e0f1f1c090/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DistributionAndOrderingUtils.scala#L63 for rebalance to be applied? |
Yes, I have noticed this changes in Spark 3.4. And backporting them to 3.3 is a considerable approach. |
|
I do not have a clue about how to fix the failures in GA tests, and where and why they fail. May need some help in this. |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |

Example for executions plan for insertion to an iceberg table,
explain EXTENDED insert into gfpersonas_platform.t_ptr_label_ice_bowen select * from gfpersonas_platform.t_ptr_label_ice;.Before: (With 326 data files written, 3KB+ per file)
After: (With 45 data files written, ~22MB per file)
Having
REBALANCE_PARTITIONS_BY_COLinExchange hashpartitioning(lab_numr#44, busi_date#45, 200), REBALANCE_PARTITIONS_BY_COL,