-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-6863] Revert auto-tuning of dedup parallelism #9722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Lets revisit the problems 6802 was tackliing. Main issue it was addressing is, making our shuffle parallelism dynamic and relative to the incoming df's num partitions. So, if someone is running 1000s of pipelines, they don't need to statically set the right value for shuffle parallelism for each of the 1000 pipelines. can you help me understand whats the issue we are hitting that warrants us to revert it? |
This PR does not revert the dynamic determination of the shuffle parallelism. The decided target shuffle parallelism is passed in with " |
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala
Outdated
Show resolved
Hide resolved
nsivabalan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 minor comments. source code changes looks good.
Before this PR, the auto-tuning logic for dedup parallelism dictates the write parallelism so that the user-configured `hoodie.upsert.shuffle.parallelism` is ignored. This commit reverts apache#6802 to fix the issue.

Change Logs
Before this PR, the auto-tuning logic for dedup parallelism dictates the write parallelism so that the user-configured
hoodie.upsert.shuffle.parallelismis ignored. This PR reverts #6802 to fix the issue.Impact
Performance fix
Risk level
low
Documentation Update
N/A
Contributor's checklist