-
Notifications
You must be signed in to change notification settings - Fork 3k
Rename fanout configs #1877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rename fanout configs #1877
Conversation
| package org.apache.iceberg.spark; | ||
|
|
||
| public class SparkWriteOptions { | ||
| public static final String FANOUT_ENABLED = "fanout-enabled"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add the other write options in this PR or keep the refactor separate?
|
@rdblue @aokolnychyi Can you please help review the change? |
| /** | ||
| * Spark DF write options | ||
| */ | ||
| public class SparkWriteOptions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I like adding a class for these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue @karuppayya, shall we add SparkReadOptions and move all read and write options to those classes? I think we need a place for them. It would be great to build a utility class that would take into account table properties, SQL conf, write/read options and provide the actual value to be used in the future. That being said, I think we should start simple and SparkReadOptions and SparkWriteOptions sounds reasonable to me.
spark/src/test/java/org/apache/iceberg/spark/source/TestSparkDataWrite.java
Outdated
Show resolved
Hide resolved
|
If you don't mind, this might be a good chance to update the guide on partitioned write in Spark: https://github.com/apache/iceberg/blob/master/site/docs/spark.md#writing-against-partitioned-table Fanout writer can be an alternative to explicit sort, hence would be better to add it there. If you'd just like to finalize your work with the code change, I can take a step for that. |
|
Thanks @karuppayya! Looks good so I merged it. @HeartSaVioR, let's update the docs separately. I agree that it would probably be helpful, but I think we want to be careful about how we document this. This is primarily for streaming and I would not recommend people avoid adding a sort for batch work. If you're interested in working on the docs, that would be great! Thank you! |
|
Yeah I'm happy to document it. Thanks! One thing I wonder is how much it is helpful or even it hurts to have table property for this, as we are in consensus that this might be dangerous on batch query if they don't notice the behavior (depending on cardinality of partitions). I also commented on previous PR and it got merged without answering it. For now we don't explain what is fanout writer in the doc, so they have no idea and fear to enable this, but once we document the behavior without proper warn, they may be misunderstanding the behavior as good for all cases and update the table property. (I guess you're concerning about documenting this due to this, do I understand correctly?) Personally I'm not 100% sure we'd like to add table property which is only good for specific workload, but at least we could document this with warning that this opens files for cardinality of partitions in the data in each task, so only recommended to use it in streaming write (not mentioning table property here). WDYT? |
|
@HeartSaVioR, do you happen to have a link to the open comment from the previous PR? |
|
I am not against having a table property but I think we should clearly document the concern of enabling this and discourage using fanout writers in batch jobs. The cost of a local sort is minimal and Spark does this implicitly in batch jobs for built-in V1 tables. I think the documentation should be clear about the potential consequences and give a good example. |
|
Thanks for working on this, @karuppayya! |
|
@aokolnychyi I agree we could probably discourage using the option in doc. Probably they would only want to set the table property when their table is only written by streaming query. |
|
Raised a PR #1929 for documenting fanout. |
Renaming the fanout configs.
partitioned.fanout.enabledtofanout-enabled(similar to other Spark options)write.partitioned.fanout.enabledtowrite.spark.partitioned.fanout.enabled, as it applies to only Spark.