Rename fanout configs #1877

karuppayya · 2020-12-04T17:38:38Z

Renaming the fanout configs.

Renamed DF write option from partitioned.fanout.enabled to fanout-enabled(similar to other Spark options)
Renamed TableProperty write.partitioned.fanout.enabled to write.spark.partitioned.fanout.enabled, as it applies to only Spark.

karuppayya · 2020-12-04T17:40:00Z

spark/src/main/java/org/apache/iceberg/spark/SparkWriteOptions.java

+package org.apache.iceberg.spark;
+
+public class SparkWriteOptions {
+    public static final String FANOUT_ENABLED = "fanout-enabled";


Do we want to add the other write options in this PR or keep the refactor separate?

karuppayya · 2020-12-05T00:27:16Z

@rdblue @aokolnychyi Can you please help review the change?

site/docs/configuration.md

rdblue · 2020-12-05T01:20:52Z

spark/src/main/java/org/apache/iceberg/spark/SparkWriteOptions.java

+/**
+ * Spark DF write options
+ */
+public class SparkWriteOptions {


Thanks, I like adding a class for these.

@rdblue @karuppayya, shall we add SparkReadOptions and move all read and write options to those classes? I think we need a place for them. It would be great to build a utility class that would take into account table properties, SQL conf, write/read options and provide the actual value to be used in the future. That being said, I think we should start simple and SparkReadOptions and SparkWriteOptions sounds reasonable to me.

spark/src/test/java/org/apache/iceberg/spark/source/TestSparkDataWrite.java

HeartSaVioR · 2020-12-05T04:19:11Z

If you don't mind, this might be a good chance to update the guide on partitioned write in Spark: https://github.com/apache/iceberg/blob/master/site/docs/spark.md#writing-against-partitioned-table

Fanout writer can be an alternative to explicit sort, hence would be better to add it there. If you'd just like to finalize your work with the code change, I can take a step for that.

rdblue · 2020-12-05T22:44:34Z

Thanks @karuppayya! Looks good so I merged it.

@HeartSaVioR, let's update the docs separately. I agree that it would probably be helpful, but I think we want to be careful about how we document this. This is primarily for streaming and I would not recommend people avoid adding a sort for batch work. If you're interested in working on the docs, that would be great! Thank you!

HeartSaVioR · 2020-12-06T00:47:27Z

Yeah I'm happy to document it. Thanks!

One thing I wonder is how much it is helpful or even it hurts to have table property for this, as we are in consensus that this might be dangerous on batch query if they don't notice the behavior (depending on cardinality of partitions). I also commented on previous PR and it got merged without answering it. For now we don't explain what is fanout writer in the doc, so they have no idea and fear to enable this, but once we document the behavior without proper warn, they may be misunderstanding the behavior as good for all cases and update the table property. (I guess you're concerning about documenting this due to this, do I understand correctly?)

Personally I'm not 100% sure we'd like to add table property which is only good for specific workload, but at least we could document this with warning that this opens files for cardinality of partitions in the data in each task, so only recommended to use it in streaming write (not mentioning table property here). WDYT?

aokolnychyi · 2020-12-07T11:07:32Z

@HeartSaVioR, do you happen to have a link to the open comment from the previous PR?

aokolnychyi · 2020-12-07T11:11:48Z

I am not against having a table property but I think we should clearly document the concern of enabling this and discourage using fanout writers in batch jobs. The cost of a local sort is minimal and Spark does this implicitly in batch jobs for built-in V1 tables.

I think the documentation should be clear about the potential consequences and give a good example.

aokolnychyi · 2020-12-07T11:12:36Z

Thanks for working on this, @karuppayya!

HeartSaVioR · 2020-12-07T12:18:21Z

@aokolnychyi
Here it is. #1774 (comment)

I agree we could probably discourage using the option in doc. Probably they would only want to set the table property when their table is only written by streaming query.

HeartSaVioR · 2020-12-14T06:34:19Z

Raised a PR #1929 for documenting fanout.

Rename fanout configs

047affc

karuppayya commented Dec 4, 2020

View reviewed changes

Add code comments

9a80a58

github-actions bot added core docs spark labels Dec 4, 2020

Fix checkstyle

779e918

rdblue reviewed Dec 5, 2020

View reviewed changes

site/docs/configuration.md Outdated Show resolved Hide resolved

rdblue reviewed Dec 5, 2020

View reviewed changes

spark/src/test/java/org/apache/iceberg/spark/source/TestSparkDataWrite.java Outdated Show resolved Hide resolved

Address review comments

984d607

karuppayya requested a review from rdblue December 5, 2020 06:54

rdblue approved these changes Dec 5, 2020

View reviewed changes

rdblue merged commit 61702d1 into apache:master Dec 5, 2020

pvary pushed a commit to pvary/iceberg that referenced this pull request Dec 7, 2020

Spark: Rename fanout configs (apache#1877)

c0de0fc

Rename fanout configs #1877

Rename fanout configs #1877

Uh oh!

Conversation

karuppayya commented Dec 4, 2020

Uh oh!

karuppayya Dec 4, 2020

Choose a reason for hiding this comment

Uh oh!

karuppayya commented Dec 5, 2020

Uh oh!

Uh oh!

rdblue Dec 5, 2020

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Dec 7, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HeartSaVioR commented Dec 5, 2020

Uh oh!

rdblue commented Dec 5, 2020

Uh oh!

HeartSaVioR commented Dec 6, 2020

Uh oh!

aokolnychyi commented Dec 7, 2020

Uh oh!

aokolnychyi commented Dec 7, 2020

Uh oh!

aokolnychyi commented Dec 7, 2020

Uh oh!

HeartSaVioR commented Dec 7, 2020

Uh oh!

HeartSaVioR commented Dec 14, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants