Skip to content

Conversation

@karuppayya
Copy link
Contributor

Renaming the fanout configs.

  1. Renamed DF write option from partitioned.fanout.enabled to fanout-enabled(similar to other Spark options)
  2. Renamed TableProperty write.partitioned.fanout.enabled to write.spark.partitioned.fanout.enabled, as it applies to only Spark.

package org.apache.iceberg.spark;

public class SparkWriteOptions {
public static final String FANOUT_ENABLED = "fanout-enabled";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add the other write options in this PR or keep the refactor separate?

@karuppayya
Copy link
Contributor Author

@rdblue @aokolnychyi Can you please help review the change?

/**
* Spark DF write options
*/
public class SparkWriteOptions {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I like adding a class for these.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue @karuppayya, shall we add SparkReadOptions and move all read and write options to those classes? I think we need a place for them. It would be great to build a utility class that would take into account table properties, SQL conf, write/read options and provide the actual value to be used in the future. That being said, I think we should start simple and SparkReadOptions and SparkWriteOptions sounds reasonable to me.

@HeartSaVioR
Copy link
Contributor

If you don't mind, this might be a good chance to update the guide on partitioned write in Spark: https://github.com/apache/iceberg/blob/master/site/docs/spark.md#writing-against-partitioned-table

Fanout writer can be an alternative to explicit sort, hence would be better to add it there. If you'd just like to finalize your work with the code change, I can take a step for that.

@karuppayya karuppayya requested a review from rdblue December 5, 2020 06:54
@rdblue rdblue merged commit 61702d1 into apache:master Dec 5, 2020
@rdblue
Copy link
Contributor

rdblue commented Dec 5, 2020

Thanks @karuppayya! Looks good so I merged it.

@HeartSaVioR, let's update the docs separately. I agree that it would probably be helpful, but I think we want to be careful about how we document this. This is primarily for streaming and I would not recommend people avoid adding a sort for batch work. If you're interested in working on the docs, that would be great! Thank you!

@HeartSaVioR
Copy link
Contributor

Yeah I'm happy to document it. Thanks!

One thing I wonder is how much it is helpful or even it hurts to have table property for this, as we are in consensus that this might be dangerous on batch query if they don't notice the behavior (depending on cardinality of partitions). I also commented on previous PR and it got merged without answering it. For now we don't explain what is fanout writer in the doc, so they have no idea and fear to enable this, but once we document the behavior without proper warn, they may be misunderstanding the behavior as good for all cases and update the table property. (I guess you're concerning about documenting this due to this, do I understand correctly?)

Personally I'm not 100% sure we'd like to add table property which is only good for specific workload, but at least we could document this with warning that this opens files for cardinality of partitions in the data in each task, so only recommended to use it in streaming write (not mentioning table property here). WDYT?

@aokolnychyi
Copy link
Contributor

@HeartSaVioR, do you happen to have a link to the open comment from the previous PR?

@aokolnychyi
Copy link
Contributor

I am not against having a table property but I think we should clearly document the concern of enabling this and discourage using fanout writers in batch jobs. The cost of a local sort is minimal and Spark does this implicitly in batch jobs for built-in V1 tables.

I think the documentation should be clear about the potential consequences and give a good example.

@aokolnychyi
Copy link
Contributor

Thanks for working on this, @karuppayya!

@HeartSaVioR
Copy link
Contributor

@aokolnychyi
Here it is. #1774 (comment)

I agree we could probably discourage using the option in doc. Probably they would only want to set the table property when their table is only written by streaming query.

pvary pushed a commit to pvary/iceberg that referenced this pull request Dec 7, 2020
@HeartSaVioR
Copy link
Contributor

Raised a PR #1929 for documenting fanout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants