Skip to content

Support write mode for Parquet and Avro using table property#302

Closed
arina-ielchiieva wants to merge 1 commit intoapache:masterfrom
arina-ielchiieva:writeMode
Closed

Support write mode for Parquet and Avro using table property#302
arina-ielchiieva wants to merge 1 commit intoapache:masterfrom
arina-ielchiieva:writeMode

Conversation

@arina-ielchiieva
Copy link
Copy Markdown
Member

@arina-ielchiieva arina-ielchiieva commented Jul 21, 2019

Currently in Parquet and Avro writers write mode is hard-coded. For Parquet - overwrite, Avro - create.
This PR allows to set write mode using table properties for each of the above mentioned writers thus write mode can be controlled based on the needs of Iceberg table user.

public static final String PARQUET_COMPRESSION_DEFAULT = "gzip";

public static final String PARQUET_WRITE_MODE = "write.parquet.write-mode";
public static final String PARQUET_WRITE_MODE_DEFAULT = "overwrite";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 nice, having the "overwrite" default makes this change backwards compatible, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Parquet - yes, initially it was overwrite. For Avro - no, initially it was create. I made both overwrite to be the same but I can default Avro to create for backward compatibility if needed.

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Jul 24, 2019

Can you update the description with what this changes and why it is useful?

@arina-ielchiieva
Copy link
Copy Markdown
Member Author

@rdblue updated the description.

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Jul 24, 2019

@arina-ielchiieva, why is this needed? Shouldn't the mode always be overwrite?

@arina-ielchiieva
Copy link
Copy Markdown
Member Author

@rdblue
whether to overwrite file or fail when file already exists depends on application strategy:

  1. if you are confident that you always create unique files and then somehow collision has occurred, you would prefer to know that something went wrong rather than implicitly overwriting the file.
  2. on the other hand, when creating a several files: some files were created successfully, some were not, and you have retry strategy that attempts to create the same files one more time, in this case overwrite mode maybe more preferable.
    In my opinion, having options is more flexible than adjusting to some default behavior that's why I have made write modes configurable. If you think this is overhead, we can always chose one default behavior for these readers since unfortunately currently they behave differently.

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Jul 25, 2019

I think we should make a decision and have a single default behavior instead of adding table properties. I don't see much value in being able to configure this differently across tables.

@arina-ielchiieva
Copy link
Copy Markdown
Member Author

Closed in favor of PR #318.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants