Skip to content

Synchronize write mode among writers#318

Merged
rdblue merged 3 commits intoapache:masterfrom
arina-ielchiieva:commonWriteMode
Jul 30, 2019
Merged

Synchronize write mode among writers#318
rdblue merged 3 commits intoapache:masterfrom
arina-ielchiieva:commonWriteMode

Conversation

@arina-ielchiieva
Copy link
Copy Markdown
Member

@arina-ielchiieva arina-ielchiieva commented Jul 26, 2019

Currently Iceberg had three writers but they have different write modes:

Writer Mode
Parquet Overwrite
Avro Create
Orc Create

This PR synchronizes write modes in writers and defaults them to create as discussed in PR #302.

@arina-ielchiieva
Copy link
Copy Markdown
Member Author

@rdblue please review.


public WriteBuilder forTable(Table table) {
schema(table.schema());
setAll(table.properties());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is unrelated and should go in a separate PR. It's a good idea to add this, just not here.

Also, I think this should do some translation from table properties to ORC properties, like taking write.orc.compression-codec and setting the correct property for ORC.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Function<Schema, DatumWriter<?>> createWriterFunc,
CodecFactory codec, Map<String, String> metadata) throws IOException {
this.stream = file.create();
this.stream = file.createOrOverwrite();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will affect writes for all data and metadata files. Is it safe to do this everywhere?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved it to Avro class and made configurable.

.setWriteSupport(getWriteSupport(type))
.withCompressionCodec(codec())
.withWriteMode(ParquetFileWriter.Mode.OVERWRITE) // TODO: support modes
.withWriteMode(ParquetFileWriter.Mode.OVERWRITE)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this TODO item is still valid.

This comment refers to supporting modes passed into the Parquet write builder. That's different from #302 because it isn't a table property.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made write mode configurable via write builder, I guess this was meant by todo.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! That's exactly what the TODO meant.

@arina-ielchiieva
Copy link
Copy Markdown
Member Author

@rdblue thanks for the code review, addressed code review comments.

return this;
}

public WriteBuilder overwrite(boolean newOverwrite) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest also adding overwrite() that sets this.overwrite = true. That way users can avoid passing a boolean constant.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

.schema(ManifestFile.schema())
.named("manifest_file")
.meta(meta)
.overwrite(false)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's default manifest lists and manifests to overwrite. These use UUID-based file names and should never conflict.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

private Map<String, String> config = Maps.newHashMap();
private Map<String, String> metadata = Maps.newLinkedHashMap();
private Function<Schema, DatumWriter<?>> createWriterFunc = GenericAvroWriter::new;
private boolean overwrite = true;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should default to false.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now all readers default to overwrite false.

return this;
}

public WriteBuilder writeMode(ParquetFileWriter.Mode newWriteMode) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of passing in a write mode, I think Parquet should use the same methods as the other file format helpers, overwrite() and overwrite(boolean enabled). That way, all of them have a similar API and we don't leak Parquet classes through the Iceberg API.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, made the changes.

private Map<String, String> config = Maps.newLinkedHashMap();
private Function<MessageType, ParquetValueWriter<?>> createWriterFunc = null;
private MetricsConfig metricsConfig = MetricsConfig.getDefault();
private ParquetFileWriter.Mode writeMode = ParquetFileWriter.Mode.OVERWRITE;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this default to non-overwrite mode?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now all readers default to overwrite false.

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Jul 30, 2019

Looks great! Thanks for fixing this, @arina-ielchiieva!

@rdblue rdblue merged commit f4fc8ff into apache:master Jul 30, 2019
danielcweeks pushed a commit that referenced this pull request Aug 1, 2019
* Add argument validation to HadoopTables#create (#298)

* Install source JAR when running install target (#310)

* Add projectStrict for Dates and Timestamps (#283)

* Correctly publish artifacts on JitPack (#321)

The Gradle install target produces invalid POM files that are missing
the dependencyManagement section and versions for some dependencies.
Instead, we directly tell JitPack to run the correct Gradle target.

* Add build info to README.md (#304)

* Convert Iceberg time type to Hive string type (#325)

* Add overwrite option to write builders (#318)

* Fix out of order Pig partition fields (#326)

* Add mapping to Iceberg for external name-based schemas (#338)

* Site: Fix broken link to Iceberg API (#333)

* Add forTable method for Avro WriteBuilder (#322)

* Remove multiple literal strings check rule for scala (#335)

* Fix invalid javadoc url in README.md (#336)

* Use UnicodeUtil.truncateString for Truncate transform. (#340)

This truncates by unicode codepoint instead of Java chars.

* Refactor metrics tests for reuse (#331)

* Spark: Add support for write-audit-publish workflows (#342)

* Avoid write failures if metrics mode is invalid (#301)

* Fix truncateStringMax in UnicodeUtil (#334)

Fixes #328, fixes #329.

Index to codePointAt should be the offset calculated by code points

* [Vectorization] Added batch sizing, switched to BufferAllocator, other minor style fixes.
rdblue pushed a commit to rdblue/iceberg that referenced this pull request Aug 7, 2019
rdblue pushed a commit to rdblue/iceberg that referenced this pull request Aug 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants