Synchronize write mode among writers by arina-ielchiieva · Pull Request #318 · apache/iceberg

arina-ielchiieva · 2019-07-26T12:20:30Z

Currently Iceberg had three writers but they have different write modes:

Writer	Mode
Parquet	Overwrite
Avro	Create
Orc	Create

This PR synchronizes write modes in writers and defaults them to create as discussed in PR #302.

arina-ielchiieva · 2019-07-26T12:21:03Z

@rdblue please review.

rdblue · 2019-07-26T16:31:10Z


+    public WriteBuilder forTable(Table table) {
+      schema(table.schema());
+      setAll(table.properties());


I think this change is unrelated and should go in a separate PR. It's a good idea to add this, just not here.

Also, I think this should do some translation from table properties to ORC properties, like taking write.orc.compression-codec and setting the correct property for ORC.

rdblue · 2019-07-26T16:33:42Z

                   Function<Schema, DatumWriter<?>> createWriterFunc,
                   CodecFactory codec, Map<String, String> metadata) throws IOException {
-    this.stream = file.create();
+    this.stream = file.createOrOverwrite();


This will affect writes for all data and metadata files. Is it safe to do this everywhere?

Moved it to Avro class and made configurable.

rdblue · 2019-07-26T16:34:16Z

            .setWriteSupport(getWriteSupport(type))
            .withCompressionCodec(codec())
-            .withWriteMode(ParquetFileWriter.Mode.OVERWRITE) // TODO: support modes
+            .withWriteMode(ParquetFileWriter.Mode.OVERWRITE)


I think this TODO item is still valid.

This comment refers to supporting modes passed into the Parquet write builder. That's different from #302 because it isn't a table property.

Made write mode configurable via write builder, I guess this was meant by todo.

Great! That's exactly what the TODO meant.

arina-ielchiieva · 2019-07-27T13:59:18Z

@rdblue thanks for the code review, addressed code review comments.

rdblue · 2019-07-29T17:22:51Z

      return this;
    }

+    public WriteBuilder overwrite(boolean newOverwrite) {


I suggest also adding overwrite() that sets this.overwrite = true. That way users can avoid passing a boolean constant.

rdblue · 2019-07-29T17:23:32Z

          .schema(ManifestFile.schema())
          .named("manifest_file")
          .meta(meta)
+          .overwrite(false)


Let's default manifest lists and manifests to overwrite. These use UUID-based file names and should never conflict.

rdblue · 2019-07-29T17:24:05Z

    private Map<String, String> config = Maps.newHashMap();
    private Map<String, String> metadata = Maps.newLinkedHashMap();
    private Function<Schema, DatumWriter<?>> createWriterFunc = GenericAvroWriter::new;
+    private boolean overwrite = true;


I think this should default to false.

Now all readers default to overwrite false.

rdblue · 2019-07-29T17:25:11Z

      return this;
    }

+    public WriteBuilder writeMode(ParquetFileWriter.Mode newWriteMode) {


Instead of passing in a write mode, I think Parquet should use the same methods as the other file format helpers, overwrite() and overwrite(boolean enabled). That way, all of them have a similar API and we don't leak Parquet classes through the Iceberg API.

Agree, made the changes.

rdblue · 2019-07-29T17:25:33Z

    private Map<String, String> config = Maps.newLinkedHashMap();
    private Function<MessageType, ParquetValueWriter<?>> createWriterFunc = null;
    private MetricsConfig metricsConfig = MetricsConfig.getDefault();
+    private ParquetFileWriter.Mode writeMode = ParquetFileWriter.Mode.OVERWRITE;


Can this default to non-overwrite mode?

Now all readers default to overwrite false.

rdblue · 2019-07-30T16:31:26Z

Looks great! Thanks for fixing this, @arina-ielchiieva!

* Add argument validation to HadoopTables#create (#298) * Install source JAR when running install target (#310) * Add projectStrict for Dates and Timestamps (#283) * Correctly publish artifacts on JitPack (#321) The Gradle install target produces invalid POM files that are missing the dependencyManagement section and versions for some dependencies. Instead, we directly tell JitPack to run the correct Gradle target. * Add build info to README.md (#304) * Convert Iceberg time type to Hive string type (#325) * Add overwrite option to write builders (#318) * Fix out of order Pig partition fields (#326) * Add mapping to Iceberg for external name-based schemas (#338) * Site: Fix broken link to Iceberg API (#333) * Add forTable method for Avro WriteBuilder (#322) * Remove multiple literal strings check rule for scala (#335) * Fix invalid javadoc url in README.md (#336) * Use UnicodeUtil.truncateString for Truncate transform. (#340) This truncates by unicode codepoint instead of Java chars. * Refactor metrics tests for reuse (#331) * Spark: Add support for write-audit-publish workflows (#342) * Avoid write failures if metrics mode is invalid (#301) * Fix truncateStringMax in UnicodeUtil (#334) Fixes #328, fixes #329. Index to codePointAt should be the offset calculated by code points * [Vectorization] Added batch sizing, switched to BufferAllocator, other minor style fixes.

Synchronize write mode among writers

e568764

arina-ielchiieva mentioned this pull request Jul 26, 2019

Support write mode for Parquet and Avro using table property #302

Closed

rdblue reviewed Jul 26, 2019

View reviewed changes

arina-ielchiieva force-pushed the commonWriteMode branch from 2f2a442 to ac86145 Compare July 27, 2019 14:04

Addressed code review comments

df0853b

arina-ielchiieva force-pushed the commonWriteMode branch from ac86145 to df0853b Compare July 27, 2019 14:05

rdblue reviewed Jul 29, 2019

View reviewed changes

Changes after second round of code review

b1d37b9

rdblue merged commit f4fc8ff into apache:master Jul 30, 2019

rdblue pushed a commit to rdblue/iceberg that referenced this pull request Aug 7, 2019

Add overwrite option to write builders (apache#318)

76ab147

rdblue pushed a commit to rdblue/iceberg that referenced this pull request Aug 22, 2019

Add overwrite option to write builders (apache#318)

b02d4e0

Conversation

arina-ielchiieva commented Jul 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arina-ielchiieva commented Jul 26, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arina-ielchiieva commented Jul 27, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jul 30, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

arina-ielchiieva commented Jul 26, 2019 •

edited

Loading