Synchronize write mode among writers#318
Conversation
|
@rdblue please review. |
|
|
||
| public WriteBuilder forTable(Table table) { | ||
| schema(table.schema()); | ||
| setAll(table.properties()); |
There was a problem hiding this comment.
I think this change is unrelated and should go in a separate PR. It's a good idea to add this, just not here.
Also, I think this should do some translation from table properties to ORC properties, like taking write.orc.compression-codec and setting the correct property for ORC.
| Function<Schema, DatumWriter<?>> createWriterFunc, | ||
| CodecFactory codec, Map<String, String> metadata) throws IOException { | ||
| this.stream = file.create(); | ||
| this.stream = file.createOrOverwrite(); |
There was a problem hiding this comment.
This will affect writes for all data and metadata files. Is it safe to do this everywhere?
There was a problem hiding this comment.
Moved it to Avro class and made configurable.
| .setWriteSupport(getWriteSupport(type)) | ||
| .withCompressionCodec(codec()) | ||
| .withWriteMode(ParquetFileWriter.Mode.OVERWRITE) // TODO: support modes | ||
| .withWriteMode(ParquetFileWriter.Mode.OVERWRITE) |
There was a problem hiding this comment.
I think this TODO item is still valid.
This comment refers to supporting modes passed into the Parquet write builder. That's different from #302 because it isn't a table property.
There was a problem hiding this comment.
Made write mode configurable via write builder, I guess this was meant by todo.
There was a problem hiding this comment.
Great! That's exactly what the TODO meant.
|
@rdblue thanks for the code review, addressed code review comments. |
2f2a442 to
ac86145
Compare
ac86145 to
df0853b
Compare
| return this; | ||
| } | ||
|
|
||
| public WriteBuilder overwrite(boolean newOverwrite) { |
There was a problem hiding this comment.
I suggest also adding overwrite() that sets this.overwrite = true. That way users can avoid passing a boolean constant.
| .schema(ManifestFile.schema()) | ||
| .named("manifest_file") | ||
| .meta(meta) | ||
| .overwrite(false) |
There was a problem hiding this comment.
Let's default manifest lists and manifests to overwrite. These use UUID-based file names and should never conflict.
| private Map<String, String> config = Maps.newHashMap(); | ||
| private Map<String, String> metadata = Maps.newLinkedHashMap(); | ||
| private Function<Schema, DatumWriter<?>> createWriterFunc = GenericAvroWriter::new; | ||
| private boolean overwrite = true; |
There was a problem hiding this comment.
I think this should default to false.
There was a problem hiding this comment.
Now all readers default to overwrite false.
| return this; | ||
| } | ||
|
|
||
| public WriteBuilder writeMode(ParquetFileWriter.Mode newWriteMode) { |
There was a problem hiding this comment.
Instead of passing in a write mode, I think Parquet should use the same methods as the other file format helpers, overwrite() and overwrite(boolean enabled). That way, all of them have a similar API and we don't leak Parquet classes through the Iceberg API.
There was a problem hiding this comment.
Agree, made the changes.
| private Map<String, String> config = Maps.newLinkedHashMap(); | ||
| private Function<MessageType, ParquetValueWriter<?>> createWriterFunc = null; | ||
| private MetricsConfig metricsConfig = MetricsConfig.getDefault(); | ||
| private ParquetFileWriter.Mode writeMode = ParquetFileWriter.Mode.OVERWRITE; |
There was a problem hiding this comment.
Can this default to non-overwrite mode?
There was a problem hiding this comment.
Now all readers default to overwrite false.
|
Looks great! Thanks for fixing this, @arina-ielchiieva! |
* Add argument validation to HadoopTables#create (#298) * Install source JAR when running install target (#310) * Add projectStrict for Dates and Timestamps (#283) * Correctly publish artifacts on JitPack (#321) The Gradle install target produces invalid POM files that are missing the dependencyManagement section and versions for some dependencies. Instead, we directly tell JitPack to run the correct Gradle target. * Add build info to README.md (#304) * Convert Iceberg time type to Hive string type (#325) * Add overwrite option to write builders (#318) * Fix out of order Pig partition fields (#326) * Add mapping to Iceberg for external name-based schemas (#338) * Site: Fix broken link to Iceberg API (#333) * Add forTable method for Avro WriteBuilder (#322) * Remove multiple literal strings check rule for scala (#335) * Fix invalid javadoc url in README.md (#336) * Use UnicodeUtil.truncateString for Truncate transform. (#340) This truncates by unicode codepoint instead of Java chars. * Refactor metrics tests for reuse (#331) * Spark: Add support for write-audit-publish workflows (#342) * Avoid write failures if metrics mode is invalid (#301) * Fix truncateStringMax in UnicodeUtil (#334) Fixes #328, fixes #329. Index to codePointAt should be the offset calculated by code points * [Vectorization] Added batch sizing, switched to BufferAllocator, other minor style fixes.
Currently Iceberg had three writers but they have different write modes:
This PR synchronizes write modes in writers and defaults them to
createas discussed in PR #302.