Add mapping to Iceberg for external name-based schemas#338
Add mapping to Iceberg for external name-based schemas#338rdblue merged 1 commit intoapache:masterfrom
Conversation
21c5937 to
03ff49f
Compare
| public static final String DEFAULT_WRITE_METRICS_MODE = "write.metadata.metrics.default"; | ||
| public static final String DEFAULT_WRITE_METRICS_MODE_DEFAULT = "truncate(16)"; | ||
|
|
||
| public static final String DEFAULT_NAME_MAPPING = "schema.name-mapping.default"; |
There was a problem hiding this comment.
We're calling this default, but it's not really a default, right? Shouldn't this just be schema.name-mapping?
There was a problem hiding this comment.
The idea is that there may eventually be multiple mappings. If you have two streams of data written by Kafka, for example, you may want a different mapping from source name to Iceberg columns.
The current name defines a default mapping, to be used when a mapping is needed but there is no specified mapping for a file. Later, we can add schema.name-mapping.(name) properties to add more than one.
| if (node.has(NAMES)) { | ||
| names = ImmutableSet.copyOf(JsonUtil.getStringList(NAMES, node)); | ||
| } else { | ||
| names = ImmutableSet.of(); |
There was a problem hiding this comment.
Seems like we are allowing field_ids to exist without a mapping. Is there a reason for this as opposed to requiring fields to either default to mapping by id or mapped to id by name? I'm not clear what it means to have a filed identified with no associated name.
There was a problem hiding this comment.
No associated name would mean that there is no data mapped to a column, but the ID would exist as a place-holder.
|
Thanks for reviewing @danielcweeks! @rdsr, I'm going to go ahead and merge this so we can start building on it. Feel free to add additional review comments and we can fix anything we need to in a follow-up. |
* Add argument validation to HadoopTables#create (#298) * Install source JAR when running install target (#310) * Add projectStrict for Dates and Timestamps (#283) * Correctly publish artifacts on JitPack (#321) The Gradle install target produces invalid POM files that are missing the dependencyManagement section and versions for some dependencies. Instead, we directly tell JitPack to run the correct Gradle target. * Add build info to README.md (#304) * Convert Iceberg time type to Hive string type (#325) * Add overwrite option to write builders (#318) * Fix out of order Pig partition fields (#326) * Add mapping to Iceberg for external name-based schemas (#338) * Site: Fix broken link to Iceberg API (#333) * Add forTable method for Avro WriteBuilder (#322) * Remove multiple literal strings check rule for scala (#335) * Fix invalid javadoc url in README.md (#336) * Use UnicodeUtil.truncateString for Truncate transform. (#340) This truncates by unicode codepoint instead of Java chars. * Refactor metrics tests for reuse (#331) * Spark: Add support for write-audit-publish workflows (#342) * Avoid write failures if metrics mode is invalid (#301) * Fix truncateStringMax in UnicodeUtil (#334) Fixes #328, fixes #329. Index to codePointAt should be the offset calculated by code points * [Vectorization] Added batch sizing, switched to BufferAllocator, other minor style fixes.
This adds a mapping from external schema names to Iceberg type IDs. This will be used to add Iceberg IDs to data files written without those IDs, like Avro files.
This mapping is a multi-level mapping that matches the structure of Iceberg schemas. Each nested type (struct, list, and map) has a nested mapping for its nested fields. Each field, including list element, map key, and map value, is represented using
MappedFieldthat contains a set of names and an optional Iceberg ID.Mappings are updated as an Iceberg schema evolves. Renaming a field will add a new name to its mapping, and adding new fields will add new field mappings.
NamedMappingprovides high-level methods to find mapped fields by ID or by qualified name.