-
Notifications
You must be signed in to change notification settings - Fork 3k
API: Add default value core api and schema serialization #4525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: Add default value core api and schema serialization #4525
Conversation
build.gradle
Outdated
| project(':iceberg-api') { | ||
| dependencies { | ||
| implementation project(path: ':iceberg-bundled-guava', configuration: 'shadow') | ||
| implementation "com.fasterxml.jackson.core:jackson-databind" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is an additional Jackson library needed? I think we should be able to add default value serialization just like the other parsers.
| } | ||
| if (field.writeDefaultValue() != null) { | ||
| generator.writeFieldName(WRITE_DEFAULT); | ||
| generator.writeRawValue(field.writeDefaultValue().toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using toString that produces JSON, I think that this should expose a ValueParser that converts values from the internal Iceberg representation into JSON values.
| {Types.DoubleType.get(), parseJsonStringToJsonNode("123.456")}, | ||
| {Types.DateType.get(), parseJsonStringToJsonNode("3650")}, | ||
| {Types.TimeType.get(), parseJsonStringToJsonNode("36000000000")}, | ||
| {Types.TimestampType.withoutZone(), parseJsonStringToJsonNode("1649374911000000")}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this again, I think that date/time fields should be converted to (strict) ISO-8601 representations. That way they are readable in JSON.
| } | ||
|
|
||
| @SuppressWarnings("checkstyle:CyclomaticComplexity") | ||
| private static boolean isValidDefault(Type type, JsonNode defaultValue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iceberg uses visitors to separate the logic for schema traversal from the logic for individual decisions. Can you please use a visitor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel using a SchemaVisitor here would be an overkill, since this logic is only validating a NestField, it's technically only part of the schema, and even if it's a complex type, we only recursively validate downward, but not upward. Also I feel readability-wise, this recursive implementation is clean enough? it clearly shows the logic is to validate each type of default value case-by-case.
To me, I feel SchemaVisitor is mostly used to do holistic schema manipulations/transformations.
| return doc; | ||
| } | ||
|
|
||
| public JsonNode initialDefaultValue() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This API should not use Jackson classes. Instead, it should use the standard Iceberg in-memory representation:
- boolean - Java primitive boolean
- int - Java primitive int
- long - Java primitive long
- float - Java primitive float
- double - Java primitive double
- date - Java primitive int
- time - Java primitive long
- timestamp/timestamptz - Java primitive long
- string - Java CharSequence
- uuid - Java UUID
- binary - Java ByteBuffer
- fixed(L) - Java ByteBuffer
- decimal(P, S) - Java BigDecimal
- list - Java Collection
- struct - StructLike
- map - Java Map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- list - Java Collection
I think using Java List is more correct, as List cares about the order.
- struct - StructLike
I see this comment:
/**
* Interface for accessing data by position in a schema.
* <p>
* This interface supports accessing data in top-level fields, not in nested fields.
*/
in that interface, also I see the API exposes is getting fields by position, which doesn't suit our casing of accessing fields by field id.
I think using a native Java map would be better.
| return Objects.hash(NestedField.class, id, isOptional, name, type); | ||
| } | ||
|
|
||
| private static JsonNode validateDefault(String name, Type type, JsonNode defaultValue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this logic should be moved to a ValueParser with toJson and fromJson methods, like the others. We don't want to leak any Jackson classes in the API.
|
I see your major concern in the reviews is regarding using Jackson in Iceberg-api module. Let me put my thoughts here in one place to answer all the questions above, the core problem is whether we should represent the default value in memory in The reason I chose Jackson is that:
If we take the java primitive approach, I think my current code could be adapted this way: Maintain the current json validation logic, then add a Let me know your thoughts and I'm open to further discussions too. |
|
@rzhang10, it's okay to add the library to the API module if that's what is needed, although I'd like to see if we actually need it there or if we can convert values to/from JSON in core. For the in-memory representation, we do need to use the common representation and not JsonNode. We also cannot leak Jackson classes in the API module. |
| return new Literals.DecimalLiteral(value); | ||
| } | ||
|
|
||
| static Literal<Integer> ofDateLiteral(int value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you should need any new literal constructors. You can always convert using to and the type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my below comment.
| return (Literal<T>) this; | ||
| } else if (type.typeId() == Type.TypeID.STRING) { | ||
| return (Literal<T>) new StringLiteral(LocalTime.ofNanoOfDay(value() * 1000) | ||
| .format(DateTimeFormatter.ISO_LOCAL_TIME)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this conversion should be done using literals. Instead, use the methods in DateTimeUtil and add any new methods there. This shouldn't require changes to the public API in expressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah since previously I remember you pointed me to the Literals code so I thought I might be able to utilize the existing literals conversion logic there.
My changes basically add the conversion from the 3 time/date-related type to string literal. Since I noticed the existing string literal code has the conversion to those date/time types, so I thought it might be a plus to add my current changes that can make the conversions more symmetric and enable people to do round-trip conversions?
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this conversion should be done using literals. You can see how we're doing it here, but there is no need to change the Literals API. And there is certainly no need to add extra conversions to Literals. These are purposely limited.
|
|
||
| private NestedField(boolean isOptional, int id, String name, Type type, String doc) { | ||
| private NestedField(boolean isOptional, int id, String name, Type type, String doc, | ||
| Object initialDefault, Object writeDefault) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style: In Iceberg always start all argument lines at the same point, either all indented after the method declaration (aligned with boolean isOptional, ...) or all starting at the same continuation indent level. This mixes the two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rzhang10, please don't resolve threads that are not updated yet.
| return doc; | ||
| } | ||
|
|
||
| public Object initialDefaultValue() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need Value here? It seems redundant. Maybe just initialDefault and writeDefault.
| import org.apache.iceberg.types.Type; | ||
| import org.apache.iceberg.types.Types; | ||
|
|
||
| public class DefaultValueParser { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this class could be in a separate PR to start with. The representation and tests are fairly well-defined. Can you separate it out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I put them together is that, the default value serialization and deserialization will depends on the APIs to access the default value stored inside the schema. So in order for the jsonRoundTrip test to work, it must depends on the methods introduced in the API..
Do you think we can first create a standalone PR with only API changes (I think that PR should be very minimum and fast for approval and merge), then I can rebase the changes and submit subsequent PRs? (The spark reader implementation PRs will also surely depends on the new API methods).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I'm asking is even if I split all my later PRs, if the API changes are not checked in, those PRs won't compile themselves, or else I have to include the same API code changes in all the split PRs respectively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyways, I have created a minimum API-change-only PR: #4732 , which took into account all your above style comments. I think this PR should be quite straightforward to be merged. And I plan to submit all subsequent split PRs by rebasing this after this has been merged.
|
|
||
| switch (type.typeId()) { | ||
| case BOOLEAN: | ||
| return jsonNode.booleanValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should use the same pattern and checks that are used in other JSON parsers. Those use JsonUtil and validate the type. For example:
public static int getInt(String property, JsonNode node) {
Preconditions.checkArgument(node.has(property), "Cannot parse missing int %s", property);
JsonNode pNode = node.get(property);
Preconditions.checkArgument(pNode != null && !pNode.isNull() && pNode.isNumber(),
"Cannot parse %s to an integer value: %s", property, pNode);
return pNode.asInt();
}That checks that the node actually is an integer. We want to be similarly strict here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's another dedicated method which does validation work isValidDefault which does this similar check already, and it should always be called before calling the current method parseDefaultFromJson which does the actual parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's match the approach that the other parsers take so we have consistency in the code.
| MAPPER = new ObjectMapper(FACTORY); | ||
| SimpleModule customModule = new SimpleModule(); | ||
| customModule.addSerializer(ByteBuffer.class, new HexStringCustomByteBufferSerializer()); | ||
| MAPPER.registerModule(customModule); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The majority of the Iceberg parsers don't use ObjectMapper. Why is it needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use this mainly to help serialize the java in-mem fixed and binary type defaults which are represented by java ByteBuffer to our custom json representation that looks like 0x111f, you can see the customized serialized logic in HexStringCustomByteBufferSerializer.
| return defaultValue; | ||
| } | ||
|
|
||
| @SuppressWarnings("checkstyle:CyclomaticComplexity") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generally don't suppress this warning unless we are trying to avoid major code changes. Since this is all new, it is best to follow the recommendation.
| ObjectMapper mapper = new ObjectMapper(); | ||
| return mapper.readTree(json); | ||
| } catch (JsonProcessingException e) { | ||
| System.out.println("Failed to parse: " + json + "; reason: " + e.getMessage()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove print statements.
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
This is the first PR to implement the spec pr #4301
It augments the
NestedFieldAPI with default values and adds logic toSchemaParserfor serializing default values with schema, and it also implements the logic to deserialize it again to the in-memory representation inNestedField.The default value is represented in memory as vanilla java
Objectand is validated/parsed/converted from the user-facing JSON representation via utilities inDefaultValueParser. The JSON representation is also what gets serialized to disk along with the schema, the representation aligns with Appendix C in spec #4301.This PR will be the basis of the later PRs to support default value r/w semantics in each data type X compute engine.
Please take a look @rdblue , thank you!