-
Notifications
You must be signed in to change notification settings - Fork 3k
Core: Implement default value parsing and unparsing #4871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } | ||
|
|
||
| public static JsonNode validateDefault(Type type, JsonNode defaultValue) { | ||
| if (defaultValue != null && !isValidDefault(type, defaultValue)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should match the style of the other JSON parsers, which don't do this work twice. Here, you're using a switch statement on the type to validate, and then using a switch statement on the type to extract the value. Instead, I think this should have one method that keeps the logic for each type in the same place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me do this.
| public static Object parseDefaultFromJson(Type type, JsonNode defaultValue) { | ||
| validateDefault(type, defaultValue); | ||
|
|
||
| if (defaultValue == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to check isNull as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
| return Literal.of(defaultValue.textValue()).to(type).value(); | ||
| case FIXED: | ||
| byte[] fixedBytes = BaseEncoding.base16().decode(defaultValue.textValue().toUpperCase(Locale.ROOT).replaceFirst( | ||
| "^0X", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the value of 0x? I think I'd rather just remove it than have all this extra handling for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we can remove this, should we also update the spec according to this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please update the spec as well.
| byte[] fixedBytes = BaseEncoding.base16().decode(defaultValue.textValue().toUpperCase(Locale.ROOT).replaceFirst( | ||
| "^0X", | ||
| "")); | ||
| return ByteBuffer.allocate(((Types.FixedType) type).length()).put(fixedBytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to validate the length of the byte array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use ByteBuffer.wrap here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| case MAP: | ||
| Map<Object, Object> defaultMap = Maps.newHashMap(); | ||
| List<JsonNode> keysAndValues = StreamSupport | ||
| .stream(defaultValue.spliterator(), false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iceberg code should not use spliterator. Can you find another way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer need this since we represent the Map using a JSON Object.
| .stream(defaultValue.spliterator(), false) | ||
| .collect(Collectors.toList()); | ||
| JsonNode keys = keysAndValues.get(0); | ||
| JsonNode values = keysAndValues.get(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the spec, the JSON node should be an object with two fields: keys and values. I think it would be much easier to validate that the node is an object and then read the fields, rather than trying to convert to a list. This needs to respect the names, not the order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, fixed.
|
Hi @rdblue I've updated the PR addressing your comments, could you please take a look again? Thanks! |
| switch (type.typeId()) { | ||
| case BOOLEAN: | ||
| Preconditions.checkArgument(defaultValue.isBoolean(), | ||
| "Cannot parse %s to a %s value", defaultValue, type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're trying to copy the error messages from JsonUtil, but removed the wrong %s. The user value goes after the error message and should not be embedded in it. The Cannot parse %s in JsonUtil tells the reader which field was being parsed, like Cannot parse snapshot-id to a long value: null.
This should be "Cannot parse default as a %s value: %s", type, defaultValue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, as long as this is in a type branch, you should just embed the type string: "Cannot parse default as a boolean value: %s", defaultValue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't reusing each type's type.toString() method better? As I see that is defined for each type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel strongly about this. Either way is fine, but the first comment about the error message should be fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done refactoring it to Cannot parse default as a %s value: %s", type, defaultValue
| "Cannot parse %s to a %s value", defaultValue, type); | ||
| return defaultValue.longValue(); | ||
| case FLOAT: | ||
| Preconditions.checkArgument(defaultValue.isNumber(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should check isFloatingPointNumber
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That means when the user can't specify the float value as 1, but instead need to specify 1.0 for the parser to parse. Do you think this restriction is preferred?
| case DECIMAL: | ||
| Preconditions.checkArgument(defaultValue.isNumber(), | ||
| "Cannot parse %s to a %s value", defaultValue, type); | ||
| return defaultValue.decimalValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this also needs to validate that the decimal's scale matches the expected scale. That must always match or else it should throw an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
| case UUID: | ||
| Preconditions.checkArgument(defaultValue.isTextual(), | ||
| "Cannot parse %s to a %s value", defaultValue, type); | ||
| return UUID.fromString(defaultValue.textValue()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should validate that the string's length is the length of a UUID string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UUID.fromString is already doing such validation, let me use a try-catch block to print out the error message with the same format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it isn't:
public static UUID fromString(String var0) {
String[] var1 = var0.split("-");
if (var1.length != 5) {
throw new IllegalArgumentException("Invalid UUID string: " + var0);
} else {
for(int var2 = 0; var2 < 5; ++var2) {
var1[var2] = "0x" + var1[var2];
}
long var6 = Long.decode(var1[0]);
var6 <<= 16;
var6 |= Long.decode(var1[1]);
var6 <<= 16;
var6 |= Long.decode(var1[2]);
long var4 = Long.decode(var1[3]);
var4 <<= 48;
var4 |= Long.decode(var1[4]);
return new UUID(var6, var4);
}
}It's just validating that there are enough parts, and decoding those parts.
Instead of doing the try/catch thing (which has its own problem) I think you should check that the string is the right length to be a UUID.
| case FIXED: | ||
| Preconditions.checkArgument( | ||
| defaultValue.isTextual() && defaultValue.textValue().length() == ((Types.FixedType) type).length() * 2, | ||
| "Cannot parse %s to a %s value", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you produce a better error message for when the length is invalid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| "Cannot parse %s to a %s value", defaultValue, type); | ||
| List<Object> defaultList = Lists.newArrayList(); | ||
| for (JsonNode element : defaultValue) { | ||
| defaultList.add(parseDefaultFromJson(type.asListType().elementType(), element)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can move type.asListType().elementType() out of the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may also be shorter to do it this way:
Type elementType = type.asListType().elementType();
return Lists.newArrayList(Iterables.transform(arrayNode, e -> DefaultValueParser.fromJson(elementType, e)));There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| type); | ||
| Map<Object, Object> defaultMap = Maps.newHashMap(); | ||
| JsonNode keys = defaultValue.get("keys"); | ||
| JsonNode values = defaultValue.get("values"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should check that the size of these array nodes matches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| JsonNode keys = defaultValue.get("keys"); | ||
| JsonNode values = defaultValue.get("values"); | ||
| List<JsonNode> keyList = Lists.newArrayList(keys.iterator()); | ||
| List<JsonNode> valueList = Lists.newArrayList(values.iterator()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't be necessary to copy these into lists. Instead, you can iterate over them simultaneously after checking that the size is the same:
ImmutableMap.Builder<Object, Object> mapBuilder = ImmutableMap.builder();
Iterator<JsonNode> keyIter = keys.iterator();
Type keyType = type.asMapType().keyType();
Iterator<JsonNode> valueIter = values.iterator();
Type valueType = type.asMapType().valueType();
while (keyIter.hasNext()) {
mapBuilder.put(fromJson(keyType, keyIter.next()), fromJson(valueType, valueIter.next()));
}
return mapBuilder.build();There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| case STRUCT: | ||
| Preconditions.checkArgument(defaultValue.isObject(), | ||
| "Cannot parse %s to a %s value", defaultValue, type); | ||
| Map<Integer, Object> defaultStruct = Maps.newHashMap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should return a StructLike:
StructType struct = type.asStructType();
StructLike defaultRecord = GenericRecord.create(struct);
List<NestedField> fields = struct.fields();
for (int pos = 0; pos < fields.size(); pos += 1) {
NestedField field = fields.get(pos);
String idString = String.valueOf(field.fieldId());
if (defaultValue.has(idString)) {
defaultRecord.set(pos, fromJson(field.type(), defaultValue.get(idString));
}
}
return defaultRecord;There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, updated.
| } | ||
| return defaultStruct; | ||
| default: | ||
| return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this throw an exception if the type is not supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| Object value = defaultValue.has(fieldIdAsString) ? parseDefaultFromJson( | ||
| subField.type(), | ||
| defaultValue.get(fieldIdAsString)) : null; | ||
| if (value != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, I think we need to handle the child default values. If we make this independent of the child's default value, then there is no way to distinguish between an explicit null default and a missing default after this returns.
When the default is missing and the child field has a default, this should fill in the child's default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the field can't actually carry a default value right now, I think we can put this off until the next PR.
For the next step, I think this should add the API changes as package-private so we can add handling for child defaults in the same package. We can move the parser and make more things public as we make progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think handling child default will require a second pass to traverse the schema (with default) again. I plan to have another PR that implements a SchemaVisitor handle this.
| case TIMESTAMP: | ||
| Preconditions.checkArgument(defaultValue.isTextual(), | ||
| "Cannot parse %s to a %s value", defaultValue, type); | ||
| return Literal.of(defaultValue.textValue()).to(type).value(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than using Literal, could you just refactor to add the conversions to DateTimeUtil like the to string conversions? That way we have both in the util.
| (int) (micros % 1000000) * 1000, ZoneOffset.UTC).format(DateTimeFormatter.ISO_LOCAL_DATE_TIME); | ||
| if (withUTCZone) { | ||
| // We standardize the format by always using the UTC zone | ||
| return LocalDateTime.parse(localDateTime, DateTimeFormatter.ISO_LOCAL_DATE_TIME) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not produce a string and then parse it. Instead, it should update the conversion above to go directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, refactored.
| } | ||
|
|
||
| @SuppressWarnings("checkstyle:CyclomaticComplexity") | ||
| public static Object parseDefaultFromJson(Type type, JsonNode defaultValue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rename this fromJson? And also add the variations of the method that accept String.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
| } | ||
| } | ||
|
|
||
| public static Object convertJavaDefaultForSerialization(Type type, Object value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like the other parsers, this method should be passed a JsonGenerator that handles creating the JSON string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored it to be:
public static String toJson(Type type, Object javaDefaultValue) throws IOException {
return JsonUtil.mapper().writeValueAsString(DefaultValueParser.convertJavaDefaultForSerialization(
type,
javaDefaultValue));
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rzhang10, please look at the other parsers and match what they do. You should be using a JsonGenerator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I refactored to use JsonGenerator.
| convertedDefault.put("values", valueList); | ||
| return convertedDefault; | ||
| case STRUCT: | ||
| Map<Integer, Object> defaultStruct = (Map<Integer, Object>) value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should deconstruct a StructLike, not a map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| private static String defaultValueParseAndUnParseRoundTrip(Type type, JsonNode defaultValue) | ||
| throws JsonProcessingException { | ||
| Object javaDefaultValue = DefaultValueParser.parseDefaultFromJson(type, defaultValue); | ||
| String jsonDefaultValue = JsonUtil.mapper() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parser should produce and accept strings, rather than doing it here in tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, makes sense, refactored.
| {Types.StructType.of( | ||
| required(1, "f1", Types.IntegerType.get(), "doc"), | ||
| optional(2, "f2", Types.StringType.get(), "doc")), | ||
| stringToJsonNode("{\"1\": 1, \"2\": \"bar\"}")} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add test cases for nested types? One of each (list, map, struct) that contains a struct would be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
| import static org.apache.iceberg.types.Types.NestedField.required; | ||
|
|
||
| @RunWith(Parameterized.class) | ||
| public class TestDefaultValuesParsingAndUnParsing { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should also have a few tests for cases that are caught above, like maps with different length key and value lists, binary and fixed values that are not the right length, UUID values that are not actually UUIDs, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
368a321 to
16e9559
Compare
|
@rdblue I've addressed the comments and rebased on master and also did a |
|
Thanks, @rzhang10! The latest changes look good. I merged this. |
|
hi @rdblue / @rzhang10, are there more PRs to be developed before we can support default value in Iceberg? I read in PR 4732 and it seems that these are still pending to create but just want to double check and confirm. Also, is there a place I can have a holistic view of the full issue? It seems that issue 2039 was the one but it was not updated. |
|
Hi @shiyancao, yes more PRs are underway for support reading default values in engines with different formats (Avro/ORC/Parquet), we will start with implementing support for Spark first. |
(cherry picked from commit 3a9e0a6)
This PR adds the default value parsing and un-parsing (serialization) from/to its JSON representation, as per spec #4301 .
@rdblue Please review, cc @wmoustafa