Skip to content

Conversation

@rzhang10
Copy link
Contributor

This PR adds the default value parsing and un-parsing (serialization) from/to its JSON representation, as per spec #4301 .

@rdblue Please review, cc @wmoustafa

@github-actions github-actions bot added the core label May 25, 2022
}

public static JsonNode validateDefault(Type type, JsonNode defaultValue) {
if (defaultValue != null && !isValidDefault(type, defaultValue)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should match the style of the other JSON parsers, which don't do this work twice. Here, you're using a switch statement on the type to validate, and then using a switch statement on the type to extract the value. Instead, I think this should have one method that keeps the logic for each type in the same place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me do this.

public static Object parseDefaultFromJson(Type type, JsonNode defaultValue) {
validateDefault(type, defaultValue);

if (defaultValue == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to check isNull as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

return Literal.of(defaultValue.textValue()).to(type).value();
case FIXED:
byte[] fixedBytes = BaseEncoding.base16().decode(defaultValue.textValue().toUpperCase(Locale.ROOT).replaceFirst(
"^0X",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the value of 0x? I think I'd rather just remove it than have all this extra handling for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can remove this, should we also update the spec according to this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please update the spec as well.

byte[] fixedBytes = BaseEncoding.base16().decode(defaultValue.textValue().toUpperCase(Locale.ROOT).replaceFirst(
"^0X",
""));
return ByteBuffer.allocate(((Types.FixedType) type).length()).put(fixedBytes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to validate the length of the byte array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use ByteBuffer.wrap here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

case MAP:
Map<Object, Object> defaultMap = Maps.newHashMap();
List<JsonNode> keysAndValues = StreamSupport
.stream(defaultValue.spliterator(), false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iceberg code should not use spliterator. Can you find another way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer need this since we represent the Map using a JSON Object.

.stream(defaultValue.spliterator(), false)
.collect(Collectors.toList());
JsonNode keys = keysAndValues.get(0);
JsonNode values = keysAndValues.get(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the spec, the JSON node should be an object with two fields: keys and values. I think it would be much easier to validate that the node is an object and then read the fields, rather than trying to convert to a list. This needs to respect the names, not the order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, fixed.

@rzhang10
Copy link
Contributor Author

Hi @rdblue I've updated the PR addressing your comments, could you please take a look again? Thanks!

switch (type.typeId()) {
case BOOLEAN:
Preconditions.checkArgument(defaultValue.isBoolean(),
"Cannot parse %s to a %s value", defaultValue, type);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're trying to copy the error messages from JsonUtil, but removed the wrong %s. The user value goes after the error message and should not be embedded in it. The Cannot parse %s in JsonUtil tells the reader which field was being parsed, like Cannot parse snapshot-id to a long value: null.

This should be "Cannot parse default as a %s value: %s", type, defaultValue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, as long as this is in a type branch, you should just embed the type string: "Cannot parse default as a boolean value: %s", defaultValue

Copy link
Contributor Author

@rzhang10 rzhang10 Jun 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't reusing each type's type.toString() method better? As I see that is defined for each type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly about this. Either way is fine, but the first comment about the error message should be fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done refactoring it to Cannot parse default as a %s value: %s", type, defaultValue

"Cannot parse %s to a %s value", defaultValue, type);
return defaultValue.longValue();
case FLOAT:
Preconditions.checkArgument(defaultValue.isNumber(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should check isFloatingPointNumber

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means when the user can't specify the float value as 1, but instead need to specify 1.0 for the parser to parse. Do you think this restriction is preferred?

case DECIMAL:
Preconditions.checkArgument(defaultValue.isNumber(),
"Cannot parse %s to a %s value", defaultValue, type);
return defaultValue.decimalValue();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this also needs to validate that the decimal's scale matches the expected scale. That must always match or else it should throw an exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

case UUID:
Preconditions.checkArgument(defaultValue.isTextual(),
"Cannot parse %s to a %s value", defaultValue, type);
return UUID.fromString(defaultValue.textValue());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should validate that the string's length is the length of a UUID string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UUID.fromString is already doing such validation, let me use a try-catch block to print out the error message with the same format?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it isn't:

    public static UUID fromString(String var0) {
        String[] var1 = var0.split("-");
        if (var1.length != 5) {
            throw new IllegalArgumentException("Invalid UUID string: " + var0);
        } else {
            for(int var2 = 0; var2 < 5; ++var2) {
                var1[var2] = "0x" + var1[var2];
            }

            long var6 = Long.decode(var1[0]);
            var6 <<= 16;
            var6 |= Long.decode(var1[1]);
            var6 <<= 16;
            var6 |= Long.decode(var1[2]);
            long var4 = Long.decode(var1[3]);
            var4 <<= 48;
            var4 |= Long.decode(var1[4]);
            return new UUID(var6, var4);
        }
    }

It's just validating that there are enough parts, and decoding those parts.

Instead of doing the try/catch thing (which has its own problem) I think you should check that the string is the right length to be a UUID.

case FIXED:
Preconditions.checkArgument(
defaultValue.isTextual() && defaultValue.textValue().length() == ((Types.FixedType) type).length() * 2,
"Cannot parse %s to a %s value",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you produce a better error message for when the length is invalid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

"Cannot parse %s to a %s value", defaultValue, type);
List<Object> defaultList = Lists.newArrayList();
for (JsonNode element : defaultValue) {
defaultList.add(parseDefaultFromJson(type.asListType().elementType(), element));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can move type.asListType().elementType() out of the loop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may also be shorter to do it this way:

Type elementType = type.asListType().elementType();
return Lists.newArrayList(Iterables.transform(arrayNode, e -> DefaultValueParser.fromJson(elementType, e)));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

type);
Map<Object, Object> defaultMap = Maps.newHashMap();
JsonNode keys = defaultValue.get("keys");
JsonNode values = defaultValue.get("values");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should check that the size of these array nodes matches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

JsonNode keys = defaultValue.get("keys");
JsonNode values = defaultValue.get("values");
List<JsonNode> keyList = Lists.newArrayList(keys.iterator());
List<JsonNode> valueList = Lists.newArrayList(values.iterator());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be necessary to copy these into lists. Instead, you can iterate over them simultaneously after checking that the size is the same:

  ImmutableMap.Builder<Object, Object> mapBuilder = ImmutableMap.builder();

  Iterator<JsonNode> keyIter = keys.iterator();
  Type keyType = type.asMapType().keyType();
  Iterator<JsonNode> valueIter = values.iterator();
  Type valueType = type.asMapType().valueType();

  while (keyIter.hasNext()) {
    mapBuilder.put(fromJson(keyType, keyIter.next()), fromJson(valueType, valueIter.next()));
  }

  return mapBuilder.build();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

case STRUCT:
Preconditions.checkArgument(defaultValue.isObject(),
"Cannot parse %s to a %s value", defaultValue, type);
Map<Integer, Object> defaultStruct = Maps.newHashMap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should return a StructLike:

  StructType struct = type.asStructType();
  StructLike defaultRecord = GenericRecord.create(struct);

  List<NestedField> fields = struct.fields();
  for (int pos = 0; pos < fields.size(); pos += 1) {
    NestedField field = fields.get(pos);
    String idString = String.valueOf(field.fieldId());
    if (defaultValue.has(idString)) {
      defaultRecord.set(pos, fromJson(field.type(), defaultValue.get(idString));
    }
  }

  return defaultRecord;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, updated.

}
return defaultStruct;
default:
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this throw an exception if the type is not supported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Object value = defaultValue.has(fieldIdAsString) ? parseDefaultFromJson(
subField.type(),
defaultValue.get(fieldIdAsString)) : null;
if (value != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I think we need to handle the child default values. If we make this independent of the child's default value, then there is no way to distinguish between an explicit null default and a missing default after this returns.

When the default is missing and the child field has a default, this should fill in the child's default value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the field can't actually carry a default value right now, I think we can put this off until the next PR.

For the next step, I think this should add the API changes as package-private so we can add handling for child defaults in the same package. We can move the parser and make more things public as we make progress.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think handling child default will require a second pass to traverse the schema (with default) again. I plan to have another PR that implements a SchemaVisitor handle this.

case TIMESTAMP:
Preconditions.checkArgument(defaultValue.isTextual(),
"Cannot parse %s to a %s value", defaultValue, type);
return Literal.of(defaultValue.textValue()).to(type).value();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than using Literal, could you just refactor to add the conversions to DateTimeUtil like the to string conversions? That way we have both in the util.

(int) (micros % 1000000) * 1000, ZoneOffset.UTC).format(DateTimeFormatter.ISO_LOCAL_DATE_TIME);
if (withUTCZone) {
// We standardize the format by always using the UTC zone
return LocalDateTime.parse(localDateTime, DateTimeFormatter.ISO_LOCAL_DATE_TIME)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not produce a string and then parse it. Instead, it should update the conversion above to go directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, refactored.

}

@SuppressWarnings("checkstyle:CyclomaticComplexity")
public static Object parseDefaultFromJson(Type type, JsonNode defaultValue) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename this fromJson? And also add the variations of the method that accept String.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

}
}

public static Object convertJavaDefaultForSerialization(Type type, Object value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like the other parsers, this method should be passed a JsonGenerator that handles creating the JSON string.

Copy link
Contributor Author

@rzhang10 rzhang10 Jun 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored it to be:

  public static String toJson(Type type, Object javaDefaultValue) throws IOException {
    return JsonUtil.mapper().writeValueAsString(DefaultValueParser.convertJavaDefaultForSerialization(
        type,
        javaDefaultValue));
  }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rzhang10, please look at the other parsers and match what they do. You should be using a JsonGenerator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I refactored to use JsonGenerator.

convertedDefault.put("values", valueList);
return convertedDefault;
case STRUCT:
Map<Integer, Object> defaultStruct = (Map<Integer, Object>) value;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should deconstruct a StructLike, not a map.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

private static String defaultValueParseAndUnParseRoundTrip(Type type, JsonNode defaultValue)
throws JsonProcessingException {
Object javaDefaultValue = DefaultValueParser.parseDefaultFromJson(type, defaultValue);
String jsonDefaultValue = JsonUtil.mapper()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parser should produce and accept strings, rather than doing it here in tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, makes sense, refactored.

{Types.StructType.of(
required(1, "f1", Types.IntegerType.get(), "doc"),
optional(2, "f2", Types.StringType.get(), "doc")),
stringToJsonNode("{\"1\": 1, \"2\": \"bar\"}")}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add test cases for nested types? One of each (list, map, struct) that contains a struct would be good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

import static org.apache.iceberg.types.Types.NestedField.required;

@RunWith(Parameterized.class)
public class TestDefaultValuesParsingAndUnParsing {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should also have a few tests for cases that are caught above, like maps with different length key and value lists, binary and fixed values that are not the right length, UUID values that are not actually UUIDs, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@rzhang10 rzhang10 force-pushed the default_value_parse_unparse branch from 368a321 to 16e9559 Compare August 1, 2022 21:16
@rzhang10
Copy link
Contributor Author

rzhang10 commented Aug 1, 2022

@rdblue I've addressed the comments and rebased on master and also did a spotlessApply, could you review again?

@rdblue rdblue merged commit 3a9e0a6 into apache:master Aug 2, 2022
@rdblue
Copy link
Contributor

rdblue commented Aug 2, 2022

Thanks, @rzhang10! The latest changes look good. I merged this.

@shiyancao
Copy link

hi @rdblue / @rzhang10, are there more PRs to be developed before we can support default value in Iceberg?

I read in PR 4732 and it seems that these are still pending to create but just want to double check and confirm.

Add the JSON value parser
Add as much as possible to Parquet, Avro, and ORC readers, like being able to read with a fake map of default values.

Also, is there a place I can have a holistic view of the full issue? It seems that issue 2039 was the one but it was not updated.

@rzhang10
Copy link
Contributor Author

rzhang10 commented Sep 8, 2022

Hi @shiyancao, yes more PRs are underway for support reading default values in engines with different formats (Avro/ORC/Parquet), we will start with implementing support for Spark first.

abmo-x pushed a commit to abmo-x/iceberg that referenced this pull request Oct 21, 2022
zhongyujiang pushed a commit to zhongyujiang/iceberg that referenced this pull request Apr 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants