Core: Implement default value parsing and unparsing #4871

rzhang10 · 2022-05-25T17:13:54Z

This PR adds the default value parsing and un-parsing (serialization) from/to its JSON representation, as per spec #4301 .

@rdblue Please review, cc @wmoustafa

rdblue · 2022-05-27T19:29:29Z

core/src/main/java/org/apache/iceberg/DefaultValueParser.java

+  }
+
+  public static JsonNode validateDefault(Type type, JsonNode defaultValue) {
+    if (defaultValue != null && !isValidDefault(type, defaultValue)) {


I think this should match the style of the other JSON parsers, which don't do this work twice. Here, you're using a switch statement on the type to validate, and then using a switch statement on the type to extract the value. Instead, I think this should have one method that keeps the logic for each type in the same place.

Sure, let me do this.

rdblue · 2022-05-27T19:29:58Z

core/src/main/java/org/apache/iceberg/DefaultValueParser.java

+  public static Object parseDefaultFromJson(Type type, JsonNode defaultValue) {
+    validateDefault(type, defaultValue);
+
+    if (defaultValue == null) {


This needs to check isNull as well.

core/src/main/java/org/apache/iceberg/DefaultValueParser.java

rdblue · 2022-05-27T19:33:29Z

core/src/main/java/org/apache/iceberg/DefaultValueParser.java

+        return Literal.of(defaultValue.textValue()).to(type).value();
+      case FIXED:
+        byte[] fixedBytes = BaseEncoding.base16().decode(defaultValue.textValue().toUpperCase(Locale.ROOT).replaceFirst(
+            "^0X",


What is the value of 0x? I think I'd rather just remove it than have all this extra handling for it.

Sure, we can remove this, should we also update the spec according to this?

Yes, please update the spec as well.

rdblue · 2022-05-27T19:34:00Z

core/src/main/java/org/apache/iceberg/DefaultValueParser.java

+        byte[] fixedBytes = BaseEncoding.base16().decode(defaultValue.textValue().toUpperCase(Locale.ROOT).replaceFirst(
+            "^0X",
+            ""));
+        return ByteBuffer.allocate(((Types.FixedType) type).length()).put(fixedBytes);


This needs to validate the length of the byte array.

Sure, fixed.

Why not use ByteBuffer.wrap here?

rdblue · 2022-05-27T19:36:59Z

core/src/main/java/org/apache/iceberg/DefaultValueParser.java

+      case MAP:
+        Map<Object, Object> defaultMap = Maps.newHashMap();
+        List<JsonNode> keysAndValues = StreamSupport
+            .stream(defaultValue.spliterator(), false)


Iceberg code should not use spliterator. Can you find another way?

No longer need this since we represent the Map using a JSON Object.

rdblue · 2022-05-27T19:41:14Z

core/src/main/java/org/apache/iceberg/DefaultValueParser.java

+            .stream(defaultValue.spliterator(), false)
+            .collect(Collectors.toList());
+        JsonNode keys = keysAndValues.get(0);
+        JsonNode values = keysAndValues.get(1);


According to the spec, the JSON node should be an object with two fields: keys and values. I think it would be much easier to validate that the node is an object and then read the fields, rather than trying to convert to a list. This needs to respect the names, not the order.

Sure, fixed.

rzhang10 · 2022-05-31T22:01:45Z

Hi @rdblue I've updated the PR addressing your comments, could you please take a look again? Thanks!

rdblue · 2022-06-08T21:47:32Z

core/src/main/java/org/apache/iceberg/DefaultValueParser.java

+    switch (type.typeId()) {
+      case BOOLEAN:
+        Preconditions.checkArgument(defaultValue.isBoolean(),
+            "Cannot parse %s to a %s value", defaultValue, type);


I think you're trying to copy the error messages from JsonUtil, but removed the wrong %s. The user value goes after the error message and should not be embedded in it. The Cannot parse %s in JsonUtil tells the reader which field was being parsed, like Cannot parse snapshot-id to a long value: null.

This should be "Cannot parse default as a %s value: %s", type, defaultValue.

Actually, as long as this is in a type branch, you should just embed the type string: "Cannot parse default as a boolean value: %s", defaultValue

Wouldn't reusing each type's type.toString() method better? As I see that is defined for each type.

I don't feel strongly about this. Either way is fine, but the first comment about the error message should be fixed.

Done refactoring it to Cannot parse default as a %s value: %s", type, defaultValue

rdblue · 2022-06-08T21:55:53Z

core/src/main/java/org/apache/iceberg/DefaultValueParser.java

+            "Cannot parse %s to a %s value", defaultValue, type);
+        return defaultValue.longValue();
+      case FLOAT:
+        Preconditions.checkArgument(defaultValue.isNumber(),


I think this should check isFloatingPointNumber

That means when the user can't specify the float value as 1, but instead need to specify 1.0 for the parser to parse. Do you think this restriction is preferred?

rdblue · 2022-06-08T21:56:39Z

core/src/main/java/org/apache/iceberg/DefaultValueParser.java

+      case DECIMAL:
+        Preconditions.checkArgument(defaultValue.isNumber(),
+            "Cannot parse %s to a %s value", defaultValue, type);
+        return defaultValue.decimalValue();


I think this also needs to validate that the decimal's scale matches the expected scale. That must always match or else it should throw an exception.

rdblue · 2022-06-08T21:58:15Z

core/src/main/java/org/apache/iceberg/DefaultValueParser.java

+      case UUID:
+        Preconditions.checkArgument(defaultValue.isTextual(),
+            "Cannot parse %s to a %s value", defaultValue, type);
+        return UUID.fromString(defaultValue.textValue());


I think this should validate that the string's length is the length of a UUID string.

UUID.fromString is already doing such validation, let me use a try-catch block to print out the error message with the same format?

Actually, it isn't:

public static UUID fromString(String var0) { String[] var1 = var0.split("-"); if (var1.length != 5) { throw new IllegalArgumentException("Invalid UUID string: " + var0); } else { for(int var2 = 0; var2 < 5; ++var2) { var1[var2] = "0x" + var1[var2]; } long var6 = Long.decode(var1[0]); var6 <<= 16; var6 |= Long.decode(var1[1]); var6 <<= 16; var6 |= Long.decode(var1[2]); long var4 = Long.decode(var1[3]); var4 <<= 48; var4 |= Long.decode(var1[4]); return new UUID(var6, var4); } }

It's just validating that there are enough parts, and decoding those parts.

Instead of doing the try/catch thing (which has its own problem) I think you should check that the string is the right length to be a UUID.

rdblue · 2022-06-08T22:43:56Z

core/src/main/java/org/apache/iceberg/DefaultValueParser.java

+      case FIXED:
+        Preconditions.checkArgument(
+            defaultValue.isTextual() && defaultValue.textValue().length() == ((Types.FixedType) type).length() * 2,
+            "Cannot parse %s to a %s value",


Can you produce a better error message for when the length is invalid?

rdblue · 2022-06-08T22:46:04Z

core/src/main/java/org/apache/iceberg/DefaultValueParser.java

+            "Cannot parse %s to a %s value", defaultValue, type);
+        List<Object> defaultList = Lists.newArrayList();
+        for (JsonNode element : defaultValue) {
+          defaultList.add(parseDefaultFromJson(type.asListType().elementType(), element));


You can move type.asListType().elementType() out of the loop.

It may also be shorter to do it this way:

Type elementType = type.asListType().elementType(); return Lists.newArrayList(Iterables.transform(arrayNode, e -> DefaultValueParser.fromJson(elementType, e)));

rdblue · 2022-06-08T22:53:54Z

core/src/main/java/org/apache/iceberg/DefaultValueParser.java

+            type);
+        Map<Object, Object> defaultMap = Maps.newHashMap();
+        JsonNode keys = defaultValue.get("keys");
+        JsonNode values = defaultValue.get("values");


I think you should check that the size of these array nodes matches.

rdblue · 2022-06-08T22:57:33Z

core/src/main/java/org/apache/iceberg/DefaultValueParser.java

+        JsonNode keys = defaultValue.get("keys");
+        JsonNode values = defaultValue.get("values");
+        List<JsonNode> keyList = Lists.newArrayList(keys.iterator());
+        List<JsonNode> valueList = Lists.newArrayList(values.iterator());


It shouldn't be necessary to copy these into lists. Instead, you can iterate over them simultaneously after checking that the size is the same:

ImmutableMap.Builder<Object, Object> mapBuilder = ImmutableMap.builder(); Iterator<JsonNode> keyIter = keys.iterator(); Type keyType = type.asMapType().keyType(); Iterator<JsonNode> valueIter = values.iterator(); Type valueType = type.asMapType().valueType(); while (keyIter.hasNext()) { mapBuilder.put(fromJson(keyType, keyIter.next()), fromJson(valueType, valueIter.next())); } return mapBuilder.build();

rdblue · 2022-06-08T23:04:48Z

core/src/main/java/org/apache/iceberg/DefaultValueParser.java

+      case STRUCT:
+        Preconditions.checkArgument(defaultValue.isObject(),
+            "Cannot parse %s to a %s value", defaultValue, type);
+        Map<Integer, Object> defaultStruct = Maps.newHashMap();


This should return a StructLike:

StructType struct = type.asStructType(); StructLike defaultRecord = GenericRecord.create(struct); List<NestedField> fields = struct.fields(); for (int pos = 0; pos < fields.size(); pos += 1) { NestedField field = fields.get(pos); String idString = String.valueOf(field.fieldId()); if (defaultValue.has(idString)) { defaultRecord.set(pos, fromJson(field.type(), defaultValue.get(idString)); } } return defaultRecord;

Makes sense, updated.

rdblue · 2022-06-08T23:05:32Z

core/src/main/java/org/apache/iceberg/DefaultValueParser.java

+        }
+        return defaultStruct;
+      default:
+        return null;


Shouldn't this throw an exception if the type is not supported?

rdblue · 2022-06-08T23:10:44Z

core/src/main/java/org/apache/iceberg/DefaultValueParser.java

+          Object value = defaultValue.has(fieldIdAsString) ? parseDefaultFromJson(
+              subField.type(),
+              defaultValue.get(fieldIdAsString)) : null;
+          if (value != null) {


Here, I think we need to handle the child default values. If we make this independent of the child's default value, then there is no way to distinguish between an explicit null default and a missing default after this returns.

When the default is missing and the child field has a default, this should fill in the child's default value.

Since the field can't actually carry a default value right now, I think we can put this off until the next PR.

For the next step, I think this should add the API changes as package-private so we can add handling for child defaults in the same package. We can move the parser and make more things public as we make progress.

I think handling child default will require a second pass to traverse the schema (with default) again. I plan to have another PR that implements a SchemaVisitor handle this.

rdblue · 2022-06-08T23:12:56Z

core/src/main/java/org/apache/iceberg/DefaultValueParser.java

+      case TIMESTAMP:
+        Preconditions.checkArgument(defaultValue.isTextual(),
+            "Cannot parse %s to a %s value", defaultValue, type);
+        return Literal.of(defaultValue.textValue()).to(type).value();


Rather than using Literal, could you just refactor to add the conversions to DateTimeUtil like the to string conversions? That way we have both in the util.

rdblue · 2022-06-08T23:14:09Z

core/src/main/java/org/apache/iceberg/util/DateTimeUtil.java

+        (int) (micros % 1000000) * 1000, ZoneOffset.UTC).format(DateTimeFormatter.ISO_LOCAL_DATE_TIME);
+    if (withUTCZone) {
+      // We standardize the format by always using the UTC zone
+      return LocalDateTime.parse(localDateTime, DateTimeFormatter.ISO_LOCAL_DATE_TIME)


This should not produce a string and then parse it. Instead, it should update the conversion above to go directly.

Got it, refactored.

core/src/main/java/org/apache/iceberg/DefaultValueParser.java

rdblue · 2022-06-08T23:18:28Z

core/src/main/java/org/apache/iceberg/DefaultValueParser.java

+  }
+
+  @SuppressWarnings("checkstyle:CyclomaticComplexity")
+  public static Object parseDefaultFromJson(Type type, JsonNode defaultValue) {


Can you rename this fromJson? And also add the variations of the method that accept String.

rdblue · 2022-06-08T23:19:19Z

core/src/main/java/org/apache/iceberg/DefaultValueParser.java

+    }
+  }
+
+  public static Object convertJavaDefaultForSerialization(Type type, Object value) {


Like the other parsers, this method should be passed a JsonGenerator that handles creating the JSON string.

I refactored it to be:

public static String toJson(Type type, Object javaDefaultValue) throws IOException { return JsonUtil.mapper().writeValueAsString(DefaultValueParser.convertJavaDefaultForSerialization( type, javaDefaultValue)); }

@rzhang10, please look at the other parsers and match what they do. You should be using a JsonGenerator.

Sure, I refactored to use JsonGenerator.

core/src/main/java/org/apache/iceberg/DefaultValueParser.java

rdblue · 2022-06-08T23:22:14Z

core/src/main/java/org/apache/iceberg/DefaultValueParser.java

+        convertedDefault.put("values", valueList);
+        return convertedDefault;
+      case STRUCT:
+        Map<Integer, Object> defaultStruct = (Map<Integer, Object>) value;


This should deconstruct a StructLike, not a map.

rdblue · 2022-06-08T23:22:51Z

core/src/test/java/org/apache/iceberg/TestDefaultValuesParsingAndUnParsing.java

+  private static String defaultValueParseAndUnParseRoundTrip(Type type, JsonNode defaultValue)
+      throws JsonProcessingException {
+    Object javaDefaultValue = DefaultValueParser.parseDefaultFromJson(type, defaultValue);
+    String jsonDefaultValue = JsonUtil.mapper()


The parser should produce and accept strings, rather than doing it here in tests.

Got it, makes sense, refactored.

rdblue · 2022-06-08T23:23:54Z

core/src/test/java/org/apache/iceberg/TestDefaultValuesParsingAndUnParsing.java

+        {Types.StructType.of(
+            required(1, "f1", Types.IntegerType.get(), "doc"),
+            optional(2, "f2", Types.StringType.get(), "doc")),
+         stringToJsonNode("{\"1\": 1, \"2\": \"bar\"}")}


Can you add test cases for nested types? One of each (list, map, struct) that contains a struct would be good.

rdblue · 2022-06-08T23:25:36Z

core/src/test/java/org/apache/iceberg/TestDefaultValuesParsingAndUnParsing.java

+import static org.apache.iceberg.types.Types.NestedField.required;
+
+@RunWith(Parameterized.class)
+public class TestDefaultValuesParsingAndUnParsing {


I think this should also have a few tests for cases that are caught above, like maps with different length key and value lists, binary and fixed values that are not the right length, UUID values that are not actually UUIDs, etc.

rzhang10 · 2022-08-01T21:34:13Z

@rdblue I've addressed the comments and rebased on master and also did a spotlessApply, could you review again?

rdblue · 2022-08-02T22:53:37Z

Thanks, @rzhang10! The latest changes look good. I merged this.

shiyancao · 2022-09-07T19:54:54Z

hi @rdblue / @rzhang10, are there more PRs to be developed before we can support default value in Iceberg?

I read in PR 4732 and it seems that these are still pending to create but just want to double check and confirm.

Add the JSON value parser
Add as much as possible to Parquet, Avro, and ORC readers, like being able to read with a fake map of default values.

Also, is there a place I can have a holistic view of the full issue? It seems that issue 2039 was the one but it was not updated.

rzhang10 · 2022-09-08T19:05:07Z

Hi @shiyancao, yes more PRs are underway for support reading default values in engines with different formats (Avro/ORC/Parquet), we will start with implementing support for Spark first.

(cherry picked from commit 3a9e0a6)

github-actions bot added the core label May 25, 2022

rdblue reviewed May 27, 2022

View reviewed changes

core/src/main/java/org/apache/iceberg/DefaultValueParser.java Show resolved Hide resolved

rdblue reviewed May 27, 2022

View reviewed changes

rdblue reviewed Jun 8, 2022

View reviewed changes

core/src/main/java/org/apache/iceberg/DefaultValueParser.java Outdated Show resolved Hide resolved

rdblue reviewed Jun 8, 2022

View reviewed changes

core/src/main/java/org/apache/iceberg/DefaultValueParser.java Show resolved Hide resolved

rdblue reviewed Jun 8, 2022

View reviewed changes

rzhang10 and others added 18 commits August 1, 2022 14:03

Refactor toJson to use JsonGenerator

bcc8350

Add TestInvalidDefaultValues

e3c8f46

Minor change: UUID error message

ac34d57

Add validation in toJson

a64937b

Add handle null and map refactor in toJson

af1a8d9

Minor fix

84c6994

Add null unit test and add additional validation for FIXED in toJson

3ef04c8

Address comments

1e8f98e

Refactor unit tests

dad1811

refactor unit tests again

2042fb6

revert mistaken changes

9cca9c6

Address comments and refactor

976e936

Change decimal serialization to string

a908d53

Address comments

0e063f3

Use custom formatter for serializing timestamptz value

94421c3

Change default spec according to the changes in this PR

08e740e

Minor fix

e85de09

Fix all comments

16e9559

rzhang10 force-pushed the default_value_parse_unparse branch from 368a321 to 16e9559 Compare August 1, 2022 21:16

rzhang10 added 2 commits August 1, 2022 14:17

Refactor variable name

3b855ad

SpotlessApply

64d949a

rdblue approved these changes Aug 2, 2022

View reviewed changes

rdblue merged commit 3a9e0a6 into apache:master Aug 2, 2022

rzhang10 mentioned this pull request Sep 9, 2022

Spark: Demonstrate reading Avro files with default value using a mock… #5738

Closed

abmo-x pushed a commit to abmo-x/iceberg that referenced this pull request Oct 21, 2022

Core: Add JSON single value parser (apache#4871)

2ba0ed6

(cherry picked from commit 3a9e0a6)

zhongyujiang pushed a commit to zhongyujiang/iceberg that referenced this pull request Apr 16, 2025

[Cherry-pick] Core: Add JSON single value parser (apache#4871)

6a12c3a

Core: Implement default value parsing and unparsing #4871

Core: Implement default value parsing and unparsing #4871

Uh oh!

Conversation

rzhang10 commented May 25, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rzhang10 commented May 31, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rzhang10 Jun 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

rzhang10 Jun 8, 2022 •

edited

Loading