API: Add default value core api and schema serialization #4525

rzhang10 · 2022-04-08T00:58:07Z

This is the first PR to implement the spec pr #4301

It augments the NestedField API with default values and adds logic to SchemaParser for serializing default values with schema, and it also implements the logic to deserialize it again to the in-memory representation in NestedField.

The default value is represented in memory as vanilla java Object and is validated/parsed/converted from the user-facing JSON representation via utilities in DefaultValueParser. The JSON representation is also what gets serialized to disk along with the schema, the representation aligns with Appendix C in spec #4301.

This PR will be the basis of the later PRs to support default value r/w semantics in each data type X compute engine.

Please take a look @rdblue , thank you!

rdblue · 2022-04-10T20:57:46Z

build.gradle

 project(':iceberg-api') {
  dependencies {
    implementation project(path: ':iceberg-bundled-guava', configuration: 'shadow')
+    implementation "com.fasterxml.jackson.core:jackson-databind"


Why is an additional Jackson library needed? I think we should be able to add default value serialization just like the other parsers.

rdblue · 2022-04-10T20:58:46Z

core/src/main/java/org/apache/iceberg/SchemaParser.java

+      }
+      if (field.writeDefaultValue() != null) {
+        generator.writeFieldName(WRITE_DEFAULT);
+        generator.writeRawValue(field.writeDefaultValue().toString());


Instead of using toString that produces JSON, I think that this should expose a ValueParser that converts values from the internal Iceberg representation into JSON values.

rdblue · 2022-04-10T20:59:36Z

api/src/test/java/org/apache/iceberg/types/TestTypeDefaultValuesValidation.java

+        {Types.DoubleType.get(), parseJsonStringToJsonNode("123.456")},
+        {Types.DateType.get(), parseJsonStringToJsonNode("3650")},
+        {Types.TimeType.get(), parseJsonStringToJsonNode("36000000000")},
+        {Types.TimestampType.withoutZone(), parseJsonStringToJsonNode("1649374911000000")},


Looking at this again, I think that date/time fields should be converted to (strict) ISO-8601 representations. That way they are readable in JSON.

rdblue · 2022-04-10T21:00:30Z

api/src/main/java/org/apache/iceberg/types/Types.java

+    }
+
+    @SuppressWarnings("checkstyle:CyclomaticComplexity")
+    private static boolean isValidDefault(Type type, JsonNode defaultValue) {


Iceberg uses visitors to separate the logic for schema traversal from the logic for individual decisions. Can you please use a visitor?

I feel using a SchemaVisitor here would be an overkill, since this logic is only validating a NestField, it's technically only part of the schema, and even if it's a complex type, we only recursively validate downward, but not upward. Also I feel readability-wise, this recursive implementation is clean enough? it clearly shows the logic is to validate each type of default value case-by-case.

To me, I feel SchemaVisitor is mostly used to do holistic schema manipulations/transformations.

rdblue · 2022-04-10T21:05:50Z

api/src/main/java/org/apache/iceberg/types/Types.java

      return doc;
    }

+    public JsonNode initialDefaultValue() {


This API should not use Jackson classes. Instead, it should use the standard Iceberg in-memory representation:

boolean - Java primitive boolean

int - Java primitive int

long - Java primitive long

float - Java primitive float

double - Java primitive double

date - Java primitive int

time - Java primitive long

timestamp/timestamptz - Java primitive long

string - Java CharSequence

uuid - Java UUID

binary - Java ByteBuffer

fixed(L) - Java ByteBuffer

decimal(P, S) - Java BigDecimal

list - Java Collection

struct - StructLike

map - Java Map

list - Java Collection

I think using Java List is more correct, as List cares about the order.

struct - StructLike

I see this comment:

/** * Interface for accessing data by position in a schema. * <p> * This interface supports accessing data in top-level fields, not in nested fields. */

in that interface, also I see the API exposes is getting fields by position, which doesn't suit our casing of accessing fields by field id.

I think using a native Java map would be better.

rdblue · 2022-04-10T21:06:36Z

api/src/main/java/org/apache/iceberg/types/Types.java

      return Objects.hash(NestedField.class, id, isOptional, name, type);
    }
+
+    private static JsonNode validateDefault(String name, Type type, JsonNode defaultValue) {


I think this logic should be moved to a ValueParser with toJson and fromJson methods, like the others. We don't want to leak any Jackson classes in the API.

rzhang10 · 2022-04-11T19:16:43Z

@rdblue

I see your major concern in the reviews is regarding using Jackson in Iceberg-api module.

Let me put my thoughts here in one place to answer all the questions above, the core problem is whether we should represent the default value in memory in JsonNode or Java primitives.

The reason I chose Jackson is that:

Jackson has more robust json validation utils, which come in handy here.
Jackson JsonNoe is also java.io.Serializable.
Jackson is also used in the Iceberg-core module as a dependency. I believe whenever iceberg is used as a library, iceberg-api and iceberg-core will both be dependent upon. So I wouldn't believe introducing Jackson in the API module will (additionally) pollute downstream dependencies.

If we take the java primitive approach, I think my current code could be adapted this way:

Maintain the current json validation logic, then add a ValueParser with toJson and fromJson to convert between json and their respective java primitive representations. The structure of the current code will pretty much stay the same.

Let me know your thoughts and I'm open to further discussions too.

rdblue · 2022-04-11T19:50:05Z

@rzhang10, it's okay to add the library to the API module if that's what is needed, although I'd like to see if we actually need it there or if we can convert values to/from JSON in core.

For the in-memory representation, we do need to use the common representation and not JsonNode. We also cannot leak Jackson classes in the API module.

rdblue · 2022-05-03T02:03:19Z

api/src/main/java/org/apache/iceberg/expressions/Literal.java

    return new Literals.DecimalLiteral(value);
  }

+  static Literal<Integer> ofDateLiteral(int value) {


I don't think you should need any new literal constructors. You can always convert using to and the type.

See my below comment.

rdblue · 2022-05-03T02:04:50Z

api/src/main/java/org/apache/iceberg/expressions/Literals.java

        return (Literal<T>) this;
+      } else if (type.typeId() == Type.TypeID.STRING) {
+        return (Literal<T>) new StringLiteral(LocalTime.ofNanoOfDay(value() * 1000)
+            .format(DateTimeFormatter.ISO_LOCAL_TIME));


I don't think this conversion should be done using literals. Instead, use the methods in DateTimeUtil and add any new methods there. This shouldn't require changes to the public API in expressions.

Yeah since previously I remember you pointed me to the Literals code so I thought I might be able to utilize the existing literals conversion logic there.
My changes basically add the conversion from the 3 time/date-related type to string literal. Since I noticed the existing string literal code has the conversion to those date/time types, so I thought it might be a plus to add my current changes that can make the conversions more symmetric and enable people to do round-trip conversions?

WDYT?

I don't think this conversion should be done using literals. You can see how we're doing it here, but there is no need to change the Literals API. And there is certainly no need to add extra conversions to Literals. These are purposely limited.

rdblue · 2022-05-03T02:06:55Z

api/src/main/java/org/apache/iceberg/types/Types.java


-    private NestedField(boolean isOptional, int id, String name, Type type, String doc) {
+    private NestedField(boolean isOptional, int id, String name, Type type, String doc,
+        Object initialDefault, Object writeDefault) {


Style: In Iceberg always start all argument lines at the same point, either all indented after the method declaration (aligned with boolean isOptional, ...) or all starting at the same continuation indent level. This mixes the two.

@rzhang10, please don't resolve threads that are not updated yet.

api/src/main/java/org/apache/iceberg/types/Types.java

rdblue · 2022-05-03T02:10:56Z

api/src/main/java/org/apache/iceberg/types/Types.java

      return doc;
    }

+    public Object initialDefaultValue() {


Do we need Value here? It seems redundant. Maybe just initialDefault and writeDefault.

rdblue · 2022-05-03T02:12:09Z

core/src/main/java/org/apache/iceberg/DefaultValueParser.java

+import org.apache.iceberg.types.Type;
+import org.apache.iceberg.types.Types;
+
+public class DefaultValueParser {


I think this class could be in a separate PR to start with. The representation and tests are fairly well-defined. Can you separate it out?

The reason I put them together is that, the default value serialization and deserialization will depends on the APIs to access the default value stored inside the schema. So in order for the jsonRoundTrip test to work, it must depends on the methods introduced in the API..

Do you think we can first create a standalone PR with only API changes (I think that PR should be very minimum and fast for approval and merge), then I can rebase the changes and submit subsequent PRs? (The spark reader implementation PRs will also surely depends on the new API methods).

The reason I'm asking is even if I split all my later PRs, if the API changes are not checked in, those PRs won't compile themselves, or else I have to include the same API code changes in all the split PRs respectively.

Anyways, I have created a minimum API-change-only PR: #4732 , which took into account all your above style comments. I think this PR should be quite straightforward to be merged. And I plan to submit all subsequent split PRs by rebasing this after this has been merged.

rdblue · 2022-05-03T02:14:30Z

core/src/main/java/org/apache/iceberg/DefaultValueParser.java

+
+    switch (type.typeId()) {
+      case BOOLEAN:
+        return jsonNode.booleanValue();


I think this should use the same pattern and checks that are used in other JSON parsers. Those use JsonUtil and validate the type. For example:

public static int getInt(String property, JsonNode node) { Preconditions.checkArgument(node.has(property), "Cannot parse missing int %s", property); JsonNode pNode = node.get(property); Preconditions.checkArgument(pNode != null && !pNode.isNull() && pNode.isNumber(), "Cannot parse %s to an integer value: %s", property, pNode); return pNode.asInt(); }

That checks that the node actually is an integer. We want to be similarly strict here.

There's another dedicated method which does validation work isValidDefault which does this similar check already, and it should always be called before calling the current method parseDefaultFromJson which does the actual parsing.

Let's match the approach that the other parsers take so we have consistency in the code.

core/src/main/java/org/apache/iceberg/util/JsonUtil.java

rdblue · 2022-05-03T02:15:40Z

core/src/main/java/org/apache/iceberg/DefaultValueParser.java

+    MAPPER = new ObjectMapper(FACTORY);
+    SimpleModule customModule = new SimpleModule();
+    customModule.addSerializer(ByteBuffer.class, new HexStringCustomByteBufferSerializer());
+    MAPPER.registerModule(customModule);


The majority of the Iceberg parsers don't use ObjectMapper. Why is it needed here?

I use this mainly to help serialize the java in-mem fixed and binary type defaults which are represented by java ByteBuffer to our custom json representation that looks like 0x111f, you can see the customized serialized logic in HexStringCustomByteBufferSerializer.

rdblue · 2022-05-03T02:16:33Z

core/src/main/java/org/apache/iceberg/DefaultValueParser.java

+    return defaultValue;
+  }
+
+  @SuppressWarnings("checkstyle:CyclomaticComplexity")


We generally don't suppress this warning unless we are trying to avoid major code changes. Since this is all new, it is best to follow the recommendation.

rdblue · 2022-05-03T02:17:21Z

core/src/test/java/org/apache/iceberg/TestDefaultValuesValidationAndParsing.java

+      ObjectMapper mapper = new ObjectMapper();
+      return mapper.readTree(json);
+    } catch (JsonProcessingException e) {
+      System.out.println("Failed to parse: " + json + "; reason: " + e.getMessage());


Please remove print statements.

github-actions · 2024-08-08T00:13:26Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

github-actions · 2024-08-15T00:14:02Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

rzhang10 added 3 commits April 7, 2022 15:35

Add default value to API module

0ebebaf

Add default value validation unit tests

1b418f9

Add default value serialization logic

c6459c1

github-actions bot added API build core labels Apr 8, 2022

Reformat code style for TestTypeDefaultValuesValidation

fab4dee

rdblue reviewed Apr 10, 2022

View reviewed changes

rzhang10 added 2 commits April 11, 2022 16:44

Refactor API/Core module to let json validation/parser logic sit in core

8c4c10b

Implement Json to/from java schema representation round-trip conversion

b575ece

rzhang10 mentioned this pull request Apr 12, 2022

Spark: Implement the architecture to read default values #4547

Closed

rzhang10 added 4 commits April 13, 2022 15:02

Use new api in PruneColumns

2430dae

Convert date/time/ts to java in-mem int/long/long representations

4b0baad

Remove not needed jackson module

43b7f7d

Fix checkstyle

a134f2b

rdblue reviewed May 3, 2022

View reviewed changes

api/src/main/java/org/apache/iceberg/types/Types.java Show resolved Hide resolved

rdblue reviewed May 3, 2022

View reviewed changes

api/src/main/java/org/apache/iceberg/types/Types.java Show resolved Hide resolved

rdblue reviewed May 3, 2022

View reviewed changes

core/src/main/java/org/apache/iceberg/util/JsonUtil.java Show resolved Hide resolved

rdblue reviewed May 3, 2022

View reviewed changes

github-actions bot added the stale label Aug 8, 2024

github-actions bot closed this Aug 15, 2024

API: Add default value core api and schema serialization #4525

API: Add default value core api and schema serialization #4525

Uh oh!

Conversation

rzhang10 commented Apr 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rzhang10 commented Apr 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue commented Apr 11, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue May 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 8, 2024

Uh oh!

github-actions bot commented Aug 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

rzhang10 commented Apr 8, 2022 •

edited

Loading

rzhang10 commented Apr 11, 2022 •

edited

Loading

rdblue May 18, 2022 •

edited

Loading