Skip to content

Conversation

@kunal0829
Copy link

@kunal0829 kunal0829 commented Mar 10, 2022

In this PR, the Iceberg events are now serializable into JSON by introducing an EventParser along with the addition of ExpressionParser.

kunal0829 and others added 15 commits March 10, 2022 13:51
* Event Parser and Logging

Added Expression parser in event parser and adding error logging for notify.

* Removed unused imports

* Update scan test logic and SNS error message

Combine calls for scan event logic in test.

Add topic name in error message for SNS topic.

* Remove unused import

Did not remove Tablescan import after changing scan to one inline operation

* Fix error throwing
Co-authored-by: Kunal Bhargava <kunbharg@amazon.com>
* Added Event Types in Json

* Changed Reflection Initialization

* Changed Event Type Input in Json

* Fixed Listener Initialization
* Added Event Types in Json

* Changed Reflection Initialization

* Changed Event Type Input in Json

* Fixed Listener Initialization

* Added Partial fromJson in ExpressionParser

Still missing Literal serialization.

* Added Partial fromJson in ExpressionParser

Still missing Literal serialization.

* Added Literal Serialization

* Fixed DecimalLiteral Serialization

* Changed to Iceberg Private Classes (Original)
@jackye1995
Copy link
Contributor

Can we limit this PR to only event parsers? The purpose is to breakdown the draft, not republish it

@kunal0829 kunal0829 changed the title AWS: Add Support for SNS and SQS Listeners Core: EventParser and ExpressionParser Mar 10, 2022
@kunal0829 kunal0829 changed the title Core: EventParser and ExpressionParser Core: add parsers for events and unbounded expressions Mar 10, 2022
}
}

@SuppressWarnings("CyclomaticComplexity")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not needed?

}
}

public static Predicate<?, ?> fromJsonToPredicate(JsonNode json, String predicateType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please keep all methods private except for the main one for Expression

public static Predicate<?, ?> fromJsonToPredicate(JsonNode json, String predicateType) {
if (UNBOUNDED_PREDICATE.equals(predicateType)) {
return fromJsonUnboundPredicate(json);
} else if (BOUNDED_LITERAL_PREDICATE.equals(predicateType)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are only targeting unbounded, we can just use a generic else block saying we don't support bounded expression yet


public static void toJson(And expression, JsonGenerator generator) throws IOException {
generator.writeStartObject();
generator.writeStringField(TYPE, AND);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here can we reuse the word "operation"? Because phrases like "and" and "or" are operations

generator.writeStringField(OPERATION, predicate.op().name().toLowerCase());
generator.writeFieldName(TERM);
toJson(predicate.term(), generator);
if (!(ONE_INPUTS.contains(predicate.op()))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant brackets after !

generator.writeEndObject();
}

public static void toJson(List<Literal<?>> literals, JsonGenerator generator) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method seems unnecessary, can be combined to parent method, also can use enhanced for loop

private static final String LEFT_OPERAND = "left-operand";
private static final String RIGHT_OPERAND = "right-operand";
private static final String OPERAND = "operand";
private static final String AND = "and";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these can just use operation.name().toLowerCase(), no need to define extra variables

return True.INSTANCE;
} else if (FALSE.equals(expressionType)) {
return False.INSTANCE;
} else if (PREDICATE_TYPES.contains(expressionType)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need a set of predicate types, we just need to have a single else block, delegating all the rest of the conditions to fromJsonToPredicate(json, expressionType);


public class EventParser {
private static final String EVENT_TYPE = "event-type";

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra newline

}
}

public static ScanEvent fromJsonToScanEvent(JsonNode json) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep methods below private

}

@Test
public void testParserBothWays() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "roundTrip" instead of "bothWays"

@rdblue
Copy link
Contributor

rdblue commented Mar 11, 2022

Looks like this is proposing JSON serialization. Can you write up the proposal for those formats? It is much easier to review the proposed formats that way, rather than looking at code that does it.

private static final String NOT = "not";
private static final String TRUE = "true";
private static final String FALSE = "false";
private static final String UNBOUNDED_PREDICATE = "unbounded-predicate";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term is "unbound", meaning that the expression has not been bound to a particular incoming struct/row type, not "unbounded".

private static final String FALSE = "false";
private static final String UNBOUNDED_PREDICATE = "unbounded-predicate";
private static final String BOUNDED_LITERAL_PREDICATE = "bounded-literal-predicate";
private static final String BOUNDED_SET_PREDICATE = "bounded-set-predicate";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to serialize bound predicates?

}
}

public static void toJson(Expression expression, JsonGenerator generator) throws IOException {
Copy link
Contributor

@rdblue rdblue Mar 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iceberg separates the recursive logic to process a tree structure from the logic to do something by using visitors. I think that this should use an in-order expression visitor to produce a JSON representation.

import org.apache.iceberg.expressions.Expression;
import org.apache.iceberg.expressions.ExpressionParser;

public class EventParser {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of event is this parser for? It appears to accept Object in toJson, which is suspicious.

if (pretty) {
generator.useDefaultPrettyPrinter();
}
if (event instanceof ScanEvent) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We add newlines after control flow blocks and the following statement.

toJson((CreateSnapshotEvent) event, generator);
} else if (event instanceof IncrementalScanEvent) {
toJson((IncrementalScanEvent) event, generator);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the event is not handled, this will just fail?

Made Parser function private that aren't called outside the classes and changed Operation type to "operation" instead of "type"
@rdblue
Copy link
Contributor

rdblue commented Apr 10, 2022

@kunal0829, can you write up a doc that summarizes the proposed serialization formats?

@kunal0829
Copy link
Author

@rdblue Here is an example write up that explains the proposed format: https://docs.google.com/document/d/1CvOfsVAMezqQyyAwIn0qfew9EvlX-5PIvKGHPzlHldA/edit?usp=sharing

Please let me know if you have any questions about the proposed format

@rdblue
Copy link
Contributor

rdblue commented Apr 24, 2022

I've not been able to comment on the doc since it isn't opened up for discussion. I'll add a few of the comments here instead.

My main concern about the proposed JSON format is that it is needlessly verbose and relies heavily on the JVM classes rather than representing the expressions more directly. For example, it is scoped to only unbound expressions, but still uses "unbound-predicate" and "named-reference". It also uses "left-operand" and "right-operand". I think it would be better to use fewer words and make better use of "type".

Each node in the expression tree should be represented as a JSON object. Those are distinguished by the type field: that can have the following values: and, or, not, eq, not-eq, in, not-in, lt, lt-eq, gt, gt-eq, starts-with, not-starts-with, is-null, not-null, is-nan, not-nan, reference, transform, and literal.

Then there is some specialization by type:

  • and and or must have child expressions named left and right
  • not must have a child expression named child
  • All predicates must have a term expression that is either a transform or a reference
  • transform must have a term expression that is either a transform or a reference and a transform field that is a transform string ("identity", "bucket[16]", etc.)
  • Set predicates (in and not-in) must have a JSON array called values of literals
  • Comparison predicates (eq, etc.) must have a value that is a literal
  • reference must have a name that is a JSON string
  • literal must have a value field
  • Literal values must use the JSON value serialization defined in Docs: Default value support feature specification #4301 (structs, lists, and maps are not allowed)

I would also allow the reference to be replaced with a string field name, and a literal to be replaced with the value when nested.

I think this format is more direct:

# example 1: event_ts = 2022-04-24T11:42:12.578391 OR bucket(16, id) IN (0, 1)
{
  "type": "or",
  "left": {
    "type": "eq",
    "term": "event_ts", # equivalent to { "type": "reference", "name": "event_ts" }
    "value": "2022-04-24T11:42:12.578391" # equivalent to { "type": "literal", "value": "2022-04-24T11:42:12.578391" }
  }
  "right": {
    "type": "in"
    "term": {
      "type": "transform",
      "transform": "bucket[16]",
      "term": "id"
    },
    "values": [
      { "type": "literal", "value": 0 },
      1
    ]
  }
}
# example 2: true
{ "type": "literal", "value": true }

rdblue added a commit to rdblue/iceberg that referenced this pull request Aug 14, 2022
rdblue added a commit to rdblue/iceberg that referenced this pull request Aug 14, 2022
rdblue added a commit to rdblue/iceberg that referenced this pull request Aug 14, 2022
rdblue added a commit to rdblue/iceberg that referenced this pull request Aug 14, 2022
rdblue added a commit to rdblue/iceberg that referenced this pull request Aug 14, 2022
@rdblue rdblue closed this in #5602 Aug 22, 2022
@rdblue
Copy link
Contributor

rdblue commented Aug 22, 2022

@kunal0829 and @jackye1995, an expression JSON parser was just merged in #5602.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants