-
Notifications
You must be signed in to change notification settings - Fork 3k
Hive: Implement Deserializer for Hive writes #1854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@marton-bod, @lcspinter: Could you please review? |
marton-bod
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally looks good, just a few minor questions
...n/java/org/apache/iceberg/mr/hive/serde/objectinspector/IcebergTimestampObjectInspector.java
Outdated
Show resolved
Hide resolved
|
@rdblue, @shardulm94: Could you please review if you have some time? This is a part of #1407 (Hive: HiveIcebergOutputFormat first implementation for handling Hive inserts into unpartitioned Iceberg tables) Thanks! |
| Object value(Object object); | ||
| } | ||
|
|
||
| private static FieldDeserializer deserializer(Type type, ObjectInspector fieldInspector) throws SerDeException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the other modules use the visitor pattern to traverse a type. That keeps the logic for traversing a schema in just one place so you don't need to mix it in with your domain-specific code.
It looks like this method is called from with the StructDeserializer constructor, so there is a recursive traversal of the schema that goes back and forth between object constructors and this method. That's a bit hard to follow, so I think it would be simpler if you took the visitor approach.
A GenericAvroReader is similar to what you're building here, so take a look at that as an example: https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/avro/GenericAvroReader.java
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I moved to the Visitor pattern. The the code is quite nice, and compact 😄
Hive struct names are not matched to the Iceberg struct name fields in case of the main struct which contains the result columns of a query. In this case we have the following col names: 0:col1, 1:col2 and they should be matched by the positions of the fields and not by the names.
Handling this will add more complexity. I propose to handle this in the next PR. What do you think?
.../main/java/org/apache/iceberg/mr/hive/serde/objectinspector/IcebergWriteObjectInspector.java
Outdated
Show resolved
Hide resolved
mr/src/test/java/org/apache/iceberg/mr/hive/HiveIcebergTestUtils.java
Outdated
Show resolved
Hide resolved
| PrimitiveObjectInspectorFactory.writableLongObjectInspector, | ||
| PrimitiveObjectInspectorFactory.writableStringObjectInspector | ||
| )); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: double whitespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we use 2 spaces for Continuation indent as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Continuation indents are 2 indents, which are 4 spaces. What you have here looks correct to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I am not sure what was the comment about. 😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was an extra newline.
mr/src/test/java/org/apache/iceberg/mr/hive/TestDeserializer.java
Outdated
Show resolved
Hide resolved
mr/src/test/java/org/apache/iceberg/mr/hive/TestDeserializer.java
Outdated
Show resolved
Hide resolved
mr/src/test/java/org/apache/iceberg/mr/hive/TestDeserializer.java
Outdated
Show resolved
Hide resolved
|
Thanks, @pvary! This looks great so far. I had a few comments, but it looks like a good improvement! |
| return null; | ||
| } | ||
|
|
||
| List<Object> result = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: prefer Lists.newArrayList().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember on some other PR got a review from someone else to avoid unnecessary guava uses.
I am find with both, but I think we should stick to one or the other. Shall it be Lists.newArrayList then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to bring in any additional Guava classes if possible, but using the ones that are already there is a good thing. For this case, we can easily replace map class implementations with an import change later.
| } | ||
|
|
||
| private static class PartnerObjectInspectorByNameAccessors | ||
| implements SchemaWithPartnerVisitor.PartnerAccessors<ObjectInspector> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: This could probably be a singleton.
| * wrapper around the ObjectInspector. This wrapper uses the Iceberg schema column names instead of the Hive column | ||
| * names for {@link #getStructFieldRef(String) getStructFieldRef} | ||
| */ | ||
| private static class FixNameMappingObjectInspector extends StructObjectInspector { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't all of the structs need to be wrapped by this?
|
I don't have any major concerns about this, so I'm going to merge it to unblock the other work. Thanks @pvary! It would be nice to figure out a better way to traverse the object inspector tree that doesn't require the |
|
Thanks @rdblue for the review and the merge! In Hive you always need to provide the name-value pair for structs. AFAIK the only exception is the Schema level struct.
So this is why I have kept the current one which has its own drawbacks, so I am happy to go for a more general solution if we find one. |
Implements the Deserializer for Hive writes, and creates the corresponding test suite