Conversation
|
This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions. |
|
This pull request/issue is no longer marked as stale. |
9dd87ad to
d144d8e
Compare
|
This seems like a reasonable addition 👍, could you also add support to the avro streaming input format that was added by #11040? It does look like it might be hard to make the coverage bot happy without testing a lot of different union types, but maybe that is a good thing in this case if not too tedious. |
|
@clintropolis sure, I will get around to it sometime this week. |
1135107 to
a51214a
Compare
|
@clintropolis I rebased this and added support in the new InputFormat and extended the tests to cover all the union member types, ready for another review. |
|
Looks like only the spell check failed. Added to the spelling file and also tweaked the docs a little bit to correct a missing type and be more specific about how named types are handled. |
a5b5c04 to
d0031f4
Compare
|
@clintropolis would you be able to take another look at this? I think it's ready to merge. |
clintropolis
left a comment
There was a problem hiding this comment.
The changes look good to me, the only thing I'm concerned about is whether or not explodeUnions is the best name for this, mainly the explode part of the name, which in my mind i associate with something like #8698, which was an attempt to add a feature that could take an input row that had an array and produce multiple rows with the scalar values of the input array.
Perhaps convertUnions or extractUnions would be a better name? Idk, naming is hard 😓
|
I don't feel strongly about the name. I'm happy with |
d0031f4 to
ab1b27d
Compare
clintropolis
left a comment
There was a problem hiding this comment.
sorry I missed this before, but we probably should also update https://github.com/apache/druid/blob/master/docs/ingestion/data-formats.md#avro-stream and the other tables that describe the input formats/parsers to include this new parameter
There was a problem hiding this comment.
super nit: i think it's -> its
There was a problem hiding this comment.
nit: still using explode terminology
There was a problem hiding this comment.
Hmm, I looked a lot closer at this than I did on a previous pass, and I think the old docs were sort of wrong. Using the example someMultiMemberUnion type in this PR, I can still use a flatten spec to extract the values of any type with the existing code, apparently even including for record types, where the extraction path is $.someMultiMemberUnion.subString (instead of $.someMultiMemberUnion.UnionSubRecord.subString as in the mode added in this PR).
As such, with my better understanding I think it makes sense to instead call this new property extractUnionsByType or something similar, and clarify that this new mode requires using a flatten spec to extract the values, but with the benefit that you can selectively extract values of only a certain type so that they can be mapped to separate Druid columns or whatever. I also don't think it necessarily makes sense to refer to the other mode as legacy, since I guess it still has a use if the union is composed mainly of primitive types and all are able to be coerced into a common Druid type, or if it is a simple union type of the legacy form, which the new mode does not effect (since the isUnion code checks for more than 1 non-null type).
Sorry I didn't look closer into this previously and for the review churn, my bad.
Also because the new mode needs to be link to the flatten spec docs, maybe it makes sense to just move the unions description entirely into the complex types section, which also seems to mirror the Avro specification docs https://avro.apache.org/docs/current/spec.html#schema_complex
There was a problem hiding this comment.
No problem, I think you are right and I will make these changes next week.
ab1b27d to
3ebd657
Compare
clintropolis
left a comment
There was a problem hiding this comment.
there is a spelling error causing a CI failure, https://travis-ci.com/github/apache/druid/jobs/512959568#L759, but other than that, lgtm 👍
* Avro union support * Document new union support * Add support for AvroStreamInputFormat and fix checkstyle * Extend multi-member union test schema and format * Some additional docs and add Enums to spelling * Rename explodeUnions -> extractUnions * explode -> extract * ByType * Correct spelling error
Adds back test coverage for Avro flattener that was mistakenly removed in apache#10505. Recfactored the tests a bit too.
* Add back missing unit test coverage in AvroFlattenerMakerTest Adds back test coverage for Avro flattener that was mistakenly removed in #10505. Recfactored the tests a bit too. * resolve checkstyle warnings
Fixes #10493
Description
Implements better support for Avro unions in the Avro extensions.
Currently when ingesting data that contains unions unless the union is always the same value it will return mixed type results.
This PR addresses the problem by exploding union fields into maps keyed by the union member type or type name in the case of named types (enums, fixed, records).
The method was chosen as it's similar to what is done on other systems that support ingesting Avro data, such as Google BigQuery which details their Avro compatibility here: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#avro_conversions
This PR has:
Key changed/added classes in this PR
AvroFlattenerMakerGenericAvroJsonProvider