Avro union support by josephglanville · Pull Request #10505 · apache/druid

josephglanville · 2020-10-10T23:45:27Z

Description

Implements better support for Avro unions in the Avro extensions.
Currently when ingesting data that contains unions unless the union is always the same value it will return mixed type results.
This PR addresses the problem by exploding union fields into maps keyed by the union member type or type name in the case of named types (enums, fixed, records).
The method was chosen as it's similar to what is done on other systems that support ingesting Avro data, such as Google BigQuery which details their Avro compatibility here: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#avro_conversions

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

Key changed/added classes in this PR

AvroFlattenerMaker
GenericAvroJsonProvider

stale · 2020-12-12T21:11:39Z

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

stale · 2021-03-22T10:53:09Z

This pull request/issue is no longer marked as stale.

clintropolis · 2021-04-13T08:05:02Z

This seems like a reasonable addition 👍, could you also add support to the avro streaming input format that was added by #11040?

It does look like it might be hard to make the coverage bot happy without testing a lot of different union types, but maybe that is a good thing in this case if not too tedious.

josephglanville · 2021-04-13T23:31:46Z

@clintropolis sure, I will get around to it sometime this week.

josephglanville · 2021-04-14T04:51:02Z

@clintropolis I rebased this and added support in the new InputFormat and extended the tests to cover all the union member types, ready for another review.

josephglanville · 2021-04-14T23:29:00Z

Looks like only the spell check failed. Added to the spelling file and also tweaked the docs a little bit to correct a missing type and be more specific about how named types are handled.

josephglanville · 2021-05-08T23:10:24Z

@clintropolis would you be able to take another look at this? I think it's ready to merge.

clintropolis

The changes look good to me, the only thing I'm concerned about is whether or not explodeUnions is the best name for this, mainly the explode part of the name, which in my mind i associate with something like #8698, which was an attempt to add a feature that could take an input row that had an array and produce multiple rows with the scalar values of the input array.

Perhaps convertUnions or extractUnions would be a better name? Idk, naming is hard 😓

josephglanville · 2021-06-03T07:05:34Z

I don't feel strongly about the name. I'm happy with extractUnions. Will make the changes later today.

clintropolis

sorry I missed this before, but we probably should also update https://github.com/apache/druid/blob/master/docs/ingestion/data-formats.md#avro-stream and the other tables that describe the input formats/parsers to include this new parameter

clintropolis · 2021-06-10T03:27:22Z

super nit: i think it's -> its

clintropolis · 2021-06-10T03:27:49Z

nit: still using explode terminology

clintropolis · 2021-06-10T04:16:12Z

Hmm, I looked a lot closer at this than I did on a previous pass, and I think the old docs were sort of wrong. Using the example someMultiMemberUnion type in this PR, I can still use a flatten spec to extract the values of any type with the existing code, apparently even including for record types, where the extraction path is $.someMultiMemberUnion.subString (instead of $.someMultiMemberUnion.UnionSubRecord.subString as in the mode added in this PR).

As such, with my better understanding I think it makes sense to instead call this new property extractUnionsByType or something similar, and clarify that this new mode requires using a flatten spec to extract the values, but with the benefit that you can selectively extract values of only a certain type so that they can be mapped to separate Druid columns or whatever. I also don't think it necessarily makes sense to refer to the other mode as legacy, since I guess it still has a use if the union is composed mainly of primitive types and all are able to be coerced into a common Druid type, or if it is a simple union type of the legacy form, which the new mode does not effect (since the isUnion code checks for more than 1 non-null type).

Sorry I didn't look closer into this previously and for the review churn, my bad.

Also because the new mode needs to be link to the flatten spec docs, maybe it makes sense to just move the unions description entirely into the complex types section, which also seems to mirror the Avro specification docs https://avro.apache.org/docs/current/spec.html#schema_complex

No problem, I think you are right and I will make these changes next week.

clintropolis

there is a spelling error causing a CI failure, https://travis-ci.com/github/apache/druid/jobs/512959568#L759, but other than that, lgtm 👍

* Avro union support * Document new union support * Add support for AvroStreamInputFormat and fix checkstyle * Extend multi-member union test schema and format * Some additional docs and add Enums to spelling * Rename explodeUnions -> extractUnions * explode -> extract * ByType * Correct spelling error

Adds back test coverage for Avro flattener that was mistakenly removed in apache#10505. Recfactored the tests a bit too.

* Add back missing unit test coverage in AvroFlattenerMakerTest Adds back test coverage for Avro flattener that was mistakenly removed in #10505. Recfactored the tests a bit too. * resolve checkstyle warnings

josephglanville marked this pull request as draft October 10, 2020 23:45

stale Bot added the stale label Dec 12, 2020

josephglanville changed the title ~~WIP: Avro union support~~ Avro union support Mar 22, 2021

stale Bot removed the stale label Mar 22, 2021

josephglanville force-pushed the jpg/avro-union-support branch from 9dd87ad to d144d8e Compare March 22, 2021 10:54

josephglanville marked this pull request as ready for review March 22, 2021 10:54

clintropolis added the Area - Ingestion label Mar 31, 2021

josephglanville force-pushed the jpg/avro-union-support branch from 1135107 to a51214a Compare April 14, 2021 01:10

josephglanville force-pushed the jpg/avro-union-support branch from a5b5c04 to d0031f4 Compare May 8, 2021 23:09

clintropolis reviewed Jun 3, 2021

View reviewed changes

josephglanville force-pushed the jpg/avro-union-support branch from d0031f4 to ab1b27d Compare June 3, 2021 12:26

clintropolis reviewed Jun 10, 2021

View reviewed changes

josephglanville added 8 commits June 11, 2021 06:07

Avro union support

0878f25

Document new union support

b1bbd4a

Add support for AvroStreamInputFormat and fix checkstyle

fe3d3a1

Extend multi-member union test schema and format

5335b8a

Some additional docs and add Enums to spelling

cde6041

Rename explodeUnions -> extractUnions

79c1fbf

explode -> extract

438722e

ByType

3ebd657

josephglanville force-pushed the jpg/avro-union-support branch from ab1b27d to 3ebd657 Compare June 11, 2021 00:41

clintropolis approved these changes Jun 22, 2021

View reviewed changes

Correct spelling error

c17aa1f

clintropolis merged commit d5e8d4d into apache:master Jul 7, 2021

josephglanville deleted the jpg/avro-union-support branch July 15, 2021 07:19

zachjsh added a commit to zachjsh/druid that referenced this pull request Jul 15, 2021

Add back missing unit test coverage in AvroFlattenerMakerTest

63dc839

Adds back test coverage for Avro flattener that was mistakenly removed in apache#10505. Recfactored the tests a bit too.

zachjsh mentioned this pull request Jul 15, 2021

Add back missing unit test coverage in AvroFlattenerMakerTest #11451

Merged

clintropolis added this to the 0.22.0 milestone Aug 12, 2021

clintropolis mentioned this pull request Sep 3, 2021

[Draft] 0.22.0 Release Notes #11657

Closed

Conversation

josephglanville commented Oct 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key changed/added classes in this PR

Uh oh!

stale Bot commented Dec 12, 2020

Uh oh!

stale Bot commented Mar 22, 2021

Uh oh!

clintropolis commented Apr 13, 2021

Uh oh!

josephglanville commented Apr 13, 2021

Uh oh!

josephglanville commented Apr 14, 2021

Uh oh!

josephglanville commented Apr 14, 2021

Uh oh!

josephglanville commented May 8, 2021

Uh oh!

clintropolis left a comment

Choose a reason for hiding this comment

Uh oh!

josephglanville commented Jun 3, 2021

Uh oh!

clintropolis left a comment

Choose a reason for hiding this comment

Uh oh!

clintropolis Jun 10, 2021

Choose a reason for hiding this comment

Uh oh!

clintropolis Jun 10, 2021

Choose a reason for hiding this comment

Uh oh!

clintropolis Jun 10, 2021

Choose a reason for hiding this comment

Uh oh!

josephglanville Jun 10, 2021

Choose a reason for hiding this comment

Uh oh!

clintropolis left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

josephglanville commented Oct 10, 2020 •

edited

Loading