add avro stream input format by bananaaggle · Pull Request #11040 · apache/druid

bananaaggle · 2021-03-28T08:32:53Z

Because of deprecated of parseSpec, I develop AvroStreamInputFormat for new interface, which supports stream ingestion for data encoded by Avro.

This PR has:

been self-reviewed.
- using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

bananaaggle · 2021-03-28T08:36:14Z

@clintropolis Hi, I implement AvroStreamInputFormat as I mentioned last weekend. Can you review it and help me refine it?

clintropolis

code LGTM 👍

This PR needs docs added to https://github.com/apache/druid/blob/master/docs/ingestion/data-formats.md I think before it is ready to go.

I think it should also be relatively easy to add an integration test for this since we already have an integration test for the Parser implementation of Avro + Schema Registry. All that needs done is a new input_format directory be created in this location https://github.com/apache/druid/tree/master/integration-tests/src/test/resources/stream/data/avro_schema_registry with a new input_format.json template (using the InputFormat instead of the Parser). See JSON for example: https://github.com/apache/druid/tree/master/integration-tests/src/test/resources/stream/data/json. If this template is added, then i think it should be automatically picked up and run as part of the kafka data format integration tests.

clintropolis · 2021-03-30T00:54:46Z

+    final AvroStreamInputFormat that = (AvroStreamInputFormat) o;
+    return Objects.equals(getFlattenSpec(), that.getFlattenSpec()) &&
+        Objects.equals(avroBytesDecoder, that.avroBytesDecoder);
+  }
+
+  @Override
+  public int hashCode()
+  {
+    return Objects.hash(getFlattenSpec(), avroBytesDecoder);
+  }


equality/hashcode should probably consider binaryAsString for their computations

clintropolis · 2021-03-30T01:33:40Z

+    return CloseableIterators.withEmptyBaggage(
+        Iterators.singletonIterator(avroBytesDecoder.parse(ByteBuffer.wrap(IOUtils.toByteArray(source.open())
+        ))));


nit: strange formatting (occasionally style bot doesn't pick stuff up)

Suggested change

return CloseableIterators.withEmptyBaggage(

Iterators.singletonIterator(avroBytesDecoder.parse(ByteBuffer.wrap(IOUtils.toByteArray(source.open())

))));

return CloseableIterators.withEmptyBaggage(

Iterators.singletonIterator(avroBytesDecoder.parse(ByteBuffer.wrap(IOUtils.toByteArray(source.open()))))

);

bananaaggle · 2021-03-30T13:31:04Z

code LGTM 👍

This PR needs docs added to https://github.com/apache/druid/blob/master/docs/ingestion/data-formats.md I think before it is ready to go.

I think it should also be relatively easy to add an integration test for this since we already have an integration test for the Parser implementation of Avro + Schema Registry. All that needs done is a new input_format directory be created in this location https://github.com/apache/druid/tree/master/integration-tests/src/test/resources/stream/data/avro_schema_registry with a new input_format.json template (using the InputFormat instead of the Parser). See JSON for example: https://github.com/apache/druid/tree/master/integration-tests/src/test/resources/stream/data/json. If this template is added, then i think it should be automatically picked up and run as part of the kafka data format integration tests.

Sorry, I don't understand what you said. Do you mean an integration test is developing for Avro stream and when it finished, I can add a new json about this test? Or I need create this integration test by myself?

bananaaggle · 2021-03-31T08:21:52Z

@clintropolis Document done. Do I need to create integration test? Is there some examples of it? I'm interesting about it.

clintropolis · 2021-03-31T10:59:25Z

Sorry, I don't understand what you said. Do you mean an integration test is developing for Avro stream and when it finished, I can add a new json about this test? Or I need create this integration test by myself?

Ah sorry, let me try to explain a bit more. So in the case of avro inline schema and avro schema registry, you should be able to just add the JSON files with the InputFormat template and get the tests for free. The integration test is ITKafkaIndexingServiceDataFormatTest is used to test the same data with kafka streaming using a variety of different data formats which are supported by kafka ingestion. It works by iterating over the JSON templates in https://github.com/apache/druid/tree/master/integration-tests/src/test/resources/stream/data to test each data-format present in that directory with the same set of data. For each of these data formats in the integration tests, there is implemented a corresponding EventSerializer, which writes data for the tests to the Kafka stream for the format, and the parser or inputFormat templates are then used to construct a supervisor to spawn indexing tasks to read the data and run the actual tests to verify stuff is working correctly.

AvroEventSerializer, and AvroSchemaRegistryEventSerializer are already present because there are integration tests using the Avro stream parsers (avro inline and avro schema registry), they are just missing the input format templates because it didn't exist until this PR (JSON, CSV, and TSV do have input format templates, which might be useful as a reference). You should just be able to adapt those parser templates into the equivalent input format template.

clintropolis · 2021-03-31T11:14:40Z

+      "type" : "schema_repo",
+      "subjectAndIdConverter" : {
+        "type" : "avro_1124",
+        "topic" : "${YOUR_TOPIC}"
+      },
+      "schemaRepository" : {
+        "type" : "avro_1124_rest_client",
+        "url" : "${YOUR_SCHEMA_REPO_END_POINT}",
+      }


I suggest we should switch to using 'inline' or 'schema-registry' as the example instead of 'schema_repo', which isn't used as frequently in practice as far as I know.

clintropolis · 2021-03-31T11:16:07Z

+|flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract nested values from a Avro record. Note that only 'path' expression are supported ('jq' is unavailable).| no (default will auto-discover 'root' level properties) |
+|`avroBytesDecoder`| JSON Object |Specifies how to decode bytes to Avro record. | yes |
+| binaryAsString | Boolean | Specifies if the bytes Avro column which is not logically marked as a string or enum type should be treated as a UTF-8 encoded string. | no (default = false) |
+


I think we should move https://github.com/apache/druid/blob/master/docs/ingestion/data-formats.md#avro-bytes-decoder (which currently lives with the 'parsers' documentation) up to this 'input formats' section, and have the parsers section link to the bytes decoder docs here.

clintropolis · 2021-04-09T23:23:09Z

I haven't quite determined what is going on yet, but it seems like there is some sort of serialization error that is causing the newly added schema-registry input format integration test to fail:

Caused by: com.fasterxml.jackson.databind.exc.ValueInstantiationException: Cannot construct instance of `org.apache.druid.data.input.avro.SchemaRegistryBasedAvroBytesDecoder`, problem: Expected at least one URL to be passed in constructor
 at [Source: (byte[])&quot;{&quot;type&quot;:&quot;index_kafka&quot;,&quot;id&quot;:&quot;index_kafka_kafka_data_format_indexing_service_test_c940ce28-01a9-4070-9dc0-27df2903249e %?????? ?? ??!?_a0ffca1f01390e3_ejndffbp&quot;,&quot;resource&quot;:{&quot;availabilityGroup&quot;:&quot;index_kafka_kafka_data_format_indexing_service_test_c940ce28-01a9-4070-9dc0-27df2903249e %?????? ?? ??!?_a0ffca1f01390e3&quot;,&quot;requiredCapacity&quot;:1},&quot;dataSchema&quot;:{&quot;dataSource&quot;:&quot;kafka_data_format_indexing_service_test_c940ce28-01a9-4070-9dc0-27df2903249e %?????? ?? ??!?&quot;,&quot;[truncated 4940 bytes]; line: 1, column: 5072] (through reference chain: org.apache.druid.indexing.kafka.KafkaIndexTask[&quot;ioConfig&quot;]-&gt;org.apache.druid.indexing.kafka.KafkaIndexTaskIOConfig[&quot;inputFormat&quot;]-&gt;org.apache.druid.data.input.avro.AvroStreamInputFormat[&quot;avroBytesDecoder&quot;])
	at com.fasterxml.jackson.databind.exc.ValueInstantiationException.from(ValueInstantiationException.java:47)
	at com.fasterxml.jackson.databind.DeserializationContext.instantiationException(DeserializationContext.java:1735)
	at com.fasterxml.jackson.databind.deser.std.StdValueInstantiator.wrapAsJsonMappingException(StdValueInstantiator.java:491)
	at com.fasterxml.jackson.databind.deser.std.StdValueInstantiator.rewrapCtorProblem(StdValueInstantiator.java:514)
	at com.fasterxml.jackson.databind.deser.std.StdValueInstantiator.createFromObjectWith(StdValueInstantiator.java:285)

https://travis-ci.com/github/apache/druid/jobs/496046398#L9197

The inline schema test is passing 👍

clintropolis · 2021-04-09T23:44:12Z

It seems like https://github.com/apache/druid/blob/master/extensions-core/avro-extensions/src/main/java/org/apache/druid/data/input/avro/SchemaRegistryBasedAvroBytesDecoder.java#L50 is missing getter methods annotated with @JsonProperty, which I suspect is related to the test failure. I'm not really sure how the parser based integration test is passing since it doesn't seem like serializing the schema-registry bytes decoder should work... (looking into this).

Could you add serialization round trip tests for more AvroBytesDecoder implementations to your unit tests with the input format so we can get coverage on this?

bananaaggle · 2021-04-10T02:01:59Z

I haven't quite determined what is going on yet, but it seems like there is some sort of serialization error that is causing the newly added schema-registry input format integration test to fail:

Caused by: com.fasterxml.jackson.databind.exc.ValueInstantiationException: Cannot construct instance of `org.apache.druid.data.input.avro.SchemaRegistryBasedAvroBytesDecoder`, problem: Expected at least one URL to be passed in constructor
 at [Source: (byte[])&quot;{&quot;type&quot;:&quot;index_kafka&quot;,&quot;id&quot;:&quot;index_kafka_kafka_data_format_indexing_service_test_c940ce28-01a9-4070-9dc0-27df2903249e %?????? ?? ??!?_a0ffca1f01390e3_ejndffbp&quot;,&quot;resource&quot;:{&quot;availabilityGroup&quot;:&quot;index_kafka_kafka_data_format_indexing_service_test_c940ce28-01a9-4070-9dc0-27df2903249e %?????? ?? ??!?_a0ffca1f01390e3&quot;,&quot;requiredCapacity&quot;:1},&quot;dataSchema&quot;:{&quot;dataSource&quot;:&quot;kafka_data_format_indexing_service_test_c940ce28-01a9-4070-9dc0-27df2903249e %?????? ?? ??!?&quot;,&quot;[truncated 4940 bytes]; line: 1, column: 5072] (through reference chain: org.apache.druid.indexing.kafka.KafkaIndexTask[&quot;ioConfig&quot;]-&gt;org.apache.druid.indexing.kafka.KafkaIndexTaskIOConfig[&quot;inputFormat&quot;]-&gt;org.apache.druid.data.input.avro.AvroStreamInputFormat[&quot;avroBytesDecoder&quot;])
	at com.fasterxml.jackson.databind.exc.ValueInstantiationException.from(ValueInstantiationException.java:47)
	at com.fasterxml.jackson.databind.DeserializationContext.instantiationException(DeserializationContext.java:1735)
	at com.fasterxml.jackson.databind.deser.std.StdValueInstantiator.wrapAsJsonMappingException(StdValueInstantiator.java:491)
	at com.fasterxml.jackson.databind.deser.std.StdValueInstantiator.rewrapCtorProblem(StdValueInstantiator.java:514)
	at com.fasterxml.jackson.databind.deser.std.StdValueInstantiator.createFromObjectWith(StdValueInstantiator.java:285)

https://travis-ci.com/github/apache/druid/jobs/496046398#L9197

The inline schema test is passing 👍

I test it last week and I know something wrong with schema registry decoder. I will fix it this weekend, thanks for your advice and I will change code for this exception.

bananaaggle · 2021-04-11T07:39:25Z

@clintropolis Hi, I fix this bug and pass integration-tests. Then I add one unit test for this.

clintropolis

thanks for fixing this up 👍

clintropolis · 2021-04-09T23:25:41Z

+  public AvroStreamInputFormat(
+      @JsonProperty("flattenSpec") @Nullable JSONPathSpec flattenSpec,
+      @JsonProperty("avroBytesDecoder") AvroBytesDecoder avroBytesDecoder,
+      @JsonProperty("binaryAsString") @Nullable Boolean binaryAsString


missing a binaryAsString getter annotated with @JsonProperty i think

clintropolis

lgtm, thanks @bananaaggle 👍

add avro stream input format

8c8213c

bug fixed

1fdd972

clintropolis added Area - Extension Area - Streaming Ingestion labels Mar 30, 2021

clintropolis reviewed Mar 30, 2021

View reviewed changes

add document

c924995

doc fix

9f08bc1

clintropolis reviewed Mar 31, 2021

View reviewed changes

yuanyi added 2 commits April 3, 2021 21:20

change doc

60de593

add integretion test

573a9ab

clintropolis added the Release Notes label Apr 6, 2021

yuanyi added 2 commits April 11, 2021 11:38

bug fixed

0d825f1

bug fixed

4eeeeda

clintropolis reviewed Apr 12, 2021

View reviewed changes

add string as binary getter

f8dbbb4

clintropolis approved these changes Apr 13, 2021

View reviewed changes

clintropolis merged commit d0a94a8 into apache:master Apr 13, 2021

clintropolis mentioned this pull request Apr 13, 2021

Avro union support #10505

Merged

6 tasks

clintropolis added this to the 0.22.0 milestone Aug 12, 2021

clintropolis mentioned this pull request Sep 3, 2021

[Draft] 0.22.0 Release Notes #11657

Closed

Conversation

bananaaggle commented Mar 28, 2021

Uh oh!

bananaaggle commented Mar 28, 2021

Uh oh!

clintropolis left a comment

Choose a reason for hiding this comment

Uh oh!

clintropolis Mar 30, 2021

Choose a reason for hiding this comment

Uh oh!

clintropolis Mar 30, 2021

Choose a reason for hiding this comment

Uh oh!

bananaaggle commented Mar 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bananaaggle commented Mar 31, 2021

Uh oh!

clintropolis commented Mar 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clintropolis Mar 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clintropolis Mar 31, 2021

Choose a reason for hiding this comment

Uh oh!

clintropolis commented Apr 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clintropolis commented Apr 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bananaaggle commented Apr 10, 2021

Uh oh!

bananaaggle commented Apr 11, 2021

Uh oh!

clintropolis left a comment

Choose a reason for hiding this comment

Uh oh!

clintropolis Apr 9, 2021

Choose a reason for hiding this comment

Uh oh!

bananaaggle Apr 12, 2021

Choose a reason for hiding this comment

Uh oh!

clintropolis left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bananaaggle commented Mar 30, 2021 •

edited

Loading

clintropolis commented Mar 31, 2021 •

edited

Loading

clintropolis Mar 31, 2021 •

edited

Loading

clintropolis commented Apr 9, 2021 •

edited

Loading

clintropolis commented Apr 9, 2021 •

edited

Loading