add avro stream input format#11040
Conversation
|
@clintropolis Hi, I implement AvroStreamInputFormat as I mentioned last weekend. Can you review it and help me refine it? |
clintropolis
left a comment
There was a problem hiding this comment.
code LGTM 👍
This PR needs docs added to https://github.com/apache/druid/blob/master/docs/ingestion/data-formats.md I think before it is ready to go.
I think it should also be relatively easy to add an integration test for this since we already have an integration test for the Parser implementation of Avro + Schema Registry. All that needs done is a new input_format directory be created in this location https://github.com/apache/druid/tree/master/integration-tests/src/test/resources/stream/data/avro_schema_registry with a new input_format.json template (using the InputFormat instead of the Parser). See JSON for example: https://github.com/apache/druid/tree/master/integration-tests/src/test/resources/stream/data/json. If this template is added, then i think it should be automatically picked up and run as part of the kafka data format integration tests.
| final AvroStreamInputFormat that = (AvroStreamInputFormat) o; | ||
| return Objects.equals(getFlattenSpec(), that.getFlattenSpec()) && | ||
| Objects.equals(avroBytesDecoder, that.avroBytesDecoder); | ||
| } | ||
|
|
||
| @Override | ||
| public int hashCode() | ||
| { | ||
| return Objects.hash(getFlattenSpec(), avroBytesDecoder); | ||
| } |
There was a problem hiding this comment.
equality/hashcode should probably consider binaryAsString for their computations
| return CloseableIterators.withEmptyBaggage( | ||
| Iterators.singletonIterator(avroBytesDecoder.parse(ByteBuffer.wrap(IOUtils.toByteArray(source.open()) | ||
| )))); |
There was a problem hiding this comment.
nit: strange formatting (occasionally style bot doesn't pick stuff up)
| return CloseableIterators.withEmptyBaggage( | |
| Iterators.singletonIterator(avroBytesDecoder.parse(ByteBuffer.wrap(IOUtils.toByteArray(source.open()) | |
| )))); | |
| return CloseableIterators.withEmptyBaggage( | |
| Iterators.singletonIterator(avroBytesDecoder.parse(ByteBuffer.wrap(IOUtils.toByteArray(source.open())))) | |
| ); |
Sorry, I don't understand what you said. Do you mean an integration test is developing for Avro stream and when it finished, I can add a new json about this test? Or I need create this integration test by myself? |
|
@clintropolis Document done. Do I need to create integration test? Is there some examples of it? I'm interesting about it. |
Ah sorry, let me try to explain a bit more. So in the case of avro inline schema and avro schema registry, you should be able to just add the JSON files with the
|
| "type" : "schema_repo", | ||
| "subjectAndIdConverter" : { | ||
| "type" : "avro_1124", | ||
| "topic" : "${YOUR_TOPIC}" | ||
| }, | ||
| "schemaRepository" : { | ||
| "type" : "avro_1124_rest_client", | ||
| "url" : "${YOUR_SCHEMA_REPO_END_POINT}", | ||
| } |
There was a problem hiding this comment.
I suggest we should switch to using 'inline' or 'schema-registry' as the example instead of 'schema_repo', which isn't used as frequently in practice as far as I know.
| |flattenSpec| JSON Object |Define a [`flattenSpec`](#flattenspec) to extract nested values from a Avro record. Note that only 'path' expression are supported ('jq' is unavailable).| no (default will auto-discover 'root' level properties) | | ||
| |`avroBytesDecoder`| JSON Object |Specifies how to decode bytes to Avro record. | yes | | ||
| | binaryAsString | Boolean | Specifies if the bytes Avro column which is not logically marked as a string or enum type should be treated as a UTF-8 encoded string. | no (default = false) | | ||
|
|
There was a problem hiding this comment.
I think we should move https://github.com/apache/druid/blob/master/docs/ingestion/data-formats.md#avro-bytes-decoder (which currently lives with the 'parsers' documentation) up to this 'input formats' section, and have the parsers section link to the bytes decoder docs here.
|
I haven't quite determined what is going on yet, but it seems like there is some sort of serialization error that is causing the newly added schema-registry input format integration test to fail: https://travis-ci.com/github/apache/druid/jobs/496046398#L9197 The inline schema test is passing 👍 |
|
It seems like https://github.com/apache/druid/blob/master/extensions-core/avro-extensions/src/main/java/org/apache/druid/data/input/avro/SchemaRegistryBasedAvroBytesDecoder.java#L50 is missing getter methods annotated with Could you add serialization round trip tests for more |
I test it last week and I know something wrong with schema registry decoder. I will fix it this weekend, thanks for your advice and I will change code for this exception. |
|
@clintropolis Hi, I fix this bug and pass integration-tests. Then I add one unit test for this. |
clintropolis
left a comment
There was a problem hiding this comment.
thanks for fixing this up 👍
| public AvroStreamInputFormat( | ||
| @JsonProperty("flattenSpec") @Nullable JSONPathSpec flattenSpec, | ||
| @JsonProperty("avroBytesDecoder") AvroBytesDecoder avroBytesDecoder, | ||
| @JsonProperty("binaryAsString") @Nullable Boolean binaryAsString |
There was a problem hiding this comment.
missing a binaryAsString getter annotated with @JsonProperty i think
clintropolis
left a comment
There was a problem hiding this comment.
lgtm, thanks @bananaaggle 👍
Because of deprecated of parseSpec, I develop AvroStreamInputFormat for new interface, which supports stream ingestion for data encoded by Avro.
This PR has: