Skip to content

Add avro_ocf to supported Kafka/Kinesis InputFormats#11865

Merged
a2l007 merged 6 commits intoapache:masterfrom
jacobtolar:patch-2
Dec 3, 2021
Merged

Add avro_ocf to supported Kafka/Kinesis InputFormats#11865
a2l007 merged 6 commits intoapache:masterfrom
jacobtolar:patch-2

Conversation

@jacobtolar
Copy link
Copy Markdown
Contributor

Description

Update docs to add avro_ocf to list of supported input formats for Kafka/Kinesis. Also, updated Kinesis docs to more closely match Kafka (importing some of the changes from this PR: https://github.com/apache/druid/pull/11624/files).

The avro_ocf input format was added here: #9671


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster - we have one datasource set up to ingest using avro_ocf, so I know that's working as documented here. I haven't tested with Kinesis but have no reason to believe it would not also work.

@a2l007
Copy link
Copy Markdown
Contributor

a2l007 commented Dec 2, 2021

@jacobtolar LGTM. Could you please resolve the conflicts?

@jacobtolar
Copy link
Copy Markdown
Contributor Author

Ah, looks like a later PR (#11912) entirely reworked the Kafka ingestion docs.

@a2l007 a2l007 merged commit f7f5505 into apache:master Dec 3, 2021
@clintropolis
Copy link
Copy Markdown
Member

I'm a bit curious, Avro OCF is a file format, is it common to put these files in streaming ingest messages? There is no technical reason this wouldn't work if the files were small enough to fit in the messages since it is all just binary blobs in the end, but was mostly wondering if this is a common use case compared to the streaming oriented avro formats we support (inline schema, multi-inline-schema, schema repo, schema registry).

@jacobtolar
Copy link
Copy Markdown
Contributor Author

I don't know that it's a common use case...but we have some scenarios where we do this. There's obviously some overhead to providing the schema in every message (cost is amortized somewhat by providing many records in a single Kafka message), but it's nice not needing to have an extra component (schema registry).

The avro_ocf support works right now by writing every message to a file on localhost...which isn't ideal for streaming in one 'file' per message (but technically works, if your disks are fast enough or your data volume is low enough 🙃). When I get some time I plan to submit a PR so you can configure that to happen in memory which should make it more usable.

@abhishekagarwal87 abhishekagarwal87 added this to the 0.23.0 milestone May 11, 2022
techdocsmith added a commit to techdocsmith/druid that referenced this pull request Jul 26, 2024
techdocsmith added a commit that referenced this pull request Jul 26, 2024
sreemanamala pushed a commit to sreemanamala/druid that referenced this pull request Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants