Skip to content

Support csv input format in Kafka ingestion with header#16630

Merged
kfaraz merged 6 commits intoapache:masterfrom
kfaraz:support_list_row
Jun 25, 2024
Merged

Support csv input format in Kafka ingestion with header#16630
kfaraz merged 6 commits intoapache:masterfrom
kfaraz:support_list_row

Conversation

@kfaraz
Copy link
Copy Markdown
Contributor

@kfaraz kfaraz commented Jun 20, 2024

Description

When Kafka ingestion is setup with ioConfig.type = kafka (i.e. enable "Parse Kafka metadata") and csv input format, we get the following parsing error while both sampling the data and running an actual ingestion task.

Screenshot 2024-06-20 at 12 17 45 PM

org.apache.druid.java.util.common.parsers.ParseException:
  Unsupported input format in valueFormat. KafkaInputFormat only supports input format
  that return MapBasedInputRow rows

This error eventually fails the sampling with

org.apache.druid.indexing.overlord.sampler.SamplerExceptionMapper
- Failed to sample data: 
   Size of rawColumnsList([[{kafka.timestamp=1718857393187, name=a, kafka.topic=abc, time=2024-06-14T01:00:00Z, value=1}]])
   does not correspond to size of inputRows([[]])

and ingestion simply rejects the events due to the parse exception.

The root cause is that KafkaInputReader expects the input rows to be MapBasedInputRows
so that it may use the event map to blend the values with headers and keys.

Changes

  • Convert ListBasedInputRow to MapBasedInputRow using .asMap()
    while building blended rows that contain values from Kafka headers, key and value.

Screenshot after the fix

Screenshot 2024-06-20 at 12 07 30 PM

Testing

  • Add a unit test to KafkaInputFormatTest with csv record payload
  • Tested ingestion and sampling on a local cluster with csv values. (refer to the screenshot above)

Release note

Allow use of csv input format in Kafka record when "Parse Kafka metadata" is also enabled.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@kfaraz kfaraz requested a review from clintropolis June 20, 2024 06:41
@kfaraz kfaraz requested a review from AmatyaAvadhanula June 20, 2024 06:41
@kfaraz kfaraz changed the title Support ListBasedInputRow in Kafka ingestion with header Support csv input format in Kafka ingestion with header Jun 20, 2024
Copy link
Copy Markdown
Contributor

@AmatyaAvadhanula AmatyaAvadhanula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix, @kfaraz. LGTM!

// Return type for the value parser should be of type MapBasedInputRow
// Parsers returning other types are not compatible currently.
valueRow = (MapBasedInputRow) r;
if (r instanceof ListBasedInputRow) {
Copy link
Copy Markdown
Contributor

@cryptoe cryptoe Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the performance implication of this if should be okay.
Anyway can we add a UT for this ?

Copy link
Copy Markdown
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add a UT for the above change.

Copy link
Copy Markdown
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we just made buildBlendedEventMap less picky about stuff? I attached a patch that instead makes buildBlendedEventMap look something like this

private static Map<String, Object> buildBlendedEventMap(
      Function<String, Object> getRowValue,
      Set<String> rowDimensions,
      Map<String, Object> fallback
  )

so then usage is like:

    return valueParser.read().map(
        r -> {

          final HashSet<String> newDimensions = new HashSet<>(r.getDimensions());
          final Map<String, Object> event = buildBlendedEventMap(r::getRaw, newDimensions, headerKeyList);
...

kafka-reader.patch

I didn't test much, but KafkaInputFormatTest passed minus the parse exception test due to different message from different key ordering, though that's probably easy to fix.

@kfaraz
Copy link
Copy Markdown
Contributor Author

kfaraz commented Jun 20, 2024

Thanks for the patch, @clintropolis ! I have tested out the changes (with a minor tweak for sampling) and updated the PR accordingly. It works as expected.

@kfaraz kfaraz requested a review from clintropolis June 20, 2024 14:15
Copy link
Copy Markdown
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be nice to add a test that uses csv to KafkaInputFormatTest.

Also I think that the keyFormat might have a similar problem,

which can be changed to something like this

          InputRow keyRow = keyIterator.next();
          // Add the key to the mergeList only if the key string is not already present
          mergedHeaderMap.putIfAbsent(
              keyColumnName,
              keyRow.getRaw(Iterables.getOnlyElement(keyRow.getDimensions()))
          );

if we also change KafkaInputFormat.java key parser thingy to not use the regular input schema,


to something like

        (keyFormat == null) ?
            null :
            record ->
                (record.getRecord().key() == null) ?
                    null :
                    JsonInputFormat.withLineSplittable(keyFormat, false).createReader(
                        new InputRowSchema(
                            dummyTimestampSpec,
                            DimensionsSpec.EMPTY,
                            null
                        ),
                        new ByteEntity(record.getRecord().key()),
                        temporaryDirectory
                    ),

@kfaraz
Copy link
Copy Markdown
Contributor Author

kfaraz commented Jun 21, 2024

@clintropolis , I have added a test for CSV value. Do you think it would be okay if we fix the handling of the key format in a follow up PR?

@kfaraz kfaraz requested a review from clintropolis June 21, 2024 09:34
Copy link
Copy Markdown
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes overall lgtm, is fine to do other fix as a follow-up

Comment thread processing/src/main/java/org/apache/druid/java/util/common/StringUtils.java Outdated
@kfaraz kfaraz merged commit f1043d2 into apache:master Jun 25, 2024
@kfaraz
Copy link
Copy Markdown
Contributor Author

kfaraz commented Jun 25, 2024

Thanks for the reviews, @AmatyaAvadhanula , @clintropolis !

@kfaraz kfaraz deleted the support_list_row branch June 25, 2024 06:20
@asdf2014
Copy link
Copy Markdown
Member

asdf2014 commented Jul 8, 2024

I believe this is worth mentioning in the release notes 👍

gianm added a commit to gianm/druid that referenced this pull request Oct 3, 2024
Follow-up to apache#16630, which fixed a similar issue for the valueFormat.
gianm added a commit that referenced this pull request Oct 3, 2024
* KafkaInputFormat: Fix handling of CSV/TSV keyFormat.

Follow-up to #16630, which fixed a similar issue for the valueFormat.

* Simplify.
@kfaraz kfaraz added this to the 31.0.0 milestone Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants