allow using nested column indexer for schema discovery by clintropolis · Pull Request #13653 · apache/druid

clintropolis · 2023-01-10T08:01:09Z

Description

This PR introduces a new experimental mode for schema discovery which is powered by the 'nested' column indexer. To accompany this, there are also some changes to the nested column selector behavior in the case that the column consists of a single typed 'root' literal column (so no nested data), to allow nested columns to mimic the column type of this root literal.

The result is a schema discovery mode which can produce columns of the correct type rather than being limited to all columns being STRING typed with the current schemaless behavior. Like existing schemaless ingestion, the timestampSpec must still be defined, perhaps future enhancements could add automatic time column selection. Also in this PR all discovered columns are writing out full nested columns, a future PR will add optimizations to the nested column serializer to only store what is necessary.

I think the most compelling use case for this is with streaming ingestion, since it allows for effortless support of schema evolution.

Example

For example, imagine I have a kafka topic, schemafree. With the changes in this PR, we can define a very minimal ingestion spec:

{
  "type": "kafka",
  "spec": {
    "ioConfig": {
      "type": "kafka",
      "consumerProperties": {
        "bootstrap.servers": "localhost:9092"
      },
      "topic": "schemafree",
      "inputFormat": {
        "type": "json"
      }
    },
    "tuningConfig": {
      "type": "kafka",
      "appendableIndexSpec": {
        "type": "onheap",
        "useNestedColumnIndexerForSchemaDiscovery": true
      }
    },
    "dataSchema": {
      "dataSource": "schemafree",
      "timestampSpec": {
        "column": "time",
        "format": "iso"
      },
      "dimensionsSpec":{}
    }
  }
}

If we send a first batch of events to our topic:

{"time":"2023-01-07T00:00:00Z", "some_long":1234, "some_double":1.23, "some_string":"a", "some_variant":"a"}
{"time":"2023-01-07T01:00:00Z", "some_long":5678, "some_double":4.56, "some_string":"b", "some_variant":1}
{"time":"2023-01-07T01:10:00Z", "some_long":1111, "some_string":"c", "some_variant":2.2}
{"time":"2023-01-07T01:20:00Z", "some_double":11.11, "some_variant":1}

useNestedColumnIndexerForSchemaDiscovery set on appendableIndexSpec of the tuningConfig tells the IncrementalIndex to use a NestedColumnIndexer for any discovered dimensions instead of a StringDimensionIndexer. The new mimic behavior of nested column selectors then allows queries to see these discovered columns as their correct type:

Adding additional events:

{"time":"2023-01-07T00:00:00Z", "other_long": 1111, "other_double":2.22, "other_string": "zz"}
{"time":"2023-01-07T00:00:00Z", "other_long": 2222, "other_double":3.33, "other_string": "yy"}
{"time":"2023-01-07T00:00:00Z", "other_long": 3333, "other_double":4.44, "other_string": "xx"}
{"time":"2023-01-07T00:00:00Z", "other_long": 4444, "other_double":5.55, "other_string": "ww"}

these are picked up as well:

The nested column selectors are using the same nested literal column selectors that are used for the nested virtual columns that back SQL functions like JSON_VALUE, so the performance is approximately the same as if these were regular literal columns, and we can query them as if they were such, grouping and aggregating and so on:

Follow-up work

The most important pieces to follow are:

improving nested column serializer to optimize the root literal case
refactor flattener machinery which provides column discovery to include 'nested' columns... ironically right now these are filtered out so that columns with actual nested data will not be automatically ingested even though the nested column indexer was literally built for this
refactor column merging code to use something more purpose built than ColumnCapabilities for segment merging/picking column handlers/etc, something like ColumnFormat or ColumnShape. For now I have added the concept of 'handler' capabilities as a crutch to allow merging to choose the nested column merger even though the column capabilities reports as a STRING or LONG or whatever with nested columns, but going forward i think something nicer can be built, more on this later
i'm sure i'm forgetting other things 🙃

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
a release note entry in the PR description.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

…f those types

fjy · 2023-01-10T16:54:58Z

Is there a way to do this without a new configuration setting? It would be nice if this could just be the default without having anyone think about configuration.

I wasn't super sure reading the description, but for SQL-based ingestion, is Select * the only command needed?

Finally, does this work on auto-type detection apply to evolving schemas?

clintropolis · 2023-01-10T22:58:00Z

Is there a way to do this without a new configuration setting? It would be nice if this could just be the default without having anyone think about configuration.

I think my intention is for this to be temporary while I make improvements to the feature to ensure it is production ready. After the storage optimizations are done, I think we can switch this to be the default, or could just remove the setting entirely once comfortable that it fully replaces the existing string only discovery behavior.

I wasn't super sure reading the description, but for SQL-based ingestion, is Select * the only command needed?

There is still a bit of work to be done to support this with SQL based ingestion which will be in a follow-up PR, but that is my goal - that we only have to do a select *.

Finally, does this work on auto-type detection apply to evolving schemas?

I do think this is where this feature will truly find its best use, particularly with streams that have some variety and/or change over time. This PR enable all columns from some input to be discovered and ingested as the correct type, which is certainly part of the schema evolution story. However, this PR doesn't make any changes in cases when the schema varies between segments, so it will use druid's existing best effort mechanisms to execute the query as asked, potentially casting values whenever necessary and the like. I think there is potentially some room for further improvement in this area as well, such as adding support for coercing values to a certain type to match existing schemas, etc, but that will be future enhancements.

imply-cheddar

A few comments here and there. Are there old tests that we could adjust the indexing to happen through auto-detection just to increase our confidence? I thought that was done, but didn't see it, perhaps I overlooked something?

imply-cheddar · 2023-01-11T00:14:14Z

+          if (o == null) {
+            return null;
+          }
+          return String.valueOf(o);


Nit: Given that you've already done the null check, this is the same as o.toString().

imply-cheddar · 2023-01-11T00:20:29Z

+    if (
+        fieldIndexers.size() == 1 &&
+        fieldIndexers.containsKey(NestedPathFinder.JSON_PATH_ROOT) &&
+        fieldIndexers.get(NestedPathFinder.JSON_PATH_ROOT).getTypes().getSingleType() != null
+    ) {
+      final ColumnValueSelector delegate = makeColumnValueSelector(currEntry, desc);


The call to makeColumnValueSelector does the exact same validations as the if statement above here. Instead of reusing that function, how about creating a private function that can be invoked assuming that these things are true and use that instead?

imply-cheddar · 2023-01-11T00:22:07Z

+          public boolean isNull()
+          {
+            final Object o = getObject();
+            return !(o instanceof Number);


Numbers cannot be null? Or... what's the logic of this instanceof check? I had expected just a normal o == null?

isNull is technically from javadocs: Returns true if the primitive long, double, or float value returned by this selector should be treated as null. so in this case its treating anything that is not numeric as null since it is going to precede a call to getLong/getDouble/getFloat

imply-cheddar · 2023-01-11T00:23:55Z

+          @Override
+          public Object getObject()
+          {
+            final int dimIndex = desc.getIndex();


Does this dimIndex change across different rows? It looks like we are calling it once-per-row, but it seems like it should be static?

dimIndex is backed by a final so it indeed fixed for the life of the selector

I think it would be nice to just grab it once and reuse instead of calling getIndex() each time?

imply-cheddar · 2023-01-11T00:27:47Z

+        return ColumnCapabilitiesImpl.createSimpleNumericColumnCapabilities(rootField.getTypes().getSingleType())
+                                     .setHasNulls(hasNulls);


This seems weird, the code seems like it's returning a numeric type, but it's actually using whatever type was found, which could be a String. Looking at the actual semantics of what is returned by that static method, it seems like it's really just building a capabilities that is "single type, just the values, no dictionaries, no frills". Can we rename that static method or add a new one that is more appropriately named?

oops, i left this like this since I was originally prototyping with long inputs and just threw this in there, will adjust

imply-cheddar · 2023-01-11T04:43:23Z

@@ -91,6 +95,15 @@ public NestedDataColumnSupplier(
        fields = GenericIndexed.read(bb, GenericIndexed.STRING_STRATEGY, mapper);


This is a comment on old code and I think we'd have to push the version forward yet again to adjust this, which might not be worth it, but it's a bit sad to me that we aren't using the front-coded stuff for the fields. They are pretty much guaranteed to benefit from it.

yeah, will keep that in mind the next time we need to bump format version

imply-cheddar · 2023-01-11T04:46:18Z

+        if (fields.size() == 1 &&
+            ((version == 0x03 && NestedPathFinder.JQ_PATH_ROOT.equals(fields.get(0))) ||
+             (version == 0x04 && NestedPathFinder.JSON_PATH_ROOT.equals(fields.get(0))))
+        ) {
+          simpleType = fieldInfo.getTypes(0).getSingleType();
+        } else {
+          simpleType = null;
+        }


Just generally speaking, all of this work is being done in a constructor. It would be better to have it in a static method and just pass the needed things into the constructor. I realize it's a nit comment on old code...

agree, will rework this in a follow-up PR

imply-cheddar · 2023-01-11T04:55:13Z

+    return index;
+  }
+
+  private MapBasedInputRow makeInputRow(


Perhaps you could make this public static on MapBasedInputRow? Seems like a nice enough helper thing to have available.

will move in a future PR

imply-cheddar · 2023-01-12T01:08:07Z

-          delegate.inspectRuntimeShape(inspector);
+


I'll be honest that I'm unclear on what inspectRuntimeShape actually does, but you removed the call to the delegate instead of changing it to call on rootLiteralSelector. Is that intentional?

inspectRuntimeShape is used with the topN specialization stuff which doesn't do much if anything to incremental index queries afaik. i removed it from here because rootLiteralSelector method also does nothing so it seemed pointless

single typed "root" only nested columns now mimic "regular" columns o…

e40e0c8

…f those types

clintropolis added Area - Querying Area - Segment Format and Ser/De Area - Ingestion labels Jan 10, 2023

oops

fd698e9

github-advanced-security AI found potential problems Jan 10, 2023

View reviewed changes

Comment thread processing/src/main/java/org/apache/druid/segment/NestedDataColumnIndexer.java Fixed

clintropolis added 4 commits January 10, 2023 02:38

final

a5cb06d

sharing is caring

35ad4ee

fixes

1c4e991

unused

5ee8cb6

adjust

b553957

imply-cheddar reviewed Jan 11, 2023

View reviewed changes

adjust

c019edf

imply-cheddar approved these changes Jan 12, 2023

View reviewed changes

clintropolis added 4 commits January 12, 2023 04:59

fix it

5f98974

oops

8a552df

wrong

56f45be

more consistent

6de1625

clintropolis merged commit b5b740b into apache:master Jan 13, 2023

clintropolis deleted the nested-column-mimic branch January 13, 2023 02:31

clintropolis mentioned this pull request Jan 14, 2023

discover nested columns when using nested column indexer for schemaless ingestion #13672

Merged

6 tasks

This was referenced Jan 26, 2023

sampler + type detection = bff #13711

Merged

fix nested column handling of null and "null" #13714

Merged

various nested column (and other) fixes #13732

Merged

vogievetsky mentioned this pull request Apr 3, 2023

Web console: use new sampler features #14017

Merged

clintropolis added this to the 26.0 milestone Apr 10, 2023

vtlim mentioned this pull request Apr 18, 2023

[DRAFT] 26.0.0 release notes #14064

Closed

		return ColumnCapabilitiesImpl.createSimpleNumericColumnCapabilities(rootField.getTypes().getSingleType())
		.setHasNulls(hasNulls);

		@@ -91,6 +95,15 @@ public NestedDataColumnSupplier(
		fields = GenericIndexed.read(bb, GenericIndexed.STRING_STRATEGY, mapper);

Conversation

clintropolis commented Jan 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Example

Follow-up work

Uh oh!

Uh oh!

fjy commented Jan 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clintropolis commented Jan 10, 2023

Uh oh!

imply-cheddar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

clintropolis commented Jan 10, 2023 •

edited

Loading

fjy commented Jan 10, 2023 •

edited

Loading