Skip to content

allow using nested column indexer for schema discovery#13653

Merged
clintropolis merged 12 commits intoapache:masterfrom
clintropolis:nested-column-mimic
Jan 13, 2023
Merged

allow using nested column indexer for schema discovery#13653
clintropolis merged 12 commits intoapache:masterfrom
clintropolis:nested-column-mimic

Conversation

@clintropolis
Copy link
Copy Markdown
Member

@clintropolis clintropolis commented Jan 10, 2023

Description

This PR introduces a new experimental mode for schema discovery which is powered by the 'nested' column indexer. To accompany this, there are also some changes to the nested column selector behavior in the case that the column consists of a single typed 'root' literal column (so no nested data), to allow nested columns to mimic the column type of this root literal.

The result is a schema discovery mode which can produce columns of the correct type rather than being limited to all columns being STRING typed with the current schemaless behavior. Like existing schemaless ingestion, the timestampSpec must still be defined, perhaps future enhancements could add automatic time column selection. Also in this PR all discovered columns are writing out full nested columns, a future PR will add optimizations to the nested column serializer to only store what is necessary.

I think the most compelling use case for this is with streaming ingestion, since it allows for effortless support of schema evolution.

Example

For example, imagine I have a kafka topic, schemafree. With the changes in this PR, we can define a very minimal ingestion spec:

{
  "type": "kafka",
  "spec": {
    "ioConfig": {
      "type": "kafka",
      "consumerProperties": {
        "bootstrap.servers": "localhost:9092"
      },
      "topic": "schemafree",
      "inputFormat": {
        "type": "json"
      }
    },
    "tuningConfig": {
      "type": "kafka",
      "appendableIndexSpec": {
        "type": "onheap",
        "useNestedColumnIndexerForSchemaDiscovery": true
      }
    },
    "dataSchema": {
      "dataSource": "schemafree",
      "timestampSpec": {
        "column": "time",
        "format": "iso"
      },
      "dimensionsSpec":{}
    }
  }
}

If we send a first batch of events to our topic:

{"time":"2023-01-07T00:00:00Z", "some_long":1234, "some_double":1.23, "some_string":"a", "some_variant":"a"}
{"time":"2023-01-07T01:00:00Z", "some_long":5678, "some_double":4.56, "some_string":"b", "some_variant":1}
{"time":"2023-01-07T01:10:00Z", "some_long":1111, "some_string":"c", "some_variant":2.2}
{"time":"2023-01-07T01:20:00Z", "some_double":11.11, "some_variant":1}

useNestedColumnIndexerForSchemaDiscovery set on appendableIndexSpec of the tuningConfig tells the IncrementalIndex to use a NestedColumnIndexer for any discovered dimensions instead of a StringDimensionIndexer. The new mimic behavior of nested column selectors then allows queries to see these discovered columns as their correct type:

Screen Shot 2023-01-09 at 10 46 21 PM

Adding additional events:

{"time":"2023-01-07T00:00:00Z", "other_long": 1111, "other_double":2.22, "other_string": "zz"}
{"time":"2023-01-07T00:00:00Z", "other_long": 2222, "other_double":3.33, "other_string": "yy"}
{"time":"2023-01-07T00:00:00Z", "other_long": 3333, "other_double":4.44, "other_string": "xx"}
{"time":"2023-01-07T00:00:00Z", "other_long": 4444, "other_double":5.55, "other_string": "ww"}

these are picked up as well:

Screen Shot 2023-01-09 at 10 47 19 PM

The nested column selectors are using the same nested literal column selectors that are used for the nested virtual columns that back SQL functions like JSON_VALUE, so the performance is approximately the same as if these were regular literal columns, and we can query them as if they were such, grouping and aggregating and so on:

Screen Shot 2023-01-09 at 11 29 48 PM

Follow-up work

The most important pieces to follow are:

  • improving nested column serializer to optimize the root literal case
  • refactor flattener machinery which provides column discovery to include 'nested' columns... ironically right now these are filtered out so that columns with actual nested data will not be automatically ingested even though the nested column indexer was literally built for this
  • refactor column merging code to use something more purpose built than ColumnCapabilities for segment merging/picking column handlers/etc, something like ColumnFormat or ColumnShape. For now I have added the concept of 'handler' capabilities as a crutch to allow merging to choose the nested column merger even though the column capabilities reports as a STRING or LONG or whatever with nested columns, but going forward i think something nicer can be built, more on this later
  • i'm sure i'm forgetting other things 🙃

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@fjy
Copy link
Copy Markdown
Contributor

fjy commented Jan 10, 2023

Is there a way to do this without a new configuration setting? It would be nice if this could just be the default without having anyone think about configuration.

I wasn't super sure reading the description, but for SQL-based ingestion, is Select * the only command needed?

Finally, does this work on auto-type detection apply to evolving schemas?

@clintropolis
Copy link
Copy Markdown
Member Author

Is there a way to do this without a new configuration setting? It would be nice if this could just be the default without having anyone think about configuration.

I think my intention is for this to be temporary while I make improvements to the feature to ensure it is production ready. After the storage optimizations are done, I think we can switch this to be the default, or could just remove the setting entirely once comfortable that it fully replaces the existing string only discovery behavior.

I wasn't super sure reading the description, but for SQL-based ingestion, is Select * the only command needed?

There is still a bit of work to be done to support this with SQL based ingestion which will be in a follow-up PR, but that is my goal - that we only have to do a select *.

Finally, does this work on auto-type detection apply to evolving schemas?

I do think this is where this feature will truly find its best use, particularly with streams that have some variety and/or change over time. This PR enable all columns from some input to be discovered and ingested as the correct type, which is certainly part of the schema evolution story. However, this PR doesn't make any changes in cases when the schema varies between segments, so it will use druid's existing best effort mechanisms to execute the query as asked, potentially casting values whenever necessary and the like. I think there is potentially some room for further improvement in this area as well, such as adding support for coercing values to a certain type to match existing schemas, etc, but that will be future enhancements.

Copy link
Copy Markdown
Contributor

@imply-cheddar imply-cheddar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments here and there. Are there old tests that we could adjust the indexing to happen through auto-detection just to increase our confidence? I thought that was done, but didn't see it, perhaps I overlooked something?

if (o == null) {
return null;
}
return String.valueOf(o);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Given that you've already done the null check, this is the same as o.toString().

Comment on lines +145 to +150
if (
fieldIndexers.size() == 1 &&
fieldIndexers.containsKey(NestedPathFinder.JSON_PATH_ROOT) &&
fieldIndexers.get(NestedPathFinder.JSON_PATH_ROOT).getTypes().getSingleType() != null
) {
final ColumnValueSelector delegate = makeColumnValueSelector(currEntry, desc);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call to makeColumnValueSelector does the exact same validations as the if statement above here. Instead of reusing that function, how about creating a private function that can be invoked assuming that these things are true and use that instead?

public boolean isNull()
{
final Object o = getObject();
return !(o instanceof Number);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numbers cannot be null? Or... what's the logic of this instanceof check? I had expected just a normal o == null?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isNull is technically from javadocs: Returns true if the primitive long, double, or float value returned by this selector should be treated as null. so in this case its treating anything that is not numeric as null since it is going to precede a call to getLong/getDouble/getFloat

@Override
public Object getObject()
{
final int dimIndex = desc.getIndex();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this dimIndex change across different rows? It looks like we are calling it once-per-row, but it seems like it should be static?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dimIndex is backed by a final so it indeed fixed for the life of the selector

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to just grab it once and reuse instead of calling getIndex() each time?

Comment on lines +288 to +289
return ColumnCapabilitiesImpl.createSimpleNumericColumnCapabilities(rootField.getTypes().getSingleType())
.setHasNulls(hasNulls);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems weird, the code seems like it's returning a numeric type, but it's actually using whatever type was found, which could be a String. Looking at the actual semantics of what is returned by that static method, it seems like it's really just building a capabilities that is "single type, just the values, no dictionaries, no frills". Can we rename that static method or add a new one that is more appropriately named?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, i left this like this since I was originally prototyping with long inputs and just threw this in there, will adjust

@@ -91,6 +95,15 @@ public NestedDataColumnSupplier(
fields = GenericIndexed.read(bb, GenericIndexed.STRING_STRATEGY, mapper);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a comment on old code and I think we'd have to push the version forward yet again to adjust this, which might not be worth it, but it's a bit sad to me that we aren't using the front-coded stuff for the fields. They are pretty much guaranteed to benefit from it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, will keep that in mind the next time we need to bump format version

Comment on lines +98 to +105
if (fields.size() == 1 &&
((version == 0x03 && NestedPathFinder.JQ_PATH_ROOT.equals(fields.get(0))) ||
(version == 0x04 && NestedPathFinder.JSON_PATH_ROOT.equals(fields.get(0))))
) {
simpleType = fieldInfo.getTypes(0).getSingleType();
} else {
simpleType = null;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just generally speaking, all of this work is being done in a constructor. It would be better to have it in a static method and just pass the needed things into the constructor. I realize it's a nit comment on old code...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, will rework this in a follow-up PR

return index;
}

private MapBasedInputRow makeInputRow(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you could make this public static on MapBasedInputRow? Seems like a nice enough helper thing to have available.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will move in a future PR

Comment on lines +167 to +164
delegate.inspectRuntimeShape(inspector);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be honest that I'm unclear on what inspectRuntimeShape actually does, but you removed the call to the delegate instead of changing it to call on rootLiteralSelector. Is that intentional?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inspectRuntimeShape is used with the topN specialization stuff which doesn't do much if anything to incremental index queries afaik. i removed it from here because rootLiteralSelector method also does nothing so it seemed pointless

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants