allow using nested column indexer for schema discovery#13653
allow using nested column indexer for schema discovery#13653clintropolis merged 12 commits intoapache:masterfrom
Conversation
|
Is there a way to do this without a new configuration setting? It would be nice if this could just be the default without having anyone think about configuration. I wasn't super sure reading the description, but for SQL-based ingestion, is Select * the only command needed? Finally, does this work on auto-type detection apply to evolving schemas? |
I think my intention is for this to be temporary while I make improvements to the feature to ensure it is production ready. After the storage optimizations are done, I think we can switch this to be the default, or could just remove the setting entirely once comfortable that it fully replaces the existing string only discovery behavior.
There is still a bit of work to be done to support this with SQL based ingestion which will be in a follow-up PR, but that is my goal - that we only have to do a select *.
I do think this is where this feature will truly find its best use, particularly with streams that have some variety and/or change over time. This PR enable all columns from some input to be discovered and ingested as the correct type, which is certainly part of the schema evolution story. However, this PR doesn't make any changes in cases when the schema varies between segments, so it will use druid's existing best effort mechanisms to execute the query as asked, potentially casting values whenever necessary and the like. I think there is potentially some room for further improvement in this area as well, such as adding support for coercing values to a certain type to match existing schemas, etc, but that will be future enhancements. |
imply-cheddar
left a comment
There was a problem hiding this comment.
A few comments here and there. Are there old tests that we could adjust the indexing to happen through auto-detection just to increase our confidence? I thought that was done, but didn't see it, perhaps I overlooked something?
| if (o == null) { | ||
| return null; | ||
| } | ||
| return String.valueOf(o); |
There was a problem hiding this comment.
Nit: Given that you've already done the null check, this is the same as o.toString().
| if ( | ||
| fieldIndexers.size() == 1 && | ||
| fieldIndexers.containsKey(NestedPathFinder.JSON_PATH_ROOT) && | ||
| fieldIndexers.get(NestedPathFinder.JSON_PATH_ROOT).getTypes().getSingleType() != null | ||
| ) { | ||
| final ColumnValueSelector delegate = makeColumnValueSelector(currEntry, desc); |
There was a problem hiding this comment.
The call to makeColumnValueSelector does the exact same validations as the if statement above here. Instead of reusing that function, how about creating a private function that can be invoked assuming that these things are true and use that instead?
| public boolean isNull() | ||
| { | ||
| final Object o = getObject(); | ||
| return !(o instanceof Number); |
There was a problem hiding this comment.
Numbers cannot be null? Or... what's the logic of this instanceof check? I had expected just a normal o == null?
There was a problem hiding this comment.
isNull is technically from javadocs: Returns true if the primitive long, double, or float value returned by this selector should be treated as null. so in this case its treating anything that is not numeric as null since it is going to precede a call to getLong/getDouble/getFloat
| @Override | ||
| public Object getObject() | ||
| { | ||
| final int dimIndex = desc.getIndex(); |
There was a problem hiding this comment.
Does this dimIndex change across different rows? It looks like we are calling it once-per-row, but it seems like it should be static?
There was a problem hiding this comment.
dimIndex is backed by a final so it indeed fixed for the life of the selector
There was a problem hiding this comment.
I think it would be nice to just grab it once and reuse instead of calling getIndex() each time?
| return ColumnCapabilitiesImpl.createSimpleNumericColumnCapabilities(rootField.getTypes().getSingleType()) | ||
| .setHasNulls(hasNulls); |
There was a problem hiding this comment.
This seems weird, the code seems like it's returning a numeric type, but it's actually using whatever type was found, which could be a String. Looking at the actual semantics of what is returned by that static method, it seems like it's really just building a capabilities that is "single type, just the values, no dictionaries, no frills". Can we rename that static method or add a new one that is more appropriately named?
There was a problem hiding this comment.
oops, i left this like this since I was originally prototyping with long inputs and just threw this in there, will adjust
| @@ -91,6 +95,15 @@ public NestedDataColumnSupplier( | |||
| fields = GenericIndexed.read(bb, GenericIndexed.STRING_STRATEGY, mapper); | |||
There was a problem hiding this comment.
This is a comment on old code and I think we'd have to push the version forward yet again to adjust this, which might not be worth it, but it's a bit sad to me that we aren't using the front-coded stuff for the fields. They are pretty much guaranteed to benefit from it.
There was a problem hiding this comment.
yeah, will keep that in mind the next time we need to bump format version
| if (fields.size() == 1 && | ||
| ((version == 0x03 && NestedPathFinder.JQ_PATH_ROOT.equals(fields.get(0))) || | ||
| (version == 0x04 && NestedPathFinder.JSON_PATH_ROOT.equals(fields.get(0)))) | ||
| ) { | ||
| simpleType = fieldInfo.getTypes(0).getSingleType(); | ||
| } else { | ||
| simpleType = null; | ||
| } |
There was a problem hiding this comment.
Just generally speaking, all of this work is being done in a constructor. It would be better to have it in a static method and just pass the needed things into the constructor. I realize it's a nit comment on old code...
There was a problem hiding this comment.
agree, will rework this in a follow-up PR
| return index; | ||
| } | ||
|
|
||
| private MapBasedInputRow makeInputRow( |
There was a problem hiding this comment.
Perhaps you could make this public static on MapBasedInputRow? Seems like a nice enough helper thing to have available.
There was a problem hiding this comment.
will move in a future PR
| delegate.inspectRuntimeShape(inspector); | ||
|
|
There was a problem hiding this comment.
I'll be honest that I'm unclear on what inspectRuntimeShape actually does, but you removed the call to the delegate instead of changing it to call on rootLiteralSelector. Is that intentional?
There was a problem hiding this comment.
inspectRuntimeShape is used with the topN specialization stuff which doesn't do much if anything to incremental index queries afaik. i removed it from here because rootLiteralSelector method also does nothing so it seemed pointless
Description
This PR introduces a new experimental mode for schema discovery which is powered by the 'nested' column indexer. To accompany this, there are also some changes to the nested column selector behavior in the case that the column consists of a single typed 'root' literal column (so no nested data), to allow nested columns to mimic the column type of this root literal.
The result is a schema discovery mode which can produce columns of the correct type rather than being limited to all columns being
STRINGtyped with the current schemaless behavior. Like existing schemaless ingestion, thetimestampSpecmust still be defined, perhaps future enhancements could add automatic time column selection. Also in this PR all discovered columns are writing out full nested columns, a future PR will add optimizations to the nested column serializer to only store what is necessary.I think the most compelling use case for this is with streaming ingestion, since it allows for effortless support of schema evolution.
Example
For example, imagine I have a kafka topic,
schemafree. With the changes in this PR, we can define a very minimal ingestion spec:{ "type": "kafka", "spec": { "ioConfig": { "type": "kafka", "consumerProperties": { "bootstrap.servers": "localhost:9092" }, "topic": "schemafree", "inputFormat": { "type": "json" } }, "tuningConfig": { "type": "kafka", "appendableIndexSpec": { "type": "onheap", "useNestedColumnIndexerForSchemaDiscovery": true } }, "dataSchema": { "dataSource": "schemafree", "timestampSpec": { "column": "time", "format": "iso" }, "dimensionsSpec":{} } } }If we send a first batch of events to our topic:
{"time":"2023-01-07T00:00:00Z", "some_long":1234, "some_double":1.23, "some_string":"a", "some_variant":"a"} {"time":"2023-01-07T01:00:00Z", "some_long":5678, "some_double":4.56, "some_string":"b", "some_variant":1} {"time":"2023-01-07T01:10:00Z", "some_long":1111, "some_string":"c", "some_variant":2.2} {"time":"2023-01-07T01:20:00Z", "some_double":11.11, "some_variant":1}useNestedColumnIndexerForSchemaDiscoveryset onappendableIndexSpecof thetuningConfigtells theIncrementalIndexto use aNestedColumnIndexerfor any discovered dimensions instead of aStringDimensionIndexer. The new mimic behavior of nested column selectors then allows queries to see these discovered columns as their correct type:Adding additional events:
{"time":"2023-01-07T00:00:00Z", "other_long": 1111, "other_double":2.22, "other_string": "zz"} {"time":"2023-01-07T00:00:00Z", "other_long": 2222, "other_double":3.33, "other_string": "yy"} {"time":"2023-01-07T00:00:00Z", "other_long": 3333, "other_double":4.44, "other_string": "xx"} {"time":"2023-01-07T00:00:00Z", "other_long": 4444, "other_double":5.55, "other_string": "ww"}these are picked up as well:
The nested column selectors are using the same nested literal column selectors that are used for the nested virtual columns that back SQL functions like
JSON_VALUE, so the performance is approximately the same as if these were regular literal columns, and we can query them as if they were such, grouping and aggregating and so on:Follow-up work
The most important pieces to follow are:
ColumnCapabilitiesfor segment merging/picking column handlers/etc, something likeColumnFormatorColumnShape. For now I have added the concept of 'handler' capabilities as a crutch to allow merging to choose the nested column merger even though the column capabilities reports as aSTRINGorLONGor whatever with nested columns, but going forward i think something nicer can be built, more on this laterThis PR has: