sampler + type detection = bff#13711
Merged
clintropolis merged 7 commits intoapache:masterfrom Feb 28, 2023
Merged
Conversation
Contributor
|
Nice. are you also going to change docs? |
Contributor
|
I had expected the first set of output to have one sub-section that is the specific detected type (maybe the "segmentSchema" chunk?) and another output that was the actual physical "indexer type" (maybe the "dimension" chunk?). For something that is fully running with the auto-detection, I would expect the physical "indexer type" one to just be json typed all the way down as that is what was used to ingest it. |
08b844b to
ce489bf
Compare
Member
Author
I was holding off on docs since I'm not quite sure this has solidified yet, but looking closer I'm not sure we have any docs on the sampler API at all, so I think I'd definitely rather not do it in this PR. |
imply-cheddar
approved these changes
Feb 28, 2023
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR improves the response from
/druid/indexer/v1/samplerto now includedimensionsandsegmentSchemawhich are a list of dimension schemas and aRowSignaturefor the set of rows sampled.Building on top of #13653, when the sampler spec included with
it allows an application (such as the web-console data loader) to get typing information about the sampled data in schemaless mode.
For example, given a kafka stream
the new sampler response contains something like:
{ "numRowsRead":4, "numRowsIndexed":4, "logicalDimensions":[ {"type":"string","name":"time","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true}, {"type":"long","name":"some_long","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":false}, {"type":"double","name":"some_double","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":false}, {"type":"string","name":"some_string","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true}, {"type":"json","name":"some_variant","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true}, {"type":"json","name":"some_nested","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true} ], "physicalDimensions":[ {"type":"json","name":"time","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true}, {"type":"json","name":"some_long","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true}, {"type":"json","name":"some_double","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true}, {"type":"json","name":"some_string","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true}, {"type":"json","name":"some_variant","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true}, {"type":"json","name":"some_nested","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true} ], "logicalSegmentSchema":[ {"name":"__time","type":"LONG"}, {"name":"time","type":"STRING"}, {"name":"some_long","type":"LONG"}, {"name":"some_double","type":"DOUBLE"}, {"name":"some_string","type":"STRING"}, {"name":"some_variant","type":"COMPLEX<json>"}, {"name":"some_nested","type":"COMPLEX<json>"} ], "data":[...] }which evolves as we proceed through data loader steps:
{ "numRowsRead":4, "numRowsIndexed":4, "logicalDimensions":[ {"type":"long","name":"some_long","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":false}, {"type":"double","name":"some_double","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":false}, {"type":"string","name":"some_string","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true}, {"type":"json","name":"some_variant","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true}, {"type":"json","name":"some_nested","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true} ], "physicalDimensions":[ {"type":"json","name":"some_long","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true}, {"type":"json","name":"some_double","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true}, {"type":"json","name":"some_string","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true}, {"type":"json","name":"some_variant","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true}, {"type":"json","name":"some_nested","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true} ], "logicalSegmentSchema":[ {"name":"__time","type":"LONG"}, {"name":"some_long","type":"LONG"}, {"name":"some_double","type":"DOUBLE"}, {"name":"some_string","type":"STRING"}, {"name":"some_variant","type":"COMPLEX<json>"}, {"name":"some_nested","type":"COMPLEX<json>"} ], "data":[...] }After this PR we can update the web-console to take advantage of this new information, which would help simplify some things.
Release note
/druid/indexer/v1/samplerhas been improved to now includelogicalDimension,physicalDimensionandlogicalSegmentSchemawhich are a list of the most restrictive typed dimension schemas, the list of dimension schemas actually used to sample the data, and full resulting segment schema for the set of rows sampled respectively.This PR has: