Skip to content

sampler + type detection = bff#13711

Merged
clintropolis merged 7 commits intoapache:masterfrom
clintropolis:add-sampler-type-info
Feb 28, 2023
Merged

sampler + type detection = bff#13711
clintropolis merged 7 commits intoapache:masterfrom
clintropolis:add-sampler-type-info

Conversation

@clintropolis
Copy link
Copy Markdown
Member

@clintropolis clintropolis commented Jan 26, 2023

Description

This PR improves the response from /druid/indexer/v1/sampler to now include dimensions and segmentSchema which are a list of dimension schemas and a RowSignature for the set of rows sampled.

Building on top of #13653, when the sampler spec included with

...
    "dimensionsSpec": {
        "dimensions": [],
        "useSchemaDiscovery": true
...
      },
...

it allows an application (such as the web-console data loader) to get typing information about the sampled data in schemaless mode.

For example, given a kafka stream

Screenshot 2023-01-25 at 6 16 18 PM

the new sampler response contains something like:

{
    "numRowsRead":4,
    "numRowsIndexed":4,
    "logicalDimensions":[
        {"type":"string","name":"time","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true},
        {"type":"long","name":"some_long","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":false},
        {"type":"double","name":"some_double","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":false},
        {"type":"string","name":"some_string","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true},
        {"type":"json","name":"some_variant","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true},
        {"type":"json","name":"some_nested","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true}
    ],
    "physicalDimensions":[
        {"type":"json","name":"time","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true},
        {"type":"json","name":"some_long","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true},
        {"type":"json","name":"some_double","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true},
        {"type":"json","name":"some_string","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true},
        {"type":"json","name":"some_variant","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true},
        {"type":"json","name":"some_nested","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true}
    ],
    "logicalSegmentSchema":[
        {"name":"__time","type":"LONG"},
        {"name":"time","type":"STRING"},
        {"name":"some_long","type":"LONG"},
        {"name":"some_double","type":"DOUBLE"},
        {"name":"some_string","type":"STRING"},
        {"name":"some_variant","type":"COMPLEX<json>"},
        {"name":"some_nested","type":"COMPLEX<json>"}
    ],
    "data":[...]
}

which evolves as we proceed through data loader steps:

Screenshot 2023-01-25 at 6 17 57 PM

{
    "numRowsRead":4,
    "numRowsIndexed":4,
    "logicalDimensions":[
        {"type":"long","name":"some_long","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":false},
        {"type":"double","name":"some_double","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":false},
        {"type":"string","name":"some_string","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true},
        {"type":"json","name":"some_variant","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true},
        {"type":"json","name":"some_nested","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true}
    ],
    "physicalDimensions":[
        {"type":"json","name":"some_long","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true},
        {"type":"json","name":"some_double","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true},
        {"type":"json","name":"some_string","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true},
        {"type":"json","name":"some_variant","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true},
        {"type":"json","name":"some_nested","multiValueHandling":"SORTED_ARRAY","createBitmapIndex":true}
    ],
    "logicalSegmentSchema":[
        {"name":"__time","type":"LONG"},
        {"name":"some_long","type":"LONG"},
        {"name":"some_double","type":"DOUBLE"},
        {"name":"some_string","type":"STRING"},
        {"name":"some_variant","type":"COMPLEX<json>"},
        {"name":"some_nested","type":"COMPLEX<json>"}
    ],
    "data":[...]
}

After this PR we can update the web-console to take advantage of this new information, which would help simplify some things.

Release note

/druid/indexer/v1/sampler has been improved to now include logicalDimension, physicalDimension and logicalSegmentSchema which are a list of the most restrictive typed dimension schemas, the list of dimension schemas actually used to sample the data, and full resulting segment schema for the set of rows sampled respectively.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@abhishekagarwal87
Copy link
Copy Markdown
Contributor

Nice. are you also going to change docs?

@imply-cheddar
Copy link
Copy Markdown
Contributor

I had expected the first set of output to have one sub-section that is the specific detected type (maybe the "segmentSchema" chunk?) and another output that was the actual physical "indexer type" (maybe the "dimension" chunk?). For something that is fully running with the auto-detection, I would expect the physical "indexer type" one to just be json typed all the way down as that is what was used to ingest it.

@clintropolis
Copy link
Copy Markdown
Member Author

Nice. are you also going to change docs?

I was holding off on docs since I'm not quite sure this has solidified yet, but looking closer I'm not sure we have any docs on the sampler API at all, so I think I'd definitely rather not do it in this PR.

@clintropolis clintropolis merged commit 1d8fff4 into apache:master Feb 28, 2023
@clintropolis clintropolis deleted the add-sampler-type-info branch February 28, 2023 12:14
@clintropolis clintropolis added this to the 26.0 milestone Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants