Skip to content

Add support for a custom DimensionSchema in DataSourceMSQDestination#16864

Merged
LakshSingla merged 9 commits intoapache:masterfrom
gargvishesh:add-colformat-spec-to-msq
Aug 16, 2024
Merged

Add support for a custom DimensionSchema in DataSourceMSQDestination#16864
LakshSingla merged 9 commits intoapache:masterfrom
gargvishesh:add-colformat-spec-to-msq

Conversation

@gargvishesh
Copy link
Copy Markdown
Contributor

@gargvishesh gargvishesh commented Aug 8, 2024

Description

In the native engine, a user can specify values for createBitmapIndex and multiValueHandling properties for string dimensions to override the defaults. In MSQ, however, there is currently no provision to pass these details for strings; the default values for the type are always used instead.

This PR adds support for passing in a custom DimensionSchema map to MSQ query destination of type DataSourceMSQDestination. The current consumer of this functionality is a compaction job which needs to preserve the schema of the segments being compacted. With this change, compaction tasks can retain only the createBitmapIndex property of the input segments, but not multiValueHandling property since that info is not persisted anywhere in the segment.

Main classes to review:

  • DataSourceMSQDestination
  • ControllerImpl
  • MSQCompactionRunner

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@github-actions github-actions Bot added Area - Batch Ingestion Area - Querying Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Aug 8, 2024
public MSQSpec(
@JsonProperty("query") Query<?> query,
@JsonProperty("columnMappings") @Nullable ColumnMappings columnMappings,
@JsonProperty("columnMappings") ColumnMappings columnMappings,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason null-able is removed ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a not-null check below.

null,
ImmutableList.of(replaceInterval)
ImmutableList.of(replaceInterval),
dataSchema.getDimensionsSpec()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about cases where dimension schema is not present in the compaction spec, would those dimensions be present in this schema ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in those cases the segment schemas of compaction candidates are analyzed and a coerced schema is used.

type,
query.context()
query.context(),
dimensionToSchemaMap
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are they getting piped to the segment generator factory ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema returned from this function call is used as the input to the factory.

@gargvishesh gargvishesh requested a review from cryptoe August 9, 2024 10:53
@gargvishesh gargvishesh marked this pull request as ready for review August 12, 2024 14:03
if (dimensionToSchemaMap != null && dimensionToSchemaMap.containsKey(outputColumnName)) {
return dimensionToSchemaMap.get(outputColumnName);
}
// For aggregators moved to dimensions, we won't have an entry in the map. For those cases, use the default config.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also happens for like regular ingestions which are not compaction, right? comment makes it seem only like aggs

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for pointing out. Updated the comment.

}

@Test(dataProvider = "engine")
public void testAutoCompactionPreservesCreateBitmapIndexInDimensionSchema(CompactionEngine engine) throws Exception
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to add a test that uses AutoTypeColumnSchema for like a long column to ensure that it is recreated with AutoTypeColumnSchema instead of LongDimensionSchema (or similar with double) ('auto' longs have indexes while classic longs do not)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a long column to dimensions with AutoTypeColumnSchema in the same test.

Copy link
Copy Markdown
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like it would solve the main problem I was worried about, though i don't have super strong opinions on the API changes so would feel better about stuff if someone else also +1 this

if (dimensionToSchemaMap != null && dimensionToSchemaMap.containsKey(outputColumnName)) {
return dimensionToSchemaMap.get(outputColumnName);
}
// For regular ingestion, or for metrics moved to dimensions in case of compaction, we won't have an entry in the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can clarify "metrics moved to dimensions" a bit

Suggested change
// For regular ingestion, or for metrics moved to dimensions in case of compaction, we won't have an entry in the
// For ingestion or when metrics are converted to dimensions when compaction is performed without rollup (finalize: false), we won't have an entry in the

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks!

context.put(QueryContexts.FINALIZE_KEY, false);
// Only scalar or array-type dimensions are allowed as grouping keys.
context.putIfAbsent(GroupByQueryConfig.CTX_KEY_ENABLE_MULTI_VALUE_UNNESTING, false);
context.putIfAbsent(MultiStageQueryContext.CTX_ARRAY_INGEST_MODE, "array");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we even allow ARRAY_INGEST_MODE as "mvd", i.e. in addition to this, do we need to have a check flagging the queries with CTX_ARRAY_INGEST_MODE = 'mvd'

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a user explicitly sets this flag for compaction, it would be out of a conscious choice. It's the same for all the flags above which may not ideally make sense for compaction but are set for some reason (e.g. finalize), and would end up in some valid output.
We do already have a warn log for array_ingest_mode = mvd in the controller, so not adding another one here.

Copy link
Copy Markdown
Contributor

@LakshSingla LakshSingla Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finalize still makes some sense, however, with MVD mode with compaction is a sure shot way to shoot oneself in the foot. It still exists for ingestion for some historical reasons - We only ingested MVDs + it is difficult to change the queries once the data is ingested as an MVD. For compaction, it makes little sense to modify a string array to MVD.

Anyway, we don't need to block this patch for this discussion and can take it in a follow-up if need be.

Comment on lines +39 to +56
.withPrefabValues(
Map.class,
ImmutableMap.of(
"language",
new StringDimensionSchema(
"language",
DimensionSchema.MultiValueHandling.SORTED_ARRAY,
false
)
),
ImmutableMap.of(
"region",
new StringDimensionSchema(
"region",
DimensionSchema.MultiValueHandling.SORTED_ARRAY,
false
)
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my limited understanding, it's only required if a class is self-referential. DataSourceMSQDestination doesn't has a reference to DataSourceMSQDestination, so I don't think we'd need this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required for Map which throws Role's equals method delegates to an abstract method error otherwise.

Copy link
Copy Markdown
Contributor

@LakshSingla LakshSingla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I wonder if dimensionToSchemaMap looks succinct if renamed to dimensionSchemas

@gargvishesh
Copy link
Copy Markdown
Contributor Author

gargvishesh commented Aug 16, 2024

@LakshSingla: Will leave this PR as-is to avoid further delay, and do both dimensionSchemas simplification and array_ingest_mode:array forceful override changes in the next PR I'm already working on.

@LakshSingla
Copy link
Copy Markdown
Contributor

Makes sense. I'll merge this one, can you please raise a separate patch for changing the name to dimensionSchemas, since if this one makes it in one of the releases without the other patch then there'd be compatibility issues b/w versions?

Having a standalone patch for the name change would make sure that we remember to treat this change with the other as one. Having the name change with other changes would make it difficult to track, and we might end up with a version that doesn't have the rename.

@LakshSingla LakshSingla merged commit e37fe93 into apache:master Aug 16, 2024
LakshSingla pushed a commit that referenced this pull request Aug 20, 2024
…GEST_MODE to array (#16909)

A follow-up PR for #16864. Just renames dimensionToSchemaMap to dimensionSchemas and always overrides ARRAY_INGEST_MODE context value to array for MSQ compaction.
hevansDev pushed a commit to hevansDev/druid that referenced this pull request Aug 29, 2024
…GEST_MODE to array (apache#16909)

A follow-up PR for apache#16864. Just renames dimensionToSchemaMap to dimensionSchemas and always overrides ARRAY_INGEST_MODE context value to array for MSQ compaction.
@kfaraz kfaraz added this to the 31.0.0 milestone Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Area - Querying

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants