Add support for a custom `DimensionSchema` in `DataSourceMSQDestination` by gargvishesh · Pull Request #16864 · apache/druid

gargvishesh · 2024-08-08T06:37:34Z

Description

In the native engine, a user can specify values for createBitmapIndex and multiValueHandling properties for string dimensions to override the defaults. In MSQ, however, there is currently no provision to pass these details for strings; the default values for the type are always used instead.

This PR adds support for passing in a custom DimensionSchema map to MSQ query destination of type DataSourceMSQDestination. The current consumer of this functionality is a compaction job which needs to preserve the schema of the segments being compacted. With this change, compaction tasks can retain only the createBitmapIndex property of the input segments, but not multiValueHandling property since that info is not persisted anywhere in the segment.

Main classes to review:

DataSourceMSQDestination
ControllerImpl
MSQCompactionRunner

This PR has:

cryptoe · 2024-08-08T11:48:47Z

  public MSQSpec(
      @JsonProperty("query") Query<?> query,
-      @JsonProperty("columnMappings") @Nullable ColumnMappings columnMappings,
+      @JsonProperty("columnMappings") ColumnMappings columnMappings,


Is there a reason null-able is removed ?

There is a not-null check below.

cryptoe · 2024-08-08T11:49:51Z

        null,
-        ImmutableList.of(replaceInterval)
+        ImmutableList.of(replaceInterval),
+        dataSchema.getDimensionsSpec()


What about cases where dimension schema is not present in the compaction spec, would those dimensions be present in this schema ?

Yes, in those cases the segment schemas of compaction candidates are analyzed and a coerced schema is used.

cryptoe · 2024-08-08T11:52:13Z

                type,
-                query.context()
+                query.context(),
+                dimensionToSchemaMap


Where are they getting piped to the segment generator factory ?

The schema returned from this function call is used as the input to the factory.

clintropolis · 2024-08-13T03:03:46Z

+    if (dimensionToSchemaMap != null && dimensionToSchemaMap.containsKey(outputColumnName)) {
+      return dimensionToSchemaMap.get(outputColumnName);
+    }
+    // For aggregators moved to dimensions, we won't have an entry in the map. For those cases, use the default config.


this also happens for like regular ingestions which are not compaction, right? comment makes it seem only like aggs

Yes, thanks for pointing out. Updated the comment.

clintropolis · 2024-08-13T03:09:06Z

  }

+  @Test(dataProvider = "engine")
+  public void testAutoCompactionPreservesCreateBitmapIndexInDimensionSchema(CompactionEngine engine) throws Exception


would be nice to add a test that uses AutoTypeColumnSchema for like a long column to ensure that it is recreated with AutoTypeColumnSchema instead of LongDimensionSchema (or similar with double) ('auto' longs have indexes while classic longs do not)

Added a long column to dimensions with AutoTypeColumnSchema in the same test.

clintropolis

this seems like it would solve the main problem I was worried about, though i don't have super strong opinions on the API changes so would feel better about stuff if someone else also +1 this

LakshSingla · 2024-08-16T04:57:50Z

+    if (dimensionToSchemaMap != null && dimensionToSchemaMap.containsKey(outputColumnName)) {
+      return dimensionToSchemaMap.get(outputColumnName);
+    }
+    // For regular ingestion, or for metrics moved to dimensions in case of compaction, we won't have an entry in the


nit: can clarify "metrics moved to dimensions" a bit

Suggested change

// For regular ingestion, or for metrics moved to dimensions in case of compaction, we won't have an entry in the

// For ingestion or when metrics are converted to dimensions when compaction is performed without rollup (finalize: false), we won't have an entry in the

Updated. Thanks!

LakshSingla · 2024-08-16T05:02:58Z

    context.put(QueryContexts.FINALIZE_KEY, false);
    // Only scalar or array-type dimensions are allowed as grouping keys.
    context.putIfAbsent(GroupByQueryConfig.CTX_KEY_ENABLE_MULTI_VALUE_UNNESTING, false);
+    context.putIfAbsent(MultiStageQueryContext.CTX_ARRAY_INGEST_MODE, "array");


Should we even allow ARRAY_INGEST_MODE as "mvd", i.e. in addition to this, do we need to have a check flagging the queries with CTX_ARRAY_INGEST_MODE = 'mvd'

If a user explicitly sets this flag for compaction, it would be out of a conscious choice. It's the same for all the flags above which may not ideally make sense for compaction but are set for some reason (e.g. finalize), and would end up in some valid output.
We do already have a warn log for array_ingest_mode = mvd in the controller, so not adding another one here.

Finalize still makes some sense, however, with MVD mode with compaction is a sure shot way to shoot oneself in the foot. It still exists for ingestion for some historical reasons - We only ingested MVDs + it is difficult to change the queries once the data is ingested as an MVD. For compaction, it makes little sense to modify a string array to MVD.

Anyway, we don't need to block this patch for this discussion and can take it in a follow-up if need be.

LakshSingla · 2024-08-16T05:07:13Z

+                  .withPrefabValues(
+                      Map.class,
+                      ImmutableMap.of(
+                          "language",
+                          new StringDimensionSchema(
+                              "language",
+                              DimensionSchema.MultiValueHandling.SORTED_ARRAY,
+                              false
+                          )
+                      ),
+                      ImmutableMap.of(
+                          "region",
+                          new StringDimensionSchema(
+                              "region",
+                              DimensionSchema.MultiValueHandling.SORTED_ARRAY,
+                              false
+                          )
+                      )


Based on my limited understanding, it's only required if a class is self-referential. DataSourceMSQDestination doesn't has a reference to DataSourceMSQDestination, so I don't think we'd need this.

This is required for Map which throws Role's equals method delegates to an abstract method error otherwise.

LakshSingla

LGTM. I wonder if dimensionToSchemaMap looks succinct if renamed to dimensionSchemas

gargvishesh · 2024-08-16T08:07:11Z

@LakshSingla: Will leave this PR as-is to avoid further delay, and do both dimensionSchemas simplification and array_ingest_mode:array forceful override changes in the next PR I'm already working on.

LakshSingla · 2024-08-16T09:53:29Z

Makes sense. I'll merge this one, can you please raise a separate patch for changing the name to dimensionSchemas, since if this one makes it in one of the releases without the other patch then there'd be compatibility issues b/w versions?

Having a standalone patch for the name change would make sure that we remember to treat this change with the other as one. Having the name change with other changes would make it difficult to track, and we might end up with a version that doesn't have the rename.

…GEST_MODE to array (#16909) A follow-up PR for #16864. Just renames dimensionToSchemaMap to dimensionSchemas and always overrides ARRAY_INGEST_MODE context value to array for MSQ compaction.

…GEST_MODE to array (apache#16909) A follow-up PR for apache#16864. Just renames dimensionToSchemaMap to dimensionSchemas and always overrides ARRAY_INGEST_MODE context value to array for MSQ compaction.

Add dimensionSchema to DataSourceMSQDestination

8e85bec

github-actions Bot added Area - Batch Ingestion Area - Querying Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Aug 8, 2024

cryptoe reviewed Aug 8, 2024

View reviewed changes

Set arrayIngestMode to array

ba8f2af

gargvishesh requested a review from cryptoe August 9, 2024 10:53

gargvishesh added 2 commits August 12, 2024 11:24

Merge branch 'refs/heads/master' into add-colformat-spec-to-msq

44d7738

Add test

cf62813

gargvishesh marked this pull request as ready for review August 12, 2024 14:03

clintropolis reviewed Aug 13, 2024

View reviewed changes

gargvishesh added 2 commits August 13, 2024 12:00

Add test for auto-type

2c0b403

Update comment

6cc8465

gargvishesh requested a review from clintropolis August 13, 2024 06:43

gargvishesh added 2 commits August 13, 2024 15:29

Fix tests

76693ce

Fix IT

66910a3

clintropolis approved these changes Aug 14, 2024

View reviewed changes

LakshSingla reviewed Aug 16, 2024

View reviewed changes

Address review comments

166f1a2

LakshSingla approved these changes Aug 16, 2024

View reviewed changes

LakshSingla merged commit e37fe93 into apache:master Aug 16, 2024

gargvishesh mentioned this pull request Aug 16, 2024

Change dimensionToSchemaMap to dimensionSchemas and override ARRAY_INGEST_MODE to array in compaction #16909

Merged

gargvishesh mentioned this pull request Sep 2, 2024

Enable rollup on multi-value dimensions for compaction with MSQ engine #16937

Merged

10 tasks

kfaraz added this to the 31.0.0 milestone Oct 4, 2024

kfaraz mentioned this pull request Oct 11, 2024

[DRAFT] 31.0.0 Release Notes #17332

Closed

	// For regular ingestion, or for metrics moved to dimensions in case of compaction, we won't have an entry in the
	// For ingestion or when metrics are converted to dimensions when compaction is performed without rollup (finalize: false), we won't have an entry in the

Conversation

gargvishesh commented Aug 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clintropolis left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LakshSingla Aug 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LakshSingla left a comment

Choose a reason for hiding this comment

Uh oh!

gargvishesh commented Aug 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LakshSingla commented Aug 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gargvishesh commented Aug 8, 2024 •

edited

Loading

LakshSingla Aug 16, 2024 •

edited

Loading

gargvishesh commented Aug 16, 2024 •

edited

Loading