Enable rollup on multi-value dimensions for compaction with MSQ engine by gargvishesh · Pull Request #16937 · apache/druid

gargvishesh · 2024-08-21T11:57:18Z

Description

Currently compaction with MSQ engine doesn't work for rollup on multi-value dimensions (MVDs), the reason being the default behaviour of grouping on MVD dimensions to unnest the dimension values; for instance grouping on [s1,s2] with aggregate a will result in two rows: <s1,a> and <s2,a>.

This change enables rollup on MVDs (without unnest) by converting MVDs to Arrays before rollup using virtual columns, and then converting them back to MVDs using post aggregators. If segment schema is available to the compaction task (when it ends up downloading segments to get existing dimensions/metrics/granularity), it selectively does the MVD-Array conversion only for known multi-valued columns; else it conservatively performs this conversion for all string columns.

Key changed/added classes in this PR

MSQCompactionRunner
CompactionTask

This PR has:

# Conflicts: # extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/indexing/MSQCompactionRunner.java

+  public void testCombinedDataSchemaSetsMultiValuedColumnsInfo()
+  {
+    MultiValuedColumnsInfo multiValuedColumnsInfo = MultiValuedColumnsInfo.processed();
+    multiValuedColumnsInfo.addMultiValuedColumn("dimA");
+
+    CombinedDataSchema schema = new CombinedDataSchema(
+        IdUtilsTest.VALID_ID_CHARS,
+        new TimestampSpec("time", "auto", null),
+        DimensionsSpec.builder()
+                      .setDimensions(
+                          DimensionsSpec.getDefaultSchemas(ImmutableList.of("dimA", "dimB", "metric1"))


+    );
+    Assert.assertTrue(schema.getMultiValuedColumnsInfo().isProcessed());
+    Assert.assertEquals(ImmutableSet.of("dimA"), schema.getMultiValuedColumnsInfo().getMultiValuedColumns());
+  }


+    );
+    Assert.assertTrue(schema.getMultiValuedColumnsInfo().isProcessed());
+    Assert.assertEquals(ImmutableSet.of("dimA"), schema.getMultiValuedColumnsInfo().getMultiValuedColumns());
+  }


kfaraz · 2024-08-22T02:20:03Z

-{"timestamp": "2013-08-31T11:58:39Z", "page": "Crimson Typhoon", "language" : "zh", "user" : "triplets", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"China", "region":"Shanxi", "city":"Taiyuan", "added": 905, "deleted": 5, "delta": 900}
-{"timestamp": "2013-08-31T12:41:27Z", "page": "Coyote Tango", "language" : "ja", "user" : "stringer", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"Japan", "region":"Kanto", "city":"Tokyo", "added": 1, "deleted": 10, "delta": -9}
-{"timestamp": "2013-09-01T01:02:33Z", "page": "Gypsy Danger", "language" : "en", "user" : "nuclear", "unpatrolled" : "true", "newPage" : "true", "robot": "false", "anonymous": "false", "namespace":"article", "continent":"North America", "country":"United States", "region":"Bay Area", "city":"San Francisco", "added": 57, "deleted": 200, "delta": -143}
+{"timestamp": "2013-08-31T11:58:39Z", "page": "Crimson Typhoon", "language" : "zh", "tags": ["t5", "t6"], "user" : "triplets", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"China", "region":"Shanxi", "city":"Taiyuan", "added": 905, "deleted": 5, "delta": 900}


The value of tags always has 2 entries in all the data files. Maybe mix it up a bit, like 0 values, 1 value or 3 values.

Rather than changing existing data files, I would advise we create new ones (e.g. wikipedia_index_data1_mvd.json). These files are probably used in multiple tests and it would be better to leave those as is.

Have updated test dataset varying tags length. Modified existing dataset itself since only one other test required an update mainly because most of tests selectively use columns from the dataset.

kfaraz · 2024-09-02T03:19:49Z

+        forceTriggerAutoCompaction(5);
        verifyQuery(INDEX_QUERIES_RESOURCE);
-        verifySegmentsCompacted(hashedPartitionsSpec, 4);
+        verifySegmentsCompacted(hashedPartitionsSpec, 5);


Why do we need to change this test?

Have a comment a few lines above. Reqd since numShards is a (max) hint and the actual number can change based on the data.

LakshSingla · 2024-09-02T06:40:30Z

else it conservatively performs this conversion for all string columns.

I haven't gone through the PR yet, but grouping on String columns can make use of the indices. Converting them to arrays before grouping on them will invalidate those indices and can be a lot slower. I think this might not be as much of a concern in MSQ only, since the frames don't contain indices in the first place, but if this changes in the future, will this modification be actively harmful to query processing in MSQ?

Moreover, conversion roundtrip will inherently be slower because of the mere fact that conversion is happening. Is the overhead calculated somewhere?

gargvishesh · 2024-09-02T09:40:49Z

@LakshSingla: The change only impacts compaction using MSQ engine, and we don't have a recourse here since for MSQ, grouping on MVDs ends up unnesting the dimensions. The conversion overhead should be acceptable since this is only for compaction. For regular query processing post compaction, the original indexes for each column will be preserved, since with #16864, we pass the dimensionSchema to compaction job.

kfaraz

Minor suggestions, otherwise looks good.

kfaraz · 2024-09-04T10:57:45Z

Merging this PR as the failing test is an unrelated flaky test being fixed in a different PR .

apache#16937) Currently compaction with MSQ engine doesn't work for rollup on multi-value dimensions (MVDs), the reason being the default behaviour of grouping on MVD dimensions to unnest the dimension values; for instance grouping on `[s1,s2]` with aggregate `a` will result in two rows: `<s1,a>` and `<s2,a>`. This change enables rollup on MVDs (without unnest) by converting MVDs to Arrays before rollup using virtual columns, and then converting them back to MVDs using post aggregators. If segment schema is available to the compaction task (when it ends up downloading segments to get existing dimensions/metrics/granularity), it selectively does the MVD-Array conversion only for known multi-valued columns; else it conservatively performs this conversion for all `string` columns.

gargvishesh added 7 commits August 16, 2024 19:23

Initial commit

d13b075

add virtualcols to dims

a67aa18

Merge branch 'refs/heads/master' into msq-compaction-enable-group-on-mvd

ff93527

# Conflicts: # extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/indexing/MSQCompactionRunner.java

Merge branch 'refs/heads/master' into msq-compaction-enable-group-on-mvd

251c785

# Conflicts: # extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/indexing/MSQCompactionRunner.java

Add UTs

8144b26

Fixes and add ITs

50bdb9c

Fixes style

31f671b

github-actions Bot added Area - Batch Ingestion Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Aug 21, 2024

gargvishesh added 2 commits August 23, 2024 18:03

Fix IT and coverage

9f8b33f

Rename

700d377

github-advanced-security AI found potential problems Aug 26, 2024

View reviewed changes

kfaraz reviewed Sep 2, 2024

View reviewed changes

gargvishesh added 2 commits September 2, 2024 15:00

Address review comments

1d96df5

Add empty tags dataset

8a254be

kfaraz approved these changes Sep 2, 2024

View reviewed changes

gargvishesh added 6 commits September 2, 2024 22:20

Address review comments part 2

9e04521

Fix ITs

6013e27

Fix IT part 2

0e09d87

Fix IT part 3

950d30d

Merge branch 'refs/heads/master' into msq-compaction-enable-group-on-mvd

216ecb8

Merge branch 'refs/heads/master' into msq-compaction-enable-group-on-mvd

f60bde1

kfaraz merged commit e28424e into apache:master Sep 4, 2024

kfaraz added this to the 31.0.0 milestone Oct 4, 2024

Conversation

gargvishesh commented Aug 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key changed/added classes in this PR

Uh oh!

Check notice

Check notice

Check notice

kfaraz Aug 22, 2024

Choose a reason for hiding this comment

Uh oh!

kfaraz Sep 2, 2024

Choose a reason for hiding this comment

Uh oh!

gargvishesh Sep 2, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kfaraz Sep 2, 2024

Choose a reason for hiding this comment

Uh oh!

gargvishesh Sep 2, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LakshSingla commented Sep 2, 2024

Uh oh!

gargvishesh commented Sep 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kfaraz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kfaraz commented Sep 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gargvishesh commented Aug 21, 2024 •

edited

Loading

gargvishesh commented Sep 2, 2024 •

edited

Loading