Skip to content

Enable rollup on multi-value dimensions for compaction with MSQ engine#16937

Merged
kfaraz merged 17 commits intoapache:masterfrom
gargvishesh:msq-compaction-enable-group-on-mvd
Sep 4, 2024
Merged

Enable rollup on multi-value dimensions for compaction with MSQ engine#16937
kfaraz merged 17 commits intoapache:masterfrom
gargvishesh:msq-compaction-enable-group-on-mvd

Conversation

@gargvishesh
Copy link
Copy Markdown
Contributor

@gargvishesh gargvishesh commented Aug 21, 2024

Description

Currently compaction with MSQ engine doesn't work for rollup on multi-value dimensions (MVDs), the reason being the default behaviour of grouping on MVD dimensions to unnest the dimension values; for instance grouping on [s1,s2] with aggregate a will result in two rows: <s1,a> and <s2,a>.

This change enables rollup on MVDs (without unnest) by converting MVDs to Arrays before rollup using virtual columns, and then converting them back to MVDs using post aggregators. If segment schema is available to the compaction task (when it ends up downloading segments to get existing dimensions/metrics/granularity), it selectively does the MVD-Array conversion only for known multi-valued columns; else it conservatively performs this conversion for all string columns.

Key changed/added classes in this PR
  • MSQCompactionRunner
  • CompactionTask

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

# Conflicts:
#	extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/indexing/MSQCompactionRunner.java
# Conflicts:
#	extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/indexing/MSQCompactionRunner.java
@github-actions github-actions Bot added Area - Batch Ingestion Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Aug 21, 2024
Comment on lines +689 to +699
public void testCombinedDataSchemaSetsMultiValuedColumnsInfo()
{
MultiValuedColumnsInfo multiValuedColumnsInfo = MultiValuedColumnsInfo.processed();
multiValuedColumnsInfo.addMultiValuedColumn("dimA");

CombinedDataSchema schema = new CombinedDataSchema(
IdUtilsTest.VALID_ID_CHARS,
new TimestampSpec("time", "auto", null),
DimensionsSpec.builder()
.setDimensions(
DimensionsSpec.getDefaultSchemas(ImmutableList.of("dimA", "dimB", "metric1"))

Check notice

Code scanning / CodeQL

Deprecated method or constructor invocation

Invoking [DataSchema.DataSchema](1) should be avoided because it has been deprecated.
);
Assert.assertTrue(schema.getMultiValuedColumnsInfo().isProcessed());
Assert.assertEquals(ImmutableSet.of("dimA"), schema.getMultiValuedColumnsInfo().getMultiValuedColumns());
}

Check notice

Code scanning / CodeQL

Deprecated method or constructor invocation

Invoking [DataSchema.getParserMap](1) should be avoided because it has been deprecated.
);
Assert.assertTrue(schema.getMultiValuedColumnsInfo().isProcessed());
Assert.assertEquals(ImmutableSet.of("dimA"), schema.getMultiValuedColumnsInfo().getMultiValuedColumns());
}

Check notice

Code scanning / CodeQL

Deprecated method or constructor invocation

Invoking [DataSchema.getParserMap](1) should be avoided because it has been deprecated.
{"timestamp": "2013-08-31T11:58:39Z", "page": "Crimson Typhoon", "language" : "zh", "user" : "triplets", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"China", "region":"Shanxi", "city":"Taiyuan", "added": 905, "deleted": 5, "delta": 900}
{"timestamp": "2013-08-31T12:41:27Z", "page": "Coyote Tango", "language" : "ja", "user" : "stringer", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"Japan", "region":"Kanto", "city":"Tokyo", "added": 1, "deleted": 10, "delta": -9}
{"timestamp": "2013-09-01T01:02:33Z", "page": "Gypsy Danger", "language" : "en", "user" : "nuclear", "unpatrolled" : "true", "newPage" : "true", "robot": "false", "anonymous": "false", "namespace":"article", "continent":"North America", "country":"United States", "region":"Bay Area", "city":"San Francisco", "added": 57, "deleted": 200, "delta": -143} No newline at end of file
{"timestamp": "2013-08-31T11:58:39Z", "page": "Crimson Typhoon", "language" : "zh", "tags": ["t5", "t6"], "user" : "triplets", "unpatrolled" : "true", "newPage" : "false", "robot": "true", "anonymous": "false", "namespace":"wikipedia", "continent":"Asia", "country":"China", "region":"Shanxi", "city":"Taiyuan", "added": 905, "deleted": 5, "delta": 900}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value of tags always has 2 entries in all the data files. Maybe mix it up a bit, like 0 values, 1 value or 3 values.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than changing existing data files, I would advise we create new ones (e.g. wikipedia_index_data1_mvd.json). These files are probably used in multiple tests and it would be better to leave those as is.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have updated test dataset varying tags length. Modified existing dataset itself since only one other test required an update mainly because most of tests selectively use columns from the dataset.

Comment on lines +687 to +689
forceTriggerAutoCompaction(5);
verifyQuery(INDEX_QUERIES_RESOURCE);
verifySegmentsCompacted(hashedPartitionsSpec, 4);
verifySegmentsCompacted(hashedPartitionsSpec, 5);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to change this test?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a comment a few lines above. Reqd since numShards is a (max) hint and the actual number can change based on the data.

Comment thread server/src/main/java/org/apache/druid/segment/indexing/CombinedDataSchema.java Outdated
@LakshSingla
Copy link
Copy Markdown
Contributor

else it conservatively performs this conversion for all string columns.

I haven't gone through the PR yet, but grouping on String columns can make use of the indices. Converting them to arrays before grouping on them will invalidate those indices and can be a lot slower. I think this might not be as much of a concern in MSQ only, since the frames don't contain indices in the first place, but if this changes in the future, will this modification be actively harmful to query processing in MSQ?

Moreover, conversion roundtrip will inherently be slower because of the mere fact that conversion is happening. Is the overhead calculated somewhere?

@gargvishesh
Copy link
Copy Markdown
Contributor Author

gargvishesh commented Sep 2, 2024

@LakshSingla: The change only impacts compaction using MSQ engine, and we don't have a recourse here since for MSQ, grouping on MVDs ends up unnesting the dimensions. The conversion overhead should be acceptable since this is only for compaction. For regular query processing post compaction, the original indexes for each column will be preserved, since with #16864, we pass the dimensionSchema to compaction job.

Copy link
Copy Markdown
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestions, otherwise looks good.

Comment thread server/src/test/java/org/apache/druid/segment/indexing/DataSchemaTest.java Outdated
@kfaraz
Copy link
Copy Markdown
Contributor

kfaraz commented Sep 4, 2024

Merging this PR as the failing test is an unrelated flaky test being fixed in a different PR .

@kfaraz kfaraz merged commit e28424e into apache:master Sep 4, 2024
edgar2020 pushed a commit to edgar2020/druid that referenced this pull request Sep 5, 2024
apache#16937)

Currently compaction with MSQ engine doesn't work for rollup on multi-value dimensions (MVDs), the reason being the default behaviour of grouping on MVD dimensions to unnest the dimension values; for instance grouping on `[s1,s2]` with aggregate `a` will result in two rows: `<s1,a>` and `<s2,a>`. 

This change enables rollup on MVDs (without unnest) by converting MVDs to Arrays before rollup using virtual columns, and then converting them back to MVDs using post aggregators. If segment schema is available to the compaction task (when it ends up downloading segments to get existing dimensions/metrics/granularity), it selectively does the MVD-Array conversion only for known multi-valued columns; else it conservatively performs this conversion for all `string` columns.
@kfaraz kfaraz added this to the 31.0.0 milestone Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area - Batch Ingestion Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants