Add support for a custom DimensionSchema in DataSourceMSQDestination#16864
Add support for a custom DimensionSchema in DataSourceMSQDestination#16864LakshSingla merged 9 commits intoapache:masterfrom
DimensionSchema in DataSourceMSQDestination#16864Conversation
| public MSQSpec( | ||
| @JsonProperty("query") Query<?> query, | ||
| @JsonProperty("columnMappings") @Nullable ColumnMappings columnMappings, | ||
| @JsonProperty("columnMappings") ColumnMappings columnMappings, |
There was a problem hiding this comment.
Is there a reason null-able is removed ?
There was a problem hiding this comment.
There is a not-null check below.
| null, | ||
| ImmutableList.of(replaceInterval) | ||
| ImmutableList.of(replaceInterval), | ||
| dataSchema.getDimensionsSpec() |
There was a problem hiding this comment.
What about cases where dimension schema is not present in the compaction spec, would those dimensions be present in this schema ?
There was a problem hiding this comment.
Yes, in those cases the segment schemas of compaction candidates are analyzed and a coerced schema is used.
| type, | ||
| query.context() | ||
| query.context(), | ||
| dimensionToSchemaMap |
There was a problem hiding this comment.
Where are they getting piped to the segment generator factory ?
There was a problem hiding this comment.
The schema returned from this function call is used as the input to the factory.
| if (dimensionToSchemaMap != null && dimensionToSchemaMap.containsKey(outputColumnName)) { | ||
| return dimensionToSchemaMap.get(outputColumnName); | ||
| } | ||
| // For aggregators moved to dimensions, we won't have an entry in the map. For those cases, use the default config. |
There was a problem hiding this comment.
this also happens for like regular ingestions which are not compaction, right? comment makes it seem only like aggs
There was a problem hiding this comment.
Yes, thanks for pointing out. Updated the comment.
| } | ||
|
|
||
| @Test(dataProvider = "engine") | ||
| public void testAutoCompactionPreservesCreateBitmapIndexInDimensionSchema(CompactionEngine engine) throws Exception |
There was a problem hiding this comment.
would be nice to add a test that uses AutoTypeColumnSchema for like a long column to ensure that it is recreated with AutoTypeColumnSchema instead of LongDimensionSchema (or similar with double) ('auto' longs have indexes while classic longs do not)
There was a problem hiding this comment.
Added a long column to dimensions with AutoTypeColumnSchema in the same test.
clintropolis
left a comment
There was a problem hiding this comment.
this seems like it would solve the main problem I was worried about, though i don't have super strong opinions on the API changes so would feel better about stuff if someone else also +1 this
| if (dimensionToSchemaMap != null && dimensionToSchemaMap.containsKey(outputColumnName)) { | ||
| return dimensionToSchemaMap.get(outputColumnName); | ||
| } | ||
| // For regular ingestion, or for metrics moved to dimensions in case of compaction, we won't have an entry in the |
There was a problem hiding this comment.
nit: can clarify "metrics moved to dimensions" a bit
| // For regular ingestion, or for metrics moved to dimensions in case of compaction, we won't have an entry in the | |
| // For ingestion or when metrics are converted to dimensions when compaction is performed without rollup (finalize: false), we won't have an entry in the |
There was a problem hiding this comment.
Updated. Thanks!
| context.put(QueryContexts.FINALIZE_KEY, false); | ||
| // Only scalar or array-type dimensions are allowed as grouping keys. | ||
| context.putIfAbsent(GroupByQueryConfig.CTX_KEY_ENABLE_MULTI_VALUE_UNNESTING, false); | ||
| context.putIfAbsent(MultiStageQueryContext.CTX_ARRAY_INGEST_MODE, "array"); |
There was a problem hiding this comment.
Should we even allow ARRAY_INGEST_MODE as "mvd", i.e. in addition to this, do we need to have a check flagging the queries with CTX_ARRAY_INGEST_MODE = 'mvd'
There was a problem hiding this comment.
If a user explicitly sets this flag for compaction, it would be out of a conscious choice. It's the same for all the flags above which may not ideally make sense for compaction but are set for some reason (e.g. finalize), and would end up in some valid output.
We do already have a warn log for array_ingest_mode = mvd in the controller, so not adding another one here.
There was a problem hiding this comment.
Finalize still makes some sense, however, with MVD mode with compaction is a sure shot way to shoot oneself in the foot. It still exists for ingestion for some historical reasons - We only ingested MVDs + it is difficult to change the queries once the data is ingested as an MVD. For compaction, it makes little sense to modify a string array to MVD.
Anyway, we don't need to block this patch for this discussion and can take it in a follow-up if need be.
| .withPrefabValues( | ||
| Map.class, | ||
| ImmutableMap.of( | ||
| "language", | ||
| new StringDimensionSchema( | ||
| "language", | ||
| DimensionSchema.MultiValueHandling.SORTED_ARRAY, | ||
| false | ||
| ) | ||
| ), | ||
| ImmutableMap.of( | ||
| "region", | ||
| new StringDimensionSchema( | ||
| "region", | ||
| DimensionSchema.MultiValueHandling.SORTED_ARRAY, | ||
| false | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Based on my limited understanding, it's only required if a class is self-referential. DataSourceMSQDestination doesn't has a reference to DataSourceMSQDestination, so I don't think we'd need this.
There was a problem hiding this comment.
This is required for Map which throws Role's equals method delegates to an abstract method error otherwise.
LakshSingla
left a comment
There was a problem hiding this comment.
LGTM. I wonder if dimensionToSchemaMap looks succinct if renamed to dimensionSchemas
|
@LakshSingla: Will leave this PR as-is to avoid further delay, and do both |
|
Makes sense. I'll merge this one, can you please raise a separate patch for changing the name to Having a standalone patch for the name change would make sure that we remember to treat this change with the other as one. Having the name change with other changes would make it difficult to track, and we might end up with a version that doesn't have the rename. |
…GEST_MODE to array (apache#16909) A follow-up PR for apache#16864. Just renames dimensionToSchemaMap to dimensionSchemas and always overrides ARRAY_INGEST_MODE context value to array for MSQ compaction.
Description
In the native engine, a user can specify values for
createBitmapIndexandmultiValueHandlingproperties forstringdimensions to override the defaults. In MSQ, however, there is currently no provision to pass these details for strings; the default values for the type are always used instead.This PR adds support for passing in a custom
DimensionSchemamap to MSQ query destination of typeDataSourceMSQDestination. The current consumer of this functionality is a compaction job which needs to preserve the schema of the segments being compacted. With this change, compaction tasks can retain only thecreateBitmapIndexproperty of the input segments, but notmultiValueHandlingproperty since that info is not persisted anywhere in the segment.Main classes to review:
DataSourceMSQDestinationControllerImplMSQCompactionRunnerThis PR has: