Adding new config for disabling group by on multiValue column#12253
Adding new config for disabling group by on multiValue column#12253abhishekagarwal87 merged 5 commits intoapache:masterfrom
Conversation
| query.getDimensions()); | ||
| if (!(query.getContextValue(GroupByQueryConfig.CTX_KEY_ENABLE_MULTI_VALUE_UNNESTING, true)) | ||
| && !allSingleValueDims) { | ||
| throw new ISE( |
There was a problem hiding this comment.
We should name the dimension so this error message is more actionable. People are going to need to know which dimension is causing the problem, so they can either remove it, set it to array mode, or process it in some other way.
How about:
- Change hasNoExplodingDimensions to findExplodingDimensions (i.e., return the names of the exploding dimensions).
- Change the error message to:
Encountered multi-value dimensions [%s] that cannot be processed with %s set to false. Consider changing these dimensions to arrays or setting %s to true.
There was a problem hiding this comment.
Acked. Agreed such messaging would be better for the users
| query.getDimensions()); | ||
| if (!(query.getContextValue(GroupByQueryConfig.CTX_KEY_ENABLE_MULTI_VALUE_UNNESTING, true)) | ||
| && !allSingleValueDims) { | ||
| throw new ISE( |
There was a problem hiding this comment.
Acked. Agreed such messaging would be better for the users
| ## Disable GroupBy on multivalue columns | ||
|
|
||
| As grouping on multivalue columns causes implicit unnest, users can avoid this behaviour by setting | ||
| `groupByEnableMultiValueUnnesting` in the query context to `false`. This will result the query to error out. No newline at end of file |
There was a problem hiding this comment.
I think we should leave this out of the docs until we the array-based story is complete. Then we can document this, array types, array functions, etc.
There was a problem hiding this comment.
Alternatively we can keep this documented but change the error message to not mention array-based dimensions. In that case, we can change the error message to this:
Encountered multi-value dimension [%s] that cannot be processed with %s set to false. Consider setting %s to true for unnesting behavior, or using an expression to create a scalar from the multi-value dimension.
For the docs, a couple style points:
- The rest of this page uses second person ("you can…") rather than third ("users can…") so we should stick to that.
- We usually use US spelling in documentation (e.g. behavior instead of behaviour).
So I'd go with:
You can disable the implicit unnesting behavior for groupBy by setting
groupByEnableMultiValueUnnesting: falsein your query context. In this mode, the groupBy engine will return an error instead of completing the query. This is a safety feature for situations where you believe that all dimensions are singly-valued and want the engine to reject any multi-valued dimensions that were inadvertently included.
Also, all documented groupBy parameters should be included in the groupbyquery.md document as well, under "GroupBy v2 configurations". So if you mention this here it should be mentioned in the main doc too.
There was a problem hiding this comment.
In the latest draft I have kept the documentation and changed the error messaging a bit.
Update the "GroupBy v2 configurations" as well.
| * mark columns with null capabilites as candidates for explosion. | ||
| */ | ||
| public static boolean hasNoExplodingDimensions( | ||
| public static List<String> findAllProbableExplodingDimensions( |
There was a problem hiding this comment.
Hmm, looking through this I just realized that this function can't tell if dimensions are definitely multi-valued or not. It can only tell if they might be multi-valued. (But they also might not be!)
So I think we should move the check; instead of checking here, we should check row by row as we actually run the query. If a multi-value row is ever encountered, then at that point we should throw the error.
There was a problem hiding this comment.
Thanks for the catch. Updated the check while doing row by row iteration.
|
|
||
| if (doAggregate) { | ||
| // this check is done during the row aggregation as a dimension can become multi-value col if | ||
| // {@link org.apache.druid.segment.column.ColumnCapabilities} is unkown. |
There was a problem hiding this comment.
nit: javadoc links don't work here, maybe just say "column capabilities", also "unkown" -> "unknown"
| } else { | ||
| if (multiValuesSize > 1) { | ||
| // this check is done during the row aggregation as a dimension can become multi-value col if | ||
| // {@link org.apache.druid.segment.column.ColumnCapabilities} is unkown. |
|
Looks good after the changes. Thanks! |
Description
As part of #12078 one of the followup's was to have a specific config which does not allow accidental unnesting of multi value columns if such columns become part of the grouping key.
Added a config
groupByEnableMultiValueUnnestingwhich can be set in the query context.The default value of
groupByEnableMultiValueUnnestingistrue, therefore it does not change the current engine behavior.If
groupByEnableMultiValueUnnestingis set tofalse, the query will fail if it encounters a multi-value column in the grouping key.Key changed/added classes in this PR
GroupByQueryConfigGroupByQueryEngineV2GroupByQueryEngineThis PR has: