Defer more expressions in vectorized groupBy.#16338
Conversation
This patch adds a way for columns to provide GroupByVectorColumnSelectors, which controls how the groupBy engine operates on them. This mechanism is used by ExpressionVirtualColumn to provide an ExpressionDeferredGroupByVectorColumnSelector that uses the inputs of an expression as the grouping key. The actual expression evaluation is deferred until the grouped ResultRow is created. A new context parameter "deferExpressionDimensions" allows users to control when this deferred selector is used. The default is "fixedWidthNonNumeric", which is a behavioral change from the prior behavior. Users can get the prior behavior by setting this to "singleString".
There was a problem hiding this comment.
There's also SqlGroupByBenchmark that benchmarks the code with various distributions and cardinalities. Maybe we should benchmark the code with the string columns and different parameters.
There was a problem hiding this comment.
Are you suggesting adding a new @Benchmark method to SqlGroupByBenchmark that uses a SQL query with expressions?
| { | ||
| this.expr = expr; | ||
| this.subSelectors = subSelectors; | ||
| this.exprKeyBytes = subSelectors.stream().mapToInt(GroupByVectorColumnSelector::getGroupingKeySize).sum(); |
There was a problem hiding this comment.
nit: i think this should be the same size as exprInputSignature.size()? if true i think we should move this into that loop initializing the input bindings suppliers
| dimensionSpec -> | ||
| ColumnProcessors.makeVectorProcessor( | ||
| dimensionSpec -> { | ||
| if (dimensionSpec instanceof DefaultDimensionSpec) { |
There was a problem hiding this comment.
is this just a sanity check since currently only DefaultDimensionSpec can vectorize, and ideally it will be the only one since DimensionSpec that do fancy stuff need to just be virtual columns.
There was a problem hiding this comment.
It's a defensive move in case any DimensionSpec other than DefaultDimensionSpec are vectorizable in the future. I don't expect this to happen either, but it didn't sit right to write code that would be incorrect for other DimensionSpec types.
| } | ||
| } | ||
|
|
||
| return true; |
There was a problem hiding this comment.
does this also need to check ColumnCapabilities.areDictionaryValuesUnique?
There was a problem hiding this comment.
I don't think so, because non-unique dictionary (like one that maps both 0 and 1 to the same value foo) can still be used for deferred evaluation. The non-uniqueness doesn't hurt, because we will be doing another pass after this that fully groups things.
Here I'm assuming that isDictionaryEncoded is not going to be set for string columns that lack any kind of dictionary. Like, I'm assuming that the ones where DimensionSelector#getValueCardinality is -1 would have isDictionaryEncoded set to false (or at least unknown).
There was a problem hiding this comment.
Here I'm assuming that
isDictionaryEncodedis not going to be set for string columns that lack any kind of dictionary. Like, I'm assuming that the ones whereDimensionSelector#getValueCardinalityis -1 would haveisDictionaryEncodedset to false (or at least unknown).
This is mainly what i was worried about when i wrote the comment, though looking closer at it i think its probably safe since https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/segment/virtual/ExpressionPlan.java#L279 seems like the only risky place and I believe setting that basically ensures we use the deferred single value selector, so 👍
|
|
||
| allNumericInputs = allNumericInputs && capabilities.isNumeric(); | ||
|
|
||
| if (!capabilities.isNumeric() && !capabilities.isDictionaryEncoded().isTrue()) { |
There was a problem hiding this comment.
should this check areDictionaryValuesUnique?
There was a problem hiding this comment.
see #16338 (comment), i think it doesn't need to.
| return false; | ||
| } | ||
|
|
||
| if (!capabilities.isNumeric() && !capabilities.isDictionaryEncoded().isTrue()) { |
There was a problem hiding this comment.
should this check areDictionaryValuesUnique? Technically it probably doesn't have to since i think the only things that set this to false while retaining isDictionaryEncoded as true are non-default dimension specs like extractionFn which are not 1:1, but also not supported for use with vectorization.
Also someday i'd like to have a cooler way to detect fixed width types that works for complex too, but that can wait for a future change.
There was a problem hiding this comment.
see #16338 (comment), i think it doesn't need to.
This patch adds a way for columns to provide GroupByVectorColumnSelectors, which controls how the groupBy engine operates on them. This mechanism is used by ExpressionVirtualColumn to provide an ExpressionDeferredGroupByVectorColumnSelector that uses the inputs of an expression as the grouping key. The actual expression evaluation is deferred until the grouped ResultRow is created.
A new context parameter
deferExpressionDimensionsallows users to control when this deferred selector is used. The default isfixedWidthNonNumeric, which is a behavioral change from the prior behavior. Users can get the prior behavior by setting this tosingleString.Benchmarks of a few selected queries from
SqlExpressionBenchmarkfollow. Findings:GROUP BY CONCAT(string2, '-', long2), speeds up when the expression is deferred.GROUP BY TIME_FLOOR(TIMESTAMPADD(DAY, -1, __time),GROUP BY long1 * long2,GROUP BY CAST(long1 as BOOLEAN) AND CAST (long2 as BOOLEAN), andGROUP BY long5 IS NULL, long3 IS NOT NULL. All are simple expressions with numeric inputs and outputs.For these reasons, I think
fixedWidthNonNumericis a good default.