Skip to content

Defer more expressions in vectorized groupBy.#16338

Merged
gianm merged 9 commits intoapache:masterfrom
gianm:deferred-expr-selector
Jun 27, 2024
Merged

Defer more expressions in vectorized groupBy.#16338
gianm merged 9 commits intoapache:masterfrom
gianm:deferred-expr-selector

Conversation

@gianm
Copy link
Copy Markdown
Contributor

@gianm gianm commented Apr 25, 2024

This patch adds a way for columns to provide GroupByVectorColumnSelectors, which controls how the groupBy engine operates on them. This mechanism is used by ExpressionVirtualColumn to provide an ExpressionDeferredGroupByVectorColumnSelector that uses the inputs of an expression as the grouping key. The actual expression evaluation is deferred until the grouped ResultRow is created.

A new context parameter deferExpressionDimensions allows users to control when this deferred selector is used. The default is fixedWidthNonNumeric, which is a behavioral change from the prior behavior. Users can get the prior behavior by setting this to singleString.

Benchmarks of a few selected queries from SqlExpressionBenchmark follow. Findings:

  • Query 26, GROUP BY CONCAT(string2, '-', long2), speeds up when the expression is deferred.
  • Queries 22, 24, 30, and 31 slow down when the expression is deferred. These are GROUP BY TIME_FLOOR(TIMESTAMPADD(DAY, -1, __time), GROUP BY long1 * long2, GROUP BY CAST(long1 as BOOLEAN) AND CAST (long2 as BOOLEAN), and GROUP BY long5 IS NULL, long3 IS NOT NULL. All are simple expressions with numeric inputs and outputs.

For these reasons, I think fixedWidthNonNumeric is a good default.

Benchmark                        (deferExpressionDimensions)  (query)  (rowsPerSegment)  (schema)  (vectorize)  Mode  Cnt     Score     Error  Units

SqlExpressionBenchmark.querySql                 singleString       22           5000000      auto        force  avgt    5   260.078 ±  14.858  ms/op
SqlExpressionBenchmark.querySql                   fixedWidth       22           5000000      auto        force  avgt    5  1970.522 ±  58.400  ms/op
SqlExpressionBenchmark.querySql         fixedWidthNonNumeric       22           5000000      auto        force  avgt    5   263.535 ±   5.549  ms/op
SqlExpressionBenchmark.querySql                       always       22           5000000      auto        force  avgt    5  2021.229 ± 125.010  ms/op

SqlExpressionBenchmark.querySql                 singleString       24           5000000      auto        force  avgt    5   624.300 ±  36.616  ms/op
SqlExpressionBenchmark.querySql                   fixedWidth       24           5000000      auto        force  avgt    5   889.836 ±  31.123  ms/op
SqlExpressionBenchmark.querySql         fixedWidthNonNumeric       24           5000000      auto        force  avgt    5   646.920 ±  24.566  ms/op
SqlExpressionBenchmark.querySql                       always       24           5000000      auto        force  avgt    5   890.384 ±  53.748  ms/op

SqlExpressionBenchmark.querySql                 singleString       26           5000000      auto        force  avgt    5   824.417 ±  21.941  ms/op
SqlExpressionBenchmark.querySql                   fixedWidth       26           5000000      auto        force  avgt    5   244.232 ±  15.514  ms/op
SqlExpressionBenchmark.querySql         fixedWidthNonNumeric       26           5000000      auto        force  avgt    5   244.598 ±  14.268  ms/op
SqlExpressionBenchmark.querySql                       always       26           5000000      auto        force  avgt    5   248.505 ±   8.004  ms/op

SqlExpressionBenchmark.querySql                 singleString       30           5000000      auto        force  avgt    5   223.687 ±   9.362  ms/op
SqlExpressionBenchmark.querySql                   fixedWidth       30           5000000      auto        force  avgt    5   562.844 ±  42.288  ms/op
SqlExpressionBenchmark.querySql         fixedWidthNonNumeric       30           5000000      auto        force  avgt    5   227.850 ±   3.374  ms/op
SqlExpressionBenchmark.querySql                       always       30           5000000      auto        force  avgt    5   562.631 ±  69.408  ms/op

SqlExpressionBenchmark.querySql                 singleString       31           5000000      auto        force  avgt    5   324.208 ±   9.420  ms/op
SqlExpressionBenchmark.querySql                   fixedWidth       31           5000000      auto        force  avgt    5  1271.630 ±  87.264  ms/op
SqlExpressionBenchmark.querySql         fixedWidthNonNumeric       31           5000000      auto        force  avgt    5   323.169 ±   6.383  ms/op
SqlExpressionBenchmark.querySql                       always       31           5000000      auto        force  avgt    5  1185.118 ±  34.146  ms/op

This patch adds a way for columns to provide GroupByVectorColumnSelectors,
which controls how the groupBy engine operates on them. This mechanism is used
by ExpressionVirtualColumn to provide an ExpressionDeferredGroupByVectorColumnSelector
that uses the inputs of an expression as the grouping key. The actual expression
evaluation is deferred until the grouped ResultRow is created.

A new context parameter "deferExpressionDimensions" allows users to control when
this deferred selector is used. The default is "fixedWidthNonNumeric", which is a
behavioral change from the prior behavior. Users can get the prior behavior by setting
this to "singleString".
*/
@Nullable
default GroupByVectorColumnSelector makeGroupByVectorColumnSelector(
String columnName,

Check notice

Code scanning / CodeQL

Useless parameter

The parameter 'columnName' is never used.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also SqlGroupByBenchmark that benchmarks the code with various distributions and cardinalities. Maybe we should benchmark the code with the string columns and different parameters.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting adding a new @Benchmark method to SqlGroupByBenchmark that uses a SQL query with expressions?

{
this.expr = expr;
this.subSelectors = subSelectors;
this.exprKeyBytes = subSelectors.stream().mapToInt(GroupByVectorColumnSelector::getGroupingKeySize).sum();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i think this should be the same size as exprInputSignature.size()? if true i think we should move this into that loop initializing the input bindings suppliers

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed.

dimensionSpec ->
ColumnProcessors.makeVectorProcessor(
dimensionSpec -> {
if (dimensionSpec instanceof DefaultDimensionSpec) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this just a sanity check since currently only DefaultDimensionSpec can vectorize, and ideally it will be the only one since DimensionSpec that do fancy stuff need to just be virtual columns.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a defensive move in case any DimensionSpec other than DefaultDimensionSpec are vectorizable in the future. I don't expect this to happen either, but it didn't sit right to write code that would be incorrect for other DimensionSpec types.

}
}

return true;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this also need to check ColumnCapabilities.areDictionaryValuesUnique?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, because non-unique dictionary (like one that maps both 0 and 1 to the same value foo) can still be used for deferred evaluation. The non-uniqueness doesn't hurt, because we will be doing another pass after this that fully groups things.

Here I'm assuming that isDictionaryEncoded is not going to be set for string columns that lack any kind of dictionary. Like, I'm assuming that the ones where DimensionSelector#getValueCardinality is -1 would have isDictionaryEncoded set to false (or at least unknown).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I'm assuming that isDictionaryEncoded is not going to be set for string columns that lack any kind of dictionary. Like, I'm assuming that the ones where DimensionSelector#getValueCardinality is -1 would have isDictionaryEncoded set to false (or at least unknown).

This is mainly what i was worried about when i wrote the comment, though looking closer at it i think its probably safe since https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/segment/virtual/ExpressionPlan.java#L279 seems like the only risky place and I believe setting that basically ensures we use the deferred single value selector, so 👍


allNumericInputs = allNumericInputs && capabilities.isNumeric();

if (!capabilities.isNumeric() && !capabilities.isDictionaryEncoded().isTrue()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this check areDictionaryValuesUnique?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #16338 (comment), i think it doesn't need to.

return false;
}

if (!capabilities.isNumeric() && !capabilities.isDictionaryEncoded().isTrue()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this check areDictionaryValuesUnique? Technically it probably doesn't have to since i think the only things that set this to false while retaining isDictionaryEncoded as true are non-default dimension specs like extractionFn which are not 1:1, but also not supported for use with vectorization.

Also someday i'd like to have a cooler way to detect fixed width types that works for complex too, but that can wait for a future change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see #16338 (comment), i think it doesn't need to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants