Enable multiple distinct aggregators in same query#11014
Enable multiple distinct aggregators in same query#11014clintropolis merged 5 commits intoapache:masterfrom
Conversation
| |`druid.sql.planner.maxTopNLimit`|Maximum threshold for a [TopN query](../querying/topnquery.md). Higher limits will be planned as [GroupBy queries](../querying/groupbyquery.md) instead.|100000| | ||
| |`druid.sql.planner.metadataRefreshPeriod`|Throttle for metadata refreshes.|PT1M| | ||
| |`druid.sql.planner.useApproximateCountDistinct`|Whether to use an approximate cardinality algorithm for `COUNT(DISTINCT foo)`.|true| | ||
| |`druid.sql.planner.useGroupingSetForExactDistinct`|Only relevant when `useApproximateCountDistinct` is disabled. If set to true, exact distinct queries are re-written using grouping sets. Otherwise, exact distinct queries are re-written using joins. This should be set to true for group by query with multiple exact distinct aggregations. This flag can be overridden per query.|false| |
There was a problem hiding this comment.
naively this seems better maybe than using joins... is the reason to make it false by default in case there are any regressions I guess? I only ask because things that are cool, but off by default tend to take a long time to make it to being turned on, if ever.
There was a problem hiding this comment.
I didn't want to accidentally break any queries that are running already. At least for the backports, we do want to keep it off by default but maybe turn it on for new releases?
There was a problem hiding this comment.
I think it would be ok if we could leave off by default for next release, and maybe consider turning on in the release after
| // This should be 0 because the broker needs 2 buffers and the queryable node needs one. | ||
| Assert.assertEquals(0, MERGE_BUFFER_POOL.getMinRemainBufferNum()); | ||
| Assert.assertEquals(3, MERGE_BUFFER_POOL.getPoolSize()); | ||
| Assert.assertEquals(1, MERGE_BUFFER_POOL.getMinRemainBufferNum()); | ||
| Assert.assertEquals(4, MERGE_BUFFER_POOL.getPoolSize()); |
There was a problem hiding this comment.
nit: this comment isn't accurate anymore
| import java.util.Map; | ||
| import java.util.stream.Collectors; | ||
|
|
||
| import static org.apache.druid.sql.calcite.planner.PlannerConfig.CTX_KEY_USE_GROUPING_SET_FOR_EXACT_DISTINCT; |
There was a problem hiding this comment.
iirc, I think we typically prefer to not use static imports, I know this is enforced in some places, but maybe not in test code because of some junit stuffs?
Description
Running queries with multiple exact distinct aggregations require us to enable a calcite rule
AggregateExpandDistinctAggregatesRule.INSTANCEwhich is currently not enabled. InsteadAggregateExpandDistinctAggregatesRule.JOINrule is used which plans queries with multiple distinct aggregations as a Join query with a join condition of typeIS_NOT_DISTINCT_FROM. However, druid supports only equality conditions in joins.AggregateExpandDistinctAggregatesRule.INSTANCErule, on the other hand, uses grouping aggregator, to run the queries with distinct aggregations. That aggregator was added recently.With
AggregateExpandDistinctAggregatesRule.INSTANCEenabled, query planning completes just fine however, after planning, the query execution fails. This is due to a bug in how a group by query procures merge buffers. Druid undercounts required merge buffers when there is a nested query and the subquery has subtotals.This patch fixes the logic to compute required merge buffers. Additionally, a flag has been added to control to switch between old and new behavior.
There is still a group of queries that will still fail
This PR has: