Return empty result when a group by gets optimized to a timeseries query#12065
Conversation
abhishekagarwal87
left a comment
There was a problem hiding this comment.
Minor comments. LGTM otherwise.
| // When optimization in Grouping#applyProject is applied, and it reduces a Group By query to a timeseries, we | ||
| // want it to return empty bucket if no row matches | ||
| @Test | ||
| public void testReturnEmptyRowWhenGroupByIsConvertedToTimeseries() throws Exception |
There was a problem hiding this comment.
can you add few more queries such as one that groups on a dummy dimension
There was a problem hiding this comment.
Added a few more test cases, which show the consequences of optimization in different cases.
| { | ||
| skipVectorize(); | ||
| testQuery( | ||
| "SELECT 'A', dim1 from foo WHERE m1 = 50 AND dim1 = 'wat' GROUP BY dim1", |
There was a problem hiding this comment.
btw I noticed that it runs as a group-by query if we remove the m1 = 50 clause. do you know why?
There was a problem hiding this comment.
On removing the m1=50 clause, the dim1 is not getting reduced to wat literal in the Calcite planner phase, so the optimization in Grouping.java to eliminate the literals is not getting applied.
I checked the place where this reduction is happening, and it's in the ProjectMergeRule of Calcite. When there's a single project, then ProjectMergeRule is not getting invoked.
|
Merged since test failures are unrelated. |
|
@abhishekagarwal87 I feel our test is too flaky recently. Can you file an issue for flaky tests in this PR, so that we can promote and fix it? If we have one already, please link it to here. |
|
Sure thing @jihoonson |
|
Thank you! |
Description
Related to #11188
The above mentioned PR allowed timeseries queries to return a default result, when queries of type:
select count(*) from table where dim1="_not_present_dim_"were executed. Before the PR, it returned no row, after the PR, it would return a row with value ofcount(*)as 0 (as expected by SQL standards of different dbs).In
Grouping#applyProject, we can sometimes perform optimization of a groupBy query to a timeseries query if possible (when the keys of the groupBy are constants, as generated by automated tools). For example, inselect count(*) from table where dim1="_present_dim_" group by "dummy_key", the groupBy clause can be removed. However, in the case when the filter doesn't return anything, i.e.select count(*) from table where dim1="_not_present_dim_" group by "dummy_key", the behavior of general databases would be to return nothing, while druid (due to above change) returns an empty row. This PR aims to fix this divergence of behavior.Example cases:
select count(*) from table where dim1="_not_present_dim_" group by "dummy_key".CURRENT: Returns a row with count(*) = 0
EXPECTED: Return no row
select 'A', dim1 from foo where m1 = 123123 and dim1 = '_not_present_again_' group by dim1CURRENT: Returns a row with ('A', 'wat')
EXPECTED: Return no row
To do this, a boolean
droppedDimensionsWhileApplyingProjecthas been added toGroupingwhich is true whenever we make changes to the original shape with optimization. Hence if a timeseries query has a grouping with this set to true, we setskipEmptyBuckets=truein the query context (i.e. donot return any row).Key changed/added classes in this PR
Grouping.javaDruidQuery#toTimeseriesQueryThis PR has: