Skip to content

Return empty result when a group by gets optimized to a timeseries query#12065

Merged
abhishekagarwal87 merged 7 commits intoapache:masterfrom
LakshSingla:grouping-skip-empty-bucket
Jan 7, 2022
Merged

Return empty result when a group by gets optimized to a timeseries query#12065
abhishekagarwal87 merged 7 commits intoapache:masterfrom
LakshSingla:grouping-skip-empty-bucket

Conversation

@LakshSingla
Copy link
Copy Markdown
Contributor

Description

Related to #11188

The above mentioned PR allowed timeseries queries to return a default result, when queries of type: select count(*) from table where dim1="_not_present_dim_" were executed. Before the PR, it returned no row, after the PR, it would return a row with value of count(*) as 0 (as expected by SQL standards of different dbs).

In Grouping#applyProject, we can sometimes perform optimization of a groupBy query to a timeseries query if possible (when the keys of the groupBy are constants, as generated by automated tools). For example, in select count(*) from table where dim1="_present_dim_" group by "dummy_key", the groupBy clause can be removed. However, in the case when the filter doesn't return anything, i.e. select count(*) from table where dim1="_not_present_dim_" group by "dummy_key", the behavior of general databases would be to return nothing, while druid (due to above change) returns an empty row. This PR aims to fix this divergence of behavior.

Example cases:

  1. select count(*) from table where dim1="_not_present_dim_" group by "dummy_key".
    CURRENT: Returns a row with count(*) = 0
    EXPECTED: Return no row

  2. select 'A', dim1 from foo where m1 = 123123 and dim1 = '_not_present_again_' group by dim1
    CURRENT: Returns a row with ('A', 'wat')
    EXPECTED: Return no row

To do this, a boolean droppedDimensionsWhileApplyingProject has been added to Grouping which is true whenever we make changes to the original shape with optimization. Hence if a timeseries query has a grouping with this set to true, we set skipEmptyBuckets=true in the query context (i.e. donot return any row).


Key changed/added classes in this PR
  • Grouping.java
  • DruidQuery#toTimeseriesQuery

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@LakshSingla LakshSingla marked this pull request as draft December 15, 2021 07:25
@LakshSingla LakshSingla marked this pull request as ready for review December 15, 2021 07:25
Copy link
Copy Markdown
Contributor

@abhishekagarwal87 abhishekagarwal87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments. LGTM otherwise.

Comment thread sql/src/main/java/org/apache/druid/sql/calcite/rel/Grouping.java Outdated
// When optimization in Grouping#applyProject is applied, and it reduces a Group By query to a timeseries, we
// want it to return empty bucket if no row matches
@Test
public void testReturnEmptyRowWhenGroupByIsConvertedToTimeseries() throws Exception
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add few more queries such as one that groups on a dummy dimension

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few more test cases, which show the consequences of optimization in different cases.

{
skipVectorize();
testQuery(
"SELECT 'A', dim1 from foo WHERE m1 = 50 AND dim1 = 'wat' GROUP BY dim1",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw I noticed that it runs as a group-by query if we remove the m1 = 50 clause. do you know why?

Copy link
Copy Markdown
Contributor Author

@LakshSingla LakshSingla Jan 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On removing the m1=50 clause, the dim1 is not getting reduced to wat literal in the Calcite planner phase, so the optimization in Grouping.java to eliminate the literals is not getting applied.
I checked the place where this reduction is happening, and it's in the ProjectMergeRule of Calcite. When there's a single project, then ProjectMergeRule is not getting invoked.

@abhishekagarwal87 abhishekagarwal87 merged commit 7c17341 into apache:master Jan 7, 2022
@abhishekagarwal87
Copy link
Copy Markdown
Contributor

Merged since test failures are unrelated.

@jihoonson
Copy link
Copy Markdown
Contributor

@abhishekagarwal87 I feel our test is too flaky recently. Can you file an issue for flaky tests in this PR, so that we can promote and fix it? If we have one already, please link it to here.

@abhishekagarwal87
Copy link
Copy Markdown
Contributor

Sure thing @jihoonson

@jihoonson
Copy link
Copy Markdown
Contributor

Thank you!

@abhishekagarwal87 abhishekagarwal87 added this to the 0.23.0 milestone May 11, 2022
@LakshSingla LakshSingla deleted the grouping-skip-empty-bucket branch May 14, 2024 02:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants