SQL support for nested groupBys. by gianm · Pull Request #3806 · apache/druid

gianm · 2016-12-25T07:09:58Z

Allows, for example, doing exact count distinct by writing:

SELECT COUNT(*) FROM (SELECT DISTINCT col FROM druid.foo)

Contrast with approximate count distinct, which is:

SELECT COUNT(DISTINCT col) FROM druid.foo

Allows, for example, doing exact count distinct by writing: SELECT COUNT(*) FROM (SELECT DISTINCT col FROM druid.foo) Contrast with approximate count distinct, which is: SELECT COUNT(DISTINCT col) FROM druid.foo

nishantmonu51 · 2016-12-30T18:10:44Z

+
+- `COUNT(DISTINCT col)` aggregations use [HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf), a
+fast approximate distinct counting algorithm. If you need exact distinct counts, you can instead use
+`SELECT COUNT(*) FROM (SELECT DISTINCT col FROM druid.foo)`, which will use a slower and more resource intensive exact


would be nice if there can be a flag where, Count(Distinct col) can also be executed using exact algo, instead of expecting the user to write a nested query instead.

👍 on that suggestion

That would be a nice feature, but imo it should be a different PR.

I also prefer this approach. Different behavior depending on query structure can make users confused.

Fair enough, I agree that would be cool, but I don't think it makes sense to change DISTINCT aggs in this PR. All this PR is doing is adding the nested query feature, it's not making any changes to how DISTINCT aggs are handled.

Ok. I'm reviewing the patch.

fjy · 2017-01-03T18:55:41Z

👍

jihoonson

@gianm, this patch looks good. I left some comments.
I additionally tested the following double nested group by query using CalciteQueryTest, and found it doesn't finish. Is this query not covered in this issue?

@Test
  public void testRecursivelyNestedGroupby() throws Exception
  {
    testQuery(
        "select sum(cnt), count(*) from (select dim2, sum(t1.cnt) cnt from (select dim1, dim2, count(*) cnt from druid.foo group by dim1, dim2) t1 group by dim2) t2",
        null,
        ImmutableList.of(
            new Object[]{6L, 3L}
        )
    );
  }

jihoonson · 2017-01-04T09:02:11Z

+
+    final TimeseriesQuery timeseriesQuery = queryBuilder.toTimeseriesQuery(dataSource, sourceRowSignature);
+    if (timeseriesQuery != null) {
+      executeTimeseries(queryBuilder, timeseriesQuery, sink);


I think it would be better if we are able to know which operator creates the data source ahead. But, I know this accumulate() method is just moved from DruidQueryBuilder with little changes, and adding query types will involve a lot of additional changes. Do you have any plan for this?

I thought of having just one toQuery method, but the problem with that is then when we want to execute the query, we need to pair it with the correct execution strategy for that query (select needs to issue multiple queries for pagination, all query types have different result formats, etc). So that's why accumulate checks each possible query type individually.

I don't have a plan for changing this but I am open to change.

Ok. We can change later if it needs.

jihoonson · 2017-01-04T09:05:23Z

+      if (druidRel.getQueryBuilder().getSelectProjection() != null
+          || druidRel.getQueryBuilder().getGrouping() != null
+          || druidRel.getQueryBuilder().getLimitSpec() != null) {
+        return;


How about implementing public boolean matches(RelOptRuleCall call)? I think this will be a better approach for avoiding not-matched rules early.
I know these rule classes are just moved to here, but it will be good if we can improve them.

Sounds good, I'll do that. I didn't think of this before.

jihoonson · 2017-01-04T09:05:50Z

+    }
+
+    if (queryBuilder.getGrouping() != null) {
+      cost *= 0.5;


How about making these constants as static variables?

Sure thing.

gianm · 2017-01-04T18:58:21Z

@jihoonson, on your test testRecursivelyNestedGroupby, that didn't work because it needs more merge buffers than the test allows, and since the merge buffer pool is blocking, the test blocks forever. I pushed a commit that adds your test, bumps up the test merge buffer pool to 3, adds a maxQueryCount config, and adds docs warning users about this potential deadlock in prod.

The new doc blurb is:

Note that groupBys require a separate merge buffer on the broker for each layer beyond the first layer of the groupBy. With the v2 groupBy strategy, this can potentially lead to deadlocks for groupBys nested beyond two layers, since the merge buffers are limited in number and are acquired one-by-one and not as a complete set. At this time we recommend that you avoid deeply-nested groupBys with the v2 strategy. Doubly-nested groupBys (groupBy -> groupBy -> table) are safe and do not suffer from this issue. If you like, you can forbid deeper nesting by setting druid.sql.planner.maxQueryCount = 2.

gianm · 2017-01-04T22:42:08Z

@jihoonson, I just pushed commits for the rest of your comments, please let me know what you think. thanks for the review.

gianm · 2017-01-04T23:14:46Z

Raised #3819 for the deeply nested groupBy thing.

jihoonson · 2017-01-05T00:31:23Z

Thanks! The latest patch looks good to me.

gianm · 2017-01-06T22:58:42Z

Resolved conflicts.

jon-wei · 2017-01-12T02:21:18Z

👍

* SQL support for nested groupBys. Allows, for example, doing exact count distinct by writing: SELECT COUNT(*) FROM (SELECT DISTINCT col FROM druid.foo) Contrast with approximate count distinct, which is: SELECT COUNT(DISTINCT col) FROM druid.foo * Add deeply-nested groupBy docs, tests, and maxQueryCount config. * Extract magic constants into statics. * Rework rules to put preconditions in the "matches" method.

fjy added this to the 0.10.0 milestone Dec 26, 2016

SQL support for nested groupBys.

6798c01

Allows, for example, doing exact count distinct by writing: SELECT COUNT(*) FROM (SELECT DISTINCT col FROM druid.foo) Contrast with approximate count distinct, which is: SELECT COUNT(DISTINCT col) FROM druid.foo

gianm force-pushed the sql-nested-groupBy branch from 2595049 to 6798c01 Compare December 27, 2016 18:45

nishantmonu51 reviewed Dec 31, 2016

View reviewed changes

jihoonson requested changes Jan 4, 2017

View reviewed changes

gianm added 2 commits January 4, 2017 09:01

Merge branch 'master' into sql-nested-groupBy

2f53cb0

Add deeply-nested groupBy docs, tests, and maxQueryCount config.

5463dd8

gianm added 2 commits January 4, 2017 14:39

Extract magic constants into statics.

96d8cd7

Rework rules to put preconditions in the "matches" method.

80a63f1

gianm closed this Jan 4, 2017

gianm reopened this Jan 4, 2017

gianm mentioned this pull request Jan 4, 2017

groupBy v2: Deadlock on deeply nested subqueries #3819

Closed

gianm assigned fjy and jon-wei and unassigned jon-wei Jan 5, 2017

Merge branch 'master' into sql-nested-groupBy

0a331d6

gianm force-pushed the sql-nested-groupBy branch from ac2dbea to 0a331d6 Compare January 7, 2017 20:48

gianm closed this Jan 7, 2017

gianm reopened this Jan 7, 2017

jon-wei merged commit e86859b into apache:master Jan 12, 2017

jon-wei mentioned this pull request Jan 28, 2017

Fine grained buffer management for groupby #3863

Merged

clambertus unassigned fjy Jul 6, 2018

gianm deleted the sql-nested-groupBy branch September 23, 2022 19:28

Conversation

gianm commented Dec 25, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm Jan 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fjy commented Jan 3, 2017

Uh oh!

jihoonson left a comment

Choose a reason for hiding this comment

Uh oh!

jihoonson Jan 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm Jan 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm commented Jan 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gianm commented Jan 4, 2017

Uh oh!

gianm commented Jan 4, 2017

Uh oh!

jihoonson commented Jan 5, 2017

Uh oh!

gianm commented Jan 6, 2017

Uh oh!

jon-wei commented Jan 12, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gianm Jan 2, 2017 •

edited

Loading

jihoonson Jan 4, 2017 •

edited

Loading

gianm Jan 4, 2017 •

edited

Loading

gianm commented Jan 4, 2017 •

edited

Loading