Skip to content

SQL support for nested groupBys.#3806

Merged
jon-wei merged 6 commits intoapache:masterfrom
gianm:sql-nested-groupBy
Jan 12, 2017
Merged

SQL support for nested groupBys.#3806
jon-wei merged 6 commits intoapache:masterfrom
gianm:sql-nested-groupBy

Conversation

@gianm
Copy link
Copy Markdown
Contributor

@gianm gianm commented Dec 25, 2016

Allows, for example, doing exact count distinct by writing:

SELECT COUNT(*) FROM (SELECT DISTINCT col FROM druid.foo)

Contrast with approximate count distinct, which is:

SELECT COUNT(DISTINCT col) FROM druid.foo

@fjy fjy added this to the 0.10.0 milestone Dec 26, 2016
Allows, for example, doing exact count distinct by writing:

  SELECT COUNT(*) FROM (SELECT DISTINCT col FROM druid.foo)

Contrast with approximate count distinct, which is:

  SELECT COUNT(DISTINCT col) FROM druid.foo
@gianm gianm force-pushed the sql-nested-groupBy branch from 2595049 to 6798c01 Compare December 27, 2016 18:45

- `COUNT(DISTINCT col)` aggregations use [HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf), a
fast approximate distinct counting algorithm. If you need exact distinct counts, you can instead use
`SELECT COUNT(*) FROM (SELECT DISTINCT col FROM druid.foo)`, which will use a slower and more resource intensive exact
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice if there can be a flag where, Count(Distinct col) can also be executed using exact algo, instead of expecting the user to write a nested query instead.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 on that suggestion

Copy link
Copy Markdown
Contributor Author

@gianm gianm Jan 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be a nice feature, but imo it should be a different PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also prefer this approach. Different behavior depending on query structure can make users confused.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, I agree that would be cool, but I don't think it makes sense to change DISTINCT aggs in this PR. All this PR is doing is adding the nested query feature, it's not making any changes to how DISTINCT aggs are handled.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'm reviewing the patch.

@fjy
Copy link
Copy Markdown
Contributor

fjy commented Jan 3, 2017

👍

Copy link
Copy Markdown
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gianm, this patch looks good. I left some comments.
I additionally tested the following double nested group by query using CalciteQueryTest, and found it doesn't finish. Is this query not covered in this issue?

@Test
  public void testRecursivelyNestedGroupby() throws Exception
  {
    testQuery(
        "select sum(cnt), count(*) from (select dim2, sum(t1.cnt) cnt from (select dim1, dim2, count(*) cnt from druid.foo group by dim1, dim2) t1 group by dim2) t2",
        null,
        ImmutableList.of(
            new Object[]{6L, 3L}
        )
    );
  }


final TimeseriesQuery timeseriesQuery = queryBuilder.toTimeseriesQuery(dataSource, sourceRowSignature);
if (timeseriesQuery != null) {
executeTimeseries(queryBuilder, timeseriesQuery, sink);
Copy link
Copy Markdown
Contributor

@jihoonson jihoonson Jan 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better if we are able to know which operator creates the data source ahead. But, I know this accumulate() method is just moved from DruidQueryBuilder with little changes, and adding query types will involve a lot of additional changes. Do you have any plan for this?

Copy link
Copy Markdown
Contributor Author

@gianm gianm Jan 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought of having just one toQuery method, but the problem with that is then when we want to execute the query, we need to pair it with the correct execution strategy for that query (select needs to issue multiple queries for pagination, all query types have different result formats, etc). So that's why accumulate checks each possible query type individually.

I don't have a plan for changing this but I am open to change.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. We can change later if it needs.

if (druidRel.getQueryBuilder().getSelectProjection() != null
|| druidRel.getQueryBuilder().getGrouping() != null
|| druidRel.getQueryBuilder().getLimitSpec() != null) {
return;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about implementing public boolean matches(RelOptRuleCall call)? I think this will be a better approach for avoiding not-matched rules early.
I know these rule classes are just moved to here, but it will be good if we can improve them.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I'll do that. I didn't think of this before.

}

if (queryBuilder.getGrouping() != null) {
cost *= 0.5;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about making these constants as static variables?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing.

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Jan 4, 2017

@jihoonson, on your test testRecursivelyNestedGroupby, that didn't work because it needs more merge buffers than the test allows, and since the merge buffer pool is blocking, the test blocks forever. I pushed a commit that adds your test, bumps up the test merge buffer pool to 3, adds a maxQueryCount config, and adds docs warning users about this potential deadlock in prod.

The new doc blurb is:

Note that groupBys require a separate merge buffer on the broker for each layer beyond the first layer of the groupBy. With the v2 groupBy strategy, this can potentially lead to deadlocks for groupBys nested beyond two layers, since the merge buffers are limited in number and are acquired one-by-one and not as a complete set. At this time we recommend that you avoid deeply-nested groupBys with the v2 strategy. Doubly-nested groupBys (groupBy -> groupBy -> table) are safe and do not suffer from this issue. If you like, you can forbid deeper nesting by setting druid.sql.planner.maxQueryCount = 2.

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Jan 4, 2017

@jihoonson, I just pushed commits for the rest of your comments, please let me know what you think. thanks for the review.

@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Jan 4, 2017

Raised #3819 for the deeply nested groupBy thing.

@jihoonson
Copy link
Copy Markdown
Contributor

Thanks! The latest patch looks good to me.

@gianm gianm assigned fjy and jon-wei and unassigned jon-wei Jan 5, 2017
@gianm
Copy link
Copy Markdown
Contributor Author

gianm commented Jan 6, 2017

Resolved conflicts.

@gianm gianm force-pushed the sql-nested-groupBy branch from ac2dbea to 0a331d6 Compare January 7, 2017 20:48
@gianm gianm closed this Jan 7, 2017
@gianm gianm reopened this Jan 7, 2017
@jon-wei
Copy link
Copy Markdown
Contributor

jon-wei commented Jan 12, 2017

👍

@jon-wei jon-wei merged commit e86859b into apache:master Jan 12, 2017
dgolitsyn pushed a commit to metamx/druid that referenced this pull request Feb 14, 2017
* SQL support for nested groupBys.

Allows, for example, doing exact count distinct by writing:

  SELECT COUNT(*) FROM (SELECT DISTINCT col FROM druid.foo)

Contrast with approximate count distinct, which is:

  SELECT COUNT(DISTINCT col) FROM druid.foo

* Add deeply-nested groupBy docs, tests, and maxQueryCount config.

* Extract magic constants into statics.

* Rework rules to put preconditions in the "matches" method.
@gianm gianm deleted the sql-nested-groupBy branch September 23, 2022 19:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants