SQL support for nested groupBys.#3806
Conversation
Allows, for example, doing exact count distinct by writing: SELECT COUNT(*) FROM (SELECT DISTINCT col FROM druid.foo) Contrast with approximate count distinct, which is: SELECT COUNT(DISTINCT col) FROM druid.foo
2595049 to
6798c01
Compare
|
|
||
| - `COUNT(DISTINCT col)` aggregations use [HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf), a | ||
| fast approximate distinct counting algorithm. If you need exact distinct counts, you can instead use | ||
| `SELECT COUNT(*) FROM (SELECT DISTINCT col FROM druid.foo)`, which will use a slower and more resource intensive exact |
There was a problem hiding this comment.
would be nice if there can be a flag where, Count(Distinct col) can also be executed using exact algo, instead of expecting the user to write a nested query instead.
There was a problem hiding this comment.
That would be a nice feature, but imo it should be a different PR.
There was a problem hiding this comment.
I also prefer this approach. Different behavior depending on query structure can make users confused.
There was a problem hiding this comment.
Fair enough, I agree that would be cool, but I don't think it makes sense to change DISTINCT aggs in this PR. All this PR is doing is adding the nested query feature, it's not making any changes to how DISTINCT aggs are handled.
There was a problem hiding this comment.
Ok. I'm reviewing the patch.
|
👍 |
jihoonson
left a comment
There was a problem hiding this comment.
@gianm, this patch looks good. I left some comments.
I additionally tested the following double nested group by query using CalciteQueryTest, and found it doesn't finish. Is this query not covered in this issue?
@Test
public void testRecursivelyNestedGroupby() throws Exception
{
testQuery(
"select sum(cnt), count(*) from (select dim2, sum(t1.cnt) cnt from (select dim1, dim2, count(*) cnt from druid.foo group by dim1, dim2) t1 group by dim2) t2",
null,
ImmutableList.of(
new Object[]{6L, 3L}
)
);
}
|
|
||
| final TimeseriesQuery timeseriesQuery = queryBuilder.toTimeseriesQuery(dataSource, sourceRowSignature); | ||
| if (timeseriesQuery != null) { | ||
| executeTimeseries(queryBuilder, timeseriesQuery, sink); |
There was a problem hiding this comment.
I think it would be better if we are able to know which operator creates the data source ahead. But, I know this accumulate() method is just moved from DruidQueryBuilder with little changes, and adding query types will involve a lot of additional changes. Do you have any plan for this?
There was a problem hiding this comment.
I thought of having just one toQuery method, but the problem with that is then when we want to execute the query, we need to pair it with the correct execution strategy for that query (select needs to issue multiple queries for pagination, all query types have different result formats, etc). So that's why accumulate checks each possible query type individually.
I don't have a plan for changing this but I am open to change.
There was a problem hiding this comment.
Ok. We can change later if it needs.
| if (druidRel.getQueryBuilder().getSelectProjection() != null | ||
| || druidRel.getQueryBuilder().getGrouping() != null | ||
| || druidRel.getQueryBuilder().getLimitSpec() != null) { | ||
| return; |
There was a problem hiding this comment.
How about implementing public boolean matches(RelOptRuleCall call)? I think this will be a better approach for avoiding not-matched rules early.
I know these rule classes are just moved to here, but it will be good if we can improve them.
There was a problem hiding this comment.
Sounds good, I'll do that. I didn't think of this before.
| } | ||
|
|
||
| if (queryBuilder.getGrouping() != null) { | ||
| cost *= 0.5; |
There was a problem hiding this comment.
How about making these constants as static variables?
|
@jihoonson, on your test The new doc blurb is:
|
|
@jihoonson, I just pushed commits for the rest of your comments, please let me know what you think. thanks for the review. |
|
Raised #3819 for the deeply nested groupBy thing. |
|
Thanks! The latest patch looks good to me. |
|
Resolved conflicts. |
ac2dbea to
0a331d6
Compare
|
👍 |
* SQL support for nested groupBys. Allows, for example, doing exact count distinct by writing: SELECT COUNT(*) FROM (SELECT DISTINCT col FROM druid.foo) Contrast with approximate count distinct, which is: SELECT COUNT(DISTINCT col) FROM druid.foo * Add deeply-nested groupBy docs, tests, and maxQueryCount config. * Extract magic constants into statics. * Rework rules to put preconditions in the "matches" method.
Allows, for example, doing exact count distinct by writing:
Contrast with approximate count distinct, which is: