add "subtotalsSpec" attribute to groupBy query by himanshug · Pull Request #5280 · apache/druid

himanshug · 2018-01-22T22:45:47Z

This patch introduces a "subtotalsSpec" attribute to groupBy query . So, you might have a groupBy query that looks something like below...

{
"type": "groupBy",
 ...
 ...
"dimenstions": [
  {
  "type" : "default",
  "dimension" : "d1col",
  "outputName": "D1"
  },
  {
  "type" : "extraction",
  "dimension" : "d2col",
  "outputName" :  "D2",
  "extractionFn" : extraction_func
  },
  {
  "type":"lookup",
  "dimension":"d3col",
  "outputName":"D3",
  "name":"my_lookup"
  }
],
...
...
"subtotalsSpec":[ ["D1", "D2", D3"], ["D1", "D3"], ["D3"]],
..

}

Response returned would be equivalent to concatenating result of 3 groupBy queries with "dimensions" field being ["D1", "D2", D3"], ["D1", "D3"] and ["D3"] with appropriate DimensionSpec json blob as used in above query.
Response for above query would look something like below...

[ 
  {
    "version" : "v1",
    "timestamp" : "t1",
    "event" : { "D1": "..", "D2": "..", "D3": ".." }
    }
  }, 
    {
    "version" : "v1",
    "timestamp" : "t2",
    "event" : { "D1": "..", "D2": "..", "D3": ".." }
    }
  },
  ...
  ...

   {
    "version" : "v1",
    "timestamp" : "t1",
    "event" : { "D1": "..", "D3": ".." }
    }
  }, 
    {
    "version" : "v1",
    "timestamp" : "t2",
    "event" : { "D1": "..", "D3": ".." }
    }
  },
  ...
  ...

  {
    "version" : "v1",
    "timestamp" : "t1",
    "event" : { "D3": ".." }
    }
  }, 
    {
    "version" : "v1",
    "timestamp" : "t2",
    "event" : { "D3": ".." }
    }
  },
...
]

Note that "subtotalsSpec" must contain subsets of "outputName" from various DimensionSpec json blobs in dimensions attribute and also ordering of dimensions inside subtotal spec must be same as that inside top level "dimensions" attribute e.g. ["D2", "D1"] subtotal spec is not valid as it is not in same order.

DruidSQL layer can support additional functions that could auto-generate the "subtotalsSpec" to support features similar to "ROLLUP" and "CUBE" functions in https://docs.oracle.com/cd/B28359_01/server.111/b28314/tdpdw_sql.htm#TDPDW00712

nishantmonu51 · 2018-01-23T17:59:01Z

In above query dimension seems to be redundant, probably dimensionsSpec can be made optional when specifying subtotalsSpec.
Also, can we rename subtotalsSpec to groupingSpec or columnGroupingSpec ?

himanshug · 2018-01-23T19:37:25Z

@nishantmonu51 "dimensions" isn't just a list of strings but list of DimensionSpec which provide additional functionality like extraction functions, lookups etc. So, in general, user should be able to provide it explicitly and not redundant.
but, yes, it could be made optional where "dimensions" is generated from given subtotals specially if their query looked like the one in example description. I don't see much value in that considering the type of queries I see. It would be a minor and independent change that could be done later as well.

I called it "subtotalsSpec" based on the term used in oracle https://docs.oracle.com/cd/B28359_01/server.111/b28314/tdpdw_sql.htm#TDPDW00711 and that most users were familiar with, but I wouldn't mind changing the name if other proposed options make more sense.

himanshug · 2018-01-24T16:59:49Z

@nishantmonu51 also modified the example in PR description to highlight difference between "dimensions" and "subtotalsSpec" fields.

gianm · 2018-02-02T04:14:16Z

@himanshug,

I think this satisfies the desire to be useful for SQL GROUPING SETS; the planner would need to compute the overarching union of all GROUPING SETS, then include that in the dimensions, and then create a subtotalsSpec with any other sub GROUPING SETS. So that is good.

In Druid SQL a query like yours would look like,

SELECT
  d1col AS D1,
  extraction_func(d2col) AS D2,
  LOOKUP(d3col, 'my_lookup) AS D3,
  COUNT(*)
FROM tbl
GROUP BY GROUPING SETS ( (1, 2, 3), (1, 3), (3) )

Some questions and thoughts,

If you don't specify the super-set in the subtotalsSpec, do you still get it returned to you? I think it's best if the answer is "no", since the user might not actually want the super-set. It would work better for the SQL planner if the answer is "no", consider a query like GROUP BY GROUPING SETS ( (1, 2), (2, 3) ).
In groupBy v2 the grouping results can be streamed back to the caller from the broker without materializing them completely, however with subtotalsSpec, would that mean the broker needs to materialize the results? If so where are they stored? (Forgive me- I did not read the patch yet!)
If results do need to be materialized, I think it'd be important to optimize the case where grouping sets are like GROUP BY GROUPING SETS ( (1, 2, 3), () ) (i.e. we want a table plus a grand total). The grand total can be computed while still avoiding materialization of the full result set.
In the result format is it easy to distinguish null grouping values from nulls that represent subtotals? Consider that in SQL GROUPING SETS, it is not easy without help, but luckily the GROUPING function can help. This doc describes the problem and also the GROUPING function: https://docs.microsoft.com/en-us/sql/t-sql/functions/grouping-transact-sql.

himanshug · 2018-02-02T16:41:09Z

@gianm thanks, hopefully following answers provide further explanation.

If you don't specify the super-set in the subtotalsSpec, do you still get it returned to you? I think it's best if the answer is "no", since the user might not actually want the super-set. It would work better for the SQL planner if the answer is "no", consider a query like GROUP BY GROUPING SETS ( (1, 2), (2, 3) ).

You are right, super-set result is not returned unless it was part of subtotalsSpec for exactly the reasons you mentioned.

2.In groupBy v2 the grouping results can be streamed back to the caller from the broker without materializing them completely, however with subtotalsSpec, would that mean the broker needs to materialize the results? If so where are they stored? (Forgive me- I did not read the patch yet!)

This patch does not add support for subtotals in groupBy-v1 which would just fail.
For groupBy-v2 , implementation looks very similar to that of nested groupBy. Results from super-set query are materialized inside one BufferGrouper instance (with one "merge buffer"). Then we run one query per sub-total on this BufferGrouper by iterating over the rows in it and stream-merge them. to ensure "stream-merge" would work, we enforce a constraint on the subtotals that they must have same order of dims as that in super-set (updated this constraint in PR description)

if results do need to be materialized, I think it'd be important to optimize the case where grouping sets are like GROUP BY GROUPING SETS ( (1, 2, 3), () ) (i.e. we want a table plus a grand total). The grand total can be computed while still avoiding materialization of the full result set.

Possibly yes, however current patch does not optimize for this. maybe something that can be done as a improvement followup.

In the result format is it easy to distinguish null grouping values from nulls that represent subtotals? Consider that in SQL GROUPING SETS, it is not easy without help, but luckily the GROUPING function can help. This doc describes the problem and also the GROUPING function: https://docs.microsoft.com/en-us/sql/t-sql/functions/grouping-transact-sql.

Druid groupBy result set always include name of all dimensions (even if they were null/empty) in each row. So, from the rows it would be identifiable when next subtotal begins. For example result for query in PR description would look something like below...

[ 
  {
    "version" : "v1",
    "timestamp" : "t1",
    "event" : { "D1": "..", "D2": "..", "D3": ".." }
    }
  }, 
    {
    "version" : "v1",
    "timestamp" : "t2",
    "event" : { "D1": "..", "D2": "..", "D3": ".." }
    }
  },
  ...
  ...

   {
    "version" : "v1",
    "timestamp" : "t1",
    "event" : { "D1": "..", "D3": ".." }
    }
  }, 
    {
    "version" : "v1",
    "timestamp" : "t2",
    "event" : { "D1": "..", "D3": ".." }
    }
  },
  ...
  ...

  {
    "version" : "v1",
    "timestamp" : "t1",
    "event" : { "D3": ".." }
    }
  }, 
    {
    "version" : "v1",
    "timestamp" : "t2",
    "event" : { "D3": ".." }
    }
  },
...
]

himanshug · 2018-02-05T18:19:27Z

@gianm let me know if the explanation in #5280 (comment) sounds sensible and then I will try and finish up this PR.

gianm · 2018-02-06T20:14:49Z

@himanshug this approach does sound sensible. If you don't do the optimization to avoid materialization when the user just asks for a grand total, I encourage you to at least build the feature in such a way that it could be put in later without too much refactoring. (I think it would be common, for example getting a timeseries with a grand total)

himanshug · 2018-02-06T21:22:04Z

@gianm I think patch is structured to allow optimizing that use case by checking that case in GroupByStrategyV2.processSubtotalsSpec(..) and doing something else instead of materializing the result-set inside the BufferGrouper.

…ates

himanshug · 2018-02-07T19:55:33Z

@gianm @nishantmonu51 alright, this PR is ready now.
Instead of keeping attribute name "subtotalsSpec" , following other options are available.
"groupingSets"
"groupingsSpec"
"columnGroupingsSpec"

I'm good with "subtotalsSpec" but let me know if majority likes one of the other options.

I will add documentation as well once we settle on the attribute name.

himanshug · 2018-04-12T17:49:18Z

getting back to this after a while, I'll fix the conflict . @gianm @nishantmonu51 please take a look again and help me finish this one.

jihoonson · 2018-04-12T22:21:55Z

Probably this should be labeled with Design Review.

gianm · 2018-04-12T22:37:17Z

@himanshug sure, please post when the conflict is fixed and I'll take another look.

himanshug · 2018-04-13T20:38:28Z

@gianm fixed the conflict.

jihoonson · 2018-04-16T21:52:26Z

I've added Design Review label because already two or more committers are reviewing this.

gianm

Thank you for your patience @himanshug. Let me know what you ~~honk~~ think of the review. Btw, after this patch is in, along with #5640 we’ll be able to start implementing subtotals in SQL too 🙂

gianm · 2018-04-16T20:46:45Z

+          }
+          if (!found) {
+            throw new IAE(
+                "Subtotal spec %s is either not a subset or items are in different order than in dimensiosn spec.",


Spelling: dimensions. Maybe call it dimensionsSpec so it's identical to what is in the query?

fixed the spelling, in the query it is called "dimensions" so keeping that to be identical to the query.

gianm · 2018-04-16T20:48:28Z

    return limitSpec;
  }

+  @JsonInclude(JsonInclude.Include.NON_NULL)


What does this @JsonInclude do? Does it mean don't write it if it's null? That's kind of cool.

gianm · 2018-04-16T21:07:08Z

+        return groupByStrategy.processSubtotalsSpec(
+            query,
+            resource,
+            groupByStrategy.processSubqueryResult(subquery, query, resource, finalizingResults)


Should this be query.withSubtotalsSpec(null)?

no, its needed in the impl of processSubtotalsSpec(..)

gianm · 2018-04-19T16:54:44Z

+              ).withDimensionSpecs(
+                  Lists.transform(
+                      queryWithoutSubtotalsSpec.getDimensions(),
+                      (dimSpec) -> new DefaultDimensionSpec(


This loses the type of the dimension (getOutputType) which is needed for numeric dimensions.

fixed, thanks. added an unit test too for long type dimension column

gianm · 2018-04-19T16:55:10Z

+
+      for (List<String> subtotalSpec : subtotals) {
+        GroupByQuery subtotalQuery = queryWithoutSubtotalsSpec.withDimensionSpecs(
+            subtotalSpec.stream().map(s -> new DefaultDimensionSpec(s, s)).collect(Collectors.toList())


The dimension type is lost here too.

fixed here as well.

gianm · 2018-04-19T17:02:57Z

    if (!willMergeRunners) {
-      final int requiredMergeBufferNum = countRequiredMergeBufferNum(query, 1);
+      final int requiredMergeBufferNum = countRequiredMergeBufferNum(query, 1) +
+                                         (query.getSubtotalsSpec() != null ? 1 : 0);


Do we really need an extra merge buffer when we’re computing subtotals? There’s already a requirement that the subtotals dimensions be in the same order as the top level dimensions, meaning we should be able to compute them without a big extra buffer. Just one row of scratch space plus a streaming combine.

Oh wait, I'm dumb, this isn't true. If we did a group by on A, B, C and wanted an A, C subtotal, then we'll be seeing values of C non-contiguously. Nevermind!

It would be nice in the future to optimize for the case where all subtotals can be done streaming (if they are all prefixes) but that could be future work, not in this PR.

yeah, that optimization would be nice.

gianm · 2018-04-23T18:32:02Z

Hi @himanshug - have you had a chance to review my review? 😃

himanshug · 2018-04-24T14:53:02Z

Hi @himanshug - have you had a chance to review my review? 😃

@gianm thanks for the review. sorry, I haven't had a chance to take another look. I'll try and finish it this week or next.
PS: not sure if you know, but I'm going through some major transitions personally this week :)

gianm · 2018-04-24T15:13:46Z

@himanshug I have heard, congratulations :)

The comments I had were relatively minor, I think the main interesting one was the types being lost, so we probably want some additional tests for numeric dimensions.

himanshug · 2018-05-29T06:05:29Z

@gianm my apologies for not being able to get back to this for so long, but finally :)

and also thanks to @a2l007 for reminders to get this done .

himanshug · 2018-08-21T21:21:48Z

@gianm re-merged with master and fixed build, it should be good to go now.

gianm

@himanshug It looks good, but can you add docs please?

himanshug · 2018-08-23T16:51:57Z

@gianm added docs.

gianm

@himanshug -- patch LGTM but I suggested some doc changes that I think will make things clearer.

gianm · 2018-08-26T00:45:46Z

 |aggregations|See [Aggregations](../querying/aggregations.html)|no|
 |postAggregations|See [Post Aggregations](../querying/post-aggregations.html)|no|
 |intervals|A JSON Object representing ISO-8601 Intervals. This defines the time ranges to run the query over.|yes|
+|subtotalsSpec| A JSON array of arrays to return additional result sets for groupings of subsets of top level `dimensions`. It is described later in more detail.|no|


Would be great to have a link to the section here, using the power of HTML.

gianm · 2018-08-26T00:45:57Z

+"type": "groupBy",
+ ...
+ ...
+"dimenstions": [


Spelling of "dimensions".

gianm · 2018-08-26T00:46:28Z

 See [Multi-value dimensions](multi-value-dimensions.html) for more details.

+### More on subtotalsSpec
+you can have a groupBy query that looks something like below...


I think it'd be nice to repeat the use case and behavior for subtotalsSpec here, since it's far away from the top-level docs and the reader might not have seen that. Also please use better grammar here. Pulling those two comments together, how about:

The subtotals feature allows computation of multiple sub-groupings in a single query. To use this feature, add a "subtotalsSpec" to your query, which should be a list of subgroup dimension sets. It should contain the "outputName" from dimensions in your "dimensions" attribute, in the same order as they appear in the "dimensions" attribute (although, of course, you may skip some). For example, consider a groupBy query like this one:

We should also mention that it adds 1 to the number of merge buffers you'll need. How about adding this to the "Memory tuning and resource limits" section later on. I believe it's accurate as of the current state of things:

Brokers do not need merge buffers for basic groupBy queries. Queries with subqueries (using a "query" dataSource (link to query datasource docs)) require one merge buffer if there is a single subquery, or two merge buffers if there is more than one layer of nested subqueries. Queries with subtotals (link to subtotals spec) need one merge buffer. These can stack on top of each other: a groupBy query with multiple layers of nested subqueries, and that also uses subtotals, will need three merge buffers.

Historicals and ingestion tasks need one merge buffer for each groupBy query, unless parallel combination (link to parallel combine section) is enabled, in which case they need two merge buffers per query.

thanks for writing above, added/replaced.

gianm · 2018-08-26T00:48:28Z

+]
+```
+
+Note that "subtotalsSpec" must contain subsets of "outputName" from various `DimensionSpec` json blobs in `dimensions` attribute and also ordering of dimensions inside subtotal spec must be same as that inside top level "dimensions" attribute e.g. ["D2", "D1"] subtotal spec is not valid as it is not in same order.


Please add some commas into this run-on sentence, or delete it if you agree with my suggestion above to move this content into the start of the section.

himanshug · 2018-08-28T21:22:32Z

@gianm updated the docs.

gianm

Thanks!!

himanshug requested a review from gianm January 22, 2018 22:45

himanshug mentioned this pull request Jan 22, 2018

[Proposal] Add support for multiple grouping specs in groupBy query #5179

Closed

himanshug force-pushed the multi_rullup_groupby branch from 89421d3 to 42f6bad Compare January 24, 2018 16:50

himanshug added this to the 0.13.0 milestone Jan 24, 2018

himanshug force-pushed the multi_rullup_groupby branch 5 times, most recently from 2f84db7 to 4a5e63b Compare January 25, 2018 22:18

add subtotalsSpec attribute to groupBy query

3066938

himanshug force-pushed the multi_rullup_groupby branch from 4a5e63b to 3066938 Compare January 26, 2018 16:34

himanshug added 2 commits February 7, 2018 09:57

Merge remote-tracking branch 'druidio/master' into multi_rullup_groupby

7fb8ef4

dont sent subtotalsSpec to downstream nodes from broker and other upd…

c7e783f

…ates

himanshug added Feature Release Notes labels Feb 7, 2018

himanshug changed the title ~~[WIP]add "subtotalsSpec" attribute to groupBy query~~ add "subtotalsSpec" attribute to groupBy query Feb 8, 2018

gianm mentioned this pull request Apr 13, 2018

Timeseries: Add "grandTotal" option. #5640

Merged

Merge remote-tracking branch 'druidio/master' into multi_rullup_groupby

7ef2310

jihoonson added the Design Review label Apr 16, 2018

gianm reviewed Apr 19, 2018

View reviewed changes

Merge remote-tracking branch 'druidio/master' into multi_rullup_groupby

76410b4

address review comment

e6746f0

himanshug force-pushed the multi_rullup_groupby branch from e694726 to e6746f0 Compare May 29, 2018 06:06

himanshug added 2 commits August 21, 2018 10:49

Merge remote-tracking branch 'apache/master' into multi_rullup_groupby

00acd87

fix checkstyle issues after merge to master

ce696f2

gianm reviewed Aug 23, 2018

View reviewed changes

add docs for subtotalsSpec feature

9e8de8b

himanshug added the Area - Querying label Aug 23, 2018

gianm reviewed Aug 26, 2018

View reviewed changes

himanshug added 2 commits August 28, 2018 10:31

Merge remote-tracking branch 'apache/master' into multi_rullup_groupby

d81e14e

address doc review comments

7135446

gianm approved these changes Aug 29, 2018

View reviewed changes

gianm merged commit 1fae651 into apache:master Aug 29, 2018

dclim mentioned this pull request Oct 10, 2018

Druid 0.13.0-incubating release notes #6442

Closed

Conversation

himanshug commented Jan 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nishantmonu51 commented Jan 23, 2018

Uh oh!

himanshug commented Jan 23, 2018

Uh oh!

himanshug commented Jan 24, 2018

Uh oh!

gianm commented Feb 2, 2018

Uh oh!

himanshug commented Feb 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

himanshug commented Feb 5, 2018

Uh oh!

gianm commented Feb 6, 2018

Uh oh!

himanshug commented Feb 6, 2018

Uh oh!

himanshug commented Feb 7, 2018

Uh oh!

himanshug commented Apr 12, 2018

Uh oh!

jihoonson commented Apr 12, 2018

Uh oh!

gianm commented Apr 12, 2018

Uh oh!

himanshug commented Apr 13, 2018

Uh oh!

jihoonson commented Apr 16, 2018

Uh oh!

gianm left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm Apr 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm commented Apr 23, 2018

Uh oh!

himanshug commented Apr 24, 2018

Uh oh!

gianm commented Apr 24, 2018

Uh oh!

himanshug commented May 29, 2018

Uh oh!

himanshug commented Aug 21, 2018

Uh oh!

gianm left a comment

Choose a reason for hiding this comment

Uh oh!

himanshug commented Jan 22, 2018 •

edited

Loading

himanshug commented Feb 2, 2018 •

edited

Loading

gianm left a comment •

edited

Loading

gianm Apr 19, 2018 •

edited

Loading