Skip to content

add "subtotalsSpec" attribute to groupBy query#5280

Merged
gianm merged 11 commits intoapache:masterfrom
himanshug:multi_rullup_groupby
Aug 29, 2018
Merged

add "subtotalsSpec" attribute to groupBy query#5280
gianm merged 11 commits intoapache:masterfrom
himanshug:multi_rullup_groupby

Conversation

@himanshug
Copy link
Copy Markdown
Contributor

@himanshug himanshug commented Jan 22, 2018

Fixes #5179

This patch introduces a "subtotalsSpec" attribute to groupBy query . So, you might have a groupBy query that looks something like below...

{
"type": "groupBy",
 ...
 ...
"dimenstions": [
  {
  "type" : "default",
  "dimension" : "d1col",
  "outputName": "D1"
  },
  {
  "type" : "extraction",
  "dimension" : "d2col",
  "outputName" :  "D2",
  "extractionFn" : extraction_func
  },
  {
  "type":"lookup",
  "dimension":"d3col",
  "outputName":"D3",
  "name":"my_lookup"
  }
],
...
...
"subtotalsSpec":[ ["D1", "D2", D3"], ["D1", "D3"], ["D3"]],
..

}

Response returned would be equivalent to concatenating result of 3 groupBy queries with "dimensions" field being ["D1", "D2", D3"], ["D1", "D3"] and ["D3"] with appropriate DimensionSpec json blob as used in above query.
Response for above query would look something like below...

[ 
  {
    "version" : "v1",
    "timestamp" : "t1",
    "event" : { "D1": "..", "D2": "..", "D3": ".." }
    }
  }, 
    {
    "version" : "v1",
    "timestamp" : "t2",
    "event" : { "D1": "..", "D2": "..", "D3": ".." }
    }
  },
  ...
  ...

   {
    "version" : "v1",
    "timestamp" : "t1",
    "event" : { "D1": "..", "D3": ".." }
    }
  }, 
    {
    "version" : "v1",
    "timestamp" : "t2",
    "event" : { "D1": "..", "D3": ".." }
    }
  },
  ...
  ...

  {
    "version" : "v1",
    "timestamp" : "t1",
    "event" : { "D3": ".." }
    }
  }, 
    {
    "version" : "v1",
    "timestamp" : "t2",
    "event" : { "D3": ".." }
    }
  },
...
]

Note that "subtotalsSpec" must contain subsets of "outputName" from various DimensionSpec json blobs in dimensions attribute and also ordering of dimensions inside subtotal spec must be same as that inside top level "dimensions" attribute e.g. ["D2", "D1"] subtotal spec is not valid as it is not in same order.

DruidSQL layer can support additional functions that could auto-generate the "subtotalsSpec" to support features similar to "ROLLUP" and "CUBE" functions in https://docs.oracle.com/cd/B28359_01/server.111/b28314/tdpdw_sql.htm#TDPDW00712

@nishantmonu51
Copy link
Copy Markdown
Member

In above query dimension seems to be redundant, probably dimensionsSpec can be made optional when specifying subtotalsSpec.
Also, can we rename subtotalsSpec to groupingSpec or columnGroupingSpec ?

@himanshug
Copy link
Copy Markdown
Contributor Author

@nishantmonu51 "dimensions" isn't just a list of strings but list of DimensionSpec which provide additional functionality like extraction functions, lookups etc. So, in general, user should be able to provide it explicitly and not redundant.
but, yes, it could be made optional where "dimensions" is generated from given subtotals specially if their query looked like the one in example description. I don't see much value in that considering the type of queries I see. It would be a minor and independent change that could be done later as well.

I called it "subtotalsSpec" based on the term used in oracle https://docs.oracle.com/cd/B28359_01/server.111/b28314/tdpdw_sql.htm#TDPDW00711 and that most users were familiar with, but I wouldn't mind changing the name if other proposed options make more sense.

@himanshug himanshug force-pushed the multi_rullup_groupby branch from 89421d3 to 42f6bad Compare January 24, 2018 16:50
@himanshug
Copy link
Copy Markdown
Contributor Author

@nishantmonu51 also modified the example in PR description to highlight difference between "dimensions" and "subtotalsSpec" fields.

@himanshug himanshug added this to the 0.13.0 milestone Jan 24, 2018
@himanshug himanshug force-pushed the multi_rullup_groupby branch 5 times, most recently from 2f84db7 to 4a5e63b Compare January 25, 2018 22:18
@himanshug himanshug force-pushed the multi_rullup_groupby branch from 4a5e63b to 3066938 Compare January 26, 2018 16:34
@gianm
Copy link
Copy Markdown
Contributor

gianm commented Feb 2, 2018

@himanshug,

I think this satisfies the desire to be useful for SQL GROUPING SETS; the planner would need to compute the overarching union of all GROUPING SETS, then include that in the dimensions, and then create a subtotalsSpec with any other sub GROUPING SETS. So that is good.

In Druid SQL a query like yours would look like,

SELECT
  d1col AS D1,
  extraction_func(d2col) AS D2,
  LOOKUP(d3col, 'my_lookup) AS D3,
  COUNT(*)
FROM tbl
GROUP BY GROUPING SETS ( (1, 2, 3), (1, 3), (3) )

Some questions and thoughts,

  1. If you don't specify the super-set in the subtotalsSpec, do you still get it returned to you? I think it's best if the answer is "no", since the user might not actually want the super-set. It would work better for the SQL planner if the answer is "no", consider a query like GROUP BY GROUPING SETS ( (1, 2), (2, 3) ).
  2. In groupBy v2 the grouping results can be streamed back to the caller from the broker without materializing them completely, however with subtotalsSpec, would that mean the broker needs to materialize the results? If so where are they stored? (Forgive me- I did not read the patch yet!)
  3. If results do need to be materialized, I think it'd be important to optimize the case where grouping sets are like GROUP BY GROUPING SETS ( (1, 2, 3), () ) (i.e. we want a table plus a grand total). The grand total can be computed while still avoiding materialization of the full result set.
  4. In the result format is it easy to distinguish null grouping values from nulls that represent subtotals? Consider that in SQL GROUPING SETS, it is not easy without help, but luckily the GROUPING function can help. This doc describes the problem and also the GROUPING function: https://docs.microsoft.com/en-us/sql/t-sql/functions/grouping-transact-sql.

@himanshug
Copy link
Copy Markdown
Contributor Author

himanshug commented Feb 2, 2018

@gianm thanks, hopefully following answers provide further explanation.

  1. If you don't specify the super-set in the subtotalsSpec, do you still get it returned to you? I think it's best if the answer is "no", since the user might not actually want the super-set. It would work better for the SQL planner if the answer is "no", consider a query like GROUP BY GROUPING SETS ( (1, 2), (2, 3) ).

You are right, super-set result is not returned unless it was part of subtotalsSpec for exactly the reasons you mentioned.

2.In groupBy v2 the grouping results can be streamed back to the caller from the broker without materializing them completely, however with subtotalsSpec, would that mean the broker needs to materialize the results? If so where are they stored? (Forgive me- I did not read the patch yet!)

This patch does not add support for subtotals in groupBy-v1 which would just fail.
For groupBy-v2 , implementation looks very similar to that of nested groupBy. Results from super-set query are materialized inside one BufferGrouper instance (with one "merge buffer"). Then we run one query per sub-total on this BufferGrouper by iterating over the rows in it and stream-merge them. to ensure "stream-merge" would work, we enforce a constraint on the subtotals that they must have same order of dims as that in super-set (updated this constraint in PR description)

  1. if results do need to be materialized, I think it'd be important to optimize the case where grouping sets are like GROUP BY GROUPING SETS ( (1, 2, 3), () ) (i.e. we want a table plus a grand total). The grand total can be computed while still avoiding materialization of the full result set.

Possibly yes, however current patch does not optimize for this. maybe something that can be done as a improvement followup.

In the result format is it easy to distinguish null grouping values from nulls that represent subtotals? Consider that in SQL GROUPING SETS, it is not easy without help, but luckily the GROUPING function can help. This doc describes the problem and also the GROUPING function: https://docs.microsoft.com/en-us/sql/t-sql/functions/grouping-transact-sql.

Druid groupBy result set always include name of all dimensions (even if they were null/empty) in each row. So, from the rows it would be identifiable when next subtotal begins. For example result for query in PR description would look something like below...

[ 
  {
    "version" : "v1",
    "timestamp" : "t1",
    "event" : { "D1": "..", "D2": "..", "D3": ".." }
    }
  }, 
    {
    "version" : "v1",
    "timestamp" : "t2",
    "event" : { "D1": "..", "D2": "..", "D3": ".." }
    }
  },
  ...
  ...

   {
    "version" : "v1",
    "timestamp" : "t1",
    "event" : { "D1": "..", "D3": ".." }
    }
  }, 
    {
    "version" : "v1",
    "timestamp" : "t2",
    "event" : { "D1": "..", "D3": ".." }
    }
  },
  ...
  ...

  {
    "version" : "v1",
    "timestamp" : "t1",
    "event" : { "D3": ".." }
    }
  }, 
    {
    "version" : "v1",
    "timestamp" : "t2",
    "event" : { "D3": ".." }
    }
  },
...
]

@himanshug
Copy link
Copy Markdown
Contributor Author

@gianm let me know if the explanation in #5280 (comment) sounds sensible and then I will try and finish up this PR.

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Feb 6, 2018

@himanshug this approach does sound sensible. If you don't do the optimization to avoid materialization when the user just asks for a grand total, I encourage you to at least build the feature in such a way that it could be put in later without too much refactoring. (I think it would be common, for example getting a timeseries with a grand total)

@himanshug
Copy link
Copy Markdown
Contributor Author

@gianm I think patch is structured to allow optimizing that use case by checking that case in GroupByStrategyV2.processSubtotalsSpec(..) and doing something else instead of materializing the result-set inside the BufferGrouper.

@himanshug
Copy link
Copy Markdown
Contributor Author

@gianm @nishantmonu51 alright, this PR is ready now.
Instead of keeping attribute name "subtotalsSpec" , following other options are available.
"groupingSets"
"groupingsSpec"
"columnGroupingsSpec"

I'm good with "subtotalsSpec" but let me know if majority likes one of the other options.

I will add documentation as well once we settle on the attribute name.

@himanshug himanshug changed the title [WIP]add "subtotalsSpec" attribute to groupBy query add "subtotalsSpec" attribute to groupBy query Feb 8, 2018
@himanshug
Copy link
Copy Markdown
Contributor Author

getting back to this after a while, I'll fix the conflict . @gianm @nishantmonu51 please take a look again and help me finish this one.

@jihoonson
Copy link
Copy Markdown
Contributor

Probably this should be labeled with Design Review.

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Apr 12, 2018

@himanshug sure, please post when the conflict is fixed and I'll take another look.

@himanshug
Copy link
Copy Markdown
Contributor Author

@gianm fixed the conflict.

@jihoonson
Copy link
Copy Markdown
Contributor

I've added Design Review label because already two or more committers are reviewing this.

Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your patience @himanshug. Let me know what you honk think of the review. Btw, after this patch is in, along with #5640 we’ll be able to start implementing subtotals in SQL too 🙂

}
if (!found) {
throw new IAE(
"Subtotal spec %s is either not a subset or items are in different order than in dimensiosn spec.",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling: dimensions. Maybe call it dimensionsSpec so it's identical to what is in the query?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed the spelling, in the query it is called "dimensions" so keeping that to be identical to the query.

return limitSpec;
}

@JsonInclude(JsonInclude.Include.NON_NULL)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this @JsonInclude do? Does it mean don't write it if it's null? That's kind of cool.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly

return groupByStrategy.processSubtotalsSpec(
query,
resource,
groupByStrategy.processSubqueryResult(subquery, query, resource, finalizingResults)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be query.withSubtotalsSpec(null)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, its needed in the impl of processSubtotalsSpec(..)

).withDimensionSpecs(
Lists.transform(
queryWithoutSubtotalsSpec.getDimensions(),
(dimSpec) -> new DefaultDimensionSpec(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loses the type of the dimension (getOutputType) which is needed for numeric dimensions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, thanks. added an unit test too for long type dimension column


for (List<String> subtotalSpec : subtotals) {
GroupByQuery subtotalQuery = queryWithoutSubtotalsSpec.withDimensionSpecs(
subtotalSpec.stream().map(s -> new DefaultDimensionSpec(s, s)).collect(Collectors.toList())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dimension type is lost here too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed here as well.

if (!willMergeRunners) {
final int requiredMergeBufferNum = countRequiredMergeBufferNum(query, 1);
final int requiredMergeBufferNum = countRequiredMergeBufferNum(query, 1) +
(query.getSubtotalsSpec() != null ? 1 : 0);
Copy link
Copy Markdown
Contributor

@gianm gianm Apr 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need an extra merge buffer when we’re computing subtotals? There’s already a requirement that the subtotals dimensions be in the same order as the top level dimensions, meaning we should be able to compute them without a big extra buffer. Just one row of scratch space plus a streaming combine.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, I'm dumb, this isn't true. If we did a group by on A, B, C and wanted an A, C subtotal, then we'll be seeing values of C non-contiguously. Nevermind!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice in the future to optimize for the case where all subtotals can be done streaming (if they are all prefixes) but that could be future work, not in this PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that optimization would be nice.

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Apr 23, 2018

Hi @himanshug - have you had a chance to review my review? 😃

@himanshug
Copy link
Copy Markdown
Contributor Author

Hi @himanshug - have you had a chance to review my review? 😃

@gianm thanks for the review. sorry, I haven't had a chance to take another look. I'll try and finish it this week or next.
PS: not sure if you know, but I'm going through some major transitions personally this week :)

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Apr 24, 2018

@himanshug I have heard, congratulations :)

The comments I had were relatively minor, I think the main interesting one was the types being lost, so we probably want some additional tests for numeric dimensions.

@himanshug
Copy link
Copy Markdown
Contributor Author

@gianm my apologies for not being able to get back to this for so long, but finally :)

and also thanks to @a2l007 for reminders to get this done .

@himanshug himanshug force-pushed the multi_rullup_groupby branch from e694726 to e6746f0 Compare May 29, 2018 06:06
@himanshug
Copy link
Copy Markdown
Contributor Author

@gianm re-merged with master and fixed build, it should be good to go now.

Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@himanshug It looks good, but can you add docs please?

@himanshug
Copy link
Copy Markdown
Contributor Author

@gianm added docs.

Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@himanshug -- patch LGTM but I suggested some doc changes that I think will make things clearer.

Comment thread docs/content/querying/groupbyquery.md Outdated
|aggregations|See [Aggregations](../querying/aggregations.html)|no|
|postAggregations|See [Post Aggregations](../querying/post-aggregations.html)|no|
|intervals|A JSON Object representing ISO-8601 Intervals. This defines the time ranges to run the query over.|yes|
|subtotalsSpec| A JSON array of arrays to return additional result sets for groupings of subsets of top level `dimensions`. It is described later in more detail.|no|
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to have a link to the section here, using the power of HTML.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Comment thread docs/content/querying/groupbyquery.md Outdated
"type": "groupBy",
...
...
"dimenstions": [
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling of "dimensions".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:)

Comment thread docs/content/querying/groupbyquery.md Outdated
See [Multi-value dimensions](multi-value-dimensions.html) for more details.

### More on subtotalsSpec
you can have a groupBy query that looks something like below...
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be nice to repeat the use case and behavior for subtotalsSpec here, since it's far away from the top-level docs and the reader might not have seen that. Also please use better grammar here. Pulling those two comments together, how about:

The subtotals feature allows computation of multiple sub-groupings in a single query. To use this feature, add a "subtotalsSpec" to your query, which should be a list of subgroup dimension sets. It should contain the "outputName" from dimensions in your "dimensions" attribute, in the same order as they appear in the "dimensions" attribute (although, of course, you may skip some). For example, consider a groupBy query like this one:

We should also mention that it adds 1 to the number of merge buffers you'll need. How about adding this to the "Memory tuning and resource limits" section later on. I believe it's accurate as of the current state of things:

Brokers do not need merge buffers for basic groupBy queries. Queries with subqueries (using a "query" dataSource (link to query datasource docs)) require one merge buffer if there is a single subquery, or two merge buffers if there is more than one layer of nested subqueries. Queries with subtotals (link to subtotals spec) need one merge buffer. These can stack on top of each other: a groupBy query with multiple layers of nested subqueries, and that also uses subtotals, will need three merge buffers.

Historicals and ingestion tasks need one merge buffer for each groupBy query, unless parallel combination (link to parallel combine section) is enabled, in which case they need two merge buffers per query.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for writing above, added/replaced.

Comment thread docs/content/querying/groupbyquery.md Outdated
]
```

Note that "subtotalsSpec" must contain subsets of "outputName" from various `DimensionSpec` json blobs in `dimensions` attribute and also ordering of dimensions inside subtotal spec must be same as that inside top level "dimensions" attribute e.g. ["D2", "D1"] subtotal spec is not valid as it is not in same order.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some commas into this run-on sentence, or delete it if you agree with my suggestion above to move this content into the start of the section.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted

@himanshug
Copy link
Copy Markdown
Contributor Author

@gianm updated the docs.

Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!!

@gianm gianm merged commit 1fae651 into apache:master Aug 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants