add "subtotalsSpec" attribute to groupBy query#5280
Conversation
|
In above query dimension seems to be redundant, probably dimensionsSpec can be made optional when specifying subtotalsSpec. |
|
@nishantmonu51 "dimensions" isn't just a list of strings but list of I called it "subtotalsSpec" based on the term used in oracle https://docs.oracle.com/cd/B28359_01/server.111/b28314/tdpdw_sql.htm#TDPDW00711 and that most users were familiar with, but I wouldn't mind changing the name if other proposed options make more sense. |
89421d3 to
42f6bad
Compare
|
@nishantmonu51 also modified the example in PR description to highlight difference between "dimensions" and "subtotalsSpec" fields. |
2f84db7 to
4a5e63b
Compare
4a5e63b to
3066938
Compare
|
I think this satisfies the desire to be useful for SQL GROUPING SETS; the planner would need to compute the overarching union of all GROUPING SETS, then include that in the dimensions, and then create a subtotalsSpec with any other sub GROUPING SETS. So that is good. In Druid SQL a query like yours would look like, Some questions and thoughts,
|
|
@gianm thanks, hopefully following answers provide further explanation.
You are right, super-set result is not returned unless it was part of subtotalsSpec for exactly the reasons you mentioned.
This patch does not add support for subtotals in groupBy-v1 which would just fail.
Possibly yes, however current patch does not optimize for this. maybe something that can be done as a improvement followup.
Druid groupBy result set always include name of all dimensions (even if they were null/empty) in each row. So, from the rows it would be identifiable when next subtotal begins. For example result for query in PR description would look something like below... |
|
@gianm let me know if the explanation in #5280 (comment) sounds sensible and then I will try and finish up this PR. |
|
@himanshug this approach does sound sensible. If you don't do the optimization to avoid materialization when the user just asks for a grand total, I encourage you to at least build the feature in such a way that it could be put in later without too much refactoring. (I think it would be common, for example getting a timeseries with a grand total) |
|
@gianm I think patch is structured to allow optimizing that use case by checking that case in GroupByStrategyV2.processSubtotalsSpec(..) and doing something else instead of materializing the result-set inside the BufferGrouper. |
|
@gianm @nishantmonu51 alright, this PR is ready now. I'm good with "subtotalsSpec" but let me know if majority likes one of the other options. I will add documentation as well once we settle on the attribute name. |
|
getting back to this after a while, I'll fix the conflict . @gianm @nishantmonu51 please take a look again and help me finish this one. |
|
Probably this should be labeled with |
|
@himanshug sure, please post when the conflict is fixed and I'll take another look. |
|
@gianm fixed the conflict. |
|
I've added |
There was a problem hiding this comment.
Thank you for your patience @himanshug. Let me know what you honk think of the review. Btw, after this patch is in, along with #5640 we’ll be able to start implementing subtotals in SQL too 🙂
| } | ||
| if (!found) { | ||
| throw new IAE( | ||
| "Subtotal spec %s is either not a subset or items are in different order than in dimensiosn spec.", |
There was a problem hiding this comment.
Spelling: dimensions. Maybe call it dimensionsSpec so it's identical to what is in the query?
There was a problem hiding this comment.
fixed the spelling, in the query it is called "dimensions" so keeping that to be identical to the query.
| return limitSpec; | ||
| } | ||
|
|
||
| @JsonInclude(JsonInclude.Include.NON_NULL) |
There was a problem hiding this comment.
What does this @JsonInclude do? Does it mean don't write it if it's null? That's kind of cool.
| return groupByStrategy.processSubtotalsSpec( | ||
| query, | ||
| resource, | ||
| groupByStrategy.processSubqueryResult(subquery, query, resource, finalizingResults) |
There was a problem hiding this comment.
Should this be query.withSubtotalsSpec(null)?
There was a problem hiding this comment.
no, its needed in the impl of processSubtotalsSpec(..)
| ).withDimensionSpecs( | ||
| Lists.transform( | ||
| queryWithoutSubtotalsSpec.getDimensions(), | ||
| (dimSpec) -> new DefaultDimensionSpec( |
There was a problem hiding this comment.
This loses the type of the dimension (getOutputType) which is needed for numeric dimensions.
There was a problem hiding this comment.
fixed, thanks. added an unit test too for long type dimension column
|
|
||
| for (List<String> subtotalSpec : subtotals) { | ||
| GroupByQuery subtotalQuery = queryWithoutSubtotalsSpec.withDimensionSpecs( | ||
| subtotalSpec.stream().map(s -> new DefaultDimensionSpec(s, s)).collect(Collectors.toList()) |
There was a problem hiding this comment.
The dimension type is lost here too.
There was a problem hiding this comment.
fixed here as well.
| if (!willMergeRunners) { | ||
| final int requiredMergeBufferNum = countRequiredMergeBufferNum(query, 1); | ||
| final int requiredMergeBufferNum = countRequiredMergeBufferNum(query, 1) + | ||
| (query.getSubtotalsSpec() != null ? 1 : 0); |
There was a problem hiding this comment.
Do we really need an extra merge buffer when we’re computing subtotals? There’s already a requirement that the subtotals dimensions be in the same order as the top level dimensions, meaning we should be able to compute them without a big extra buffer. Just one row of scratch space plus a streaming combine.
There was a problem hiding this comment.
Oh wait, I'm dumb, this isn't true. If we did a group by on A, B, C and wanted an A, C subtotal, then we'll be seeing values of C non-contiguously. Nevermind!
There was a problem hiding this comment.
It would be nice in the future to optimize for the case where all subtotals can be done streaming (if they are all prefixes) but that could be future work, not in this PR.
There was a problem hiding this comment.
yeah, that optimization would be nice.
|
Hi @himanshug - have you had a chance to review my review? 😃 |
@gianm thanks for the review. sorry, I haven't had a chance to take another look. I'll try and finish it this week or next. |
|
@himanshug I have heard, congratulations :) The comments I had were relatively minor, I think the main interesting one was the types being lost, so we probably want some additional tests for numeric dimensions. |
e694726 to
e6746f0
Compare
|
@gianm re-merged with master and fixed build, it should be good to go now. |
gianm
left a comment
There was a problem hiding this comment.
@himanshug It looks good, but can you add docs please?
|
@gianm added docs. |
gianm
left a comment
There was a problem hiding this comment.
@himanshug -- patch LGTM but I suggested some doc changes that I think will make things clearer.
| |aggregations|See [Aggregations](../querying/aggregations.html)|no| | ||
| |postAggregations|See [Post Aggregations](../querying/post-aggregations.html)|no| | ||
| |intervals|A JSON Object representing ISO-8601 Intervals. This defines the time ranges to run the query over.|yes| | ||
| |subtotalsSpec| A JSON array of arrays to return additional result sets for groupings of subsets of top level `dimensions`. It is described later in more detail.|no| |
There was a problem hiding this comment.
Would be great to have a link to the section here, using the power of HTML.
| "type": "groupBy", | ||
| ... | ||
| ... | ||
| "dimenstions": [ |
| See [Multi-value dimensions](multi-value-dimensions.html) for more details. | ||
|
|
||
| ### More on subtotalsSpec | ||
| you can have a groupBy query that looks something like below... |
There was a problem hiding this comment.
I think it'd be nice to repeat the use case and behavior for subtotalsSpec here, since it's far away from the top-level docs and the reader might not have seen that. Also please use better grammar here. Pulling those two comments together, how about:
The subtotals feature allows computation of multiple sub-groupings in a single query. To use this feature, add a "subtotalsSpec" to your query, which should be a list of subgroup dimension sets. It should contain the "outputName" from dimensions in your "dimensions" attribute, in the same order as they appear in the "dimensions" attribute (although, of course, you may skip some). For example, consider a groupBy query like this one:
We should also mention that it adds 1 to the number of merge buffers you'll need. How about adding this to the "Memory tuning and resource limits" section later on. I believe it's accurate as of the current state of things:
Brokers do not need merge buffers for basic groupBy queries. Queries with subqueries (using a "query" dataSource (link to query datasource docs)) require one merge buffer if there is a single subquery, or two merge buffers if there is more than one layer of nested subqueries. Queries with subtotals (link to subtotals spec) need one merge buffer. These can stack on top of each other: a groupBy query with multiple layers of nested subqueries, and that also uses subtotals, will need three merge buffers.
Historicals and ingestion tasks need one merge buffer for each groupBy query, unless parallel combination (link to parallel combine section) is enabled, in which case they need two merge buffers per query.
There was a problem hiding this comment.
thanks for writing above, added/replaced.
| ] | ||
| ``` | ||
|
|
||
| Note that "subtotalsSpec" must contain subsets of "outputName" from various `DimensionSpec` json blobs in `dimensions` attribute and also ordering of dimensions inside subtotal spec must be same as that inside top level "dimensions" attribute e.g. ["D2", "D1"] subtotal spec is not valid as it is not in same order. |
There was a problem hiding this comment.
Please add some commas into this run-on sentence, or delete it if you agree with my suggestion above to move this content into the start of the section.
|
@gianm updated the docs. |
Fixes #5179
This patch introduces a "subtotalsSpec" attribute to groupBy query . So, you might have a groupBy query that looks something like below...
Response returned would be equivalent to concatenating result of 3 groupBy queries with "dimensions" field being ["D1", "D2", D3"], ["D1", "D3"] and ["D3"] with appropriate
DimensionSpecjson blob as used in above query.Response for above query would look something like below...
Note that "subtotalsSpec" must contain subsets of "outputName" from various
DimensionSpecjson blobs indimensionsattribute and also ordering of dimensions inside subtotal spec must be same as that inside top level "dimensions" attribute e.g. ["D2", "D1"] subtotal spec is not valid as it is not in same order.DruidSQL layer can support additional functions that could auto-generate the "subtotalsSpec" to support features similar to "ROLLUP" and "CUBE" functions in https://docs.oracle.com/cd/B28359_01/server.111/b28314/tdpdw_sql.htm#TDPDW00712