SQL support for t-digest based sketch aggregators#8100
SQL support for t-digest based sketch aggregators#8100jon-wei merged 6 commits intoapache:masterfrom
Conversation
bdbe581 to
adeac28
Compare
|
Hey @samarthjain, I see this error in CI, You probably need this in your pom: |
|
Hmm, I do have that dependency added. Taking a closer look. |
|
Oh, actually I think you need this: (The |
|
@samarthjain Can you fix conflicts? |
There was a problem hiding this comment.
Suggest changing "ingested_sketch" here to <input-column>, since the aggregator can accept raw values or pre-generated sketches
There was a problem hiding this comment.
Suggest rewording this section, there's only one aggregator since the PostAggregators don't really aggregate (they only operate on values within a single row), they're more like "post aggregation transforms".
There was a problem hiding this comment.
Suggest rewriting to:
"The input field reference, which must be a PostAggregator that outputs T-Digest sketches. This can be a fieldAccess PostAggregator, which simply returns the value of the referenced input column."
There was a problem hiding this comment.
quantilesFromTDigestSketch -> quantileFromTDigestSketch
There was a problem hiding this comment.
Since build/merge implementations are combined, suggest calling the type "tDigestSketch"
594658d to
86546e6
Compare
There was a problem hiding this comment.
Since this is only one fraction, suggest changing the property name to "fraction"
There was a problem hiding this comment.
maxNumEntriesOperand should be renamed to compressionOperand
There was a problem hiding this comment.
Since the compression operand is optional, there should be a check that aggregateCall.getArgList() has size > 1
There was a problem hiding this comment.
The agg function here should be adjusted to support the optional compression param like in the quantile version
There was a problem hiding this comment.
This should check that the compression and quantile parameters match as well
There was a problem hiding this comment.
Hmm, not sure how to do that. Compression is part of the TDigestSketchAggregatorFactory and quantile isn't. The quantile is used in postAggregator.
There was a problem hiding this comment.
Ah, it should just check compression
There was a problem hiding this comment.
Can you update the docs to mention that optional compression param is supported for the quantile SQL function as well
There was a problem hiding this comment.
So I thought about this a bit more and I am not sure what is the right thing to do. Consider a query like this:
SELECT TDIGEST_QUANTILE(column1, 0.1, 100), TDIGEST_QUANTILE(column1, 0.2, 200), TDIGEST_QUANTILE(column1, 0.2, 300) FROM FOO
This query does the aggregation to generate a sketch on column1 and then applies the post aggregator to compute quantile out of it (3 times). The problem is that the sketch is generated only once using the compression param 100 (from the first TDIGEST_QUANTILE(column1, 0.1, 100) and the compression params from the following calls are ignored.
Is there a way to validate in Druid SQL that all the calls on TDIGEST_QUANTILE() for a column are using the same compression param? The other alternative is to not support compression param for this method and just use the default value of compression.
There was a problem hiding this comment.
The problem is that the sketch is generated only once using the compression param 100 (from the first TDIGEST_QUANTILE(column1, 0.1, 100) and the compression params from the following calls are ignored.
If you add a check that the compression parameter matches before reusing the
aggregator (#8100 (comment)), does this still occur?
There was a problem hiding this comment.
Thanks, that worked.
jon-wei
left a comment
There was a problem hiding this comment.
lgtm after comments are addressed
There was a problem hiding this comment.
Suggest adding tests with a quantile SQL agg that has the compression parameter is specified, and where the sketch build SQL agg uses default compression
There was a problem hiding this comment.
Since SKETCH_TO_QUANTILES and SKETCH_TO_QUANTILE are postaggs, their IDs should go in PostAggregatorIds instead
52891e5 to
d0d9043
Compare
Description
This PR adds support for t-digest based sketch aggregators that was added in #7331
Additionally this PR removes previously added mergeTDigestSketch aggregator. The merging/combining functionality has been added in buildTDigestSketch aggregator. The docs also have been updated with the relevant changes.
Note that a couple of tests added in this PR will fail till #8099 is merged.
This PR has:
For reviewers: the key changed/added classes in this PR are
TDigestGenerateSketchSqlAggregator, andTDigestSketchQuantileSqlAggregator.