Skip to content

SQL support for t-digest based sketch aggregators#8100

Merged
jon-wei merged 6 commits intoapache:masterfrom
samarthjain:tdigestsqlsupport
Aug 5, 2019
Merged

SQL support for t-digest based sketch aggregators#8100
jon-wei merged 6 commits intoapache:masterfrom
samarthjain:tdigestsqlsupport

Conversation

@samarthjain
Copy link
Copy Markdown
Contributor

Description

This PR adds support for t-digest based sketch aggregators that was added in #7331

Additionally this PR removes previously added mergeTDigestSketch aggregator. The merging/combining functionality has been added in buildTDigestSketch aggregator. The docs also have been updated with the relevant changes.

Note that a couple of tests added in this PR will fail till #8099 is merged.

This PR has:

  • [X ] been self-reviewed.
    • [X ] using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
  • [ X] added documentation for new or modified features or behaviors.
  • [ X] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • [X ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • [ X] added unit tests or modified existing tests to cover new code paths.
  • added integration tests.
  • [ X] been tested in a test Druid cluster.

For reviewers: the key changed/added classes in this PR are TDigestGenerateSketchSqlAggregator, and TDigestSketchQuantileSqlAggregator.

@gianm gianm requested a review from jon-wei July 19, 2019 20:53
@samarthjain samarthjain force-pushed the tdigestsqlsupport branch 2 times, most recently from bdbe581 to adeac28 Compare July 26, 2019 20:56
@gianm
Copy link
Copy Markdown
Contributor

gianm commented Jul 26, 2019

Hey @samarthjain, I see this error in CI,

Caused by: java.lang.ClassNotFoundException: org.apache.druid.query.expression.LookupEnabledTestExprMacroTable
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 28 more

You probably need this in your pom:

        <dependency>
            <groupId>org.apache.druid</groupId>
            <artifactId>druid-server</artifactId>
            <version>${project.parent.version}</version>
            <scope>test</scope>
        </dependency>

@samarthjain
Copy link
Copy Markdown
Contributor Author

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Jul 26, 2019

Oh, actually I think you need this:

        <dependency>
            <groupId>org.apache.druid</groupId>
            <artifactId>druid-server</artifactId>
            <version>${project.parent.version}</version>
            <type>test-jar</type>
            <scope>test</scope>
        </dependency>

(The <type>test-jar</type> part in particular)

@jon-wei
Copy link
Copy Markdown
Contributor

jon-wei commented Jul 29, 2019

@samarthjain Can you fix conflicts?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest changing "ingested_sketch" here to <input-column>, since the aggregator can accept raw values or pre-generated sketches

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest rewording this section, there's only one aggregator since the PostAggregators don't really aggregate (they only operate on values within a single row), they're more like "post aggregation transforms".

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest rewriting to:

"The input field reference, which must be a PostAggregator that outputs T-Digest sketches. This can be a fieldAccess PostAggregator, which simply returns the value of the referenced input column."

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quantilesFromTDigestSketch -> quantileFromTDigestSketch

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since build/merge implementations are combined, suggest calling the type "tDigestSketch"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is only one fraction, suggest changing the property name to "fraction"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxNumEntriesOperand should be renamed to compressionOperand

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the compression operand is optional, there should be a check that aggregateCall.getArgList() has size > 1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The agg function here should be adjusted to support the optional compression param like in the quantile version

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should check that the compression and quantile parameters match as well

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, not sure how to do that. Compression is part of the TDigestSketchAggregatorFactory and quantile isn't. The quantile is used in postAggregator.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it should just check compression

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the docs to mention that optional compression param is supported for the quantile SQL function as well

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I thought about this a bit more and I am not sure what is the right thing to do. Consider a query like this:

SELECT TDIGEST_QUANTILE(column1, 0.1, 100), TDIGEST_QUANTILE(column1, 0.2, 200), TDIGEST_QUANTILE(column1, 0.2, 300) FROM FOO

This query does the aggregation to generate a sketch on column1 and then applies the post aggregator to compute quantile out of it (3 times). The problem is that the sketch is generated only once using the compression param 100 (from the first TDIGEST_QUANTILE(column1, 0.1, 100) and the compression params from the following calls are ignored.

Is there a way to validate in Druid SQL that all the calls on TDIGEST_QUANTILE() for a column are using the same compression param? The other alternative is to not support compression param for this method and just use the default value of compression.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the sketch is generated only once using the compression param 100 (from the first TDIGEST_QUANTILE(column1, 0.1, 100) and the compression params from the following calls are ignored.

If you add a check that the compression parameter matches before reusing the
aggregator (#8100 (comment)), does this still occur?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that worked.

Copy link
Copy Markdown
Contributor

@jon-wei jon-wei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm after comments are addressed

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest adding tests with a quantile SQL agg that has the compression parameter is specified, and where the sketch build SQL agg uses default compression

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since SKETCH_TO_QUANTILES and SKETCH_TO_QUANTILE are postaggs, their IDs should go in PostAggregatorIds instead

@jon-wei jon-wei merged commit 93cf9d4 into apache:master Aug 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants