Skip to content

Update the docs for EARLIEST_BY/LATEST_BY aggregators with the newly added numeric capabilities#15670

Merged
LakshSingla merged 8 commits intoapache:masterfrom
LakshSingla:first-last-changes
Feb 1, 2024
Merged

Update the docs for EARLIEST_BY/LATEST_BY aggregators with the newly added numeric capabilities#15670
LakshSingla merged 8 commits intoapache:masterfrom
LakshSingla:first-last-changes

Conversation

@LakshSingla
Copy link
Copy Markdown
Contributor

@LakshSingla LakshSingla commented Jan 11, 2024

Description

This PR:

  • Enables MSQ tests on queries using EARLIEST/LATEST/EARLIEST_BY/LATEST_BY aggregators
  • Removes the limitations in the docs, since numeric first/last can now be used with MSQ and at ingestion time
  • (THIS IS REMOVED FROM THIS PATCH) Disallows EARLIEST_BY and LATEST_BY to be used with rolled-up pairLongObjects. This is done to prevent the caller from supplying the timeExpr, which will get ignored by the native engine, and might be unexpected behavior. The correct way to further aggregate such columns would be using EARLIEST/LATEST, where the caller understands that the time column would be implicitly taken from the rolled-up metric. In the following example:
-- Insert data into a table using the following query. 'finalize' should be false in the query context to enable rollup
INSERT INTO dim1 foo EARLIEST_BY(m1, timestampCol1) FROM EXTERN(...) GROUP BY dim1

-- Rollup the pre-aggregated metric, with a different timestamp column
-- In such a case, the native aggregator will ignore the value from the timestampCol2 and use the value that was aggregated during the ingestion. To prevent such errors, the call is disallowed, and user friendly message is thrown
SELECT EARLIEST_BY(m1, timestampCol2) FROM foo
  
-- Rollup, with the column that was used during the ingestion time
SELECT EARLIEST(m1) FROM foo

First/Last aggregators call .toString() on complex metrics (that aren't type of pairLongLong, pairLongString...) and array types, which is also weird, however, that hasn't been changed, because that has been supported for a long time, and is also documented implicitly.

Disallowing EARLIEST_BY(aggregatedMetric, timestampCol2) will call the users to change their queries, however the equivalent call to this is EARLIEST(aggregatedMetric), which is a lot more clear, as the explicitly typed column by the user isn't ignored.

Release note

EARLIEST_BY and LATEST_BY cannot be used with complex objects created during ingestion (with rollup) with the first/last aggregators.


Key changed/added classes in this PR

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@LakshSingla LakshSingla changed the title Refine EARLIEST/LATEST aggregators Disallow the EARLIEST_BY/LATEST_BY aggregators from aggregating on complex long(Type) pair objects Jan 11, 2024
@cheddar
Copy link
Copy Markdown
Contributor

cheddar commented Jan 12, 2024

the earliest/latest/earliest_by/latest_by should all be deprecated anyway. Please redirect this work to finishing up this PR: #14195

|| SerializablePairLongDoubleComplexMetricSerde.TYPE_NAME.equals(complexColumnTypeName)
|| SerializablePairLongStringComplexMetricSerde.TYPE_NAME.equals(complexColumnTypeName))) {
plannerContext.setPlanningError(
"Cannot call %s with an explicit 'timeExpr' column for pre-aggregated metric of type [%s]. Use %s instead "
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't look right to me - EARLIEST/LATEST is rewritten to their *_BY variants by #15095

this message would suggest that something like:

  @Test
  public void testEarliestWorks1()
  {
    testBuilder()
        .sql("SELECT EARLIEST(long_last_added) FROM wikipedia_first_last")
        .run();
  }

should pass - however it fails with the same error message

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment. I tried to add an isReplaced flag to distinguish between the two calls, but it seems like during the SQL planning phase of calcite, it rewrites the call to lose that information and uses the same SQL function for both variants (earliest, earliest_by). I cannot do a type check on the custom __time operand that was passed, therefore the next best resort is to just confirm if the identifier is __time or not.
This would mean EARLIEST_BY(complexMetric, __time) would also pass, but I couldn't come up with an easy fix for the same.
Since these will be deprecated sometime, I think its an acceptable compromise.

@LakshSingla
Copy link
Copy Markdown
Contributor Author

Since it's a tiny PR, and with the addition of the pairLongNumeric types, I wanted to tighten up the SQL handling, I am cool with working on both the PRs. Testing stuff out with MSQ is a way to ensure that we have parity with the native engine till the time old aggregators exist.

@cryptoe cryptoe added this to the Druid 29.0.0 milestone Jan 17, 2024
Comment thread docs/querying/sql-aggregations.md Outdated
Comment thread docs/querying/sql-aggregations.md Outdated
Copy link
Copy Markdown
Member

@vtlim vtlim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just need to fix the missing > on the first <br />. Otherwise docs look good.

Comment thread docs/querying/sql-aggregations.md Outdated
Comment thread docs/querying/sql-aggregations.md Outdated
LakshSingla and others added 2 commits February 1, 2024 00:50
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
@LakshSingla LakshSingla changed the title Disallow the EARLIEST_BY/LATEST_BY aggregators from aggregating on complex long(Type) pair objects Update the docs for EARLIEST_BY/LATEST_BY aggregators with the newly added numeric capabilities Feb 1, 2024
@LakshSingla LakshSingla merged commit 7d65caf into apache:master Feb 1, 2024
LakshSingla added a commit to LakshSingla/druid that referenced this pull request Feb 1, 2024
@LakshSingla LakshSingla deleted the first-last-changes branch February 1, 2024 04:55
abhishekagarwal87 pushed a commit that referenced this pull request Feb 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants