Skip to content

remove group-by v1#14866

Merged
gianm merged 9 commits intoapache:masterfrom
clintropolis:remove-group-by-v1
Aug 23, 2023
Merged

remove group-by v1#14866
gianm merged 9 commits intoapache:masterfrom
clintropolis:remove-group-by-v1

Conversation

@clintropolis
Copy link
Copy Markdown
Member

@clintropolis clintropolis commented Aug 18, 2023

Description

This PR removes the 'v1' grouping engine as well as the strategy selection abstraction. I found GroupByStrategyV2 to still be a relatively handy thing to keep around since it is a nice container for grabbing the resources and stuff needed and plumbing to the underlying group by v2 engine, so I have transitioned it into a new class GroupingEngine which is now used in place of prior uses of GroupByStrategySelector. Alternatively its functionality could have been distributed among the query runner and toolchest, but it seemed a bit more disruptive.

Release note

The V1 group by query has been removed in favor of always using the V2 engine. Any query context parameters associated with it will now be ignored internally.

AggregatorFactory.getRequiredColumns has been deprecated and will be removed in a future release. If you have an extension that implements AggregatorFactory then this method should be removed from your implementation (there is now a default definition that throws an exception about being deprecated).


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally!! 🎉

I looked through and thought about whether there are other things that could be updated. What are your thoughts on these two?

  • longLast, longFirst, doubleFirst, doubleLast could stop lying about their intermediate types. IIRC they only had to lie about this to make groupBy v1 work.
  • OnheapIncrementalIndex no longer needs to support multithreaded writers, so addToFacts could be simplified.

@clintropolis
Copy link
Copy Markdown
Member Author

longLast, longFirst, doubleFirst, doubleLast could stop lying about their intermediate types. IIRC they only had to lie about this to make groupBy v1 work.

I actually did this initially, but then noticed that #14462 was also making this change, so reverted for now. That said, it would only be a minor conflict with the other PR, so Im willing to add it back.

OnheapIncrementalIndex no longer needs to support multithreaded writers, so addToFacts could be simplified

Yeah, totally, i actually imagine a lot of other parts of IncrementalIndex could also be improved. Should I do that in this PR or save it for a follow-up? In the very least i can remove the group by v1 comment in addToFacts and add another comment to replace it indicating that the condition should now be impossible and we should work to simplify the code area in the future

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Aug 21, 2023

I actually did this initially, but then noticed that #14462 was also making this change, so reverted for now. That said, it would only be a minor conflict with the other PR, so Im willing to add it back.

OK, fair enough. We can leave it for #14462.

Yeah, totally, i actually imagine a lot of other parts of IncrementalIndex could also be improved. Should I do that in this PR or save it for a follow-up? In the very least i can remove the group by v1 comment in addToFacts and add another comment to replace it indicating that the condition should now be impossible and we should work to simplify the code area in the future

Up to you how much you want to do in this PR. I'd at least adjust the comment in there that mentions groupBy v1. Other stuff could be done as future work.

Comment thread docs/configuration/index.md Outdated
Note: Even if cache is enabled, for [groupBy v2](../querying/groupbyquery.md#strategies) queries, segment level cache do not work on Brokers.
See [Differences between v1 and v2](../querying/groupbyquery.md#differences-between-v1-and-v2) and [Query caching](../querying/caching.md) for more information.
Note: Even if cache is enabled, for [groupBy](../querying/groupbyquery.md) queries, segment level cache does not work on Brokers.
See [query caching](../querying/caching.md) for more information.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
See [query caching](../querying/caching.md) for more information.
See [Query caching](../querying/caching.md) for more information.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see lots of other links like this.... but why?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i mean, its in the middle of a sentence, why would we uppercase it

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, personally I find the suggested style to be better :)

Copy link
Copy Markdown
Member Author

@clintropolis clintropolis Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess like... I disagree, but we have a lot of other links doing that too.

Like take this example:
Screenshot 2023-08-22 at 9 08 33 PM

Why should the A in "Advance groupBy v2 configurations" and the M in "Memory tuning and resource limits" be uppercase?

Is this a matter of perspective? Like it bothers me because I view these inline links as just some text that is part of a sentence that happen to be clickable. But i guess another perspective might be that this is some proper section title and it should match as closely as possible? I still don't really personally like it, but i changed it because a handful of other links are like this on this and other pages (not all of them though...)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, it seems better because upper-casing highlights that part of the text. But yes, in the end, its a matter of perspective. I don't have an objective argument for either or against.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clintropolis In this specific example, Query caching is the title of the page and must be referenced using sentence case.

Copy link
Copy Markdown
Contributor

@gianm gianm Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO @ektravel's perspective makes sense. If the link text is a title of a page and is referring to the page itself then capitalizing the first letter seems good. It emphasizes that we're referring to a documentation page rather than a concept.

To me this is correct (the text refers to the page being linked to):

See [Query caching](../querying/caching.md) for more information.

And this is also correct (the text refers to a concept, not the page being linked to):

It is important to set up [query caching](../querying/caching.md) properly for best performance.

Btw, the following is also IMO correct, but awkward and I'd avoid it in favor of the first example. Here "query caching" is a common noun acting as an adjective for "documentation"; it's not referring to the page being linked to, so it shouldn't be capitalized:

See the [query caching](../querying/caching.md) documentation for more information.

Comment thread docs/querying/groupbyquery.md Outdated

|Key|Description|Default|
|---|-----------|-------|
|`groupByIsSingleThreaded`|Overrides the value of `druid.query.groupBy.singleThreaded` for this query.|
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
|`groupByIsSingleThreaded`|Overrides the value of `druid.query.groupBy.singleThreaded` for this query.|
|`groupByIsSingleThreaded`|Overrides the value of `druid.query.groupBy.singleThreaded` for this query.|None|

The entry on line 375 is missing the default.

Copy link
Copy Markdown
Contributor

@ektravel ektravel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the docs portion.

@clintropolis clintropolis removed the WIP label Aug 23, 2023
@gianm gianm merged commit 36e659a into apache:master Aug 23, 2023
@clintropolis clintropolis deleted the remove-group-by-v1 branch August 23, 2023 19:48
@LakshSingla LakshSingla added this to the 28.0 milestone Oct 12, 2023
gianm added a commit to gianm/druid that referenced this pull request Jan 16, 2024
Following apache#14866, there is no longer a reason for IncrementalIndex#add
to be thread-safe.

It turns out it already was not using its selectors in a thread-safe way,
as exposed by apache#15615 making `testMultithreadAddFactsUsingExpressionAndJavaScript`
in `IncrementalIndexIngestionTest` flaky. Note that this problem isn't
new: Strings have been stored in the dimension selectors for some time,
but we didn't have a test that checked for that case; we only have
this test that checks for concurrent adds involving numeric selectors.

At any rate, this patch changes OnheapIncrementalIndex to no longer try
to offer a thread-safe "add" method. It also improves performance a bit
by adding a row ID supplier to the selectors it uses to read InputRows,
meaning that it can get the benefit of caching values inside the selectors.

This patch also:

1) Adds synchronization to HyperUniquesAggregator and CardinalityAggregator,
   which the similar datasketches versions already have. This is done to
   help them adhere to the contract of Aggregator: concurrent calls to
   "aggregate" and "get" must be thread-safe.

2) Updates OnHeapIncrementalIndexBenchmark to use JMH and moves it to the
   druid-benchmarks module.
gianm added a commit that referenced this pull request Jan 18, 2024
* IncrementalIndex#add is no longer thread-safe.

Following #14866, there is no longer a reason for IncrementalIndex#add
to be thread-safe.

It turns out it already was not using its selectors in a thread-safe way,
as exposed by #15615 making `testMultithreadAddFactsUsingExpressionAndJavaScript`
in `IncrementalIndexIngestionTest` flaky. Note that this problem isn't
new: Strings have been stored in the dimension selectors for some time,
but we didn't have a test that checked for that case; we only have
this test that checks for concurrent adds involving numeric selectors.

At any rate, this patch changes OnheapIncrementalIndex to no longer try
to offer a thread-safe "add" method. It also improves performance a bit
by adding a row ID supplier to the selectors it uses to read InputRows,
meaning that it can get the benefit of caching values inside the selectors.

This patch also:

1) Adds synchronization to HyperUniquesAggregator and CardinalityAggregator,
   which the similar datasketches versions already have. This is done to
   help them adhere to the contract of Aggregator: concurrent calls to
   "aggregate" and "get" must be thread-safe.

2) Updates OnHeapIncrementalIndexBenchmark to use JMH and moves it to the
   druid-benchmarks module.

* Spelling.

* Changes from static analysis.

* Fix javadoc.
LakshSingla pushed a commit to LakshSingla/druid that referenced this pull request Jan 30, 2024
* IncrementalIndex#add is no longer thread-safe.

Following apache#14866, there is no longer a reason for IncrementalIndex#add
to be thread-safe.

It turns out it already was not using its selectors in a thread-safe way,
as exposed by apache#15615 making `testMultithreadAddFactsUsingExpressionAndJavaScript`
in `IncrementalIndexIngestionTest` flaky. Note that this problem isn't
new: Strings have been stored in the dimension selectors for some time,
but we didn't have a test that checked for that case; we only have
this test that checks for concurrent adds involving numeric selectors.

At any rate, this patch changes OnheapIncrementalIndex to no longer try
to offer a thread-safe "add" method. It also improves performance a bit
by adding a row ID supplier to the selectors it uses to read InputRows,
meaning that it can get the benefit of caching values inside the selectors.

This patch also:

1) Adds synchronization to HyperUniquesAggregator and CardinalityAggregator,
   which the similar datasketches versions already have. This is done to
   help them adhere to the contract of Aggregator: concurrent calls to
   "aggregate" and "get" must be thread-safe.

2) Updates OnHeapIncrementalIndexBenchmark to use JMH and moves it to the
   druid-benchmarks module.

* Spelling.

* Changes from static analysis.

* Fix javadoc.
abhishekagarwal87 pushed a commit that referenced this pull request Jan 30, 2024
* IncrementalIndex#add is no longer thread-safe.

Following #14866, there is no longer a reason for IncrementalIndex#add
to be thread-safe.

It turns out it already was not using its selectors in a thread-safe way,
as exposed by #15615 making `testMultithreadAddFactsUsingExpressionAndJavaScript`
in `IncrementalIndexIngestionTest` flaky. Note that this problem isn't
new: Strings have been stored in the dimension selectors for some time,
but we didn't have a test that checked for that case; we only have
this test that checks for concurrent adds involving numeric selectors.

At any rate, this patch changes OnheapIncrementalIndex to no longer try
to offer a thread-safe "add" method. It also improves performance a bit
by adding a row ID supplier to the selectors it uses to read InputRows,
meaning that it can get the benefit of caching values inside the selectors.

This patch also:

1) Adds synchronization to HyperUniquesAggregator and CardinalityAggregator,
   which the similar datasketches versions already have. This is done to
   help them adhere to the contract of Aggregator: concurrent calls to
   "aggregate" and "get" must be thread-safe.

2) Updates OnHeapIncrementalIndexBenchmark to use JMH and moves it to the
   druid-benchmarks module.

* Spelling.

* Changes from static analysis.

* Fix javadoc.

Co-authored-by: Gian Merlino <gianmerlino@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants