Merged
Conversation
changes: * Added `CursorBuildSpec` which captures all of the 'interesting' stuff that goes into producing a cursor as a replacement for the method arguments of `CursorFactory.canVectorize`, `CursorFactory.makeCursors`, and `CursorFactory.makeVectorCursors` * added new interfaces `CursorMaker` and new method `asCursorMaker` to `CursorFactory`, which takes a `CursorBuildSpec` as an argument and replaces `CursorFactory.canVectorize`, `CursorFactory.makeCursors`, and `CursorFactory.makeVectorCursors` * Deprecated `CursorFactory.canVectorize`, `CursorFactory.makeCursors`, and `CursorFactory.makeVectorCursors` * updated all `CursorFactory` implementations to implement `asCursorMaker` * updated all query engines to use `asCursorMaker`
asdf2014
reviewed
Jun 4, 2024
Member
asdf2014
left a comment
There was a problem hiding this comment.
This method needs to be removed as it is never used to pass the intellij-inspections check in Github Action
* CursorMaker no longer handles query granularity directly * Sequence<Cursor> of makeCursors is now just a Cursor from makeCursor * added CursorGranularizer to bucket queries by granularity instead, updated all engines that support query granularity to use this instead (topN still needs some more work) * remove Cursor.getTime * remove implementations of CursorFactory methods that are not asCursorMaker, replacing with defaults that throw exceptions
| frameWriter.addSelection(); | ||
| in.advance(); | ||
| for (; !cursor.isDoneOrInterrupted() && remainingRowsToFetch > 0; remainingRowsToFetch--) { | ||
| writer.addSelection(); |
Check warning
Code scanning / CodeQL
Dereferenced variable may be null
| in.advance(); | ||
| try(final FrameWriter writer = frameWriterFactory.newFrameWriter(columnSelectorFactory)) { | ||
| while (!cursor.isDoneOrInterrupted()) { | ||
| writer.addSelection(); |
Check warning
Code scanning / CodeQL
Dereferenced variable may be null
6 tasks
gianm
added a commit
to gianm/druid
that referenced
this pull request
Sep 25, 2024
The change in apache#16533 added CursorHolders to the processor-level Closer. This is problematic when running on an input channel: it means we started keeping around all CursorHolders for all frames we process and closing them when the channel is complete, rather than closing them as we go along. This patch restores the prior behavior, where resources are closed as we go.
gianm
added a commit
that referenced
this pull request
Sep 25, 2024
* ScanQueryFrameProcessor: Close CursorHolders as we go along. The change in #16533 added CursorHolders to the processor-level Closer. This is problematic when running on an input channel: it means we started keeping around all CursorHolders for all frames we process and closing them when the channel is complete, rather than closing them as we go along. This patch restores the prior behavior, where resources are closed as we go. * Fix other call sites. * Fix reference. * Improvements.
This was referenced Sep 26, 2024
clintropolis
added a commit
to clintropolis/druid
that referenced
this pull request
Sep 27, 2024
…pache#17175) Fixes a mistake introduced in apache#16533 which can result in CursorGranularizer incorrectly trying to get values from a selector after calling cursor.advance because of a missing check for cursor.isDone
clintropolis
added a commit
that referenced
this pull request
Sep 30, 2024
kfaraz
pushed a commit
to kfaraz/druid
that referenced
this pull request
Oct 4, 2024
…17152) * ScanQueryFrameProcessor: Close CursorHolders as we go along. The change in apache#16533 added CursorHolders to the processor-level Closer. This is problematic when running on an input channel: it means we started keeping around all CursorHolders for all frames we process and closing them when the channel is complete, rather than closing them as we go along. This patch restores the prior behavior, where resources are closed as we go. * Fix other call sites. * Fix reference. * Improvements.
gianm
added a commit
to gianm/druid
that referenced
this pull request
Oct 22, 2024
This patch re-uses timeBoundaryInspector for each cursor holder, which enables caching of minDataTimestamp and maxDataTimestamp. Fixes a performance regression introduced in apache#16533, where these fields stopped being cached across cursors. Prior to that patch, they were cached in the QueryableIndexStorageAdapter.
gianm
added a commit
that referenced
this pull request
Nov 6, 2024
This patch re-uses timeBoundaryInspector for each cursor holder, which enables caching of minDataTimestamp and maxDataTimestamp. Fixes a performance regression introduced in #16533, where these fields stopped being cached across cursors. Prior to that patch, they were cached in the QueryableIndexStorageAdapter.
Open
jtuglu1
pushed a commit
to jtuglu1/druid
that referenced
this pull request
Nov 20, 2024
This patch re-uses timeBoundaryInspector for each cursor holder, which enables caching of minDataTimestamp and maxDataTimestamp. Fixes a performance regression introduced in apache#16533, where these fields stopped being cached across cursors. Prior to that patch, they were cached in the QueryableIndexStorageAdapter.
clintropolis
pushed a commit
to clintropolis/druid
that referenced
this pull request
Nov 22, 2024
This patch re-uses timeBoundaryInspector for each cursor holder, which enables caching of minDataTimestamp and maxDataTimestamp. Fixes a performance regression introduced in apache#16533, where these fields stopped being cached across cursors. Prior to that patch, they were cached in the QueryableIndexStorageAdapter.
clintropolis
pushed a commit
to clintropolis/druid
that referenced
this pull request
Nov 22, 2024
This patch re-uses timeBoundaryInspector for each cursor holder, which enables caching of minDataTimestamp and maxDataTimestamp. Fixes a performance regression introduced in apache#16533, where these fields stopped being cached across cursors. Prior to that patch, they were cached in the QueryableIndexStorageAdapter.
clintropolis
added a commit
that referenced
this pull request
Nov 24, 2024
This patch re-uses timeBoundaryInspector for each cursor holder, which enables caching of minDataTimestamp and maxDataTimestamp. Fixes a performance regression introduced in #16533, where these fields stopped being cached across cursors. Prior to that patch, they were cached in the QueryableIndexStorageAdapter. Co-authored-by: Gian Merlino <gianmerlino@gmail.com>
clintropolis
added a commit
to clintropolis/druid
that referenced
this pull request
Dec 13, 2024
changes: * fix issue where topN with query granularity other than ALL would use the heap algorithm when it was actual able to use the pooled algorithm, and incorrectly used the pool algorithm in cases where it must use the heap algorithm, a regression from apache#16533 * fix issue where topN with query granularity other than ALL could incorrectly process values in the wrong time bucket, another regression from apache#16533
3 tasks
cryptoe
pushed a commit
that referenced
this pull request
Dec 17, 2024
* topn with granularity regression fixes changes: * fix issue where topN with query granularity other than ALL would use the heap algorithm when it was actual able to use the pooled algorithm, and incorrectly used the pool algorithm in cases where it must use the heap algorithm, a regression from #16533 * fix issue where topN with query granularity other than ALL could incorrectly process values in the wrong time bucket, another regression from #16533 * move defensive check outside of loop * more test * extra layer of safety * move check outside of loop * fix spelling * add query context parameter to allow using pooled algorithm for topN when multi-passes is required even wihen query granularity is not all * add comment, revert IT context changes and add new context flag
clintropolis
added a commit
to clintropolis/druid
that referenced
this pull request
Dec 17, 2024
* topn with granularity regression fixes changes: * fix issue where topN with query granularity other than ALL would use the heap algorithm when it was actual able to use the pooled algorithm, and incorrectly used the pool algorithm in cases where it must use the heap algorithm, a regression from apache#16533 * fix issue where topN with query granularity other than ALL could incorrectly process values in the wrong time bucket, another regression from apache#16533 * move defensive check outside of loop * more test * extra layer of safety * move check outside of loop * fix spelling * add query context parameter to allow using pooled algorithm for topN when multi-passes is required even wihen query granularity is not all * add comment, revert IT context changes and add new context flag
cryptoe
pushed a commit
that referenced
this pull request
Dec 18, 2024
* topn with granularity regression fixes (#17565) * topn with granularity regression fixes changes: * fix issue where topN with query granularity other than ALL would use the heap algorithm when it was actual able to use the pooled algorithm, and incorrectly used the pool algorithm in cases where it must use the heap algorithm, a regression from #16533 * fix issue where topN with query granularity other than ALL could incorrectly process values in the wrong time bucket, another regression from #16533 * move defensive check outside of loop * more test * extra layer of safety * move check outside of loop * fix spelling * add query context parameter to allow using pooled algorithm for topN when multi-passes is required even wihen query granularity is not all * add comment, revert IT context changes and add new context flag * remove unused
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR reworks
CursorandVectorCursorcreation fromCursorFactory(StorageAdapter) to be more efficient and push additional details about the query into the process with a newly introducedCursorBuildSpec. A newCursorHolderinterface has been added that is constructed by another new interface,CursorHolderFactorywith a methodmakeCursorHolder, which accepts theCursorBuildSpecand replaces the arguments ofCursorFactory.makeCursorsandCursorFactory.makeVectorCursors.The primary goal here is to make it easy to push these details down to cursor creation for an upcoming feature under development, projections, which are effectively a type of materialized views that live within segments and will be created at ingestion time. More details on this coming later once I finish writing the proposal, but basically we need to know stuff like what grouping columns, filters, aggregations, etc, are involved in a query in order to be able to automatically select the appropriate projection available within the segment, or using the regular default segment rows if no matching projections are available.
In addition to the primary goal, this refactor also allows cursor creation to be more efficient. Since
CursorHolderis created from theCursorBuildSpec, thecanVectorize,asCursor, andasVectorCursormethods can all share and re-use the same resources.Finally, since this PR is introducing new cursor building interfaces, it is also taking the opportunity to downplay the prominence of time ordering, particularly query granularity and its role in cursors.
CursorFactory.makeCursorspreviously returned aSequence<Cursor>, corresponding to query granularity buckets. It has been transitioned to no longer do this inCursorHolder.asCursor, instead returning a singleCursor, equivalent to usingALLgranularity with the previous method but without being wrapped in aSequence. Query granularity in engines is now handled using a newCursorGranularizerwhich is equivalent to how the vectorized engine works, simplifying query engines which do not care about query granularity, and making things easier for engines in the future to have segments which are not sorted by time.StorageAdapteris a@PublicApi, and extendsCursorFactory, so a default implementation ofCursorHolderFactorythat is backed by the oldcanVectorize,makeCursors, andmakeVectorCursormethods is provided. That said, all of the built-in query engines and tests have been migrated to usemakeCursorHolder, allCursorFactoryimplementations have been migrated toCursorHolderFactory, and the old methods ofCursorFactoryhave been marked as@Deprecated, and implementations ofCursorFactoryhave been completely removed.summary of changes
CursorBuildSpecwhich captures all of the 'interesting' stuff that goes into producing a cursor as a replacement for the method arguments ofCursorFactory.canVectorize,CursorFactory.makeCursor, andCursorFactory.makeVectorCursorCursorHolderand new interfaceCursorHolderFactoryas a replacement forCursorFactory, with methodmakeCursorHolder, which takes aCursorBuildSpecas an argument and replacesCursorFactory.canVectorize,CursorFactory.makeCursor, andCursorFactory.makeVectorCursorCursorFactory.makeCursorspreviously returned aSequence<Cursor>corresponding to the query granularity buckets, with a separateCursorper bucket.CursorHolder.asCursorinstead returns a singleCursor(equivalent to 'ALL' granularity), and a newCursorGranularizerhas been added for query engines to iterate over the cursor and divide into granularity buckets. This makes the non-vectorized engine behave the same way as the vectorized query engine (with itsVectorCursorGranularizer), and simplifies a lot of stuff that has to read segments particularly if it does not care about bucketing the results into granularities.CursorFactory,CursorFactory.canVectorize,CursorFactory.makeCursors, andCursorFactory.makeVectorCursorStorageAdapterimplementations to implementmakeCursorHolder, transitioned directCursorFactoryimplementations to instead implementCursorMakerFactory.StorageAdapterbeing aCursorMakerFactoryis intended to be a transitional thing, ideally will not be released in favor of movingCursorMakerFactoryto be fetched directly fromSegment, however this PR was already large enough so this will be done in a follow-up.makeCursorHolder, granularity based engines to useCursorGranularizer.Release note
(for developers)
Use notes from #16985 instead
There is a change to theStorageAdapterinterface, which is a 'public' api for extension writers to add additional types of segments to participate in Druids query engines.StorageAdapterextends theCursorFactoryinterface, whose methodscanVectorize,makeCursors, andmakeVectorCursorhave been deprecated. This interface is replaced with a newCursorHolderFactory, whichStorageAdapterimplements, with methodmakeCursorHolder, which accepts aCursorBuildSpecand returns a new interfaceCursorHolderwhich defines no argument versions ofcanVectorize,asCursor, andasVectorCursor.A default implementation ofmakeCursorHolderis provided that uses the existing deprecated methods, so no immediate action is needed forStorageAdapterimplementors on upgrade, but implementors should plan in the future to migrate to implementingCursorHolderFactorydirectly. Custom query engines can no longer use these deprecated methods, as they have been removed from allStorageAdapterimplementations. UseCursorHolderinstead.CursorHolder.asCursoris equivalent to callingCursorFactory.getCursorswithALLgranularity. Any custom engines which need to bucket the cursor by query granularity should instead useCursorGranularizer, which can be used to produce a set of intervals and advance the cursor within those intervals.(for users)
Due to some internal refactoring, TopN queries no longer can use the "pooled" algorithms internally for query granularities other than 'ALL', unless the columns value cardinality is smaller than the internally computed "numValuesPerPass", which is based on processing buffer size and projected aggregator size. This is not completely obvious from a users perspective, so the only indicator might be a performance difference for TopN queries with other query granularities. SQL will only ever plan to a TopN query with 'ALL' granularity so it is not affected.
Key changed/added classes in this PR
CursorHolderFactoryCursorHolderCursorBuildSpecCursorGranularizerCursorOffsetCursorFactory(deprecated)GroupingEngine(to useCursorHolderandCursorGranularizer)TimeseriesQueryEngine(to useCursorHolderandCursorGranularizer)TopNQueryEngine(to useCursorHolderandCursorGranularizer)CursorHolderCursorFactorymethods now useCursorHolderThis PR has: