transition away from StorageAdapter#16985
Conversation
c46230a to
6524978
Compare
changes: * CursorHolderFactory has been renamed to CursorFactory and moved off of StorageAdapter, instead fetched directly from the segment via 'asCursorFactory'. The previous deprecated CursorFactory interface has been merged into StorageAdapter * StorageAdapter is no longer used by any engines or tests and has been marked as deprecated with default implementations of all methods that throw exceptions indicating the new methods to call instead * StorageAdapter methods not covered by CursorFactory (CursorHolderFactory prior to this change) have been moved into interfaces which are retrieved by Segment.as, the primary classes are the previously existing Metadata, as well as new interfaces PhysicalSegmentInspector and TopNOptimizationInspector * added UnnestSegment and FilteredSegment that extend WrappedSegmentReference since their StorageAdapter implementations were previously provided by WrappedSegmentReference * added PhysicalSegmentInspector which covers some of the previous StorageAdapter functionality which was primarily used for segment metadata queries and other metadata uses, and is implemented for QueryableIndexSegment and IncrementalIndexSegment * added TopNOptimizationInspector to cover the oddly specific StorageAdapter.hasBuiltInFilters implementation, which is implemented for HashJoinSegment, UnnestSegment, and FilteredSegment * Updated all engines and tests to no longer use StorageAdapter
6524978 to
0bdcc3f
Compare
| Assert.assertEquals(adapter.getNumRows(), (long) blasterFuture.get()); | ||
| Assert.assertEquals(adapter.getNumRows() * 2, (long) muxerFuture.get()); | ||
| Assert.assertEquals(index.size(), (long) blasterFuture.get()); | ||
| Assert.assertEquals(index.size() * 2, (long) muxerFuture.get()); |
Check warning
Code scanning / CodeQL
Result of multiplication cast to wider type
|
|
||
| Assert.assertEquals( | ||
| adapter.getNumRows() * 2, | ||
| index.size() * 2, |
Check warning
Code scanning / CodeQL
Result of multiplication cast to wider type
8f4963b to
c0e6ae3
Compare
| * @deprecated use {@link Segment} directly as this does nothing | ||
| */ | ||
| @Deprecated | ||
| public abstract class AbstractSegment implements Segment |
There was a problem hiding this comment.
heh, no idea why git thinks this new file was a rename of this random old file i deleted
| @@ -19,7 +19,38 @@ | |||
|
|
|||
| package org.apache.druid.segment; | |||
|
|
|||
| public interface CursorHolderFactory | |||
There was a problem hiding this comment.
hmm, git seems to be confused here too, this is a new file not a rename 😅
| * @return instance of clazz, or null if the interface is not supported by this segment | ||
| * | ||
| * @see StorageAdapter storage adapter for queries. Never null. | ||
| * @see CursorFactory to make cursors to run queries |
There was a problem hiding this comment.
Is this still never null? I would hope so. Useful to say that, I think, if so.
There was a problem hiding this comment.
yea, though like, in the same way it was true for StorageAdapter. It could actually be null in practice because we got the storage adapter through ReferenceCountingSegment, which returns null for a lot of its methods if the backing segment has been closed due to being dropped, so the javadoc was kind of lies.
There was a problem hiding this comment.
updated to specify never null. Did not clarify about ReferenceCountingSegment because its kind of true for almost every method on Segment and wasn't quite sure how to best describe it.
| * @see TimeBoundaryInspector inspector for min/max timestamps, if supported by this segment. | ||
| * @see MaxIngestedEventTimeInspector inspector for {@link DataSourceMetadataResultValue#getMaxIngestedEventTime()} | ||
| * @see PhysicalSegmentInspector inspector for {@link org.apache.druid.query.metadata.SegmentAnalyzer} | ||
| * @see Metadata information about how a physical segment was created |
There was a problem hiding this comment.
Why is this not on PhysicalSegmentInspector?
There was a problem hiding this comment.
i guess it could be, originally i was calling PhysicalSegmentInspector SegmentAnalysisInspector because it was only used by segment metadata query, so i was avoiding putting stuff that might be used elsewhere here, but numRows is used by the SegmentManager to track load/drop row count so it got a more generic name, i suppose I should move it into there
There was a problem hiding this comment.
moved Metadata to PhysicalSegmentInspector.getMetadata
| public DataSourceMetadataQueryRunner(Segment segment) | ||
| { | ||
| this.segmentInterval = segment.asStorageAdapter().getInterval(); | ||
| this.segmentInterval = segment.getDataInterval(); |
There was a problem hiding this comment.
These sound like they should be different (interval vs dataInterval). But neither method has javadocs. Were they really equivalent?
There was a problem hiding this comment.
they were equivalent as far as I could tell.
IncrementalIndexSegmentandIncrementalIndexStorageAdapterboth returnedincrementalIndex.getDataIntervalQueryableIndexSegmentandQueryableIndexStorageAdapterboth returnedQueryableIndex.getDataIntervalHashJoinSegmentreturns base segment data interval,HashJoinStorageAdapterreturns base adapter intervalFrameSegmentonly uses segmentId with eternity interval,FrameStorageAdapteronly used eternity intervalRowBasedSegmentusesRowBasedStorageAdapterinterval
| final Set<String> dims = new LinkedHashSet<>(); | ||
| final Set<String> ignore = new HashSet<>(); | ||
| ignore.add(ColumnHolder.TIME_COLUMN_NAME); | ||
| final Metadata metadata = segment.as(Metadata.class); |
There was a problem hiding this comment.
This isn't going to work for really old segments, where availableDimensions is defined but Metadata isn't. It's possible some clusters that have been around for a while still have segments that were written before Metadata was a thing.
I suppose the impact is that we'd search all columns, rather than just the dimension columns, in that case? I suppose that's probably fine. Although the situation could be improved by checking explicitly here for as(QueryableIndex.class) and using getAvailableDimensions if it's present.
There was a problem hiding this comment.
switched this to prefer QueryableIndex and fallback to physical inspector/metadata if not available.
| // otherwise, a filtering inner join can also filter rows. | ||
| return (T) new SimpleTopNOptimizationInspector( | ||
| baseFilter != null || clauses.stream().anyMatch( | ||
| clause -> clause.getJoinType() == JoinType.INNER && !clause.getCondition().isAlwaysTrue() |
There was a problem hiding this comment.
I think this is a pre-existing issue, but, this check should be:
!clause.getJoinType().isLefty() && !clause.getCondition().isAlwaysTrue()
Because JoinType.RIGHT can also filter rows from the base.
| query.getDimensionsFilter() == null && | ||
| !storageAdapter.hasBuiltInFilters() && | ||
| query.getIntervals().stream().anyMatch(interval -> interval.contains(storageAdapter.getInterval()))) { | ||
| (topNOptimizationInspector == null || !topNOptimizationInspector.isFiltered()) && |
There was a problem hiding this comment.
This design seems brittle— it means if a Segment wraps another Segment and filters it, but forgets to implement TopNOptimizationInspector, then it'll be treated as non-filtered and the optimization will be allowed, potentially leading to incorrect results. It would be better to design things such that we avoid the optimization in such cases.
There was a problem hiding this comment.
adjusted and renamed isFiltered to areAllDictionaryIdsPresent to better reflect usage
| public interface ColumnCardinalityInspector extends ColumnInspector | ||
| { | ||
| CursorHolder makeCursorHolder(CursorBuildSpec spec); | ||
| default int getColumnValueCardinality(String column) |
There was a problem hiding this comment.
Should have javadocs at least describing what scenarios can result in DimensionDictionarySelector.CARDINALITY_UNKNOWN.
There was a problem hiding this comment.
this is removed instead, though did javadoc the similar method in the GroupingSelector that somewhat replaced this interface
| return DimensionDictionarySelector.CARDINALITY_UNKNOWN; | ||
| } | ||
| if (capabilities.is(ValueType.STRING) && capabilities.isDictionaryEncoded().isTrue()) { | ||
| final DimensionSelector dimensionSelector = makeDimensionSelector(DefaultDimensionSpec.of(columnName)); |
There was a problem hiding this comment.
Hmm I worry about performance here. The QueryableIndex implementation of makeDimensionSelector caches selectors, but most others don't.
I wonder if we can remove this method, so we don't have to worry about performance. There's only two call sites:
- One in
GroupingEngine, where by the time this method is called, dimension selectors have already been created. TheGroupingEnginecould check the dimension selectors directly. - One in
TopNQueryEngine, where this method is called before the dimension selector is created, but it probably could be restructured...
There was a problem hiding this comment.
ok did this, I made a GroupingSelector interface for GroupByColumnSelectorPlus and GroupByVectorColumnSelector to implement for grouping engine cardinality computation, and reworked topn to make its selectorplus outside of the run method (which is nicer too since it doesnt make a new one for every granularity bucket, which was unnecessary after the previous refactor)
changes: * replaced ColumnCardinalityInspector with GroupingSelector which grouping engine uses to get dimension cardinality * topN creates column selector up front to use for cardinality computation and passes it into the run method * move Metadata to PhysicalSegmentInspector, callers now get PhysicalSegmentInspector to do metadata stuff * search query prefer QueryableIndex before falling back to Metadata * TopNOptimizationInspector.isFiltered renamed/flipped to areAllDictionaryIdsPresent to better reflect usage
| clause -> clause.getJoinType() == JoinType.INNER && !clause.getCondition().isAlwaysTrue() | ||
| ) | ||
| !(baseFilter != null || clauses.stream().anyMatch( | ||
| clause -> clause.getJoinType().isLefty() && !clause.getCondition().isAlwaysTrue() |
There was a problem hiding this comment.
I think this isn't correct. The condition we want is that the baseFilter is null, and all clauses are either lefty or always-true. Like this:
baseFilter == null && clauses.stream.allMatch(
clause -> clause.getJoinType().isLefty() || clause.getCondition().isAlwaysTrue()
)
There was a problem hiding this comment.
oops, yes, fixed again. I copied your suggestion wrong last time (lost the ! on lefty check) and then flipped it to match the new contract of the inspector thingy, but should have just rewritten it since this is clearer.
* transition away from StorageAdapter changes: * CursorHolderFactory has been renamed to CursorFactory and moved off of StorageAdapter, instead fetched directly from the segment via 'asCursorFactory'. The previous deprecated CursorFactory interface has been merged into StorageAdapter * StorageAdapter is no longer used by any engines or tests and has been marked as deprecated with default implementations of all methods that throw exceptions indicating the new methods to call instead * StorageAdapter methods not covered by CursorFactory (CursorHolderFactory prior to this change) have been moved into interfaces which are retrieved by Segment.as, the primary classes are the previously existing Metadata, as well as new interfaces PhysicalSegmentInspector and TopNOptimizationInspector * added UnnestSegment and FilteredSegment that extend WrappedSegmentReference since their StorageAdapter implementations were previously provided by WrappedSegmentReference * added PhysicalSegmentInspector which covers some of the previous StorageAdapter functionality which was primarily used for segment metadata queries and other metadata uses, and is implemented for QueryableIndexSegment and IncrementalIndexSegment * added TopNOptimizationInspector to cover the oddly specific StorageAdapter.hasBuiltInFilters implementation, which is implemented for HashJoinSegment, UnnestSegment, and FilteredSegment * Updated all engines and tests to no longer use StorageAdapter
* transition away from StorageAdapter changes: * CursorHolderFactory has been renamed to CursorFactory and moved off of StorageAdapter, instead fetched directly from the segment via 'asCursorFactory'. The previous deprecated CursorFactory interface has been merged into StorageAdapter * StorageAdapter is no longer used by any engines or tests and has been marked as deprecated with default implementations of all methods that throw exceptions indicating the new methods to call instead * StorageAdapter methods not covered by CursorFactory (CursorHolderFactory prior to this change) have been moved into interfaces which are retrieved by Segment.as, the primary classes are the previously existing Metadata, as well as new interfaces PhysicalSegmentInspector and TopNOptimizationInspector * added UnnestSegment and FilteredSegment that extend WrappedSegmentReference since their StorageAdapter implementations were previously provided by WrappedSegmentReference * added PhysicalSegmentInspector which covers some of the previous StorageAdapter functionality which was primarily used for segment metadata queries and other metadata uses, and is implemented for QueryableIndexSegment and IncrementalIndexSegment * added TopNOptimizationInspector to cover the oddly specific StorageAdapter.hasBuiltInFilters implementation, which is implemented for HashJoinSegment, UnnestSegment, and FilteredSegment * Updated all engines and tests to no longer use StorageAdapter
Description
Follow-up to finish what #16533 and #16849 started, moving everything completely away from
StorageAdapter.changes:
CursorHolderFactoryhas been renamed toCursorFactoryand moved off ofStorageAdapter, instead fetched directly from the segment viaasCursorFactory. The previous deprecatedCursorFactoryinterface has been merged intoStorageAdapterStorageAdapteris no longer used by any engines or tests and has been marked as deprecated with default implementations of all methods that throw exceptions indicating the new methods to call insteadStorageAdaptermethods not covered byCursorFactory(CursorHolderFactoryprior to this change) have been moved into interfaces which are retrieved bySegment.as, the primary classes are the previously existingMetadata, as well as new interfacesPhysicalSegmentInspectorandTopNOptimizationInspectorUnnestSegmentandFilteredSegmentthat extendWrappedSegmentReferencesince theirStorageAdapterimplementations were previously provided byWrappedSegmentReferencePhysicalSegmentInspectorwhich covers some of the previousStorageAdapterfunctionality which was primarily used for segment metadata queries and other metadata uses, and is implemented forQueryableIndexSegmentandIncrementalIndexSegmentTopNOptimizationInspectorto cover the oddly specificStorageAdapter.hasBuiltInFiltersimplementation, which is implemented forIncrementalIndexSegment,QueryableIndexSegment,HashJoinSegment,UnnestSegment, andFilteredSegmentStorageAdapter, deleted allStorageAdapterimplementationsRelease note
(for developers)
The
StorageAdapterinterface, which is a 'public' api, has been made obsolete and is no longer implemented or used internally by Druid. Prior to Druid 31,StorageAdapterextended theCursorFactoryinterface, with methodscanVectorize,makeCursors, andmakeVectorCursor, but these methods have been moved ontoStorageAdapteritself, and all methods onStorageAdapterprovide default implementations which throw a Druid exception indicating their replacement.CursorFactoryhas been repurposed to be the main interface which query engines use for selecting values from Druid segments. The primary method ofCursorFactoryismakeCursorHolder, which accepts a newCursorBuildSpec, a new container type which wraps the information previously specified in the arguments of the oldCursorFactorymethods, and returns a new interfaceCursorHolderwhich defines no argument versions ofcanVectorize,asCursor, andasVectorCursor.An important thing to notice with the new
CursorHolderinterface is that the method is calledasCursorinstead ofasCursors.CursorFactory.makeCursorspreviously returned aSequence<Cursor>corresponding to the query granularity buckets, with a separateCursorper bucket.CursorHolder.asCursorinstead returns a singleCursor(equivalent to using 'ALL' granularity with the previous methhod), and a newCursorGranularizerhas been added for query engines to iterate over the cursor and divide into granularity buckets. This makes the non-vectorized engine behave the same way as the vectorized query engine (with itsVectorCursorGranularizer), and simplifies a lot of stuff that has to read segments particularly if it does not care about bucketing the results into granularities.CursorFactoryalso contains generally useful information for query engines prior to creating aCursor, such as the ability to get a basicRowSignatureof the segment it was created from, as well asColumnCapabilitiesof the base columns. This information is also available viaColumnSelectorFactory/VectorColumnSelectorFactoryafter theCursor/VectorCursorhave been created, and should generally be preferred over the information retrieved fromCursorFactoryitself.The other functionality previously provided by
StorageAdapterwas often-times segment specific, so this stuff has been repurposed into a much more flexible manner of declaration and is retrieved usingSegment.as. Several interfaces define the types of information which can be retrieved viaSegment.as:TimeBoundaryInspectorinspector for min/max timestamps, if supported by the segment.PhysicalSegmentInspectorinspector for physical aspects of a segment useful for segment metadata queries, such as column value cardinality, min/max values, number of rows, andMetadatawhich contains information about how a physical segment was created during ingestion, such as aggregators used, timestamp spec, granularity, orderingMaxIngestedEventTimeInspectorinspector for realtime segments to get the highest available timestamp of rows processed so farInstead of all segment types being forced to provide this information even when it is impossible or nonsensical, this new model allows us to selectively provide this information as appropriate.
This PR has: