Skip to content

transition away from StorageAdapter#16985

Merged
clintropolis merged 16 commits intoapache:masterfrom
clintropolis:pining-for-the-fjords
Sep 9, 2024
Merged

transition away from StorageAdapter#16985
clintropolis merged 16 commits intoapache:masterfrom
clintropolis:pining-for-the-fjords

Conversation

@clintropolis
Copy link
Copy Markdown
Member

@clintropolis clintropolis commented Sep 2, 2024

Description

Follow-up to finish what #16533 and #16849 started, moving everything completely away from StorageAdapter.

changes:

  • CursorHolderFactory has been renamed to CursorFactory and moved off of StorageAdapter, instead fetched directly from the segment via asCursorFactory. The previous deprecated CursorFactory interface has been merged into StorageAdapter
  • StorageAdapter is no longer used by any engines or tests and has been marked as deprecated with default implementations of all methods that throw exceptions indicating the new methods to call instead
  • StorageAdapter methods not covered by CursorFactory (CursorHolderFactory prior to this change) have been moved into interfaces which are retrieved by Segment.as, the primary classes are the previously existing Metadata, as well as new interfaces PhysicalSegmentInspector and TopNOptimizationInspector
  • added UnnestSegment and FilteredSegment that extend WrappedSegmentReference since their StorageAdapter implementations were previously provided by WrappedSegmentReference
  • added PhysicalSegmentInspector which covers some of the previous StorageAdapter functionality which was primarily used for segment metadata queries and other metadata uses, and is implemented for QueryableIndexSegment and IncrementalIndexSegment
  • added TopNOptimizationInspector to cover the oddly specific StorageAdapter.hasBuiltInFilters implementation, which is implemented for IncrementalIndexSegment, QueryableIndexSegment, HashJoinSegment, UnnestSegment, and FilteredSegment
  • Updated all engines and tests to no longer use StorageAdapter, deleted all StorageAdapter implementations

Release note

(for developers)

The StorageAdapter interface, which is a 'public' api, has been made obsolete and is no longer implemented or used internally by Druid. Prior to Druid 31, StorageAdapter extended the CursorFactory interface, with methods canVectorize, makeCursors, and makeVectorCursor, but these methods have been moved onto StorageAdapter itself, and all methods on StorageAdapter provide default implementations which throw a Druid exception indicating their replacement.

CursorFactory has been repurposed to be the main interface which query engines use for selecting values from Druid segments. The primary method of CursorFactory is makeCursorHolder, which accepts a new CursorBuildSpec, a new container type which wraps the information previously specified in the arguments of the old CursorFactory methods, and returns a new interface CursorHolder which defines no argument versions of canVectorize, asCursor, and asVectorCursor.

An important thing to notice with the new CursorHolder interface is that the method is called asCursor instead of asCursors. CursorFactory.makeCursors previously returned a Sequence<Cursor> corresponding to the query granularity buckets, with a separate Cursor per bucket. CursorHolder.asCursor instead returns a single Cursor (equivalent to using 'ALL' granularity with the previous methhod), and a new CursorGranularizer has been added for query engines to iterate over the cursor and divide into granularity buckets. This makes the non-vectorized engine behave the same way as the vectorized query engine (with its VectorCursorGranularizer), and simplifies a lot of stuff that has to read segments particularly if it does not care about bucketing the results into granularities.

CursorFactory also contains generally useful information for query engines prior to creating a Cursor, such as the ability to get a basic RowSignature of the segment it was created from, as well as ColumnCapabilities of the base columns. This information is also available via ColumnSelectorFactory/VectorColumnSelectorFactory after the Cursor/VectorCursor have been created, and should generally be preferred over the information retrieved from CursorFactory itself.

The other functionality previously provided by StorageAdapter was often-times segment specific, so this stuff has been repurposed into a much more flexible manner of declaration and is retrieved using Segment.as. Several interfaces define the types of information which can be retrieved via Segment.as:

  • TimeBoundaryInspector inspector for min/max timestamps, if supported by the segment.
  • PhysicalSegmentInspector inspector for physical aspects of a segment useful for segment metadata queries, such as column value cardinality, min/max values, number of rows, and Metadata which contains information about how a physical segment was created during ingestion, such as aggregators used, timestamp spec, granularity, ordering
  • MaxIngestedEventTimeInspector inspector for realtime segments to get the highest available timestamp of rows processed so far

Instead of all segment types being forced to provide this information even when it is impossible or nonsensical, this new model allows us to selectively provide this information as appropriate.


This PR has:

  • been self-reviewed.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

@github-actions github-actions Bot added Area - Batch Ingestion Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Sep 2, 2024
@clintropolis clintropolis force-pushed the pining-for-the-fjords branch 2 times, most recently from c46230a to 6524978 Compare September 2, 2024 07:13
changes:
* CursorHolderFactory has been renamed to CursorFactory and moved off of StorageAdapter, instead fetched directly from the segment via 'asCursorFactory'. The previous deprecated CursorFactory interface has been merged into StorageAdapter
* StorageAdapter is no longer used by any engines or tests and has been marked as deprecated with default implementations of all methods that throw exceptions indicating the new methods to call instead
* StorageAdapter methods not covered by CursorFactory (CursorHolderFactory prior to this change) have been moved into interfaces which are retrieved by Segment.as, the primary classes are the previously existing Metadata, as well as new interfaces PhysicalSegmentInspector and TopNOptimizationInspector
* added UnnestSegment and FilteredSegment that extend WrappedSegmentReference since their StorageAdapter implementations were previously provided by WrappedSegmentReference
* added PhysicalSegmentInspector which covers some of the previous StorageAdapter functionality which was primarily used for segment metadata queries and other metadata uses, and is implemented for QueryableIndexSegment and IncrementalIndexSegment
* added TopNOptimizationInspector to cover the oddly specific StorageAdapter.hasBuiltInFilters implementation, which is implemented for HashJoinSegment, UnnestSegment, and FilteredSegment
* Updated all engines and tests to no longer use StorageAdapter
Comment thread processing/src/main/java/org/apache/druid/segment/Segment.java Fixed
Comment thread processing/src/main/java/org/apache/druid/segment/Segment.java Fixed
Comment thread processing/src/main/java/org/apache/druid/segment/Segment.java Fixed
Comment thread processing/src/main/java/org/apache/druid/segment/Segment.java Fixed
Comment thread processing/src/main/java/org/apache/druid/segment/Segment.java Fixed
Comment thread processing/src/main/java/org/apache/druid/segment/Segment.java Fixed
Assert.assertEquals(adapter.getNumRows(), (long) blasterFuture.get());
Assert.assertEquals(adapter.getNumRows() * 2, (long) muxerFuture.get());
Assert.assertEquals(index.size(), (long) blasterFuture.get());
Assert.assertEquals(index.size() * 2, (long) muxerFuture.get());

Check warning

Code scanning / CodeQL

Result of multiplication cast to wider type

Potential overflow in [int multiplication](1) before it is converted to long by use in an invocation context.

Assert.assertEquals(
adapter.getNumRows() * 2,
index.size() * 2,

Check warning

Code scanning / CodeQL

Result of multiplication cast to wider type

Potential overflow in [int multiplication](1) before it is converted to long by use in an invocation context.
@clintropolis clintropolis added this to the 31.0.0 milestone Sep 2, 2024
* @deprecated use {@link Segment} directly as this does nothing
*/
@Deprecated
public abstract class AbstractSegment implements Segment
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh, no idea why git thinks this new file was a rename of this random old file i deleted

@@ -19,7 +19,38 @@

package org.apache.druid.segment;

public interface CursorHolderFactory
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, git seems to be confused here too, this is a new file not a rename 😅

@clintropolis clintropolis mentioned this pull request Sep 4, 2024
6 tasks
* @return instance of clazz, or null if the interface is not supported by this segment
*
* @see StorageAdapter storage adapter for queries. Never null.
* @see CursorFactory to make cursors to run queries
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still never null? I would hope so. Useful to say that, I think, if so.

Copy link
Copy Markdown
Member Author

@clintropolis clintropolis Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, though like, in the same way it was true for StorageAdapter. It could actually be null in practice because we got the storage adapter through ReferenceCountingSegment, which returns null for a lot of its methods if the backing segment has been closed due to being dropped, so the javadoc was kind of lies.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to specify never null. Did not clarify about ReferenceCountingSegment because its kind of true for almost every method on Segment and wasn't quite sure how to best describe it.

* @see TimeBoundaryInspector inspector for min/max timestamps, if supported by this segment.
* @see MaxIngestedEventTimeInspector inspector for {@link DataSourceMetadataResultValue#getMaxIngestedEventTime()}
* @see PhysicalSegmentInspector inspector for {@link org.apache.druid.query.metadata.SegmentAnalyzer}
* @see Metadata information about how a physical segment was created
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this not on PhysicalSegmentInspector?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess it could be, originally i was calling PhysicalSegmentInspector SegmentAnalysisInspector because it was only used by segment metadata query, so i was avoiding putting stuff that might be used elsewhere here, but numRows is used by the SegmentManager to track load/drop row count so it got a more generic name, i suppose I should move it into there

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved Metadata to PhysicalSegmentInspector.getMetadata

public DataSourceMetadataQueryRunner(Segment segment)
{
this.segmentInterval = segment.asStorageAdapter().getInterval();
this.segmentInterval = segment.getDataInterval();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These sound like they should be different (interval vs dataInterval). But neither method has javadocs. Were they really equivalent?

Copy link
Copy Markdown
Member Author

@clintropolis clintropolis Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they were equivalent as far as I could tell.

  • IncrementalIndexSegment and IncrementalIndexStorageAdapter both returned incrementalIndex.getDataInterval
  • QueryableIndexSegment and QueryableIndexStorageAdapter both returned QueryableIndex.getDataInterval
  • HashJoinSegment returns base segment data interval, HashJoinStorageAdapter returns base adapter interval
  • FrameSegment only uses segmentId with eternity interval, FrameStorageAdapter only used eternity interval
  • RowBasedSegment uses RowBasedStorageAdapter interval

final Set<String> dims = new LinkedHashSet<>();
final Set<String> ignore = new HashSet<>();
ignore.add(ColumnHolder.TIME_COLUMN_NAME);
final Metadata metadata = segment.as(Metadata.class);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't going to work for really old segments, where availableDimensions is defined but Metadata isn't. It's possible some clusters that have been around for a while still have segments that were written before Metadata was a thing.

I suppose the impact is that we'd search all columns, rather than just the dimension columns, in that case? I suppose that's probably fine. Although the situation could be improved by checking explicitly here for as(QueryableIndex.class) and using getAvailableDimensions if it's present.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switched this to prefer QueryableIndex and fallback to physical inspector/metadata if not available.

// otherwise, a filtering inner join can also filter rows.
return (T) new SimpleTopNOptimizationInspector(
baseFilter != null || clauses.stream().anyMatch(
clause -> clause.getJoinType() == JoinType.INNER && !clause.getCondition().isAlwaysTrue()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a pre-existing issue, but, this check should be:

!clause.getJoinType().isLefty() && !clause.getCondition().isAlwaysTrue()

Because JoinType.RIGHT can also filter rows from the base.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

query.getDimensionsFilter() == null &&
!storageAdapter.hasBuiltInFilters() &&
query.getIntervals().stream().anyMatch(interval -> interval.contains(storageAdapter.getInterval()))) {
(topNOptimizationInspector == null || !topNOptimizationInspector.isFiltered()) &&
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This design seems brittle— it means if a Segment wraps another Segment and filters it, but forgets to implement TopNOptimizationInspector, then it'll be treated as non-filtered and the optimization will be allowed, potentially leading to incorrect results. It would be better to design things such that we avoid the optimization in such cases.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adjusted and renamed isFiltered to areAllDictionaryIdsPresent to better reflect usage

public interface ColumnCardinalityInspector extends ColumnInspector
{
CursorHolder makeCursorHolder(CursorBuildSpec spec);
default int getColumnValueCardinality(String column)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have javadocs at least describing what scenarios can result in DimensionDictionarySelector.CARDINALITY_UNKNOWN.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is removed instead, though did javadoc the similar method in the GroupingSelector that somewhat replaced this interface

return DimensionDictionarySelector.CARDINALITY_UNKNOWN;
}
if (capabilities.is(ValueType.STRING) && capabilities.isDictionaryEncoded().isTrue()) {
final DimensionSelector dimensionSelector = makeDimensionSelector(DefaultDimensionSpec.of(columnName));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I worry about performance here. The QueryableIndex implementation of makeDimensionSelector caches selectors, but most others don't.

I wonder if we can remove this method, so we don't have to worry about performance. There's only two call sites:

  • One in GroupingEngine, where by the time this method is called, dimension selectors have already been created. The GroupingEngine could check the dimension selectors directly.
  • One in TopNQueryEngine, where this method is called before the dimension selector is created, but it probably could be restructured...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok did this, I made a GroupingSelector interface for GroupByColumnSelectorPlus and GroupByVectorColumnSelector to implement for grouping engine cardinality computation, and reworked topn to make its selectorplus outside of the run method (which is nicer too since it doesnt make a new one for every granularity bucket, which was unnecessary after the previous refactor)

changes:
* replaced ColumnCardinalityInspector with GroupingSelector which grouping engine uses to get dimension cardinality
* topN creates column selector up front to use for cardinality computation and passes it into the run method
* move Metadata to PhysicalSegmentInspector, callers now get PhysicalSegmentInspector to do metadata stuff
* search query prefer QueryableIndex before falling back to Metadata
* TopNOptimizationInspector.isFiltered renamed/flipped to areAllDictionaryIdsPresent to better reflect usage
clause -> clause.getJoinType() == JoinType.INNER && !clause.getCondition().isAlwaysTrue()
)
!(baseFilter != null || clauses.stream().anyMatch(
clause -> clause.getJoinType().isLefty() && !clause.getCondition().isAlwaysTrue()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this isn't correct. The condition we want is that the baseFilter is null, and all clauses are either lefty or always-true. Like this:

baseFilter == null && clauses.stream.allMatch(
  clause -> clause.getJoinType().isLefty() || clause.getCondition().isAlwaysTrue()
)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, yes, fixed again. I copied your suggestion wrong last time (lost the ! on lefty check) and then flipped it to match the new contract of the inspector thingy, but should have just rewritten it since this is clearer.

@clintropolis clintropolis merged commit f57cd6f into apache:master Sep 9, 2024
@clintropolis clintropolis deleted the pining-for-the-fjords branch September 9, 2024 21:55
clintropolis added a commit to clintropolis/druid that referenced this pull request Sep 9, 2024
* transition away from StorageAdapter
changes:
* CursorHolderFactory has been renamed to CursorFactory and moved off of StorageAdapter, instead fetched directly from the segment via 'asCursorFactory'. The previous deprecated CursorFactory interface has been merged into StorageAdapter
* StorageAdapter is no longer used by any engines or tests and has been marked as deprecated with default implementations of all methods that throw exceptions indicating the new methods to call instead
* StorageAdapter methods not covered by CursorFactory (CursorHolderFactory prior to this change) have been moved into interfaces which are retrieved by Segment.as, the primary classes are the previously existing Metadata, as well as new interfaces PhysicalSegmentInspector and TopNOptimizationInspector
* added UnnestSegment and FilteredSegment that extend WrappedSegmentReference since their StorageAdapter implementations were previously provided by WrappedSegmentReference
* added PhysicalSegmentInspector which covers some of the previous StorageAdapter functionality which was primarily used for segment metadata queries and other metadata uses, and is implemented for QueryableIndexSegment and IncrementalIndexSegment
* added TopNOptimizationInspector to cover the oddly specific StorageAdapter.hasBuiltInFilters implementation, which is implemented for HashJoinSegment, UnnestSegment, and FilteredSegment
* Updated all engines and tests to no longer use StorageAdapter
clintropolis added a commit that referenced this pull request Sep 10, 2024
* transition away from StorageAdapter
changes:
* CursorHolderFactory has been renamed to CursorFactory and moved off of StorageAdapter, instead fetched directly from the segment via 'asCursorFactory'. The previous deprecated CursorFactory interface has been merged into StorageAdapter
* StorageAdapter is no longer used by any engines or tests and has been marked as deprecated with default implementations of all methods that throw exceptions indicating the new methods to call instead
* StorageAdapter methods not covered by CursorFactory (CursorHolderFactory prior to this change) have been moved into interfaces which are retrieved by Segment.as, the primary classes are the previously existing Metadata, as well as new interfaces PhysicalSegmentInspector and TopNOptimizationInspector
* added UnnestSegment and FilteredSegment that extend WrappedSegmentReference since their StorageAdapter implementations were previously provided by WrappedSegmentReference
* added PhysicalSegmentInspector which covers some of the previous StorageAdapter functionality which was primarily used for segment metadata queries and other metadata uses, and is implemented for QueryableIndexSegment and IncrementalIndexSegment
* added TopNOptimizationInspector to cover the oddly specific StorageAdapter.hasBuiltInFilters implementation, which is implemented for HashJoinSegment, UnnestSegment, and FilteredSegment
* Updated all engines and tests to no longer use StorageAdapter
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants