Conversation
This PR begins to introduce the concept of projections to Druid datasources, which are similar to materialized views but are built into a segment, and which can automatically be used during query execution if the projection fits the query. This PR only contains the logic to build and query them for realtime queries, and does not contain the ability to serialize and actually store them in persisted segments, so it is effectively a toy right now. changes: * Adds ProjectionSpec interface, AggregateProjectionSpec implementation for defining rollup projections on druid datasources * Adds projections to DataSchema * Adds projection building and querying support to OnHeapIncrementalIndex
| public final CloserRule closer = new CloserRule(false); | ||
|
|
||
| public CursorFactoryProjectionTest( | ||
| String name, |
Check notice
Code scanning / CodeQL
Useless parameter
| // wtb some sort of virtual column comparison function that can check if projection granularity time column | ||
| // satisifies query granularity virtual column | ||
| // can rebind? q.canRebind("__time", p) | ||
| // special handle time granularity |
There was a problem hiding this comment.
yea sorry, still a bunch of todos and my rambling comments all over the place, this one is about wanting to dump using Granularity at all in favor of giving some way that a virtual column can decide if it can replace __time to check for things like finer granularity. i'm not going to do that in this PR, its just notes for myself, i'm still working on cleaning this up.
|
|
||
| ColumnFormat getColumnFormat(String columnName); | ||
|
|
||
| int size(); |
There was a problem hiding this comment.
what size is it exactly?
There was a problem hiding this comment.
number of rows in the facts table, like after rollup if it is a rollup facts table, will add javadoc and maybe rename, i just picked this up since was previous name on IncrementalIndex
There was a problem hiding this comment.
should the outputname be logged in msg "Completed dim[%s] inverted with cardinality[%,d] in %,d millis." instead of dimension name?
| rowNumConversions.add(IntBuffer.wrap(arr)); | ||
| } | ||
|
|
||
| final String section = "walk through and merge rows"; |
There was a problem hiding this comment.
| final String section = "walk through and merge rows"; | |
| final String section = "walk through and merge rows for projections"; |
There was a problem hiding this comment.
yea, still need to adjust a lot of these things, its sort of adapted from the regular flow since its pretty similar in a lot of ways
…ot chill, thanks tests
| Assert.assertEquals(index.size(), (long) blasterFuture.get()); | ||
| Assert.assertEquals(index.size() * 2, (long) muxerFuture.get()); | ||
| Assert.assertEquals(index.numRows(), (long) blasterFuture.get()); | ||
| Assert.assertEquals(index.numRows() * 2, (long) muxerFuture.get()); |
Check warning
Code scanning / CodeQL
Result of multiplication cast to wider type
|
|
||
| Assert.assertEquals( | ||
| index.size() * 2, | ||
| index.numRows() * 2, |
Check warning
Code scanning / CodeQL
Result of multiplication cast to wider type
| * {@link AggregateProjectionMetadata.Schema#getTimeColumnName()}). Callers must verify this externally before | ||
| * calling this method by examining {@link VirtualColumn#requiredColumns()}. | ||
| * <p> | ||
| * This method also does not handle other time expressions, or if the virtual column is just an identifier for a |
| } | ||
|
|
||
| @JsonProperty | ||
| @JsonInclude(JsonInclude.Include.NON_NULL) |
There was a problem hiding this comment.
would prefer NON_EMPTY here, so it only shows up if there are really projections. Unless we think we will ever have a semantic difference between projections: null and projections: [].
| @@ -87,6 +88,7 @@ public class AutoTypeColumnMerger implements DimensionMergerV9 | |||
|
|
|||
| public AutoTypeColumnMerger( | |||
There was a problem hiding this comment.
Please take a look at adding this. I think what's going on is that for regular columns, name and outputName are the same; and for projection columns, name is the parent name and outputName is the projection column name.
It might be clearer to do String name and @Nullable String parentName, i.e., make name the output name.
| } | ||
| } | ||
|
|
||
| private Metadata makeProjections( |
There was a problem hiding this comment.
This functions appears to have a bunch of stuff that is adapted and remixed from other functions in this class. It would be good to share common code, if possible.
There was a problem hiding this comment.
yes, i absolutely would like to do this, i feel like the base table is kind of just like another projection. This is true for building the incremental index as well, however I'd like to save both of these refactors for future work in order to minimize risk for now
|
|
||
| @Nullable | ||
| @JsonProperty | ||
| @JsonInclude(JsonInclude.Include.NON_NULL) |
There was a problem hiding this comment.
or NON_EMPTY, assuming there isn't a meaningful difference between null and [].
| numAdvanced++; | ||
| } | ||
|
|
||
| done = !foundMatched && (emptyRange || !baseIter.hasNext()); |
There was a problem hiding this comment.
was the clause removed here always unnecessary?
There was a problem hiding this comment.
yea, intellij suggested it could be removed because emptyRange = !cursorIterable.iterator().hasNext(); was defined in the constructor, and baseIter = cursorIterable.iterator(); at the start of this method, and finally foundMatched will advance all the way through the iterator if it cannot find a match, so !foundMatched implies that hasNext is false, and emptyRange/!baseIter.hasNext() were effectively equivalent
| @JsonCreator | ||
| public Schema( | ||
| @JsonProperty("name") String name, | ||
| @JsonProperty("timeColumnName") @Nullable String timeColumnName, |
There was a problem hiding this comment.
I think in the ideal design there is no such thing as timeColumnName. Through some introspection abilities, we should be able to select the right projections, even with time flooring, using just virtualColumns and groupingColumns. It's ok for now but something to think about for the future.
There was a problem hiding this comment.
yea, i totally agree, i just did this for now to save some work of finding the time column until larger refactors can happen and should be harmless to remove later once that happens
| matchBuilder.addReferenceedVirtualColumn(buildSpecVirtualColumn); | ||
| final List<String> requiredInputs = buildSpecVirtualColumn.requiredColumns(); | ||
| if (requiredInputs.size() == 1 && ColumnHolder.TIME_COLUMN_NAME.equals(requiredInputs.get(0))) { | ||
| // wtb some sort of virtual column comparison function that can check if projection granularity time column |
There was a problem hiding this comment.
oops, forgot about this one since it wasn't marked with // todo (clint): ... like most of my ramblings
| this.timePosition = timePosition; | ||
| Preconditions.checkArgument( | ||
| timePosition >= 0 && timePosition <= dimensionSelectors.length, | ||
| timePosition <= dimensionSelectors.length, |
There was a problem hiding this comment.
I suppose timePosition can be -1 for projections, so part of this check had to go. Please update the message too.
| } | ||
| } | ||
|
|
||
| public static class OnHeapAggregateProjection implements IncrementalIndexRowSelector |
There was a problem hiding this comment.
Given this is a static class, and the file is already quite large, please consider making this into its own file.
* abstract `IncrementalIndex` cursor stuff to prepare for using different "views" of the data based on the cursor build spec (#17064) * abstract `IncrementalIndex` cursor stuff to prepare to allow for possibility of using different "views" of the data based on the cursor build spec changes: * introduce `IncrementalIndexRowSelector` interface to capture how `IncrementalIndexCursor` and `IncrementalIndexColumnSelectorFactory` read data * `IncrementalIndex` implements `IncrementalIndexRowSelector` * move `FactsHolder` interface to separate file * other minor refactorings * add DataSchema.Builder to tidy stuff up a bit (#17065) * add DataSchema.Builder to tidy stuff up a bit * fixes * fixes * more style fixes * review stuff * Projections prototype (#17214)
…pache#17314) Follow up to apache#17214, adds implementations for substituteCombiningFactory so that more datasketches aggs can match projections, along with some projections tests for datasketches.
Description
#17117 + some refactors + projections persisted segments = possibly usable prototype
todo
realtime segments:
historical segments:
projection metadata in benchmark segment:
smoosh layout:
Release note
todo