Projections prototype by clintropolis · Pull Request #17214 · apache/druid

clintropolis · 2024-10-01T20:30:28Z

Description

#17117 + some refactors + projections persisted segments = possibly usable prototype

todo

SELECT string2, APPROX_COUNT_DISTINCT_DS_HLL("string5") FROM "druid"."projections" GROUP BY 1 ORDER BY 2
SELECT string2, SUM(long4) FROM "druid"."projections" GROUP BY 1 ORDER BY 2

realtime segments:

Benchmark                         (complexCompression)  (query)  (rowsPerSegment)  (schema)  (storageType)  (stringEncoding)  (useProjections)  (vectorize)  Mode  Cnt   Score   Error  Units
SqlProjectionsBenchmark.querySql                  none        0            150000  explicit    INCREMENTAL              UTF8              true        false  avgt    5   6.805 ± 0.928  ms/op
SqlProjectionsBenchmark.querySql                  none        0            150000  explicit    INCREMENTAL              UTF8             false        false  avgt    5  25.313 ± 2.134  ms/op
SqlProjectionsBenchmark.querySql                  none        1            150000  explicit    INCREMENTAL              UTF8              true        false  avgt    5   2.058 ± 0.429  ms/op
SqlProjectionsBenchmark.querySql                  none        1            150000  explicit    INCREMENTAL              UTF8             false        false  avgt    5  22.030 ± 2.828  ms/op

historical segments:

Benchmark                         (complexCompression)  (query)  (rowsPerSegment)  (schema)  (storageType)  (stringEncoding)  (useProjections)  (vectorize)  Mode  Cnt    Score    Error  Units
SqlProjectionsBenchmark.querySql                   lz4        0           1500000  explicit           MMAP              UTF8              true        false  avgt    5   18.742 ±  0.959  ms/op
SqlProjectionsBenchmark.querySql                   lz4        0           1500000  explicit           MMAP              UTF8              true        force  avgt    5   18.638 ±  0.775  ms/op
SqlProjectionsBenchmark.querySql                   lz4        0           1500000  explicit           MMAP              UTF8             false        false  avgt    5  180.287 ± 11.416  ms/op
SqlProjectionsBenchmark.querySql                   lz4        0           1500000  explicit           MMAP              UTF8             false        force  avgt    5  138.844 ± 10.884  ms/op
SqlProjectionsBenchmark.querySql                   lz4        1           1500000  explicit           MMAP              UTF8              true        false  avgt    5    1.764 ±  0.271  ms/op
SqlProjectionsBenchmark.querySql                   lz4        1           1500000  explicit           MMAP              UTF8              true        force  avgt    5    1.751 ±  0.408  ms/op
SqlProjectionsBenchmark.querySql                   lz4        1           1500000  explicit           MMAP              UTF8             false        false  avgt    5   48.939 ±  2.440  ms/op
SqlProjectionsBenchmark.querySql                   lz4        1           1500000  explicit           MMAP              UTF8             false        force  avgt    5   13.600 ±  0.522  ms/op

projection metadata in benchmark segment:

  "projections" : [ {
    "type" : "aggregate",
    "schema" : {
      "name" : "string2_hourly_sums_hll",
      "timeColumnName" : "__gran",
      "virtualColumns" : [ {
        "type" : "expression",
        "name" : "__gran",
        "expression" : "timestamp_floor(__time,'PT1H')",
        "outputType" : "LONG"
      } ],
      "groupingColumns" : [ "string2", "__gran" ],
      "aggregators" : [ {
        "type" : "longSum",
        "name" : "long4_sum",
        "fieldName" : "long4"
      }, {
        "type" : "doubleSum",
        "name" : "double2_sum",
        "fieldName" : "double2"
      }, {
        "type" : "HLLSketchBuild",
        "name" : "hll_string5",
        "fieldName" : "string5",
        "lgK" : 12,
        "tgtHllType" : "HLL_4",
        "shouldFinalize" : false,
        "round" : true
      } ],
      "ordering" : [ {
        "columnName" : "string2",
        "order" : "ascending"
      }, {
        "columnName" : "__gran",
        "order" : "ascending"
      } ]
    },
    "numRows" : 2376
  } ]

smoosh layout:

v1,2147483647,1
__time,0,0,6017118
double1,0,323400142,325132505
double2,0,325132505,326676524
double3,0,326676524,338722195
double4,0,338722195,350768214
double5,0,350768214,352309359
float1,0,352309359,353802435
float2,0,353802435,355166332
float3,0,355166332,361190284
float4,0,361190284,367214296
float5,0,367214296,368575446
index.drd,0,369608752,369610074
long1,0,303717758,309709418
long2,0,309709418,311371091
long3,0,311371091,312846945
long4,0,312846945,318187662
long5,0,318187662,323400142
metadata.drd,0,369610074,369612129
multi-string1,0,58809016,98386815
multi-string2,0,98386815,116609876
multi-string3,0,116609876,136310542
multi-string4,0,136310542,168498666
multi-string5,0,168498666,303717758
nested,0,368575446,368575703
nested.__encodedColumn,0,368575721,368582039
nested.__stringDictionary,0,368575703,368575721
nested.__valueIndexes,0,368582039,368582386
rows,0,6017118,6068394
string1,0,6068394,12120081
string2,0,12120081,14819759
string2_hourly_sums_hll/__gran,0,369608356,369608752
string2_hourly_sums_hll/double2_sum,0,368593213,368600687
string2_hourly_sums_hll/hll_string5,0,368600687,368600942
string2_hourly_sums_hll/hll_string5.__complexColumn,0,369605372,369605377
string2_hourly_sums_hll/hll_string5.__complexColumn_compressed,0,368611234,369605372
string2_hourly_sums_hll/hll_string5.__complexColumn_offsets,0,368600942,368611234
string2_hourly_sums_hll/long4_sum,0,368582782,368593213
string2_hourly_sums_hll/string2,0,369605377,369608356
string3,0,14819759,17091334
string4,0,17091334,25219656
string5,0,25219656,58809016

Release note

todo

This PR begins to introduce the concept of projections to Druid datasources, which are similar to materialized views but are built into a segment, and which can automatically be used during query execution if the projection fits the query. This PR only contains the logic to build and query them for realtime queries, and does not contain the ability to serialize and actually store them in persisted segments, so it is effectively a toy right now. changes: * Adds ProjectionSpec interface, AggregateProjectionSpec implementation for defining rollup projections on druid datasources * Adds projections to DataSchema * Adds projection building and querying support to OnHeapIncrementalIndex

…type

+  public final CloserRule closer = new CloserRule(false);
+
+  public CursorFactoryProjectionTest(
+      String name,


…type

abhishekagarwal87 · 2024-10-02T08:39:42Z

+        // wtb some sort of virtual column comparison function that can check if projection granularity time column
+        // satisifies query granularity virtual column
+        // can rebind? q.canRebind("__time", p)
+        // special handle time granularity


need a revision.

yea sorry, still a bunch of todos and my rambling comments all over the place, this one is about wanting to dump using Granularity at all in favor of giving some way that a virtual column can decide if it can replace __time to check for things like finer granularity. i'm not going to do that in this PR, its just notes for myself, i'm still working on cleaning this up.

abhishekagarwal87 · 2024-10-02T16:03:14Z

+
+  ColumnFormat getColumnFormat(String columnName);
+
+  int size();


what size is it exactly?

number of rows in the facts table, like after rollup if it is a rollup facts table, will add javadoc and maybe rename, i just picked this up since was previous name on IncrementalIndex

abhishekagarwal87 · 2024-10-02T16:22:09Z

should the outputname be logged in msg "Completed dim[%s] inverted with cardinality[%,d] in %,d millis." instead of dimension name?

abhishekagarwal87 · 2024-10-02T16:29:21Z

+        rowNumConversions.add(IntBuffer.wrap(arr));
+      }
+
+      final String section = "walk through and merge rows";


Suggested change

final String section = "walk through and merge rows";

final String section = "walk through and merge rows for projections";

yea, still need to adjust a lot of these things, its sort of adapted from the regular flow since its pretty similar in a lot of ways

…type

…ot chill, thanks tests

-      Assert.assertEquals(index.size(), (long) blasterFuture.get());
-      Assert.assertEquals(index.size() * 2, (long) muxerFuture.get());
+      Assert.assertEquals(index.numRows(), (long) blasterFuture.get());
+      Assert.assertEquals(index.numRows() * 2, (long) muxerFuture.get());



      Assert.assertEquals(
-          index.size() * 2,
+          index.numRows() * 2,


gianm · 2024-10-04T18:43:45Z

+   * {@link AggregateProjectionMetadata.Schema#getTimeColumnName()}). Callers must verify this externally before
+   * calling this method by examining {@link VirtualColumn#requiredColumns()}.
+   * <p>
+   * This method also does not handle other time expressions, or if the virtual column is just an identifier for a


missing text

gianm · 2024-10-04T23:49:42Z

  }

+  @JsonProperty
+  @JsonInclude(JsonInclude.Include.NON_NULL)


would prefer NON_EMPTY here, so it only shows up if there are really projections. Unless we think we will ever have a semantic difference between projections: null and projections: [].

gianm · 2024-10-05T05:23:33Z

@@ -87,6 +88,7 @@ public class AutoTypeColumnMerger implements DimensionMergerV9

  public AutoTypeColumnMerger(


Please take a look at adding this. I think what's going on is that for regular columns, name and outputName are the same; and for projection columns, name is the parent name and outputName is the projection column name.

It might be clearer to do String name and @Nullable String parentName, i.e., make name the output name.

gianm · 2024-10-05T06:02:42Z

    }
  }

+  private Metadata makeProjections(


This functions appears to have a bunch of stuff that is adapted and remixed from other functions in this class. It would be good to share common code, if possible.

yes, i absolutely would like to do this, i feel like the base table is kind of just like another projection. This is true for building the incremental index as well, however I'd like to save both of these refactors for future work in order to minimize risk for now

gianm · 2024-10-05T06:03:16Z


+  @Nullable
+  @JsonProperty
+  @JsonInclude(JsonInclude.Include.NON_NULL)


or NON_EMPTY, assuming there isn't a meaningful difference between null and [].

gianm · 2024-10-05T06:45:59Z

        numAdvanced++;
      }

-      done = !foundMatched && (emptyRange || !baseIter.hasNext());


was the clause removed here always unnecessary?

yea, intellij suggested it could be removed because emptyRange = !cursorIterable.iterator().hasNext(); was defined in the constructor, and baseIter = cursorIterable.iterator(); at the start of this method, and finally foundMatched will advance all the way through the iterator if it cannot find a match, so !foundMatched implies that hasNext is false, and emptyRange/!baseIter.hasNext() were effectively equivalent

gianm · 2024-10-05T06:57:04Z

+    @JsonCreator
+    public Schema(
+        @JsonProperty("name") String name,
+        @JsonProperty("timeColumnName") @Nullable String timeColumnName,


I think in the ideal design there is no such thing as timeColumnName. Through some introspection abilities, we should be able to select the right projections, even with time flooring, using just virtualColumns and groupingColumns. It's ok for now but something to think about for the future.

yea, i totally agree, i just did this for now to save some work of finding the time column until larger refactors can happen and should be harmless to remove later once that happens

gianm · 2024-10-05T07:04:07Z

+        matchBuilder.addReferenceedVirtualColumn(buildSpecVirtualColumn);
+        final List<String> requiredInputs = buildSpecVirtualColumn.requiredColumns();
+        if (requiredInputs.size() == 1 && ColumnHolder.TIME_COLUMN_NAME.equals(requiredInputs.get(0))) {
+          // wtb some sort of virtual column comparison function that can check if projection granularity time column


please clean up this comment

oops, forgot about this one since it wasn't marked with // todo (clint): ... like most of my ramblings

gianm · 2024-10-05T07:10:24Z

    this.timePosition = timePosition;
    Preconditions.checkArgument(
-        timePosition >= 0 && timePosition <= dimensionSelectors.length,
+        timePosition <= dimensionSelectors.length,


I suppose timePosition can be -1 for projections, so part of this check had to go. Please update the message too.

gianm · 2024-10-05T07:14:00Z

    }
  }
+
+  public static class OnHeapAggregateProjection implements IncrementalIndexRowSelector


Given this is a static class, and the file is already quite large, please consider making this into its own file.

* abstract `IncrementalIndex` cursor stuff to prepare for using different "views" of the data based on the cursor build spec (#17064) * abstract `IncrementalIndex` cursor stuff to prepare to allow for possibility of using different "views" of the data based on the cursor build spec changes: * introduce `IncrementalIndexRowSelector` interface to capture how `IncrementalIndexCursor` and `IncrementalIndexColumnSelectorFactory` read data * `IncrementalIndex` implements `IncrementalIndexRowSelector` * move `FactsHolder` interface to separate file * other minor refactorings * add DataSchema.Builder to tidy stuff up a bit (#17065) * add DataSchema.Builder to tidy stuff up a bit * fixes * fixes * more style fixes * review stuff * Projections prototype (#17214)

…17314) Follow up to #17214, adds implementations for substituteCombiningFactory so that more datasketches aggs can match projections, along with some projections tests for datasketches.

…pache#17314) Follow up to apache#17214, adds implementations for substituteCombiningFactory so that more datasketches aggs can match projections, along with some projections tests for datasketches.

…17314) (#17323) Follow up to #17214, adds implementations for substituteCombiningFactory so that more datasketches aggs can match projections, along with some projections tests for datasketches. Co-authored-by: Clint Wylie <cwylie@apache.org>

clintropolis added 10 commits September 19, 2024 13:12

Merge remote-tracking branch 'upstream/master' into projections-realtime

740240b

adjustments

b7292ab

remove interface

8d3d694

cleanup

af0dedf

remove unused

36ce2a9

simplify

b9c3e95

more tests

957b463

persist projections

13aa251

Merge remote-tracking branch 'upstream/master' into projections-proto…

b76a95d

…type

clintropolis added the WIP label Oct 1, 2024

github-actions Bot added Area - Segment Format and Ser/De Area - Ingestion labels Oct 1, 2024

meh

4d21860

github-advanced-security AI found potential problems Oct 1, 2024

View reviewed changes

clintropolis added 5 commits October 1, 2024 15:16

fixes

4c14c7c

Merge remote-tracking branch 'upstream/master' into projections-proto…

698263f

…type

fixes

ee01b5d

cleanup

042d5af

Merge remote-tracking branch 'upstream/master' into projections-proto…

ec41fe6

…type

clintropolis added the Performance label Oct 2, 2024

more clean

b4e9cf0

github-advanced-security AI found potential problems Oct 2, 2024

View reviewed changes

Comment thread processing/src/main/java/org/apache/druid/segment/IndexIO.java Fixed

abhishekagarwal87 reviewed Oct 2, 2024

View reviewed changes

clintropolis added 6 commits October 2, 2024 17:25

more better, still much to do

4825c5d

more better again

aa992e6

Merge remote-tracking branch 'upstream/master' into projections-proto…

dbf5541

…type

adjustments

53ad304

fix

453ea49

note to self

08ef35b

clintropolis added 6 commits October 4, 2024 13:28

add test for auto schemas

2567d13

fix bug with auto schemas that do not have castToType set

997c5e1

Merge remote-tracking branch 'upstream/master' into projections-proto…

177a9da

…type

add filter test to make sure indexes were chill, fixed bug that was n…

7e0e507

…ot chill, thanks tests

more test

d28d06d

adjustments

197e155

github-actions Bot added Area - Batch Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Oct 5, 2024

clintropolis added 2 commits October 4, 2024 18:17

oops

1b70549

oops

715a458

github-advanced-security AI found potential problems Oct 5, 2024

View reviewed changes

fix style

162063c

clintropolis removed the WIP label Oct 5, 2024

gianm approved these changes Oct 5, 2024

View reviewed changes

review stuff

8e1a1a0

clintropolis added this to the 31.0.0 milestone Oct 5, 2024

fix test

8b1504f

clintropolis merged commit 0bd13bc into apache:master Oct 5, 2024

clintropolis deleted the projections-prototype branch October 5, 2024 11:39

clintropolis added a commit to clintropolis/druid that referenced this pull request Oct 5, 2024

Projections prototype (apache#17214)

92cd424

clintropolis mentioned this pull request Oct 5, 2024

backport projections #17257

Merged

clintropolis mentioned this pull request Oct 10, 2024

add substituteCombiningFactory implementations for datasketches aggs #17314

Merged

kfaraz mentioned this pull request Oct 10, 2024

Empty commit for backport script (#16991) (#17064) (#17065) (#17132) (#17133) (#17135) (#17147) (#17180) (#17213) (#17214) #17315

Merged

kfaraz mentioned this pull request Oct 10, 2024

[Backport] add substituteCombiningFactory implementations for datasketches aggs … #17323

Merged

10 tasks

kfaraz mentioned this pull request Oct 11, 2024

[DRAFT] 31.0.0 Release Notes #17332

Closed

soenkeliebau mentioned this pull request Mar 26, 2025

Compile list of new product features in newly supported versions for the 25.3.0 release stackabletech/issues#705

Closed

	final String section = "walk through and merge rows";
	final String section = "walk through and merge rows for projections";

		@@ -87,6 +88,7 @@ public class AutoTypeColumnMerger implements DimensionMergerV9

		public AutoTypeColumnMerger(

Conversation

clintropolis commented Oct 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Release note

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Check notice

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Check warning

Check warning

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

clintropolis commented Oct 1, 2024 •

edited

Loading