Materialized view implementation by zhangxinyu1 · Pull Request #5556 · apache/druid

zhangxinyu1 · 2018-03-30T03:52:03Z

Target

To optimize query.

Implementation

There are two extensions namely materialized-view-maintenance and materialized-view-selection.

In materialized-view-maintenance, MaterializedViewSupervisor is used to generate or drop derived datasource segments and keep the timeline's consistency of base datasource and derived datasource.

In materialized-view-selection, MaterializedViewQuery is implemented to do materialized-view-selection for topn/groupby/timeseries query.

The detailed design and discussion is in issue #5304

Usage

Loading materialized-view-maintenance and materialized-view-selection. Notices: materialized-view-selection can only be loaded when start broker.
Submit a MaterializedViewSupervisor. e.g.:

{
  "type" : "derivativeDataSource",
  "baseDataSource": "wikiticker",
  "dimensionsSpec":{
            "dimensions" : [
              "isUnpatrolled",
              "metroCode",
              "namespace",
              "page",
              "regionIsoCode",
              "regionName",
              "user"
            ]
          },
    "metricsSpec" : [
        {
          "name" : "count",
          "type" : "count"
        },
        {
          "name" : "added",
          "type" : "longSum",
          "fieldName" : "added"
        }
      ],
  "tuningConfig": {
      "type" : "hadoop"
  }
}

Send a MaterializedViewQuery. e.g.:

{
    "queryType": "view",
    "query": {
        "queryType": "groupBy",
        "dataSource": "wikiticker",
        "granularity": "all",
        "dimensions": [
            "user"
        ],
        "limitSpec": {
            "type": "default",
            "limit": 1,
            "columns": [
                {
                    "dimension": "added",
                    "direction": "descending",
                    "dimensionOrder": "numeric"
                }
            ]
        },
        "aggregations": [
            {
                "type": "longSum",
                "name": "added",
                "fieldName": "added"
            }
        ],
        "intervals": [
            "2015-09-12/2015-09-13"
        ]
    }
}

jihoonson · 2018-03-30T04:25:01Z

@zhangxinyu1 thanks for raising this PR! Would you add a link to the proposal here?

jihoonson · 2018-03-30T04:25:46Z

Oh, never mind. It's already here.

jihoonson · 2018-03-30T18:11:10Z

I restarted Travis. @zhangxinyu1 would you check the TeamCity inspection failure?

jihoonson · 2018-04-03T00:29:47Z

@zhangxinyu1 thanks for the fix. I'll start my review. BTW, did you have a chance to test this feature in some real clusters?

zhangxinyu1 · 2018-04-03T02:17:09Z

@jihoonson Thanks!
Yes, we have real clusters running with this feature, but the version of these clusters are 0.10.0 and this feature in our clusters is implement based on 0.10.0. However, I have tested some functions of this implementation based on 0.13.0-SNAPSHOT in our test cluster. Do you have any suggestions about testing this feature?

jihoonson

Reviewed up to MaterializedViewMetadataCoordinator.

jihoonson · 2018-04-04T17:38:00Z

I believe we will have more types of views in the future. Please use more specific name like derivativeDataSource.

BTW, this annotation is not needed since you added a NamedType here.

jihoonson · 2018-04-04T17:39:09Z

nit: Can be simplified to this.baseDataSource = Preconditions.checkNotNull(baseDataSource, "baseDataSource cannot be null. This is not a valid DerivativeDataSourceMetadata.");

jihoonson · 2018-04-04T17:50:54Z

Looks like the logic is almost same with equals(). Then it would be better to call equals() here.

jihoonson · 2018-04-04T17:52:54Z

Then, this should throw UnsupportedOperationException. If this causes a problem, you might need to add some methods like isMergeable() and isSubtractable() to the DataSourceMetadata interface.

jihoonson · 2018-04-04T17:52:59Z

Same here. This should throw UnsupportedOperationException.

jihoonson · 2018-04-04T21:46:43Z

This method doesn't throw Exception.

jihoonson · 2018-04-04T21:47:10Z

nit: unnecessary type arguments.

jihoonson · 2018-04-04T21:52:55Z

Probably this method should be merged into IndexerSQLMetadataStorageCoordinator.resetDataSourceMetadata() and that method should check an entry already exists in metastore and insert a new entry if it doesn't. Otherwise, it can update the existing entry.

Yes, this method can be merged into IndexerSQLMetadataStorageCoordinator.resetDataSourceMetadata(). However, maybe we can do it in another pr, because we should consider the logic of code where used this method.

jihoonson · 2018-04-04T21:56:17Z

This method should return only used segments. Please add a method like getUsedSegmentsForInterval() which returns List<Pair<DataSegment, String>> to IndexerMetadataStorageCoordinator.

jihoonson · 2018-04-04T21:57:07Z

maxCreatedDate is a less-intuitive name.

jihoonson · 2018-04-04T23:12:11Z

Yes, we have real clusters running with this feature, but the version of these clusters are 0.10.0 and this feature in our clusters is implement based on 0.10.0. However, I have tested some functions of this implementation based on 0.13.0-SNAPSHOT in our test cluster. Do you have any suggestions about testing this feature?

@zhangxinyu1 that is great! I think it would be enough. I'll test this PR in our cluster as well.

zhangxinyu1 · 2018-04-10T02:31:47Z

@jihoonson I have modified code according to your comments. Could you please go on to review it?

jihoonson · 2018-04-10T06:19:58Z

@zhangxinyu1 sure. I'll review tomorrow.

jihoonson

@zhangxinyu1 still reviewing. Reviewed up to DataSourceOptimizer.

jihoonson · 2018-04-10T22:26:33Z

Would you elaborate more on why this feature is split into two extensions? If we need to always load both extensions to use this feature, it would be better to make a single extension.

I can't agree with you more. However, DataSourceOptimizer need BrokerServerView to get the timeline of different dataSources to do optimizing, and only broker has this information. Then, materialized-view-selection module has to be only loaded in broker, so I have to split it into two extensions. I thought about this for a long time, but cannot figure out how to solve this problem. Do you have any suggestions?

Do you mean that materialized-view-maintenance should be loaded only in overlords while materialized-view-selection should be loaded only in brokers?

materialized-view-selection should be loaded only in brokers, but materialized-view-maintenance can be loaded anywhere.

Ah, ok. We don't have a nice way to do this currently.. I think it's fine with going as it is. Would you please add some comments about this, especially materialized-view-selection should be loaded only in brokers?

Sure, I'm working on your comments these days. Thanks very much!

jihoonson · 2018-04-10T22:41:15Z

Please use Intervals.ETERNITY instead.

Intervals.ETERNITY doesn't work well when comparing to a varchar in metastore.

Would let me know which error you saw?

Intervals.ETERNITY="-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z".

When we use it to compare the start and end of segments to get all segments from metastore, such as :
select * from druid_segments where start > '-146136543-09-08T08:23:32.096Z' and end < '146140482-04-24T15:36:27.903Z';,
An empty set will be returned, that is because no end is less than '146140482-04-24T15:36:27.903Z'.

jihoonson · 2018-04-10T22:42:40Z

Looks like DEFAULT_MAX_TASK_COUNT.

jihoonson · 2018-04-10T22:47:40Z

The line indentation is not correct. Please adjust it.

jihoonson · 2018-04-10T22:48:29Z

Please break the line like

Preconditions.checkNotNull( baseDataSource, "baseDataSource cannot be null. Please provide a baseDataSource." );

Same for the following 3 lines.

jihoonson · 2018-04-10T23:09:59Z

Please add some javadoc.

jihoonson · 2018-04-10T23:14:29Z

Thanks for adding this!

jihoonson · 2018-04-10T23:16:36Z

This should be a final non-static variable.

serverView is used in optimize method, and this method is static.

I mean, this is should be a final non-static variable because it's quite dangerous. As you said, serverView is used in a static method (optimize()), but is initialized in the constructor. As you know, static methods can be used without creating an instance which means serverView might not be initialized when optimize() is called. This currently works because Guice initializes DataSourceOptimizer when DataSourceOptimizerMonitor is initialized and this happens to be before optimize() is called. However, it might be broken in the future if somethings change like someone decides to make DataSourceOptimizerMonitor configurable and disables it.

Thanks, you'r right. I'll modify it

jihoonson · 2018-04-10T23:18:40Z

Please rename to DataSourceOptimizer.

jihoonson · 2018-04-10T23:19:57Z

These variables represent the metrics of dataSourceOptimizer, which means dataSourceOptimizer needs to keep some states. Why don't we simply making a singleton instance of this?

DataSourceOptimizer is a singleton instance, and I use static because optimize method is a static method.

Why don't we simply making a singleton instance of this?

Do you mean I should write another class (e.g. DataSourceOptimizerMetrics) to do record these states.

Oh, you're right. It's singleton. Then, I wonder why you made the optimize() method static. Usually static methods are useful when a class doesn't have to keep any states (like util classes). But, DataSourceOptimizer does keep states (that is, metrics).

jihoonson

@zhangxinyu1 left more comments. It looks a nice start for supporting this kind of cool feature!

Also please add some documentation. I would love to test this in my cluster!

jihoonson · 2018-04-12T22:43:46Z

This should be outside of the try clause (https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/locks/Lock.html).

Also, probably this should be writeLock().

jihoonson · 2018-04-12T22:45:25Z

It looks that the these maps are synchronized with lock. If so, they don't have to be the concurrentHashMap.

Also please leave some comments about what these maps mean.

lock is mainly used to synchronized all stats in getAndResetStats() method. In getAndResetStats() , we get snapshots of stats one by one, and then clear all stats. I use lock to ensure there is no changing of stats between these steps.
I use concurrentHashMap beacause, in optimize(), each stat increases concurrently.

I'm not sure I understood correctly, but if my new comments are correct, readLock() and writeLock() should be used in getAndResetStats() and optimize(), respectively. If so, concurrentMap is not needed because only one thread can write at a time in optimize(), and all threads can read without contention in getAndResetStats().

In my design, many threads are allowed to call optimize() simultaneously, because, MaterializedViewQuery need to be optimized concurrently, so I use readLock in optimize(). It means that these stats can be get and changed respectively by these threads.
However, when a thread call getAndResetStats() to get the whole snapshot of stats, these stats are not allowed to change respectively. Therefore, I use the writeLock() to limit the call to optimize().

Ok. Please add some comments about this.

jihoonson · 2018-04-12T22:48:39Z

nit: can be Collections.singletonList(query).

jihoonson · 2018-04-12T22:52:26Z

Unnecessary.

jihoonson · 2018-04-12T22:53:45Z

Please rename to more intuitive name. Looks like DerivativeDataSource?

jihoonson · 2018-04-12T23:53:40Z

Should be Objects.hash(VIEW, query).

jihoonson · 2018-04-12T23:54:51Z

Please remove this.

jihoonson · 2018-04-12T23:56:47Z

Is this class needed in the current implementation?

No, it's useless. Should I remove it?

Yes please.

jihoonson · 2018-04-12T23:57:33Z

Please throw an exception if the query type is unknown.

jihoonson · 2018-04-12T23:59:33Z

I think it's better to add a method to DimFilter which returns all required column names.

Yes, it's better if we add this method. Because it will miss the case when any new implementation of DimFilter . But, do you think I should add this method in this pr?

I think it's up to you. If you don't want to make this PR bigger, please raise an issue for this.

jihoonson · 2018-04-16T23:40:45Z

Probably this should be readLock().

jihoonson · 2018-04-16T23:41:07Z

Also, probably this should be writeLock().

jihoonson · 2018-04-16T23:46:19Z

I'm not sure I understood correctly, but if my new comments are correct, readLock() and writeLock() should be used in getAndResetStats() and optimize(), respectively. If so, concurrentMap is not needed because only one thread can write at a time in optimize(), and all threads can read without contention in getAndResetStats().

zhangxinyu1 · 2018-04-28T08:06:29Z

@jihoonson

Also please add some documentation. I would love to test this in my cluster!

The rough documentation about how to use this feature is at the front of this pr. Should I add some documentation to docs?

jihoonson · 2018-04-30T18:39:51Z

@zhangxinyu1 yes, you can add docs to the directory under $DRUID/docs/content/development/extensions-contrib like other extensions.

jihoonson · 2018-05-08T00:09:12Z

@zhangxinyu1 thanks for the update! I didn't realize that. I'll take another look and do some tests in our cluster.

BTW, a recent change (#5583) merged into master includes a change of the signature of HadoopTuningConfig which makes merging this PR failed. Would you update this PR?

zhangxinyu1 · 2018-05-09T02:05:44Z

@jihoonson Thanks for reminding. I have updated it.

jihoonson

@zhangxinyu1 thanks for the update. I left my last comments. I also tested this PR in my local machine. It works nicely!

jihoonson · 2018-05-11T18:41:47Z

Please add that this feature currently requires a hadoop cluster.

jihoonson · 2018-05-11T21:04:29Z

Would you check this comment?

jihoonson · 2018-05-11T21:07:52Z

Same here. Null check is unnecessary.

jihoonson · 2018-05-11T22:51:54Z

I suggest to modify List<DataSegment> getUsedSegmentsForInterval(String dataSource, Interval interval); to return List<Pair<DataSegment, String>> rather than adding a new method.

I don't know. I just think when someone calls method getUsedSegmentsForInterval , maybe he doesn't want to get the information about created date.

Maybe created_date should be a part of DataSegment. In this way, we only need the method List<DataSegment> getUsedSegmentsForInterval(String dataSource, Interval interval);. What do you think?

The only usage of getUsedSegmentsForInterval() is SegmentAllocateAction. It checks any segments are already allocated for the given interval to allocate a new segment id. I think it can just ignore the createdDate part.

Maybe created_date should be a part of DataSegment. In this way, we only need the method List getUsedSegmentsForInterval(String dataSource, Interval interval);. What do you think?

Hmm, that's a good point. It sounds good, but I'm not sure about why created_date is not a part of DataSchema itself. @gianm any idea?

Alright, let me raise an issue for this and merge these two methods in another pr, because it affect about 16 Classes.

Sounds good. Please go for it.

dylwylie

This looks like a greatly useful change, thanks!

dylwylie · 2018-05-23T10:28:02Z

Could just return the boolean expression

Yes, thanks!

dylwylie · 2018-05-23T10:34:50Z

Can we do tuningConfigForTask.withVersion instead?

I'm afraid not, because though withVersion function can set new version, it cannot set useExplicitVersion = true.

Cool thanks!

dylwylie · 2018-05-23T11:01:34Z

It'd be kinda nice to make the UnionDataSource support QueryDataSources and reuse it to run a list of queries.

I don't understand. Could you please describe it more detail? Thanks!

Sure - not an important suggestion so please ignore if it seems irrelevant or too much work :)

In order to execute a materialised view query we have to issue multiple queries on different intervals and merge their results. That might be a more generally useful component where user's can union multiple queries rather than just multiple datasources.

dylwylie · 2018-05-23T11:06:13Z

If it's easy to do i think it'd be worth supporting UnionDatasources as well. Would it just be a matter of iterating over a list of datasource names and running the rest of this method and flattening the resulting list of queries?

Thanks for your suggestion. The current implementation support UnionDataSource in this way: In UnionQueryRunner, UnionDataSource are transformed to some TableDataSources, and then, these TableDataSources are optimized in DataSourceOptimizer.java. Is is ok?

Ah got you, thanks for the explanation!

zhangxinyu1 · 2018-05-30T09:18:18Z

@Dylan1312 Could you please trigger the travis CI building?

dylwylie · 2018-05-30T10:15:51Z

Afraid I don't have the appropriate permission, a committer should be able to help you out

b-slim · 2018-05-30T10:18:09Z

you can always close and reopen the PR to restart the build ...

zhangxinyu1 · 2018-05-30T11:34:48Z

@Dylan1312 Thanks!

zhangxinyu1 · 2018-05-30T11:35:07Z

@b-slim It works, thanks!

sascha-coenen · 2018-05-30T17:14:53Z

+import java.util.concurrent.locks.ReentrantReadWriteLock;
+import java.util.function.Consumer;
+
+public class DatasourceOptimizer 


Please forgive me for posting this here - I'm not a committer/reviewer, so my feedback does not count, but there is one thing that looks incorrect to me:
Class DatasourceOptimizer states that
"Derived dataSource with smallest average size of segments have highest priority to replace the datasource in user query"
and accordingly the following lines produce this prioritized collection of derivatives:

// get all derivatives for datasource in query. The derivatives set is sorted by average size of per segment granularity. ImmutableSortedSet<Derivative> derivatives = DerivativesManager.getDerivatives(datasourceName);

However, a few lines below items from the above collection named "derivatives" which is sorted by priority get selected and put into the following collection, which is simply a hashset, which is not sorted and which according to javadoc does also not guarantee that the items are in insertion order:
Set<Derivative> derivativesWithRequiredFields = Sets.newHashSet();

To my understanding, the "derivativesWithRequiredFields" should be a list or a LinkedHashSet such that it is guaranteed that later on the best derivative gets consulted first.

thanks

@sascha-coenen thanks for your attention and suggestion.
Please see the latest version of DataSourceOptimizer here: https://github.com/druid-io/druid/pull/5556/files#diff-250d80eb8afc10c49ee91e41d8f9d91c .
The derivativesWithRequiredFields will be sorted when it is used as follows :
for (DerivativeDataSource derivativeDataSource : ImmutableSortedSet.copyOf(derivativesWithRequiredFields))

jihoonson · 2018-06-08T17:53:14Z

I'm going to remove Design Review tag and merge this PR unless any other committers start reviewing until tonight because

There is a proposal in [Proposal] The 2nd version of implementation for materialised view #5304, and the design of this PR has already reviewed.
No committers have reviewed this PR for more than 2 months except me, and this makes merging this PR really slow.

jihoonson · 2018-06-09T19:17:35Z

All right. I'm going to merge this PR shortly.

jihoonson · 2018-06-09T19:25:10Z

Merged. @zhangxinyu1 thank you for the contribution!

zhangxinyu1 · 2018-06-11T03:02:58Z

@jihoonson Thanks! I will work on the related issue #5710 and #5775 these days.

leventov · 2019-03-25T20:27:33Z

+      List<Pair<DataSegment, String>> snapshot
+  )
+  {
+    Interval maxAllowedToBuildInterval = snapshot.parallelStream()


@zhangxinyu1 why did you use parallel Stream?

leventov · 2019-03-25T20:27:46Z

+              .list()
+    );
+
+    List<DerivativeDataSource> derivativeDataSources = derivativesInDatabase.parallelStream()


jihoonson added Feature Design Review labels Mar 30, 2018

jihoonson mentioned this pull request Mar 30, 2018

[Proposal] The 2nd version of implementation for materialised view #5304

Closed

zhangxinyu1 force-pushed the feature-materialized-view-1.0 branch from 8d328ac to fdf16a9 Compare March 31, 2018 10:49

jihoonson reviewed Apr 4, 2018

View reviewed changes

zhangxinyu1 force-pushed the feature-materialized-view-1.0 branch 2 times, most recently from 6627334 to ee6fec7 Compare April 9, 2018 07:53

jihoonson reviewed Apr 10, 2018

View reviewed changes

jihoonson reviewed Apr 13, 2018

View reviewed changes

jihoonson reviewed Apr 16, 2018

View reviewed changes

zhangxinyu1 force-pushed the feature-materialized-view-1.0 branch from ee6fec7 to 6887cb0 Compare April 27, 2018 10:56

zhangxinyu1 mentioned this pull request Apr 27, 2018

Add a method getRequiredColumns to DimFilter #5710

Closed

zhangxinyu1 force-pushed the feature-materialized-view-1.0 branch from 6887cb0 to 4a6a372 Compare May 3, 2018 09:42

zhangxinyu1 force-pushed the feature-materialized-view-1.0 branch from 4a6a372 to 0ab2bcd Compare May 8, 2018 14:58

zhangxinyu1 force-pushed the feature-materialized-view-1.0 branch 2 times, most recently from b879328 to 1b452d6 Compare May 10, 2018 03:00

jihoonson reviewed May 11, 2018

View reviewed changes

zhangxinyu1 force-pushed the feature-materialized-view-1.0 branch from 1b452d6 to 1938ce5 Compare May 14, 2018 02:55

dylwylie reviewed May 23, 2018

View reviewed changes

dylwylie approved these changes May 27, 2018

View reviewed changes

unknown and others added 9 commits May 28, 2018 11:10

implement materialized view

9708da8

modify code according to jihoonson's comments

3e16a28

modify code according to jihoonson's comments - 2

60d13b0

add documentation about materialized view

9a88e5e

use new HadoopTuningConfig in pr 5583

f58f136

add minDataLag and fix optimizer bug

dbd4ad3

correct value of DEFAULT_MIN_DATA_LAG_MS

7aab45f

modify code according to jihoonson's comments - 3

50d6f4f

use the boolean expression instead of if-else

3623f9b

zhangxinyu1 force-pushed the feature-materialized-view-1.0 branch from 1938ce5 to 3623f9b Compare May 28, 2018 03:11

zhangxinyu1 closed this May 30, 2018

zhangxinyu1 reopened this May 30, 2018

sascha-coenen reviewed May 30, 2018

View reviewed changes

jihoonson removed the Design Review label Jun 9, 2018

jihoonson merged commit e43e5eb into apache:master Jun 9, 2018

dclim added this to the 0.13.0 milestone Oct 8, 2018

dclim mentioned this pull request Oct 10, 2018

Druid 0.13.0-incubating release notes #6442

Closed

leventov mentioned this pull request Feb 1, 2019

Design of MaterializedViewQuery #6977

Open

leventov reviewed Mar 25, 2019

View reviewed changes

Conversation

zhangxinyu1 commented Mar 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Target

Implementation

Usage

Uh oh!

jihoonson commented Mar 30, 2018

Uh oh!

jihoonson commented Mar 30, 2018

Uh oh!

jihoonson commented Mar 30, 2018

Uh oh!

jihoonson commented Apr 3, 2018

Uh oh!

zhangxinyu1 commented Apr 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jihoonson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jihoonson commented Apr 4, 2018

Uh oh!

zhangxinyu1 commented Apr 10, 2018

Uh oh!

jihoonson commented Apr 10, 2018

Uh oh!

jihoonson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhangxinyu1 commented Mar 30, 2018 •

edited

Loading

zhangxinyu1 commented Apr 3, 2018 •

edited

Loading

zhangxinyu1 Apr 11, 2018 •

edited

Loading

jihoonson left a comment •

edited

Loading