Skip to content

[Proposal] The 2nd version of implementation for materialised view #5304

@zhangxinyu1

Description

@zhangxinyu1

The motivation of 'materialised view' is in issue #5218 .
The implementation of 'Materialised View' is diveded into two parts. One is the management of DerivedDatasource, and the other is optimizing query with DerivedDatasource.

Management of DerivedDatasource

Each derived-datasource is managed by a derived-datasource supervisor(like kafkaSupervisor).

CREATE

DerivedDatasource is created when user submits a derived-datasource supervisor. The json file of derived-datasource supervisor should include base datasource name, dimensions and metrics at least. DerivedDatasource name can be setted by users or generated by base datasource name, dimensions and metrics. Segment granularity and query granularity should keep the same with base datasource. The information is stored as metadata of supervisors.

MAINTAIN

A derived-datasource supervisor should make sure timeline and segment versions are the same with its base datasource (It can sovle the problem about segment version management @himanshug pointed out):

  1. When timeline of derived-datasource is less than base-datasource, supervisor will submit a derived-datasource-index-task. The derived-datasource-index-task is a hadoop-index-task. The only difference is that the version of segments generated by derived-datasource-index-task is not the time when task is submitted, but the related base datasource segments version.
  2. If the timeline of derived-datasource is more than base-datasource, supervisor will set used=false and submit kill tasks for those segments.
  3. Once supervisor find the segment version of derived-datasource is different from its related base-datasource in the same interval, supervisor will set used=false and submit a derived-datasource-index-task for that interval (This idea comes from @jihoonson 's suggestions).

DELETE

When the supervisor is shutdown or resubmited, the previous data of the derived-datasource will be deleted.
When the base datasource is disable, all its derived datasource supervisor will be shutdown.

Optimize query with DerivedDatasource

A MaterialisedViewQueryRunner is added in method applyPreMergeDecoration() of FluentQueryRunnerBuilder, such as

    public FluentQueryRunner applyPreMergeDecoration()
    {
      return from(
          new UnionQueryRunner<T>(
              new MaterialisedViewQueryRunner(
                  toolChest.preMergeQueryDecoration(
                      baseRunner
                  )
              )
          )
      );
    }

In MaterialisedViewQueryRunner, query is rewritten into union queries of derived-datasource and base datasource and merged results of all queries before returnning. The process of optimization is as follows.

  1. Check if the datasource is a table datasource. If not, do not optimize.
  2. Check if the datasource has derived-datasources, and find the derived-datasources which involve the dimensions and metrics the query need. If no derived datasource meet the condition, do not optimize.
  3. Split query interval by segment granularity into sub-intervals, and for each sub-interval, find out the derived-datasource which has the minimum amount of data. Then, replace the query datasource by the derived-datasource in the sub-interval.

In this way, query can be partially covered by the derived datasource and partially by the original one, and the problem @nishantmonu51 pointed out is solved.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions