Skip to content
This repository was archived by the owner on Jun 14, 2024. It is now read-only.
This repository was archived by the owner on Jun 14, 2024. It is now read-only.

Partition filter aware index application #338

@sezruby

Description

@sezruby

Feature requested

Currently Hyperspace considers all source files in the given relation.

However, if we refers the partition filter for the partitioned source data, we could exclude unrelated file paths.
For example,
All file paths = ["/path/col=1/a.parquet", "/path/col=2/b.parquet", "/path/col=1/b.parquet"]
Paths with partition filter (col = 1) = ["/path/col=1/a.parquet", "/path/col=1/b.parquet"]

With the filtered paths, we could apply the indexes which were "partially" refreshed. (#298)
For the partially refreshed indexes, we could apply hybrid scan in a more efficient way using the filtered indexes.

For example, a user may want to run a query with partition filter col=1 and col=2,
Index source files = ["/path/col=1/a.parquet", "/path/col=1/b.parquet"]
and there are many appended files under the source relation after the index creation (e.g. /path/col=2/*, /path/col=3/*, /path/col=4/* ..)

  • without partial refresh feature & without using partition filter
    • fully refreshed index
    • do Hybrid Scan with many deleted files
  • with partial refresh feature & without using partition filter
    • partial refreshed index
    • do Hybrid Scan with many appended files (unrelated file paths are not required, but will be handled as appended files)
  • with partial refresh feature & with using partition filter
    • partial refreshed index
    • do Hybrid Scan with less diff files, or even we don't need Hybrid Scan if source files in the given query is the same as the source file list of partially refreshed index.

Acceptance criteria
tbd

Success criteria
tbd

Additional context

  • If the index is bucketed with the partitioned column, we also could remove the unnecessary buckets(=index data files) to read, based on the partition filter.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestuntriagedThis is the default tag for a newly created issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions