-
Notifications
You must be signed in to change notification settings - Fork 116
Partition filter aware index application #338
Description
Feature requested
Currently Hyperspace considers all source files in the given relation.
However, if we refers the partition filter for the partitioned source data, we could exclude unrelated file paths.
For example,
All file paths = ["/path/col=1/a.parquet", "/path/col=2/b.parquet", "/path/col=1/b.parquet"]
Paths with partition filter (col = 1) = ["/path/col=1/a.parquet", "/path/col=1/b.parquet"]
With the filtered paths, we could apply the indexes which were "partially" refreshed. (#298)
For the partially refreshed indexes, we could apply hybrid scan in a more efficient way using the filtered indexes.
For example, a user may want to run a query with partition filter col=1 and col=2,
Index source files = ["/path/col=1/a.parquet", "/path/col=1/b.parquet"]
and there are many appended files under the source relation after the index creation (e.g. /path/col=2/*, /path/col=3/*, /path/col=4/* ..)
- without partial refresh feature & without using partition filter
- fully refreshed index
- do Hybrid Scan with many deleted files
- with partial refresh feature & without using partition filter
- partial refreshed index
- do Hybrid Scan with many appended files (unrelated file paths are not required, but will be handled as appended files)
- with partial refresh feature & with using partition filter
- partial refreshed index
- do Hybrid Scan with less diff files, or even we don't need Hybrid Scan if source files in the given query is the same as the source file list of partially refreshed index.
Acceptance criteria
tbd
Success criteria
tbd
Additional context
- If the index is bucketed with the partitioned column, we also could remove the unnecessary buckets(=index data files) to read, based on the partition filter.