Skip to content
This repository was archived by the owner on Jun 14, 2024. It is now read-only.
This repository was archived by the owner on Jun 14, 2024. It is now read-only.

[PROPOSAL]: Support for Scan Pattern for Scalable Refresh Index for Large Data Sources #276

@apoorvedave1

Description

@apoorvedave1

Describe the problem

Goals

Original Issue #223, Dependencies: #269

  1. Support incremental refresh call on scan pattern
  2. Prevent full file listing of df creation when incremental refresh is called with scan pattern

Independent Scenarios to support

  1. Globbing data sources
    E.g. original source data path = data/*/*/*/*

    • create index supports this <PR # handles this>
    • Incremental refresh on scanPattern = data/*/a/*
    • Incremental refresh on scanPattern = data/*/a
      hs.refreshIndex("indexname", scanPattern = "data/*/m=11/*")
  2. Hive-partitioned data sources
    E.g. original source data path = data

    • create index supports this today
    • Incremental refresh on scanPattern = data/y=2020/m=11/
    • Incremental refresh on scanPattern = data/y=2020/*
    • Incremental refresh on scanPattern = data/*/m=11
      hs.refreshIndex("indexname", scanPattern = "data/*/b/*")

Describe your proposed solution

General Ideas:

  1. Setting "basePath" option in a dataframe is sufficient to inform spark of hive-partition schema if it exists.

  2. For hive-partitioned data source: setting rootpath + scanPattern should suffice for listing the right set of files for refresh.

    • e.g. if actual files look like data/y=2020/m=11/d=1/f1.parquet and scanPattern is data/y=2020/*,
      dfUsedForRefresh = spark.read.option("basePath", "data").load("data/y=2020") can be used to refresh index. This will ensure the partition columns are picked and only the files from "data/y=2020" are picked.
  3. For globbing data sources: Setting "path" option to a globbing pattern (starting from data root, until the leaf directory) will tell spark to list which files to use in refresh.

    • e.g. if actual files look like data/a/b/c/d/f1.parquet, and therefore, globbing pattern = data/*/*/*/*, and scanPattern is data/*/b/*,
      dfUsedForRefresh = spark.read.load("data/*/b/*/*") can be used to refresh index. This will ensure only the files from "data/y=2020" are picked.

Algo:

  1. Globbing Data Sources
    CreateIndex:
    - Set rootPath (in metadata) =data/*/*/*/* (same as globbing pattern)
    Refresh Incremental:
    - scanPattern = data/*/b/*
    - Note: "basePath" is not required for globbing patterns but it does no harm even if present
    - mergedPath = merge(globPath, scanPattern) =>
    e.g. merge (data/*/*/*/*, data/*/b/*) = data/*/b/*/*
    - dfUsedForRefresh = spark.read.options(...).format(...).load(mergedPath)
    - Metadata changes => new metadata = old metadata + new Index Files in index content + new data files (from scan pattern) in data content

  2. Hive-Partitioned Data Sources
    CreateIndex:
    - Set rootPath (in metadata) = "data" (this is done currently)
    - Set basePath (in metadata) = "data" (this is new)
    Refresh Incremental:
    - scanPattern = data/*/m=11/*
    - Note: "basePath" is required to correctly identify partition columns.
    - Set "basePath" = basePath from metadata
    - mergedPath = merge(globPath, scanPattern) =>
    e.g. merge (data, data/*/m=11/*) = data/*/m=11/*
    - dfUsedForRefresh = spark.read.options(...).format(...).option("basePath", basePath).load(mergedPath)=> note the additional "basePath" option
    - Metadata changes => new metadata = old metadata + new index files in index content + new data files (from scan pattern) in data content

Additional context

Constraints:

How to handle deleted data? Should we consider deleted data across ALL directories or just those which satisfy the scan pattern?

  • If we stick with ALL directories, we fail on goal 2 (not listing all files)

  • If we stick with only scan-pattern directories, we will have an inconsistent situation where the metadata doesn't reflect the true list of deleted files. : (Note: This can be supported by assuming the other deleted files were deleted after incremental refresh)

  1. This design only considers append only data. We will come back to the design which would support deleted file scenarios.
  2. Data will either be hive-partitioned or globbing pattern but not a combination of both. This should support most user scenarios.

Metadata

Metadata

Assignees

Labels

proposalThis is the default tag for a newly created design proposaluntriagedThis is the default tag for a newly created issue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions