Skip to content

Support for reading Delta Lake table snapshots#17004

Merged
abhishekagarwal87 merged 12 commits intoapache:masterfrom
abhishekrb19:delta_snapshot
Sep 9, 2024
Merged

Support for reading Delta Lake table snapshots#17004
abhishekagarwal87 merged 12 commits intoapache:masterfrom
abhishekrb19:delta_snapshot

Conversation

@abhishekrb19
Copy link
Copy Markdown
Contributor

@abhishekrb19 abhishekrb19 commented Sep 4, 2024

Problem

Currently, the delta input source only supports reading from the latest snapshot of the given Delta Lake table. This is a known documented limitation.

Description

Add support for reading Delta snapshot. By default, the Druid-Delta connector reads the latest snapshot of the Delta table in order to preserve compatibility. Users can specify a snapshotVersion to ingest change data events from Delta tables into Druid.

In the future, we can also add support for time-based snapshot reads. The Delta API to read time-based snapshots is not clear currently.

Examples:

A. To ingest snapshot version 3 of the Delta table:

REPLACE INTO "snapshot_delta_table" OVERWRITE ALL
SELECT 
"id",
"map_info"
FROM TABLE(
  EXTERN(
    '{
      "type": "delta",
      "tablePath": "/Users/abhishek/projects/snapshot-table",
      "snapshotVersion": 3
    }',
    '{"type":"json"}'
  )
) EXTEND ("id" BIGINT, "map_info" TYPE('COMPLEX<json>'))
PARTITIONED BY ALL

B. To ingest a latest snapshot of the Delta table:

REPLACE INTO "snapshot_delta_table" OVERWRITE ALL
SELECT 
"id",
"map_info"
FROM TABLE(
  EXTERN(
    '{
      "type": "delta",
      "tablePath": "/Users/abhishek/projects/snapshot-table",
    }',
    '{"type":"json"}'
  )
) EXTEND ("id" BIGINT, "map_info" TYPE('COMPLEX<json>'))
PARTITIONED BY ALL

Release note

Users can optionally specify a snapshotVersion in the delta input source payload to ingest versioned snapshots from a Delta Lake table. By default, Druid ingests the latest snapshot of the Delta Lake table when snapshotVersion is not specified.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

@abhishekrb19 abhishekrb19 changed the title Support for ingesting Delta table snapshots by version Support for reading Delta Lake table snapshots Sep 4, 2024
Comment thread docs/ingestion/input-sources.md Outdated
@abhishekagarwal87 abhishekagarwal87 merged commit aa833a7 into apache:master Sep 9, 2024
abhishekrb19 added a commit to abhishekrb19/incubator-druid that referenced this pull request Sep 9, 2024
abhishekrb19 added a commit to abhishekrb19/incubator-druid that referenced this pull request Sep 9, 2024
vogievetsky pushed a commit that referenced this pull request Sep 9, 2024
* Web-console change to add Delta snapshot version.

Web-console change for #17004.

* Update web-console/src/druid-models/input-source/input-source.tsx

* Update web-console/src/druid-models/ingestion-spec/ingestion-spec.tsx
@abhishekrb19 abhishekrb19 deleted the delta_snapshot branch September 10, 2024 13:12
abhishekrb19 added a commit to abhishekrb19/incubator-druid that referenced this pull request Sep 19, 2024
* Web-console change to add Delta snapshot version.

Web-console change for apache#17004.

* Update web-console/src/druid-models/input-source/input-source.tsx

* Update web-console/src/druid-models/ingestion-spec/ingestion-spec.tsx
@abhishekrb19 abhishekrb19 mentioned this pull request Oct 3, 2024
1 task
@kfaraz kfaraz added this to the 31.0.0 milestone Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants