Skip to content

[Feature] Decouple the lifecycle of snapshot and changelog  #2899

@Aitozi

Description

@Aitozi

Search before asking

  • I searched in the issues and found nothing similar.

Motivation

Motivation

Currently, the changelog's lifecycle is binding with the snapshot, when snapshot is expired, the changelog is expired. If we want to keep longer changelog for the streaming job to reset consume, we also have to keep the data files in snapshot, this is not necessary, and will waste much space resource.

In traditional database, the binlog's lifecycle is individual. So, In paimon, we could also support to decouple the lifecycle of snapshot and changelog . In this way, users can choose the flexible option to keep the changelog data. eg: user can keep one latest snapshot and one day's changelog for consuming.

How to implement

The lifecyle is handled in expiration process. To keep two different lifecycle, we need two group file point to mark the effective changelog and snapshot

  • EARLIEST and LATEST mark the effective datafile
  • EARLIEST_CHANGELOG and LATEST mark the effective changlog

The LATEST should be same, and the EARLIEST_CHANGELOG only present when the changelog is enabled.

image

However, the hint file is not always there. We should have a way to work when the hint file is missing. So we introduce a mark file expire-snapshot-x and expire-changelog-x when expire the corresponding object. We create the expire file first, and when the two file are created, we can delete the snapshot metadata. After that, the mark file are also deleted. The EARLIEST and EARLIEST_CHANGELOG always updated after the file created. So, if the EARLIEST_ file is missing or inaccurate, we can determine the current effective snapshot and changelog by the expire-snapshot-x and expire-changelog-x file.

New config

  • changelog.time-retained
  • changelog.num-retained.min
  • changelog.num-retained.max

The counterparts of the snapshot retained configs.

Subtask

  • Introduce ExpireChangelogImpl handle the changelog expire
  • Integrate the InnerStreamScan with the changlog metadata
  • Handle the orphan file cleaner with the changelog metadata
  • Support decouple the delta files lifecycle
  • Support to merge the small changelog files

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions