-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Search before asking
- I searched in the issues and found nothing similar.
Motivation
Motivation
Currently, the changelog's lifecycle is binding with the snapshot, when snapshot is expired, the changelog is expired. If we want to keep longer changelog for the streaming job to reset consume, we also have to keep the data files in snapshot, this is not necessary, and will waste much space resource.
In traditional database, the binlog's lifecycle is individual. So, In paimon, we could also support to decouple the lifecycle of snapshot and changelog . In this way, users can choose the flexible option to keep the changelog data. eg: user can keep one latest snapshot and one day's changelog for consuming.
How to implement
The lifecyle is handled in expiration process. To keep two different lifecycle, we need two group file point to mark the effective changelog and snapshot
EARLIESTandLATESTmark the effective datafileEARLIEST_CHANGELOGandLATESTmark the effective changlog
The
LATESTshould be same, and theEARLIEST_CHANGELOGonly present when the changelog is enabled.
However, the hint file is not always there. We should have a way to work when the hint file is missing. So we introduce a mark file expire-snapshot-x and expire-changelog-x when expire the corresponding object. We create the expire file first, and when the two file are created, we can delete the snapshot metadata. After that, the mark file are also deleted. The EARLIEST and EARLIEST_CHANGELOG always updated after the file created. So, if the EARLIEST_ file is missing or inaccurate, we can determine the current effective snapshot and changelog by the expire-snapshot-x and expire-changelog-x file.
New config
changelog.time-retainedchangelog.num-retained.minchangelog.num-retained.max
The counterparts of the snapshot retained configs.
Subtask
- Introduce ExpireChangelogImpl handle the changelog expire
- Integrate the InnerStreamScan with the changlog metadata
- Handle the orphan file cleaner with the changelog metadata
- Support decouple the delta files lifecycle
- Support to merge the small changelog files
Anything else?
No response
Are you willing to submit a PR?
- I'm willing to submit a PR!