Skip to content

[Feature] Support Efficient Compaction for Storage Optimization and Query Performance #93

@lxy-9602

Description

@lxy-9602

Search before asking

  • I searched in the issues and found nothing similar.

Motivation

Compaction is essential for maintaining high performance and storage efficiency in modern data systems. Key benefits include:

  • For Append Tables: Reduces small files by merging existing data files, improving scan performance and metadata scalability.
  • For Primary Key (PK) Tables: Minimizes the number of segments that need to be merged during read-time (merge-on-read), significantly speeding up queries.
  • For PK+DV Tables: Enables writing DV (Delete Vector) files to mark outdated rows, allowing efficient read performance.

Currently, the lack of a dedicated compaction mechanism limits our ability to optimize storage layout and query latency.

Solution

The compaction framework should support the following capabilities:

  1. Support for both append tables and primary key (PK) tables, with appropriate strategies for each;
  2. Execution via background tasks or manual triggers, allowing flexibility in operation;
  3. Built-in basic compaction policies aligned with Java Paimon;
  4. Generation of Delete Vector (DV) files during/after compaction to track stale rows;
  5. Design support for data-evolution scenarios, including both vertical compaction (merging small files) and horizontal compaction (consolidating partial-column files);
  6. Ensure output data format is fully compatible with Java Paimon.

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions