Fragment-Level Merge/Update columns interface for Java

Distributed column merging is a crucial feature that enables flexible schema evolution for users. In big data engine scenarios, we need to update columns in a distributed manner, where specific concurrent tasks operate only on specific fragments.

This is a complex functionality. At the computation engine level, the most fundamental requirement is the ability to perform merge and update operations on columns at the fragment granularity.

The purpose of this issue is to add column-level merge and update APIs to the Java module. This will serve as the cornerstone for the next implementation of distributed column merging/updating in Spark and Flink.

This issue mainly contains two interfaces:
1. MergeColumn interface: The jave side `fragment#merge_columns` method. This method will merge new columns into target fragment. A new fragment with the same FragId as well as the merged Schema will be returned.
2. UpdateColumn interface:
In a distributed scenario, modifying the schema for each fragment individually is a risky operation. For this reason, we introduce an update column interface, which is semantically similar to Paimon's PartialUpdate mechanism. It builds upon the existing merge_column logic with the following key characteristics:
  a. The new batch updates an existing column rather than adding a new one.
  b. A join is still performed based on the values of a specified key column. For each matched row, the new value is determined by comparing the incoming value (`value_new`) with the current value (`value_old`):
     1. If `value_new` is present and not null, it is used as the updated value.
    2. If `value_new` is not present or null, but value_old is not null, the original `value_old` is kept.
    3. Otherwise (if both are null), the new value is null.

The UpdateColumn interface is based on the observation that in big data scenarios, users rarely add a complete set of new columns to a table all at once. Instead, they tend to add columns incrementally over the course of multiple jobs, as illustrated in the diagram below:

<img width="922" height="614" alt="Image" src="https://github.com/user-attachments/assets/e2923b32-8c7b-40df-b0d4-9ca6fcae08c5" />

Without UpdateColumn interface, the result might be:

<img width="706" height="882" alt="Image" src="https://github.com/user-attachments/assets/1542ce03-7270-41d1-b13a-441bf8665335" />

While with UpdateColumn, we can get the ideal result:

<img width="748" height="784" alt="Image" src="https://github.com/user-attachments/assets/f0eaa2e5-c751-4740-a1a7-47e5264fc30b" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fragment-Level Merge/Update columns interface for Java #4650

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fragment-Level Merge/Update columns interface for Java #4650

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions