Distributed column merging is a crucial feature that enables flexible schema evolution for users. In big data engine scenarios, we need to update columns in a distributed manner, where specific concurrent tasks operate only on specific fragments.
This is a complex functionality. At the computation engine level, the most fundamental requirement is the ability to perform merge and update operations on columns at the fragment granularity.
The purpose of this issue is to add column-level merge and update APIs to the Java module. This will serve as the cornerstone for the next implementation of distributed column merging/updating in Spark and Flink.
This issue mainly contains two interfaces:
- MergeColumn interface: The jave side
fragment#merge_columns method. This method will merge new columns into target fragment. A new fragment with the same FragId as well as the merged Schema will be returned.
- UpdateColumn interface:
In a distributed scenario, modifying the schema for each fragment individually is a risky operation. For this reason, we introduce an update column interface, which is semantically similar to Paimon's PartialUpdate mechanism. It builds upon the existing merge_column logic with the following key characteristics:
a. The new batch updates an existing column rather than adding a new one.
b. A join is still performed based on the values of a specified key column. For each matched row, the new value is determined by comparing the incoming value (value_new) with the current value (value_old):
- If
value_new is present and not null, it is used as the updated value.
- If
value_new is not present or null, but value_old is not null, the original value_old is kept.
- Otherwise (if both are null), the new value is null.
The UpdateColumn interface is based on the observation that in big data scenarios, users rarely add a complete set of new columns to a table all at once. Instead, they tend to add columns incrementally over the course of multiple jobs, as illustrated in the diagram below:
Without UpdateColumn interface, the result might be:
While with UpdateColumn, we can get the ideal result:

Distributed column merging is a crucial feature that enables flexible schema evolution for users. In big data engine scenarios, we need to update columns in a distributed manner, where specific concurrent tasks operate only on specific fragments.
This is a complex functionality. At the computation engine level, the most fundamental requirement is the ability to perform merge and update operations on columns at the fragment granularity.
The purpose of this issue is to add column-level merge and update APIs to the Java module. This will serve as the cornerstone for the next implementation of distributed column merging/updating in Spark and Flink.
This issue mainly contains two interfaces:
fragment#merge_columnsmethod. This method will merge new columns into target fragment. A new fragment with the same FragId as well as the merged Schema will be returned.In a distributed scenario, modifying the schema for each fragment individually is a risky operation. For this reason, we introduce an update column interface, which is semantically similar to Paimon's PartialUpdate mechanism. It builds upon the existing merge_column logic with the following key characteristics:
a. The new batch updates an existing column rather than adding a new one.
b. A join is still performed based on the values of a specified key column. For each matched row, the new value is determined by comparing the incoming value (
value_new) with the current value (value_old):value_newis present and not null, it is used as the updated value.value_newis not present or null, but value_old is not null, the originalvalue_oldis kept.The UpdateColumn interface is based on the observation that in big data scenarios, users rarely add a complete set of new columns to a table all at once. Instead, they tend to add columns incrementally over the course of multiple jobs, as illustrated in the diagram below:
Without UpdateColumn interface, the result might be:
While with UpdateColumn, we can get the ideal result: