Skip to content

Fragment-Level Merge/Update columns interface for Java #4650

@steFaiz

Description

@steFaiz

Distributed column merging is a crucial feature that enables flexible schema evolution for users. In big data engine scenarios, we need to update columns in a distributed manner, where specific concurrent tasks operate only on specific fragments.

This is a complex functionality. At the computation engine level, the most fundamental requirement is the ability to perform merge and update operations on columns at the fragment granularity.

The purpose of this issue is to add column-level merge and update APIs to the Java module. This will serve as the cornerstone for the next implementation of distributed column merging/updating in Spark and Flink.

This issue mainly contains two interfaces:

  1. MergeColumn interface: The jave side fragment#merge_columns method. This method will merge new columns into target fragment. A new fragment with the same FragId as well as the merged Schema will be returned.
  2. UpdateColumn interface:
    In a distributed scenario, modifying the schema for each fragment individually is a risky operation. For this reason, we introduce an update column interface, which is semantically similar to Paimon's PartialUpdate mechanism. It builds upon the existing merge_column logic with the following key characteristics:
    a. The new batch updates an existing column rather than adding a new one.
    b. A join is still performed based on the values of a specified key column. For each matched row, the new value is determined by comparing the incoming value (value_new) with the current value (value_old):
    1. If value_new is present and not null, it is used as the updated value.
    2. If value_new is not present or null, but value_old is not null, the original value_old is kept.
    3. Otherwise (if both are null), the new value is null.

The UpdateColumn interface is based on the observation that in big data scenarios, users rarely add a complete set of new columns to a table all at once. Instead, they tend to add columns incrementally over the course of multiple jobs, as illustrated in the diagram below:

Image

Without UpdateColumn interface, the result might be:

Image

While with UpdateColumn, we can get the ideal result:

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions