feat: support fragment-level update columns by wayneli-vt · Pull Request #4715 · lance-format/lance

wayneli-vt · 2025-09-12T18:19:38Z

As described in #4650, this PR implements a fragment-level column update operation in Rust and exposes it through the Java API. This allows for efficient, partial updates of a dataset without copying irrelevant fields.

The update mechanism follows these rules:

Update existing rows: For any row identified by a given key in the target table, if a corresponding update value is present in the source data, the target row's value is replaced. This applies even if the new value is null.
Preserve untouched rows: If a row in the target table has no corresponding update in the source data, it remains unchanged.

The image below illustrates this process:

Important: It is the responsibility of the calling engine (e.g., Spark, Flink) to ensure that the update data (the source) is correctly partitioned and routed to the corresponding fragments of the target table. The core operation assumes it receives the correct data for the fragment it is operating on.

steFaiz · 2025-09-13T07:52:05Z

@westonpace @jackye1995
PTAL if you have some time! We introduce the fragment-level update interface just like the merge_insert of Dataset, except that deletion is not considered here.

codecov-commenter · 2025-09-15T16:40:35Z

Codecov Report

❌ Patch coverage is 69.29134% with 78 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.85%. Comparing base (492e773) to head (1a781db).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
rust/lance/src/dataset/fragment.rs	78.23%	29 Missing and 8 partials ⚠️
rust/lance/src/dataset/hash_joiner.rs	58.90%	26 Missing and 4 partials ⚠️
rust/lance/src/dataset.rs	0.00%	11 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4715      +/-   ##
==========================================
- Coverage   80.87%   80.85%   -0.03%     
==========================================
  Files         332      332              
  Lines      131687   131941     +254     
  Branches   131687   131941     +254     
==========================================
+ Hits       106507   106684     +177     
- Misses      21430    21493      +63     
- Partials     3750     3764      +14

Flag	Coverage Δ
unittests	`80.85% <69.29%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

wayneli-vt · 2025-09-17T08:36:59Z

@jackye1995 @westonpace Hi, I've addressed all the comments. Could you please take another look? Thanks very much!

majin1102

Hi, left one comment. Mainly my own confusing.
PTAL

majin1102 · 2025-09-17T12:55:17Z

                updated_fragments,
                new_fragments,
-                fields_modified: vec![],
+                fields_modified: updated_field_ids_unsafe,


There's context about this:
#4408 (comment)

I checked the usage in UpdateTest. I don't know this should not be used in this update scenario. But I think it's a little confusing here. Let's say the normal update fields_modified is empty, why update a whole column(maybe the real change is limited) makes fields_modified not empty? The actual field list is actually not changed.

That comment said we should use fields_modified when we add a new column when upserting. That makes sense to me because the fields list is actually modified.

Yeah, the updater actually adds a new column and marks a tombstone.

I wonder if this should belong to the Update operation. Because the code path is more like merging existed columns. The normal update in my mental mind is being splitted into Delete and Insert as a type of row-oriented operation, which is quite different from the column-oriented case metioned in this PR.

Or let's say if we want to define a SQL to do this DML, should it be a merge or update (or maybe merge and update)? From syntax perspective I think this case match "merge into .... when matched ...update". Paste a whole case:

MERGE INTO orders AS o USING new_orders AS no ON o.order_id = no.order_id WHEN MATCHED THEN UPDATE SET o.quantity = no.quantity

There also a issue raised to talk about the difference between Update and DataReplacement #4744 I think both point to the same operation categories issue.

In my mental mind, there could be three clear categories of data operations:

row-oriented: append, update, delete

column-oriented: merge, merge_update(for this case in my mind)/data_replacement

row-column-mixed: merge_insert

I think the categories are necessary because column operations for now are the speciality of Lance. There's also a concern that the misalignment of Update may lead to higher educational and use costs like schema evolution.

WDYT? @westonpace @jackye1995

The normal update in my mental mind is being splitted into Delete and Insert as row-oriented operations, which is quite different from the column-oriented case metioned in this PR.

I think we could ask this question as another way: if we want to define a SQL to do this DML, should it be a merge or update.

@majin1102 I think the key point is that row-oriented operations and column-oriented operations are fundamentally different in nature, so perhaps we don't need to unify the concepts between them.
The main motivation for proposing the UpdateColumn interface is to address the issue 4650 raised: MergeColumn can only merge one complete column at a time and brings some schema-level risks. This column-level update is more similar to Paimon's PartialUpdate mechanism — new data can fill in nulls in existing columns or overwrite specific values, rather than causing existing data to be lost.

I think the key point is that row-oriented operations and column-oriented operations are fundamentally different in nature, so perhaps we don't need to unify the concepts between them

I think the concepts are meaningful for formats cause they should provide clear syntax for varities of engines. As you said, row-oriented operations and column-oriented operations are fundamentally different in nature, if we could clarify them I think will make something better:

For engines, they will have a clear choice to make. For example, the case has a merge syntax and I actually could use merge operation commit and get the same result.

When we list transactions, we clearly know what happend by operation type

Educational cost thing

MergeColumn can only merge one complete column at a time and brings some schema-level risks

Indeed I agree this should be a different operation from MergeColumn. Just wonder if this should be Update

This column-level update is more similar to Paimon's PartialUpdate mechanism

Glad to see this introduction since Lance has fragments which makes this column-level update a more systematic ability. And I don't think Paimon has a systematic column operation which Lance has at least for now(I mean for Paimon the concept of PartialUpdate is somehow reasonable to me)

I think I'm self-convinced about using Update

The column-oriented seems physical, not logical. So we could consider the final commit as row changes since the logical columns doesn't change.

Another issue is conflict resolving. It seems the Update operation has been designed as accepting physical fields changing and dealing with it.

Good job on this!

…umns

wayneli-vt · 2025-09-19T09:23:54Z

Thanks for the reviews on this so far!

I'm moving this PR to draft status as it conflicts with the recently merged #4589..
For context, #4589 expanded Operation::Update with new fields:

    Update {
        // ... existing fields ...
        // 👇 New fields from #4589
        /// The fields that used to judge whether to preserve the new frag's id into
        /// the frag bitmap of the specified indices.
        fields_for_preserving_frag_bitmap: Vec<u32>,
        /// The mode of update
        update_mode: Option<UpdateMode>,
    },

I'm going to open a separate, smaller PR to expose these new fields to the Java layer first.

And Once that compatibility PR is merged, I'll update this branch and ping here when this is ready for another look!

wayneli-vt · 2025-09-26T05:19:20Z

@jackye1995 @westonpace @majin1102 Hi, this pr is ready for review! PTAL when you have a moment. Thanks!

jackye1995

mostly looks good to me, just a nit

westonpace

Looks good from my perspective (just looking at rust changes)

jackye1995

Looks good to me ！

As described in lance-format#4650, this PR implements a fragment-level column update operation in Rust and exposes it through the Java API. This allows for efficient, partial updates of a dataset without copying irrelevant fields. The update mechanism follows these rules: * **Update existing rows**: For any row identified by a given key in the target table, if a corresponding update value is present in the source data, the target row's value is replaced. This applies even if the new value is null. * **Preserve untouched rows**: If a row in the target table has no corresponding update in the source data, it remains unchanged. The image below illustrates this process: <img width="536" height="289" alt="image" src="https://github.com/user-attachments/assets/d3629f88-e3f2-4f9d-b0e9-379d3dd482d1" /> **Important:** It is the responsibility of the calling engine (e.g., Spark, Flink) to ensure that the update data (the source) is correctly partitioned and routed to the corresponding fragments of the target table. The core operation assumes it receives the correct data for the fragment it is operating on. --------- Co-authored-by: Weiren <litaiwei.lwt@antgroup.com>

fangbo · 2026-01-15T09:53:46Z

@wayneli-vt There is an error when updating a Struct field using FileFragment.update_columns. This is the detail #5717 . Could you please take a look ?

As described in lance-format#4650, this PR implements a fragment-level column update operation in Rust and exposes it through the Java API. This allows for efficient, partial updates of a dataset without copying irrelevant fields. The update mechanism follows these rules: * **Update existing rows**: For any row identified by a given key in the target table, if a corresponding update value is present in the source data, the target row's value is replaced. This applies even if the new value is null. * **Preserve untouched rows**: If a row in the target table has no corresponding update in the source data, it remains unchanged. The image below illustrates this process: <img width="536" height="289" alt="image" src="https://github.com/user-attachments/assets/d3629f88-e3f2-4f9d-b0e9-379d3dd482d1" /> **Important:** It is the responsibility of the calling engine (e.g., Spark, Flink) to ensure that the update data (the source) is correctly partitioned and routed to the corresponding fragments of the target table. The core operation assumes it receives the correct data for the fragment it is operating on. --------- Co-authored-by: Weiren <litaiwei.lwt@antgroup.com>

wayneli-vt added 4 commits September 11, 2025 18:18

rust implement of updating columns

fab17ea

Java interface implementation for column updates

1f450c1

code format

b880d4d

cargo.lock roll back

a0298e6

github-actions Bot added enhancement New feature or request java labels Sep 12, 2025

wayneli-vt closed this Sep 12, 2025

comment fix

38626c5

wayneli-vt reopened this Sep 12, 2025

Merge branch 'main' into support-update-columns

a821417

code format

6230ef3

jackye1995 reviewed Sep 15, 2025

View reviewed changes

westonpace mentioned this pull request Sep 16, 2025

Do we still need DataReplacement operation? #4744

Open

wayneli-vt added 3 commits September 17, 2025 15:29

modify code according to reviews

db2b44a

modify comments

81095bd

remove unnecessary files

d0ff790

majin1102 requested changes Sep 17, 2025

View reviewed changes

majin1102 approved these changes Sep 17, 2025

View reviewed changes

wayneli-vt added 2 commits September 19, 2025 12:25

remove nullable check

4d73487

Merge remote-tracking branch 'community/main' into support-update-col…

4487bf3

…umns

wayneli-vt marked this pull request as draft September 19, 2025 09:52

wayneli-vt added 3 commits September 25, 2025 23:09

merge main

c961ebe

code format && rename

5bfb8d2

code format

e6ce02c

wayneli-vt marked this pull request as ready for review September 25, 2025 17:31

Merge branch 'main' into support-update-columns

8140da2

jackye1995 reviewed Sep 26, 2025

View reviewed changes

Comment thread rust/lance/src/dataset.rs Outdated

jackye1995 approved these changes Sep 26, 2025

View reviewed changes

modified according to review

6de3816

westonpace approved these changes Sep 29, 2025

View reviewed changes

Comment thread rust/lance/src/dataset/fragment.rs Outdated

Comment thread rust/lance/src/dataset/fragment.rs

wayneli-vt and others added 3 commits September 30, 2025 00:10

address review feedback on schema validation and comments

d3ea393

Merge branch 'main' into support-update-columns

b5ad4f2

Merge branch 'main' into support-update-columns

1a781db

jackye1995 approved these changes Oct 2, 2025

View reviewed changes

jackye1995 merged commit e55f266 into lance-format:main Oct 2, 2025
29 checks passed

wojiaodoubao mentioned this pull request Oct 2, 2025

fix: typo in ut test_fragment_update #4878

Merged

xloya mentioned this pull request Oct 20, 2025

Support fragment level update columns in Python SDK #5000

Closed

wayneli-vt deleted the support-update-columns branch October 27, 2025 10:26

This was referenced Jan 2, 2026

feat: support update_columns for fragment operator #3539

Closed

_rowaddr and _rowid not exposed for merge_insert #3439

Open

fangbo mentioned this pull request Jan 5, 2026

Support update using UpdateMode::RewriteColumns lance-format/lance-spark#166

Open

Conversation

wayneli-vt commented Sep 12, 2025

Uh oh!

steFaiz commented Sep 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wayneli-vt commented Sep 17, 2025

Uh oh!

majin1102 left a comment

Choose a reason for hiding this comment

Uh oh!

majin1102 Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

majin1102 Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

steFaiz Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

majin1102 Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

majin1102 Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

wayneli-vt commented Sep 19, 2025

Uh oh!

wayneli-vt commented Sep 26, 2025

Uh oh!

Uh oh!

jackye1995 left a comment

Choose a reason for hiding this comment

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jackye1995 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fangbo commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

codecov-commenter commented Sep 15, 2025 •

edited

Loading

majin1102 Sep 17, 2025 •

edited

Loading

majin1102 Sep 17, 2025 •

edited

Loading

steFaiz Sep 17, 2025 •

edited

Loading

majin1102 Sep 17, 2025 •

edited

Loading