Skip to content

feat: support fragment-level update columns#4715

Merged
jackye1995 merged 20 commits intolance-format:mainfrom
wayneli-vt:support-update-columns
Oct 2, 2025
Merged

feat: support fragment-level update columns#4715
jackye1995 merged 20 commits intolance-format:mainfrom
wayneli-vt:support-update-columns

Conversation

@wayneli-vt
Copy link
Copy Markdown
Contributor

As described in #4650, this PR implements a fragment-level column update operation in Rust and exposes it through the Java API. This allows for efficient, partial updates of a dataset without copying irrelevant fields.

The update mechanism follows these rules:

  • Update existing rows: For any row identified by a given key in the target table, if a corresponding update value is present in the source data, the target row's value is replaced. This applies even if the new value is null.
  • Preserve untouched rows: If a row in the target table has no corresponding update in the source data, it remains unchanged.

The image below illustrates this process:
image

Important: It is the responsibility of the calling engine (e.g., Spark, Flink) to ensure that the update data (the source) is correctly partitioned and routed to the corresponding fragments of the target table. The core operation assumes it receives the correct data for the fragment it is operating on.

@github-actions github-actions Bot added enhancement New feature or request java labels Sep 12, 2025
@wayneli-vt wayneli-vt closed this Sep 12, 2025
@wayneli-vt wayneli-vt reopened this Sep 12, 2025
@steFaiz
Copy link
Copy Markdown
Collaborator

steFaiz commented Sep 13, 2025

@westonpace @jackye1995
PTAL if you have some time! We introduce the fragment-level update interface just like the merge_insert of Dataset, except that deletion is not considered here.

Comment thread java/src/main/java/com/lancedb/lance/Constants.java Outdated
Comment thread java/src/test/java/com/lancedb/lance/operation/UpdateTest.java Outdated
Comment thread java/src/test/java/com/lancedb/lance/operation/UpdateTest.java Outdated
Comment thread rust/lance/src/dataset/hash_joiner.rs Outdated
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Sep 15, 2025

Codecov Report

❌ Patch coverage is 69.29134% with 78 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.85%. Comparing base (492e773) to head (1a781db).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance/src/dataset/fragment.rs 78.23% 29 Missing and 8 partials ⚠️
rust/lance/src/dataset/hash_joiner.rs 58.90% 26 Missing and 4 partials ⚠️
rust/lance/src/dataset.rs 0.00% 11 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4715      +/-   ##
==========================================
- Coverage   80.87%   80.85%   -0.03%     
==========================================
  Files         332      332              
  Lines      131687   131941     +254     
  Branches   131687   131941     +254     
==========================================
+ Hits       106507   106684     +177     
- Misses      21430    21493      +63     
- Partials     3750     3764      +14     
Flag Coverage Δ
unittests 80.85% <69.29%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@wayneli-vt
Copy link
Copy Markdown
Contributor Author

@jackye1995 @westonpace Hi, I've addressed all the comments. Could you please take another look? Thanks very much!

Copy link
Copy Markdown
Contributor

@majin1102 majin1102 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, left one comment. Mainly my own confusing.
PTAL

Comment thread java/lance-jni/src/transaction.rs Outdated
updated_fragments,
new_fragments,
fields_modified: vec![],
fields_modified: updated_field_ids_unsafe,
Copy link
Copy Markdown
Contributor

@majin1102 majin1102 Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's context about this:
#4408 (comment)

I checked the usage in UpdateTest. I don't know this should not be used in this update scenario. But I think it's a little confusing here. Let's say the normal update fields_modified is empty, why update a whole column(maybe the real change is limited) makes fields_modified not empty? The actual field list is actually not changed.

That comment said we should use fields_modified when we add a new column when upserting. That makes sense to me because the fields list is actually modified.

Copy link
Copy Markdown
Contributor

@majin1102 majin1102 Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the updater actually adds a new column and marks a tombstone.

I wonder if this should belong to the Update operation. Because the code path is more like merging existed columns. The normal update in my mental mind is being splitted into Delete and Insert as a type of row-oriented operation, which is quite different from the column-oriented case metioned in this PR.

Or let's say if we want to define a SQL to do this DML, should it be a merge or update (or maybe merge and update)? From syntax perspective I think this case match "merge into .... when matched ...update". Paste a whole case:

MERGE INTO orders AS o
USING new_orders AS no
ON o.order_id = no.order_id
WHEN MATCHED THEN 
    UPDATE SET o.quantity = no.quantity

There also a issue raised to talk about the difference between Update and DataReplacement #4744 I think both point to the same operation categories issue.

In my mental mind, there could be three clear categories of data operations:

  • row-oriented: append, update, delete
  • column-oriented: merge, merge_update(for this case in my mind)/data_replacement
  • row-column-mixed: merge_insert

I think the categories are necessary because column operations for now are the speciality of Lance. There's also a concern that the misalignment of Update may lead to higher educational and use costs like schema evolution.

WDYT? @westonpace @jackye1995

Copy link
Copy Markdown
Collaborator

@steFaiz steFaiz Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The normal update in my mental mind is being splitted into Delete and Insert as row-oriented operations, which is quite different from the column-oriented case metioned in this PR.

I think we could ask this question as another way: if we want to define a SQL to do this DML, should it be a merge or update.

@majin1102 I think the key point is that row-oriented operations and column-oriented operations are fundamentally different in nature, so perhaps we don't need to unify the concepts between them.
The main motivation for proposing the UpdateColumn interface is to address the issue 4650 raised: MergeColumn can only merge one complete column at a time and brings some schema-level risks. This column-level update is more similar to Paimon's PartialUpdate mechanism — new data can fill in nulls in existing columns or overwrite specific values, rather than causing existing data to be lost.

Copy link
Copy Markdown
Contributor

@majin1102 majin1102 Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the key point is that row-oriented operations and column-oriented operations are fundamentally different in nature, so perhaps we don't need to unify the concepts between them

I think the concepts are meaningful for formats cause they should provide clear syntax for varities of engines. As you said, row-oriented operations and column-oriented operations are fundamentally different in nature, if we could clarify them I think will make something better:

  1. For engines, they will have a clear choice to make. For example, the case has a merge syntax and I actually could use merge operation commit and get the same result.
  2. When we list transactions, we clearly know what happend by operation type
  3. Educational cost thing

MergeColumn can only merge one complete column at a time and brings some schema-level risks

Indeed I agree this should be a different operation from MergeColumn. Just wonder if this should be Update

This column-level update is more similar to Paimon's PartialUpdate mechanism

Glad to see this introduction since Lance has fragments which makes this column-level update a more systematic ability. And I don't think Paimon has a systematic column operation which Lance has at least for now(I mean for Paimon the concept of PartialUpdate is somehow reasonable to me)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm self-convinced about using Update

The column-oriented seems physical, not logical. So we could consider the final commit as row changes since the logical columns doesn't change.

Another issue is conflict resolving. It seems the Update operation has been designed as accepting physical fields changing and dealing with it.

Good job on this!

@wayneli-vt
Copy link
Copy Markdown
Contributor Author

Thanks for the reviews on this so far!

I'm moving this PR to draft status as it conflicts with the recently merged #4589..
For context, #4589 expanded Operation::Update with new fields:

    Update {
        // ... existing fields ...
        // 👇 New fields from #4589
        /// The fields that used to judge whether to preserve the new frag's id into
        /// the frag bitmap of the specified indices.
        fields_for_preserving_frag_bitmap: Vec<u32>,
        /// The mode of update
        update_mode: Option<UpdateMode>,
    },

I'm going to open a separate, smaller PR to expose these new fields to the Java layer first.

And Once that compatibility PR is merged, I'll update this branch and ping here when this is ready for another look!

@wayneli-vt wayneli-vt marked this pull request as draft September 19, 2025 09:52
@wayneli-vt wayneli-vt marked this pull request as ready for review September 25, 2025 17:31
@wayneli-vt
Copy link
Copy Markdown
Contributor Author

@jackye1995 @westonpace @majin1102 Hi, this pr is ready for review! PTAL when you have a moment. Thanks!

Comment thread rust/lance/src/dataset.rs Outdated
Copy link
Copy Markdown
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly looks good to me, just a nit

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from my perspective (just looking at rust changes)

Comment thread rust/lance/src/dataset/fragment.rs Outdated
Comment thread rust/lance/src/dataset/fragment.rs
Copy link
Copy Markdown
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me !

@jackye1995 jackye1995 merged commit e55f266 into lance-format:main Oct 2, 2025
29 checks passed
wjones127 pushed a commit to wjones127/lance that referenced this pull request Oct 3, 2025
As described in lance-format#4650, this PR
implements a fragment-level column update operation in Rust and exposes
it through the Java API. This allows for efficient, partial updates of a
dataset without copying irrelevant fields.

The update mechanism follows these rules:
* **Update existing rows**: For any row identified by a given key in the
target table, if a corresponding update value is present in the source
data, the target row's value is replaced. This applies even if the new
value is null.
* **Preserve untouched rows**: If a row in the target table has no
corresponding update in the source data, it remains unchanged.

The image below illustrates this process:
<img width="536" height="289" alt="image"
src="https://github.com/user-attachments/assets/d3629f88-e3f2-4f9d-b0e9-379d3dd482d1"
/>


**Important:** It is the responsibility of the calling engine (e.g.,
Spark, Flink) to ensure that the update data (the source) is correctly
partitioned and routed to the corresponding fragments of the target
table. The core operation assumes it receives the correct data for the
fragment it is operating on.

---------

Co-authored-by: Weiren <litaiwei.lwt@antgroup.com>
@wayneli-vt wayneli-vt deleted the support-update-columns branch October 27, 2025 10:26
@fangbo
Copy link
Copy Markdown
Contributor

fangbo commented Jan 15, 2026

@wayneli-vt There is an error when updating a Struct field using FileFragment.update_columns. This is the detail #5717 . Could you please take a look ?

jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
As described in lance-format#4650, this PR
implements a fragment-level column update operation in Rust and exposes
it through the Java API. This allows for efficient, partial updates of a
dataset without copying irrelevant fields.

The update mechanism follows these rules:
* **Update existing rows**: For any row identified by a given key in the
target table, if a corresponding update value is present in the source
data, the target row's value is replaced. This applies even if the new
value is null.
* **Preserve untouched rows**: If a row in the target table has no
corresponding update in the source data, it remains unchanged.

The image below illustrates this process:
<img width="536" height="289" alt="image"
src="https://github.com/user-attachments/assets/d3629f88-e3f2-4f9d-b0e9-379d3dd482d1"
/>


**Important:** It is the responsibility of the calling engine (e.g.,
Spark, Flink) to ensure that the update data (the source) is correctly
partitioned and routed to the corresponding fragments of the target
table. The core operation assumes it receives the correct data for the
fragment it is operating on.

---------

Co-authored-by: Weiren <litaiwei.lwt@antgroup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants