feat: support fragment-level update columns#4715
feat: support fragment-level update columns#4715jackye1995 merged 20 commits intolance-format:mainfrom
Conversation
|
@westonpace @jackye1995 |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4715 +/- ##
==========================================
- Coverage 80.87% 80.85% -0.03%
==========================================
Files 332 332
Lines 131687 131941 +254
Branches 131687 131941 +254
==========================================
+ Hits 106507 106684 +177
- Misses 21430 21493 +63
- Partials 3750 3764 +14
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@jackye1995 @westonpace Hi, I've addressed all the comments. Could you please take another look? Thanks very much! |
majin1102
left a comment
There was a problem hiding this comment.
Hi, left one comment. Mainly my own confusing.
PTAL
| updated_fragments, | ||
| new_fragments, | ||
| fields_modified: vec![], | ||
| fields_modified: updated_field_ids_unsafe, |
There was a problem hiding this comment.
There's context about this:
#4408 (comment)
I checked the usage in UpdateTest. I don't know this should not be used in this update scenario. But I think it's a little confusing here. Let's say the normal update fields_modified is empty, why update a whole column(maybe the real change is limited) makes fields_modified not empty? The actual field list is actually not changed.
That comment said we should use fields_modified when we add a new column when upserting. That makes sense to me because the fields list is actually modified.
There was a problem hiding this comment.
Yeah, the updater actually adds a new column and marks a tombstone.
I wonder if this should belong to the Update operation. Because the code path is more like merging existed columns. The normal update in my mental mind is being splitted into Delete and Insert as a type of row-oriented operation, which is quite different from the column-oriented case metioned in this PR.
Or let's say if we want to define a SQL to do this DML, should it be a merge or update (or maybe merge and update)? From syntax perspective I think this case match "merge into .... when matched ...update". Paste a whole case:
MERGE INTO orders AS o
USING new_orders AS no
ON o.order_id = no.order_id
WHEN MATCHED THEN
UPDATE SET o.quantity = no.quantity
There also a issue raised to talk about the difference between Update and DataReplacement #4744 I think both point to the same operation categories issue.
In my mental mind, there could be three clear categories of data operations:
- row-oriented: append, update, delete
- column-oriented: merge, merge_update(for this case in my mind)/data_replacement
- row-column-mixed: merge_insert
I think the categories are necessary because column operations for now are the speciality of Lance. There's also a concern that the misalignment of Update may lead to higher educational and use costs like schema evolution.
WDYT? @westonpace @jackye1995
There was a problem hiding this comment.
The normal update in my mental mind is being splitted into Delete and Insert as row-oriented operations, which is quite different from the column-oriented case metioned in this PR.
I think we could ask this question as another way: if we want to define a SQL to do this DML, should it be a merge or update.
@majin1102 I think the key point is that row-oriented operations and column-oriented operations are fundamentally different in nature, so perhaps we don't need to unify the concepts between them.
The main motivation for proposing the UpdateColumn interface is to address the issue 4650 raised: MergeColumn can only merge one complete column at a time and brings some schema-level risks. This column-level update is more similar to Paimon's PartialUpdate mechanism — new data can fill in nulls in existing columns or overwrite specific values, rather than causing existing data to be lost.
There was a problem hiding this comment.
I think the key point is that row-oriented operations and column-oriented operations are fundamentally different in nature, so perhaps we don't need to unify the concepts between them
I think the concepts are meaningful for formats cause they should provide clear syntax for varities of engines. As you said, row-oriented operations and column-oriented operations are fundamentally different in nature, if we could clarify them I think will make something better:
- For engines, they will have a clear choice to make. For example, the case has a merge syntax and I actually could use merge operation commit and get the same result.
- When we list transactions, we clearly know what happend by operation type
- Educational cost thing
MergeColumn can only merge one complete column at a time and brings some schema-level risks
Indeed I agree this should be a different operation from MergeColumn. Just wonder if this should be Update
This column-level update is more similar to Paimon's PartialUpdate mechanism
Glad to see this introduction since Lance has fragments which makes this column-level update a more systematic ability. And I don't think Paimon has a systematic column operation which Lance has at least for now(I mean for Paimon the concept of PartialUpdate is somehow reasonable to me)
There was a problem hiding this comment.
I think I'm self-convinced about using Update
The column-oriented seems physical, not logical. So we could consider the final commit as row changes since the logical columns doesn't change.
Another issue is conflict resolving. It seems the Update operation has been designed as accepting physical fields changing and dealing with it.
Good job on this!
|
Thanks for the reviews on this so far! I'm moving this PR to draft status as it conflicts with the recently merged #4589.. Update {
// ... existing fields ...
// 👇 New fields from #4589
/// The fields that used to judge whether to preserve the new frag's id into
/// the frag bitmap of the specified indices.
fields_for_preserving_frag_bitmap: Vec<u32>,
/// The mode of update
update_mode: Option<UpdateMode>,
},I'm going to open a separate, smaller PR to expose these new fields to the Java layer first. And Once that compatibility PR is merged, I'll update this branch and ping here when this is ready for another look! |
|
@jackye1995 @westonpace @majin1102 Hi, this pr is ready for review! PTAL when you have a moment. Thanks! |
jackye1995
left a comment
There was a problem hiding this comment.
mostly looks good to me, just a nit
westonpace
left a comment
There was a problem hiding this comment.
Looks good from my perspective (just looking at rust changes)
As described in lance-format#4650, this PR implements a fragment-level column update operation in Rust and exposes it through the Java API. This allows for efficient, partial updates of a dataset without copying irrelevant fields. The update mechanism follows these rules: * **Update existing rows**: For any row identified by a given key in the target table, if a corresponding update value is present in the source data, the target row's value is replaced. This applies even if the new value is null. * **Preserve untouched rows**: If a row in the target table has no corresponding update in the source data, it remains unchanged. The image below illustrates this process: <img width="536" height="289" alt="image" src="https://github.com/user-attachments/assets/d3629f88-e3f2-4f9d-b0e9-379d3dd482d1" /> **Important:** It is the responsibility of the calling engine (e.g., Spark, Flink) to ensure that the update data (the source) is correctly partitioned and routed to the corresponding fragments of the target table. The core operation assumes it receives the correct data for the fragment it is operating on. --------- Co-authored-by: Weiren <litaiwei.lwt@antgroup.com>
|
@wayneli-vt There is an error when updating a Struct field using FileFragment.update_columns. This is the detail #5717 . Could you please take a look ? |
As described in lance-format#4650, this PR implements a fragment-level column update operation in Rust and exposes it through the Java API. This allows for efficient, partial updates of a dataset without copying irrelevant fields. The update mechanism follows these rules: * **Update existing rows**: For any row identified by a given key in the target table, if a corresponding update value is present in the source data, the target row's value is replaced. This applies even if the new value is null. * **Preserve untouched rows**: If a row in the target table has no corresponding update in the source data, it remains unchanged. The image below illustrates this process: <img width="536" height="289" alt="image" src="https://github.com/user-attachments/assets/d3629f88-e3f2-4f9d-b0e9-379d3dd482d1" /> **Important:** It is the responsibility of the calling engine (e.g., Spark, Flink) to ensure that the update data (the source) is correctly partitioned and routed to the corresponding fragments of the target table. The core operation assumes it receives the correct data for the fragment it is operating on. --------- Co-authored-by: Weiren <litaiwei.lwt@antgroup.com>
As described in #4650, this PR implements a fragment-level column update operation in Rust and exposes it through the Java API. This allows for efficient, partial updates of a dataset without copying irrelevant fields.
The update mechanism follows these rules:
The image below illustrates this process:

Important: It is the responsibility of the calling engine (e.g., Spark, Flink) to ensure that the update data (the source) is correctly partitioned and routed to the corresponding fragments of the target table. The core operation assumes it receives the correct data for the fragment it is operating on.