Skip to content

feat(blob_v2): add external blob support#5385

Merged
Xuanwo merged 13 commits intomainfrom
Xuanwo/external-blob-review
Dec 3, 2025
Merged

feat(blob_v2): add external blob support#5385
Xuanwo merged 13 commits intomainfrom
Xuanwo/external-blob-review

Conversation

@Xuanwo
Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo commented Dec 2, 2025

This PR will add external blob support: users can now write an external blob in lance and read it back.

In this PR, I changed the blob struct's fields to be nullable: false because our packed struct doesn't allow nullable fields.


Parts of this PR were drafted with assistance from Codex (with gpt-5.1-codex-max) and fully reviewed and edited by me. I take full responsibility for all changes.

Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@github-actions github-actions Bot added the enhancement New feature or request label Dec 2, 2025
…review

Signed-off-by: Xuanwo <github@xuanwo.io>
@codecov
Copy link
Copy Markdown

codecov Bot commented Dec 2, 2025

Codecov Report

❌ Patch coverage is 66.99029% with 68 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/blob.rs 25.00% 44 Missing and 1 partial ⚠️
rust/lance-encoding/src/encodings/logical/blob.rs 85.88% 8 Missing and 4 partials ⚠️
rust/lance-core/src/datatypes.rs 35.29% 11 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial questions

Comment thread rust/lance-encoding/src/encodings/logical/blob.rs Outdated
Comment thread rust/lance-core/src/datatypes/field.rs Outdated
Comment thread rust/lance-encoding/src/encodings/logical/blob.rs Outdated
Comment on lines -330 to -340
for i in 0..binary_array.len() {
let is_null_row = match array.data_type() {
DataType::Struct(_) => array.is_null(i),
_ => binary_array.is_null(i),
};
if is_null_row {
kind_builder.append_null();
position_builder.append_null();
size_builder.append_null();
blob_id_builder.append_null();
uri_builder.append_null();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these changes forwards / backwards compatible?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe so. No blob v2 files have been written so far. It's part of file format 2.2

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I had forgotten we added a dedicated encoded / decoder for v2.

let data_is_set = !data_col.is_null(i);
if data_is_set {
let value = binary_array.value(i);
kind_builder.append_value(0);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use constants for kinds? or an enum that converts to/from an integer?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me, will update.

uri_builder.append_value("");
} else {
// external uri
let uri = uri_col.value(i);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check if the URI is empty?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, we can check before writing.

kind_builder.append_value(0);
position_builder.append_value(0);
size_builder.append_value(0);
blob_id_builder.append_value(0);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like blob_id is always set to zero. Can you remind me what this is for again?

Copy link
Copy Markdown
Collaborator Author

@Xuanwo Xuanwo Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blob_id == 0 means it's empty (the valid value starts from 1).

And blob_id reprents the id used by packed & dedicated blobs which are both not implemented yet.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the !data_is_set branch then? I thought an inline blob was "data column is not null and uri is null" and a dedicated / packed blob is "data column is null and uri is not null"?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see the enum now. So the !data_is_set approach is external.

Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
@Xuanwo Xuanwo merged commit 14087e7 into main Dec 3, 2025
26 checks passed
@Xuanwo Xuanwo deleted the Xuanwo/external-blob-review branch December 3, 2025 17:28
@Xuanwo Xuanwo mentioned this pull request Dec 5, 2025
9 tasks
jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
This PR will add external blob support: users can now write an external
blob in lance and read it back.

In this PR, I changed the blob struct's fields to be `nullable: false`
because our packed struct doesn't allow nullable fields.

---

**Parts of this PR were drafted with assistance from Codex (with
`gpt-5.1-codex-max`) and fully reviewed and edited by me. I take full
responsibility for all changes.**

---------

Signed-off-by: Xuanwo <github@xuanwo.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants