Skip to content

feat(blob_v2): add dedicated blob support#5406

Merged
Xuanwo merged 19 commits intomainfrom
xuanwo/blobv2-dedicated
Dec 5, 2025
Merged

feat(blob_v2): add dedicated blob support#5406
Xuanwo merged 19 commits intomainfrom
xuanwo/blobv2-dedicated

Conversation

@Xuanwo
Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo commented Dec 4, 2025

This PR will add dedicated blob support in lance.


Parts of this PR were drafted with assistance from Codex (with gpt-5.1-codex-max) and fully reviewed and edited by me. I take full responsibility for all changes.

Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
@github-actions github-actions Bot added the enhancement New feature or request label Dec 4, 2025
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Dec 4, 2025

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
@Xuanwo Xuanwo changed the title feat(blob_v2): add dedicated blob support feat(blob_v2): add dedicated blob support Dec 4, 2025
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
@codecov
Copy link
Copy Markdown

codecov Bot commented Dec 4, 2025

Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Comment thread rust/lance-core/src/utils/blob.rs Outdated
Comment thread rust/lance/src/dataset/blob.rs
@Xuanwo Xuanwo marked this pull request as ready for review December 4, 2025 16:11
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread rust/lance/src/dataset/write.rs Outdated
@Xuanwo Xuanwo marked this pull request as draft December 5, 2025 02:56
Signed-off-by: Xuanwo <github@xuanwo.io>
@Xuanwo Xuanwo marked this pull request as ready for review December 5, 2025 05:59
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread rust/lance/src/dataset/blob.rs
Signed-off-by: Xuanwo <github@xuanwo.io>
Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the way this is shaping up. Great work!

Comment thread rust/lance/src/dataset/blob.rs

const DEDICATED_THRESHOLD: usize = 4 * 1024 * 1024;

pub struct BlobPreprocessor {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document what this is?

Comment thread rust/lance/src/dataset/blob.rs Outdated

fn next_blob_id(&mut self) -> u32 {
let id = self.local_counter;
self.local_counter = self.local_counter.wrapping_add(1);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of wrapping_add maybe checked_add and return an error? Also, why not use u64 for counter? Unlikely we will ever hit u32 limit but it is just barely conceivable and better to just not worry about it.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our current design, the blob ID is unique per data file (fragement + field), so it shares the same upper limit as the data file rows: u32::MAX. It should be safe to use u32 here.

I will change this place to just +=1.

///
/// Layout: `<base>/<data_file_key>/<blob_id>.raw`
/// - `base` is typically the dataset's data directory.
/// - `data_file_key` is the stem of the data file (without extension).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I hadn't expected path to include data_file_key but I don't see any harm in it and it could maybe be useful for things like cleanup?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I picked up this design to make the GC much easier: we can just remove all blob files under the same data_file_key once we decide to remove that data file.

Comment thread rust/lance/src/dataset/blob.rs Outdated
Ok(path)
}

pub(crate) async fn preprocess_batch(&mut self, batch: &RecordBatch) -> Result<RecordBatch> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires blobs to be top-level fields. E.g. we cannot have Struct or List. I think this is fine for now but we should make sure we eventually document this somewhere.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, nice catch. I have to admit I didn’t take enough consideration into this and I actually didn’t expect this. I’ll create a follow up.

Comment on lines +188 to +194
new_fields.push(Arc::new(
arrow_schema::Field::new(
field.name(),
ArrowDataType::Struct(child_fields.into()),
field.is_nullable(),
)
.with_metadata(field.metadata().clone()),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to mark the field as packed?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this record batch only exists in memory and we won’t really encode it.

Comment thread rust/lance-encoding/src/encodings/logical/blob.rs
Comment thread rust/lance-encoding/src/encodings/logical/blob.rs
Signed-off-by: Xuanwo <github@xuanwo.io>
@Xuanwo Xuanwo mentioned this pull request Dec 5, 2025
9 tasks
@Xuanwo Xuanwo merged commit d452750 into main Dec 5, 2025
25 checks passed
@Xuanwo Xuanwo deleted the xuanwo/blobv2-dedicated branch December 5, 2025 16:46
Xuanwo added a commit that referenced this pull request Dec 9, 2025
This PR will add packed blob support for lance.

This PR is based on #5406,
please review that first.

---

**Parts of this PR were drafted with assistance from Codex (with
`gpt-5.1-codex-max`) and fully reviewed and edited by me. I take full
responsibility for all changes.**

---------

Signed-off-by: Xuanwo <github@xuanwo.io>
jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
This PR will add dedicated blob support in lance.

---

**Parts of this PR were drafted with assistance from Codex (with
`gpt-5.1-codex-max`) and fully reviewed and edited by me. I take full
responsibility for all changes.**

---------

Signed-off-by: Xuanwo <github@xuanwo.io>
jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
This PR will add packed blob support for lance.

This PR is based on lance-format#5406,
please review that first.

---

**Parts of this PR were drafted with assistance from Codex (with
`gpt-5.1-codex-max`) and fully reviewed and edited by me. I take full
responsibility for all changes.**

---------

Signed-off-by: Xuanwo <github@xuanwo.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants