feat(blob_v2): add dedicated blob support#5406
Conversation
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Signed-off-by: Xuanwo <github@xuanwo.io>
Signed-off-by: Xuanwo <github@xuanwo.io>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Signed-off-by: Xuanwo <github@xuanwo.io>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Signed-off-by: Xuanwo <github@xuanwo.io>
westonpace
left a comment
There was a problem hiding this comment.
I love the way this is shaping up. Great work!
|
|
||
| const DEDICATED_THRESHOLD: usize = 4 * 1024 * 1024; | ||
|
|
||
| pub struct BlobPreprocessor { |
|
|
||
| fn next_blob_id(&mut self) -> u32 { | ||
| let id = self.local_counter; | ||
| self.local_counter = self.local_counter.wrapping_add(1); |
There was a problem hiding this comment.
Instead of wrapping_add maybe checked_add and return an error? Also, why not use u64 for counter? Unlikely we will ever hit u32 limit but it is just barely conceivable and better to just not worry about it.
There was a problem hiding this comment.
In our current design, the blob ID is unique per data file (fragement + field), so it shares the same upper limit as the data file rows: u32::MAX. It should be safe to use u32 here.
I will change this place to just +=1.
| /// | ||
| /// Layout: `<base>/<data_file_key>/<blob_id>.raw` | ||
| /// - `base` is typically the dataset's data directory. | ||
| /// - `data_file_key` is the stem of the data file (without extension). |
There was a problem hiding this comment.
Interesting. I hadn't expected path to include data_file_key but I don't see any harm in it and it could maybe be useful for things like cleanup?
There was a problem hiding this comment.
Yes! I picked up this design to make the GC much easier: we can just remove all blob files under the same data_file_key once we decide to remove that data file.
| Ok(path) | ||
| } | ||
|
|
||
| pub(crate) async fn preprocess_batch(&mut self, batch: &RecordBatch) -> Result<RecordBatch> { |
There was a problem hiding this comment.
This requires blobs to be top-level fields. E.g. we cannot have Struct or List. I think this is fine for now but we should make sure we eventually document this somewhere.
There was a problem hiding this comment.
Oh yeah, nice catch. I have to admit I didn’t take enough consideration into this and I actually didn’t expect this. I’ll create a follow up.
| new_fields.push(Arc::new( | ||
| arrow_schema::Field::new( | ||
| field.name(), | ||
| ArrowDataType::Struct(child_fields.into()), | ||
| field.is_nullable(), | ||
| ) | ||
| .with_metadata(field.metadata().clone()), |
There was a problem hiding this comment.
Do we need to mark the field as packed?
There was a problem hiding this comment.
No, this record batch only exists in memory and we won’t really encode it.
Signed-off-by: Xuanwo <github@xuanwo.io>
This PR will add packed blob support for lance. This PR is based on #5406, please review that first. --- **Parts of this PR were drafted with assistance from Codex (with `gpt-5.1-codex-max`) and fully reviewed and edited by me. I take full responsibility for all changes.** --------- Signed-off-by: Xuanwo <github@xuanwo.io>
This PR will add dedicated blob support in lance. --- **Parts of this PR were drafted with assistance from Codex (with `gpt-5.1-codex-max`) and fully reviewed and edited by me. I take full responsibility for all changes.** --------- Signed-off-by: Xuanwo <github@xuanwo.io>
This PR will add packed blob support for lance. This PR is based on lance-format#5406, please review that first. --- **Parts of this PR were drafted with assistance from Codex (with `gpt-5.1-codex-max`) and fully reviewed and edited by me. I take full responsibility for all changes.** --------- Signed-off-by: Xuanwo <github@xuanwo.io>
This PR will add dedicated blob support in lance.
Parts of this PR were drafted with assistance from Codex (with
gpt-5.1-codex-max) and fully reviewed and edited by me. I take full responsibility for all changes.