feat(blob_v2): add ref_id deduplication across Inline/Packed/Dedicated#6587
Open
DanielMao1 wants to merge 1 commit intolance-format:mainfrom
Open
feat(blob_v2): add ref_id deduplication across Inline/Packed/Dedicated#6587DanielMao1 wants to merge 1 commit intolance-format:mainfrom
DanielMao1 wants to merge 1 commit intolance-format:mainfrom
Conversation
Rows sharing the same positive `ref_id` now reuse a single stored copy of the blob bytes, regardless of which physical storage path the size class routes to. This turns the common "N rows referencing one GOP / embedding / document" pattern into 1x storage instead of Nx. - Inline: `BlobV2StructuralEncoder` caches `(position, size)` per ref_id. Later rows skip `external_buffers.add_buffer` and reuse the cached offset. - Packed / Dedicated: `BlobPreprocessor` caches a `SidecarRef` per ref_id. Later rows skip `write_packed` / `write_dedicated` and reuse the cached `(blob_id, position, size)`. The descriptor schema gains an optional trailing `ref_id: UInt32` column, and the v2 schema check accepts 5 or 6 fields so readers stay compatible. Python `Blob` gains a matching `ref_id` and `BlobArray.from_pylist` plumbs it through. Verified at 32 KB (Inline), 1 MB (Packed), and 6 MB (Dedicated): 20 rows sharing one ref_id store 1x the bytes, not 20x. Dedicated went from 20 sidecar files down to 1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat(blob_v2): shared blob storage via ref_id across all storage paths
Motivation
Lance Blob v2 provides four physical storage modes (Inline, Packed, Dedicated, External), all built on a "1 row = 1 blob" assumption. Every row owns an independent byte payload.
In real multimodal and time-series workloads this assumption breaks down. Many rows reference the same underlying object because the logical data streams live at different frequencies and need to be aligned into a single row-per-tick table:
Today the only options are:
Shared blobs give us "physically low-frequency, logically high-frequency" — the application sees every row as a complete record while the storage layer keeps one copy per shared blob.
Why this must live in the format layer, not on top
We tried emulating sharing at the application layer (dedup at the writer, pass identical bytes, fix it up downstream). It fails for three reasons:
next_blob_id()assigns a fresh id per row,write_dedicated()opens a new sidecar,writer.write_all(data)pushes the same bytes again. No hook exists to say "reuse this blob_id."BlobV2StructuralEncoder::maybe_encode. The app cannot hand-craft a descriptor pointing at a position it hasn't chosen.So the primitive belongs in the column encoding.
Proposal:
ref_idA new optional
ref_id: u32column on the Blob v2 descriptor struct.Contract: rows with the same positive
ref_idshare one physical blob.ref_id = 0or null means no sharing (existing behavior).Why
ref_id(absolute u32) instead ofback_ref(relative index)An earlier internal design doc proposed
back_ref: i32("this row reuses row k of this batch"). We rejected it for three reasons:blob_id/write_packed/write_dedicated). A single absolute identifier bridges both layers; a relative batch-local index cannot (it cannot propagate into the encoder without breaking the "descriptor schema is frozen" constraint that madeback_refattractive in the first place).ref_idcolumn lets downstream jobs runSELECT ref_id, COUNT(*) GROUP BY ref_idto measure sharing, drive compaction hints, and track blob lineage.back_ref's sharing is invisible after write.back_refis by design single-batch;ref_idcan naturally extend to single-write scope today and (future work) cross-write compaction-aware sharing.Not in scope
writecall still bounds the dedup cache).ref_id. Purely a performance optimization, can be layered on separately.Design
Where dedup happens is decided by physics
external_buffers.add_buffer()returnsposition)BlobV2StructuralEncoderwrite_packed()returns(blob_id, position))BlobPreprocessornext_blob_id()+write_dedicated())BlobPreprocessorSo the implementation places one cache per layer, both keyed by
ref_id:Write-path flow
Both caches live for a single write session. No persistent state beyond the descriptor column itself.
Read-path
Zero functional change. Rows with the same
ref_idcarry identical(position, size)or(blob_id, size)coordinates in their descriptors. Existing readers resolve each row the same way they always did — multiple rows simply happen to resolve to the same bytes. The schema version check is widened to accept 5 or 6 descriptor fields (the 6th being the optionalref_id), so files written by the new encoder remain readable under the existing take/scan code paths without further changes.Schema compatibility
BLOB_V2_DESC_FIELDSgains a trailingref_id: UInt32, nullable=true.Implementation
Five files, ~200 lines of core logic.
rust/lance-core/src/datatypes.rsref_idfield toBLOB_V2_DESC_FIELDSrust/lance-encoding/src/encodings/logical/blob.rsref_idcolumn outputrust/lance/src/dataset/blob.rsrust/lance-encoding/src/decoder.rslance.blob.v2extension type by name (ExtensionType metadata does not always survive Lance's schema serialization)python/python/lance/blob.pyBlobdataclass +BlobTypestorage schema +BlobArray.from_pylistplumbref_idTotal: 6 files, +288 / −19 including a demo test script.
Verification
python/test_ref_id_dedup.pywrites 20 rows sharing oneref_idat three size classes and measures actual on-disk bytes:Before this PR,
dedicated_6mbproduces 20 sidecar.blobfiles totaling 120 MB. After, it produces 1 sidecar file of 6 MB. Read-back undertake_blobsreturns byte-identical payloads to the source for all 20 rows on every path.Built on
.lancelifecycle; Packed/Dedicated sidecars already follow feat(blob_v2): add GC support #5473's data-file-bound GC, which remains correct because all shared references stay inside one data file)Independent of
Forward compatibility with compaction
When compaction merges multiple data files, sharing may need to cross data-file boundaries. The contract proposed here supports that extension naturally —
ref_idis already a persistent column, so compaction can inspect it and decide whether to preserve sharing (rebuild a merged cache) or fall back to independent rows (acceptable degradation). No schema or API change required.