feat(blob_v2): add ref_id deduplication across Inline/Packed/Dedicated by DanielMao1 · Pull Request #6587 · lance-format/lance

DanielMao1 · 2026-04-21T10:16:11Z

feat(blob_v2): shared blob storage via ref_id across all storage paths

Motivation

Lance Blob v2 provides four physical storage modes (Inline, Packed, Dedicated, External), all built on a "1 row = 1 blob" assumption. Every row owns an independent byte payload.

In real multimodal and time-series workloads this assumption breaks down. Many rows reference the same underlying object because the logical data streams live at different frequencies and need to be aligned into a single row-per-tick table:

Domain	Low-frequency (shared across rows)	High-frequency (per-row)
Video understanding	GOP / I-frame	per-frame label, bbox
Autonomous driving	LiDAR scan	object detection annotation
Retrieval training	query embedding	candidate document, relevance label
Multimodal training	image / video embedding	caption, QA pair
Reinforcement learning	observation	trajectory branch / action
Medical time series	imaging (MRI / CT volume)	vital signs, events

Today the only options are:

Approach	Storage	Query	Programming model
Upsample (repeat low-freq to match high-freq)	10–1000× waste	fast	simple
Downsample	compact	fast	loses high-freq information
Two tables + JOIN	compact	slow (runtime reassembly)	breaks columnar scans
Nested list column	compact	medium	awkward indexing / filtering
Shared blob (this PR)	compact	fast	every row stays whole

Shared blobs give us "physically low-frequency, logically high-frequency" — the application sees every row as a complete record while the storage layer keeps one copy per shared blob.

Why this must live in the format layer, not on top

We tried emulating sharing at the application layer (dedup at the writer, pass identical bytes, fix it up downstream). It fails for three reasons:

Write cost is already paid by the time the preprocessor sees the row. For Dedicated, next_blob_id() assigns a fresh id per row, write_dedicated() opens a new sidecar, writer.write_all(data) pushes the same bytes again. No hook exists to say "reuse this blob_id."
Coordinates are assigned inside Lance. For Inline, the out-of-line buffer position is only known inside BlobV2StructuralEncoder::maybe_encode. The app cannot hand-craft a descriptor pointing at a position it hasn't chosen.
Readers cannot observe sharing after the fact. Without a persistent identifier in the descriptor, downstream tools (compaction, analytics, GC) have no signal that two rows were meant to be the same object.

So the primitive belongs in the column encoding.

Proposal: `ref_id`

A new optional ref_id: u32 column on the Blob v2 descriptor struct.

Contract: rows with the same positive ref_id share one physical blob. ref_id = 0 or null means no sharing (existing behavior).

from lance.blob import Blob, blob_array

# 8 frames all reference GOP #42
frames = blob_array([Blob(data=gop_bytes, ref_id=42) for _ in range(8)])

Why `ref_id` (absolute u32) instead of `back_ref` (relative index)

An earlier internal design doc proposed back_ref: i32 ("this row reuses row k of this batch"). We rejected it for three reasons:

Cross-size coverage. Inline dedup has to happen in the encoder (the only layer that knows the out-of-line buffer position), while Packed/Dedicated dedup has to happen in the preprocessor (the only layer that controls blob_id / write_packed / write_dedicated). A single absolute identifier bridges both layers; a relative batch-local index cannot (it cannot propagate into the encoder without breaking the "descriptor schema is frozen" constraint that made back_ref attractive in the first place).
Observability. A persisted ref_id column lets downstream jobs run SELECT ref_id, COUNT(*) GROUP BY ref_id to measure sharing, drive compaction hints, and track blob lineage. back_ref's sharing is invisible after write.
Scope. back_ref is by design single-batch; ref_id can naturally extend to single-write scope today and (future work) cross-write compaction-aware sharing.

Not in scope

Content-hash auto-dedup. The user chooses when to share; we only provide the mechanism.
Cross-write sharing (single write call still bounds the dedup cache).
Read-side fetch coalescing by ref_id. Purely a performance optimization, can be layered on separately.

Design

Where dedup happens is decided by physics

BlobKind	Who decides the blob's coordinates	Where dedup must live
Inline (≤64 KB)	Encoder (`external_buffers.add_buffer()` returns `position`)	`BlobV2StructuralEncoder`
Packed (64 KB – 4 MB)	Preprocessor (`write_packed()` returns `(blob_id, position)`)	`BlobPreprocessor`
Dedicated (>4 MB)	Preprocessor (`next_blob_id()` + `write_dedicated()`)	`BlobPreprocessor`

So the implementation places one cache per layer, both keyed by ref_id:

BlobPreprocessor {
    ref_id_sidecar_cache: HashMap<u32, SidecarRef>,   // Packed / Dedicated
}

BlobV2StructuralEncoder {
    ref_dedup_tmp_map: HashMap<u32, (u64, u64)>,       // Inline
}

enum SidecarRef {
    Dedicated { blob_id: u32, size: u64 },
    Packed    { blob_id: u32, position: u64, size: u64 },
}

Write-path flow

lance.write_dataset(batch)
    │
    ▼
BlobPreprocessor::preprocess_batch()              ◀── step 1
    for each row with ref_id > 0:
        if ref_id_sidecar_cache.contains(ref_id):
            emit descriptor with cached (blob_id, position, size)   ← skip sidecar write
            continue
    else:
        normal Dedicated / Packed / Inline decision
        if wrote a Dedicated or Packed blob:
            ref_id_sidecar_cache.insert(ref_id, SidecarRef { ... })
    │
    ▼  (descriptor struct flows down)
    │
BlobV2StructuralEncoder::maybe_encode()           ◀── step 2
    for each Inline row with ref_id > 0:
        if ref_dedup_tmp_map.contains(ref_id):
            reuse cached (position, size)                           ← skip add_buffer
        else:
            position = external_buffers.add_buffer(bytes)
            ref_dedup_tmp_map.insert(ref_id, (position, size))

Both caches live for a single write session. No persistent state beyond the descriptor column itself.

Read-path

Zero functional change. Rows with the same ref_id carry identical (position, size) or (blob_id, size) coordinates in their descriptors. Existing readers resolve each row the same way they always did — multiple rows simply happen to resolve to the same bytes. The schema version check is widened to accept 5 or 6 descriptor fields (the 6th being the optional ref_id), so files written by the new encoder remain readable under the existing take/scan code paths without further changes.

Schema compatibility

Descriptor struct: BLOB_V2_DESC_FIELDS gains a trailing ref_id: UInt32, nullable=true.
Old files: 5-field descriptors continue to work (the v2 schema validator accepts both shapes).
New files read by old binaries: the trailing nullable field is ignored by Arrow struct consumers that look up fields by name, which is what Lance's take/scan code does.
File format version: ≥ 2.2 (matches existing Blob v2 requirement).

Implementation

Five files, ~200 lines of core logic.

File	Lines	Role
`rust/lance-core/src/datatypes.rs`	+1	Add `ref_id` field to `BLOB_V2_DESC_FIELDS`
`rust/lance-encoding/src/encodings/logical/blob.rs`	+64 / −9	Encoder cache + Inline dedup + `ref_id` column output
`rust/lance/src/dataset/blob.rs`	+95 / −4	Preprocessor cache + Packed/Dedicated dedup + schema check widened
`rust/lance-encoding/src/decoder.rs`	+17 / −1	Recognize `lance.blob.v2` extension type by name (ExtensionType metadata does not always survive Lance's schema serialization)
`python/python/lance/blob.py`	+14 / −5	`Blob` dataclass + `BlobType` storage schema + `BlobArray.from_pylist` plumb `ref_id`

Total: 6 files, +288 / −19 including a demo test script.

Verification

python/test_ref_id_dedup.py writes 20 rows sharing one ref_id at three size classes and measures actual on-disk bytes:

=== inline_32kb (payload=32,768 B, ref_id=101, rows=20) ===
  main .lance:          34,558 B
  sidecar total:             0 B (0 files)
  naive (20 copies):   655,360 B
  amplification:           1.05×   ✓ DEDUP

=== packed_1mb (payload=1,048,576 B, ref_id=102, rows=20) ===
  main .lance:           1,784 B
  sidecar total:     1,048,576 B (1 file)
  naive (20 copies):20,971,520 B
  amplification:           1.00×   ✓ DEDUP

=== dedicated_6mb (payload=6,291,456 B, ref_id=103, rows=20) ===
  main .lance:           1,784 B
  sidecar total:     6,291,456 B (1 file)
  naive (20 copies):125,829,120 B
  amplification:           1.00×   ✓ DEDUP

Before this PR, dedicated_6mb produces 20 sidecar .blob files totaling 120 MB. After, it produces 1 sidecar file of 6 MB. Read-back under take_blobs returns byte-identical payloads to the source for all 20 rows on every path.

Built on

feat: add blob v2 schema #4948 Blob v2 schema
feat(blob_v2): add dedicated blob support #5406 Dedicated blobs
feat(blob_v2): add GC support #5473 Blob v2 GC (no new GC machinery needed — Inline bytes share the .lance lifecycle; Packed/Dedicated sidecars already follow feat(blob_v2): add GC support #5473's data-file-bound GC, which remains correct because all shared references stay inside one data file)

Independent of

Tracking issues for Blob V2 #4947's follow-up items (public API, compaction, ingest) — this PR is a format-level primitive that can land before those designs are finalized.

Forward compatibility with compaction

When compaction merges multiple data files, sharing may need to cross data-file boundaries. The contract proposed here supports that extension naturally — ref_id is already a persistent column, so compaction can inspect it and decide whether to preserve sharing (rebuild a merged cache) or fall back to independent rows (acceptable degradation). No schema or API change required.

Rows sharing the same positive `ref_id` now reuse a single stored copy of the blob bytes, regardless of which physical storage path the size class routes to. This turns the common "N rows referencing one GOP / embedding / document" pattern into 1x storage instead of Nx. - Inline: `BlobV2StructuralEncoder` caches `(position, size)` per ref_id. Later rows skip `external_buffers.add_buffer` and reuse the cached offset. - Packed / Dedicated: `BlobPreprocessor` caches a `SidecarRef` per ref_id. Later rows skip `write_packed` / `write_dedicated` and reuse the cached `(blob_id, position, size)`. The descriptor schema gains an optional trailing `ref_id: UInt32` column, and the v2 schema check accepts 5 or 6 fields so readers stay compatible. Python `Blob` gains a matching `ref_id` and `BlobArray.from_pylist` plumbs it through. Verified at 32 KB (Inline), 1 MB (Packed), and 6 MB (Dedicated): 20 rows sharing one ref_id store 1x the bytes, not 20x. Dedicated went from 20 sidecar files down to 1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

claude Bot reviewed Apr 21, 2026

View reviewed changes

github-actions Bot added enhancement New feature or request python labels Apr 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(blob_v2): add ref_id deduplication across Inline/Packed/Dedicated#6587

feat(blob_v2): add ref_id deduplication across Inline/Packed/Dedicated#6587
DanielMao1 wants to merge 1 commit intolance-format:mainfrom
fecet:feat/ref-id-all-paths

DanielMao1 commented Apr 21, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DanielMao1 commented Apr 21, 2026

feat(blob_v2): shared blob storage via ref_id across all storage paths

Motivation

Why this must live in the format layer, not on top

Proposal: ref_id

Why ref_id (absolute u32) instead of back_ref (relative index)

Not in scope

Design

Where dedup happens is decided by physics

Write-path flow

Read-path

Schema compatibility

Implementation

Verification

Built on

Independent of

Forward compatibility with compaction

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Proposal: `ref_id`

Why `ref_id` (absolute u32) instead of `back_ref` (relative index)