Skip to content

feat(blob_v2): add ref_id deduplication across Inline/Packed/Dedicated#6587

Open
DanielMao1 wants to merge 1 commit intolance-format:mainfrom
fecet:feat/ref-id-all-paths
Open

feat(blob_v2): add ref_id deduplication across Inline/Packed/Dedicated#6587
DanielMao1 wants to merge 1 commit intolance-format:mainfrom
fecet:feat/ref-id-all-paths

Conversation

@DanielMao1
Copy link
Copy Markdown

feat(blob_v2): shared blob storage via ref_id across all storage paths

Motivation

Lance Blob v2 provides four physical storage modes (Inline, Packed, Dedicated, External), all built on a "1 row = 1 blob" assumption. Every row owns an independent byte payload.

In real multimodal and time-series workloads this assumption breaks down. Many rows reference the same underlying object because the logical data streams live at different frequencies and need to be aligned into a single row-per-tick table:

Domain Low-frequency (shared across rows) High-frequency (per-row)
Video understanding GOP / I-frame per-frame label, bbox
Autonomous driving LiDAR scan object detection annotation
Retrieval training query embedding candidate document, relevance label
Multimodal training image / video embedding caption, QA pair
Reinforcement learning observation trajectory branch / action
Medical time series imaging (MRI / CT volume) vital signs, events

Today the only options are:

Approach Storage Query Programming model
Upsample (repeat low-freq to match high-freq) 10–1000× waste fast simple
Downsample compact fast loses high-freq information
Two tables + JOIN compact slow (runtime reassembly) breaks columnar scans
Nested list column compact medium awkward indexing / filtering
Shared blob (this PR) compact fast every row stays whole

Shared blobs give us "physically low-frequency, logically high-frequency" — the application sees every row as a complete record while the storage layer keeps one copy per shared blob.

Why this must live in the format layer, not on top

We tried emulating sharing at the application layer (dedup at the writer, pass identical bytes, fix it up downstream). It fails for three reasons:

  1. Write cost is already paid by the time the preprocessor sees the row. For Dedicated, next_blob_id() assigns a fresh id per row, write_dedicated() opens a new sidecar, writer.write_all(data) pushes the same bytes again. No hook exists to say "reuse this blob_id."
  2. Coordinates are assigned inside Lance. For Inline, the out-of-line buffer position is only known inside BlobV2StructuralEncoder::maybe_encode. The app cannot hand-craft a descriptor pointing at a position it hasn't chosen.
  3. Readers cannot observe sharing after the fact. Without a persistent identifier in the descriptor, downstream tools (compaction, analytics, GC) have no signal that two rows were meant to be the same object.

So the primitive belongs in the column encoding.


Proposal: ref_id

A new optional ref_id: u32 column on the Blob v2 descriptor struct.

Contract: rows with the same positive ref_id share one physical blob. ref_id = 0 or null means no sharing (existing behavior).

from lance.blob import Blob, blob_array

# 8 frames all reference GOP #42
frames = blob_array([Blob(data=gop_bytes, ref_id=42) for _ in range(8)])

Why ref_id (absolute u32) instead of back_ref (relative index)

An earlier internal design doc proposed back_ref: i32 ("this row reuses row k of this batch"). We rejected it for three reasons:

  • Cross-size coverage. Inline dedup has to happen in the encoder (the only layer that knows the out-of-line buffer position), while Packed/Dedicated dedup has to happen in the preprocessor (the only layer that controls blob_id / write_packed / write_dedicated). A single absolute identifier bridges both layers; a relative batch-local index cannot (it cannot propagate into the encoder without breaking the "descriptor schema is frozen" constraint that made back_ref attractive in the first place).
  • Observability. A persisted ref_id column lets downstream jobs run SELECT ref_id, COUNT(*) GROUP BY ref_id to measure sharing, drive compaction hints, and track blob lineage. back_ref's sharing is invisible after write.
  • Scope. back_ref is by design single-batch; ref_id can naturally extend to single-write scope today and (future work) cross-write compaction-aware sharing.

Not in scope

  • Content-hash auto-dedup. The user chooses when to share; we only provide the mechanism.
  • Cross-write sharing (single write call still bounds the dedup cache).
  • Read-side fetch coalescing by ref_id. Purely a performance optimization, can be layered on separately.

Design

Where dedup happens is decided by physics

BlobKind Who decides the blob's coordinates Where dedup must live
Inline (≤64 KB) Encoder (external_buffers.add_buffer() returns position) BlobV2StructuralEncoder
Packed (64 KB – 4 MB) Preprocessor (write_packed() returns (blob_id, position)) BlobPreprocessor
Dedicated (>4 MB) Preprocessor (next_blob_id() + write_dedicated()) BlobPreprocessor

So the implementation places one cache per layer, both keyed by ref_id:

BlobPreprocessor {
    ref_id_sidecar_cache: HashMap<u32, SidecarRef>,   // Packed / Dedicated
}

BlobV2StructuralEncoder {
    ref_dedup_tmp_map: HashMap<u32, (u64, u64)>,       // Inline
}

enum SidecarRef {
    Dedicated { blob_id: u32, size: u64 },
    Packed    { blob_id: u32, position: u64, size: u64 },
}

Write-path flow

lance.write_dataset(batch)
    │
    ▼
BlobPreprocessor::preprocess_batch()              ◀── step 1
    for each row with ref_id > 0:
        if ref_id_sidecar_cache.contains(ref_id):
            emit descriptor with cached (blob_id, position, size)   ← skip sidecar write
            continue
    else:
        normal Dedicated / Packed / Inline decision
        if wrote a Dedicated or Packed blob:
            ref_id_sidecar_cache.insert(ref_id, SidecarRef { ... })
    │
    ▼  (descriptor struct flows down)
    │
BlobV2StructuralEncoder::maybe_encode()           ◀── step 2
    for each Inline row with ref_id > 0:
        if ref_dedup_tmp_map.contains(ref_id):
            reuse cached (position, size)                           ← skip add_buffer
        else:
            position = external_buffers.add_buffer(bytes)
            ref_dedup_tmp_map.insert(ref_id, (position, size))

Both caches live for a single write session. No persistent state beyond the descriptor column itself.

Read-path

Zero functional change. Rows with the same ref_id carry identical (position, size) or (blob_id, size) coordinates in their descriptors. Existing readers resolve each row the same way they always did — multiple rows simply happen to resolve to the same bytes. The schema version check is widened to accept 5 or 6 descriptor fields (the 6th being the optional ref_id), so files written by the new encoder remain readable under the existing take/scan code paths without further changes.

Schema compatibility

  • Descriptor struct: BLOB_V2_DESC_FIELDS gains a trailing ref_id: UInt32, nullable=true.
  • Old files: 5-field descriptors continue to work (the v2 schema validator accepts both shapes).
  • New files read by old binaries: the trailing nullable field is ignored by Arrow struct consumers that look up fields by name, which is what Lance's take/scan code does.
  • File format version: ≥ 2.2 (matches existing Blob v2 requirement).

Implementation

Five files, ~200 lines of core logic.

File Lines Role
rust/lance-core/src/datatypes.rs +1 Add ref_id field to BLOB_V2_DESC_FIELDS
rust/lance-encoding/src/encodings/logical/blob.rs +64 / −9 Encoder cache + Inline dedup + ref_id column output
rust/lance/src/dataset/blob.rs +95 / −4 Preprocessor cache + Packed/Dedicated dedup + schema check widened
rust/lance-encoding/src/decoder.rs +17 / −1 Recognize lance.blob.v2 extension type by name (ExtensionType metadata does not always survive Lance's schema serialization)
python/python/lance/blob.py +14 / −5 Blob dataclass + BlobType storage schema + BlobArray.from_pylist plumb ref_id

Total: 6 files, +288 / −19 including a demo test script.


Verification

python/test_ref_id_dedup.py writes 20 rows sharing one ref_id at three size classes and measures actual on-disk bytes:

=== inline_32kb (payload=32,768 B, ref_id=101, rows=20) ===
  main .lance:          34,558 B
  sidecar total:             0 B (0 files)
  naive (20 copies):   655,360 B
  amplification:           1.05×   ✓ DEDUP

=== packed_1mb (payload=1,048,576 B, ref_id=102, rows=20) ===
  main .lance:           1,784 B
  sidecar total:     1,048,576 B (1 file)
  naive (20 copies):20,971,520 B
  amplification:           1.00×   ✓ DEDUP

=== dedicated_6mb (payload=6,291,456 B, ref_id=103, rows=20) ===
  main .lance:           1,784 B
  sidecar total:     6,291,456 B (1 file)
  naive (20 copies):125,829,120 B
  amplification:           1.00×   ✓ DEDUP

Before this PR, dedicated_6mb produces 20 sidecar .blob files totaling 120 MB. After, it produces 1 sidecar file of 6 MB. Read-back under take_blobs returns byte-identical payloads to the source for all 20 rows on every path.


Built on

Independent of

  • Tracking issues for Blob V2 #4947's follow-up items (public API, compaction, ingest) — this PR is a format-level primitive that can land before those designs are finalized.

Forward compatibility with compaction

When compaction merges multiple data files, sharing may need to cross data-file boundaries. The contract proposed here supports that extension naturally — ref_id is already a persistent column, so compaction can inspect it and decide whether to preserve sharing (rebuild a merged cache) or fall back to independent rows (acceptable degradation). No schema or API change required.


Rows sharing the same positive `ref_id` now reuse a single stored copy of
the blob bytes, regardless of which physical storage path the size class
routes to. This turns the common "N rows referencing one GOP / embedding /
document" pattern into 1x storage instead of Nx.

- Inline: `BlobV2StructuralEncoder` caches `(position, size)` per ref_id.
  Later rows skip `external_buffers.add_buffer` and reuse the cached offset.
- Packed / Dedicated: `BlobPreprocessor` caches a `SidecarRef` per ref_id.
  Later rows skip `write_packed` / `write_dedicated` and reuse the cached
  `(blob_id, position, size)`.

The descriptor schema gains an optional trailing `ref_id: UInt32` column,
and the v2 schema check accepts 5 or 6 fields so readers stay compatible.
Python `Blob` gains a matching `ref_id` and `BlobArray.from_pylist` plumbs
it through.

Verified at 32 KB (Inline), 1 MB (Packed), and 6 MB (Dedicated): 20 rows
sharing one ref_id store 1x the bytes, not 20x. Dedicated went from 20
sidecar files down to 1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions Bot added enhancement New feature or request python labels Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant