lance-format · Xuanwo · Feb 27, 2026 · Feb 27, 2026
diff --git a/docs/src/guide/blob.md b/docs/src/guide/blob.md
@@ -1,86 +1,314 @@
-# Blob As Files
+# Blob Columns
 
-Unlike other data formats, large multimodal data is a first-class citizen in the Lance columnar format.
-Lance provides a high-level API to store and retrieve large binary objects (blobs) in Lance datasets.
+Lance supports large binary objects (images, videos, audio, model artifacts) through blob columns.
+Blob access is lazy: reads return `BlobFile` handles so callers can stream bytes on demand.
 
 ![Blob](../images/blob.png)
 
-Lance serves large binary data using `lance.BlobFile`, which
-is a file-like object that lazily reads large binary objects.
+## What This Page Covers
 
-To create a Lance dataset with large blob data, you can mark a large binary column as a blob column by
-adding the metadata `lance-encoding:blob` to `true`.
+This page focuses on Python blob workflows and uses Lance file format terminology.
+
+- `data_storage_version` means the Lance **file format version** of a dataset.
+- A dataset's `data_storage_version` is fixed once the dataset is created.
+- If you need a different file format version, write a **new dataset**.
+
+## Quick Start (Blob v2)
 
 ```python
+import lance
 import pyarrow as pa
+from lance import blob_array, blob_field
+
+schema = pa.schema([
+    pa.field("id", pa.int64()),
+    blob_field("blob"),
+])
 
-schema = pa.schema(
-    [
-        pa.field("id", pa.int64()),
-        pa.field("video",
-            pa.large_binary(),
-            metadata={"lance-encoding:blob": "true"}
-        ),
-    ]
+table = pa.table(
+    {
+        "id": [1],
+        "blob": blob_array([b"hello blob v2"]),
+    },
+    schema=schema,
 )
+
+ds = lance.write_dataset(table, "./blobs_v22.lance", data_storage_version="2.2")
+
+blob = ds.take_blobs("blob", indices=[0])[0]
+with blob as f:
+    assert f.read() == b"hello blob v2"
 ```
 
-To write blob data to a Lance dataset, create a PyArrow table with the blob schema and use `lance.write_dataset`:
+## Version Compatibility (Single Source of Truth)
+
+| Dataset `data_storage_version` | Legacy blob metadata (`lance-encoding:blob`) | Blob v2 (`lance.blob.v2`) |
+|---|---|---|
+| `0.1`, `2.0`, `2.1` | Supported for write/read | Not supported |
+| `2.2+` | Not supported for write | Supported for write/read (recommended) |
+
+Important:
+
+- For file format `>= 2.2`, legacy blob metadata (`lance-encoding:blob`) is rejected on write.
+
+## Blob v2 Write Patterns
+
+Use `blob_field` and `blob_array` to build blob v2 columns.
 
 ```python
 import lance
+import pyarrow as pa
+from lance import Blob, blob_array, blob_field
+
+schema = pa.schema([
+    pa.field("id", pa.int64()),
+    blob_field("blob", nullable=True),
+])
 
-# First, download a sample video file for testing
-# wget https://www.learningcontainer.com/wp-content/uploads/2020/05/sample-mp4-file.mp4
-import urllib.request
-urllib.request.urlretrieve(
-    "https://www.learningcontainer.com/wp-content/uploads/2020/05/sample-mp4-file.mp4",
-    "sample_video.mp4"
+# A single column can mix:
+# - inline bytes
+# - external URI
+# - external URI slice (position + size)
+# - null
+rows = pa.table(
+    {
+        "id": [1, 2, 3, 4],
+        "blob": blob_array([
+            b"inline-bytes",
+            "s3://bucket/path/video.mp4",
+            Blob.from_uri("s3://bucket/archive.tar", position=4096, size=8192),
+            None,
+        ]),
+    },
+    schema=schema,
 )
 
-# Then read the video file content
-with open("sample_video.mp4", 'rb') as f:
-    video_data = f.read()
+ds = lance.write_dataset(
+    rows,
+    "./blobs_v22.lance",
+    data_storage_version="2.2",
+)
+```
 
-# Create table with blob data
-table = pa.table({
-    "id": [1],
-    "video": [video_data],
-}, schema=schema)
+### Example: packed external blobs (single container file)
 
-# Write to Lance dataset
-ds = lance.write_dataset(
-    table,
-    "./youtube.lance",
-    schema=schema
+```python
+import io
+import tarfile
+from pathlib import Path
+import lance
+import pyarrow as pa
+from lance import Blob, blob_array, blob_field
+
+# Build a tar file with three payloads
+payloads = {
+    "a.bin": b"alpha",
+    "b.bin": b"bravo",
+    "c.bin": b"charlie",
+}
+
+with tarfile.open("container.tar", "w") as tf:
+    for name, data in payloads.items():
+        info = tarfile.TarInfo(name)
+        info.size = len(data)
+        tf.addfile(info, io.BytesIO(data))
+
+# Capture offset/size for each member
+blob_values = []
+with tarfile.open("container.tar", "r") as tf:
+    container_uri = Path("container.tar").resolve().as_uri()
+    for name in payloads:
+        m = tf.getmember(name)
+        blob_values.append(Blob.from_uri(container_uri, position=m.offset_data, size=m.size))
+
+schema = pa.schema([
+    pa.field("name", pa.utf8()),
+    blob_field("blob"),
+])
+
+rows = pa.table(
+    {
+        "name": list(payloads.keys()),
+        "blob": blob_array(blob_values),
+    },
+    schema=schema,
 )
+
+ds = lance.write_dataset(rows, "./packed_blobs_v22.lance", data_storage_version="2.2")
 ```
 
-To fetch blobs from a Lance dataset, you can use `lance.dataset.LanceDataset.take_blobs`.
+## Blob v2 Read Patterns
+
+Use `take_blobs` to fetch file-like handles.
+Exactly one selector must be provided: `ids`, `indices`, or `addresses`.
+
+| Selector | Typical Use | Stability |
+|---|---|---|
+| `indices` | Positional reads within one dataset snapshot | Stable within that snapshot |
+| `ids` | Logical row-id based reads | Stable logical identity (when row ids are available) |
+| `addresses` | Low-level physical reads and debugging | Unstable physical location |
+
+### Read by row indices
+
+```python
+import lance
+
+ds = lance.dataset("./blobs_v22.lance")
+blobs = ds.take_blobs("blob", indices=[0, 1])
+
+with blobs[0] as f:
+    data = f.read()
+```
+
+### Read by row ids
+
+```python
+import lance
+
+ds = lance.dataset("./blobs_v22.lance")
+row_ids = ds.to_table(columns=[], with_row_id=True).column("_rowid").to_pylist()
+
+blobs = ds.take_blobs("blob", ids=row_ids[:2])
+```
+
+### Read by row addresses
+
+```python
+import lance
+
+ds = lance.dataset("./blobs_v22.lance")
+row_addrs = ds.to_table(columns=[], with_row_address=True).column("_rowaddr").to_pylist()
+
+blobs = ds.take_blobs("blob", addresses=row_addrs[:2])
+```
 
-For example, it's easy to use `BlobFile` to extract frames from a video file without
-loading the entire video into memory.
+### Example: decode video frames lazily
 
 ```python
-import av # pip install av
+import av
 import lance
 
-ds = lance.dataset("./youtube.lance")
-start_time, end_time = 500, 1000
-# Get blob data from the first row (id=0)
-blobs = ds.take_blobs("video", ids=[0])
-with av.open(blobs[0]) as container:
+ds = lance.dataset("./videos_v22.lance")
+blob = ds.take_blobs("video", indices=[0])[0]
+
+start_ms, end_ms = 500, 1000
+
+with av.open(blob) as container:
     stream = container.streams.video[0]
     stream.codec_context.skip_frame = "NONKEY"
 
-    start_time = start_time / stream.time_base
-    start_time = start_time.as_integer_ratio()[0]
-    end_time = end_time / stream.time_base
-    container.seek(start_time, stream=stream)
+    start = (start_ms / 1000) / stream.time_base
+    end = (end_ms / 1000) / stream.time_base
+    container.seek(int(start), stream=stream)
 
     for frame in container.decode(stream):
-        if frame.time > end_time:
+        if frame.time is not None and frame.time > end_ms / 1000:
             break
-        display(frame.to_image())
-        clear_output(wait=True) 
-```
+        # process frame
+        pass
+```
+
+## Legacy Compatibility Appendix (`data_storage_version` <= `2.1`)
+
+If you need to keep writing legacy blob columns, use file format `0.1`, `2.0`, or `2.1`
+and mark `LargeBinary` fields with `lance-encoding:blob = true`.
+
+```python
+import lance
+import pyarrow as pa
+
+schema = pa.schema([
+    pa.field("id", pa.int64()),
+    pa.field(
+        "video",
+        pa.large_binary(),
+        metadata={"lance-encoding:blob": "true"},
+    ),
+])
+
+table = pa.table(
+    {
+        "id": [1, 2],
+        "video": [b"foo", b"bar"],
+    },
+    schema=schema,
+)
+
+ds = lance.write_dataset(
+    table,
+    "./legacy_blob_dataset",
+    data_storage_version="2.1",
+)
+```
+
+This write pattern is invalid for `data_storage_version >= 2.2`.
+For new datasets, prefer blob v2.
+
+## Rewrite to a New Blob v2 Dataset
+
+If your current dataset is legacy blob and you want blob v2, rewrite into a new dataset with `data_storage_version="2.2"`.
+
+```python
+import lance
+import pyarrow as pa
+from lance import blob_array, blob_field
+
+legacy = lance.dataset("./legacy_blob_dataset")
+raw = legacy.scanner(columns=["id", "video"], blob_handling="all_binary").to_table()
+
+new_schema = pa.schema([
+    pa.field("id", pa.int64()),
+    blob_field("video"),
+])
+
+rewritten = pa.table(
+    {
+        "id": raw.column("id"),
+        "video": blob_array(raw.column("video").to_pylist()),
+    },
+    schema=new_schema,
+)
+
+lance.write_dataset(
+    rewritten,
+    "./blob_v22_dataset",
+    data_storage_version="2.2",
+)
+```
+
+Warning:
+
+- The example above materializes binary payloads in memory (`blob_handling="all_binary"` and `to_pylist()`).
+- For large datasets, prefer chunked/batched rewrite pipelines.
+
+## Troubleshooting
+
+### "Blob v2 requires file version >= 2.2"
+
+Cause:
+
+- You are writing blob v2 values into a dataset/file format below `2.2`.
+
+Fix:
+
+- Write to a dataset created with `data_storage_version="2.2"` (or newer).
+
+### "Legacy blob columns ... are not supported for file version >= 2.2"
+
+Cause:
+
+- You are using legacy blob metadata (`lance-encoding:blob`) while writing `2.2+` data.
+
+Fix:
+
+- Replace legacy metadata-based columns with blob v2 columns (`blob_field` / `blob_array`).
+
+### "Exactly one of ids, indices, or addresses must be specified"
+
+Cause:
+
+- `take_blobs` received none or multiple selectors.
+
+Fix:
+
+- Provide exactly one of `ids`, `indices`, or `addresses`.