diff --git a/docs/src/guide/blob.md b/docs/src/guide/blob.md index 6450230e09b..1a4737e7346 100644 --- a/docs/src/guide/blob.md +++ b/docs/src/guide/blob.md @@ -1,86 +1,314 @@ -# Blob As Files +# Blob Columns -Unlike other data formats, large multimodal data is a first-class citizen in the Lance columnar format. -Lance provides a high-level API to store and retrieve large binary objects (blobs) in Lance datasets. +Lance supports large binary objects (images, videos, audio, model artifacts) through blob columns. +Blob access is lazy: reads return `BlobFile` handles so callers can stream bytes on demand. ![Blob](../images/blob.png) -Lance serves large binary data using `lance.BlobFile`, which -is a file-like object that lazily reads large binary objects. +## What This Page Covers -To create a Lance dataset with large blob data, you can mark a large binary column as a blob column by -adding the metadata `lance-encoding:blob` to `true`. +This page focuses on Python blob workflows and uses Lance file format terminology. + +- `data_storage_version` means the Lance **file format version** of a dataset. +- A dataset's `data_storage_version` is fixed once the dataset is created. +- If you need a different file format version, write a **new dataset**. + +## Quick Start (Blob v2) ```python +import lance import pyarrow as pa +from lance import blob_array, blob_field + +schema = pa.schema([ + pa.field("id", pa.int64()), + blob_field("blob"), +]) -schema = pa.schema( - [ - pa.field("id", pa.int64()), - pa.field("video", - pa.large_binary(), - metadata={"lance-encoding:blob": "true"} - ), - ] +table = pa.table( + { + "id": [1], + "blob": blob_array([b"hello blob v2"]), + }, + schema=schema, ) + +ds = lance.write_dataset(table, "./blobs_v22.lance", data_storage_version="2.2") + +blob = ds.take_blobs("blob", indices=[0])[0] +with blob as f: + assert f.read() == b"hello blob v2" ``` -To write blob data to a Lance dataset, create a PyArrow table with the blob schema and use `lance.write_dataset`: +## Version Compatibility (Single Source of Truth) + +| Dataset `data_storage_version` | Legacy blob metadata (`lance-encoding:blob`) | Blob v2 (`lance.blob.v2`) | +|---|---|---| +| `0.1`, `2.0`, `2.1` | Supported for write/read | Not supported | +| `2.2+` | Not supported for write | Supported for write/read (recommended) | + +Important: + +- For file format `>= 2.2`, legacy blob metadata (`lance-encoding:blob`) is rejected on write. + +## Blob v2 Write Patterns + +Use `blob_field` and `blob_array` to build blob v2 columns. ```python import lance +import pyarrow as pa +from lance import Blob, blob_array, blob_field + +schema = pa.schema([ + pa.field("id", pa.int64()), + blob_field("blob", nullable=True), +]) -# First, download a sample video file for testing -# wget https://www.learningcontainer.com/wp-content/uploads/2020/05/sample-mp4-file.mp4 -import urllib.request -urllib.request.urlretrieve( - "https://www.learningcontainer.com/wp-content/uploads/2020/05/sample-mp4-file.mp4", - "sample_video.mp4" +# A single column can mix: +# - inline bytes +# - external URI +# - external URI slice (position + size) +# - null +rows = pa.table( + { + "id": [1, 2, 3, 4], + "blob": blob_array([ + b"inline-bytes", + "s3://bucket/path/video.mp4", + Blob.from_uri("s3://bucket/archive.tar", position=4096, size=8192), + None, + ]), + }, + schema=schema, ) -# Then read the video file content -with open("sample_video.mp4", 'rb') as f: - video_data = f.read() +ds = lance.write_dataset( + rows, + "./blobs_v22.lance", + data_storage_version="2.2", +) +``` -# Create table with blob data -table = pa.table({ - "id": [1], - "video": [video_data], -}, schema=schema) +### Example: packed external blobs (single container file) -# Write to Lance dataset -ds = lance.write_dataset( - table, - "./youtube.lance", - schema=schema +```python +import io +import tarfile +from pathlib import Path +import lance +import pyarrow as pa +from lance import Blob, blob_array, blob_field + +# Build a tar file with three payloads +payloads = { + "a.bin": b"alpha", + "b.bin": b"bravo", + "c.bin": b"charlie", +} + +with tarfile.open("container.tar", "w") as tf: + for name, data in payloads.items(): + info = tarfile.TarInfo(name) + info.size = len(data) + tf.addfile(info, io.BytesIO(data)) + +# Capture offset/size for each member +blob_values = [] +with tarfile.open("container.tar", "r") as tf: + container_uri = Path("container.tar").resolve().as_uri() + for name in payloads: + m = tf.getmember(name) + blob_values.append(Blob.from_uri(container_uri, position=m.offset_data, size=m.size)) + +schema = pa.schema([ + pa.field("name", pa.utf8()), + blob_field("blob"), +]) + +rows = pa.table( + { + "name": list(payloads.keys()), + "blob": blob_array(blob_values), + }, + schema=schema, ) + +ds = lance.write_dataset(rows, "./packed_blobs_v22.lance", data_storage_version="2.2") ``` -To fetch blobs from a Lance dataset, you can use `lance.dataset.LanceDataset.take_blobs`. +## Blob v2 Read Patterns + +Use `take_blobs` to fetch file-like handles. +Exactly one selector must be provided: `ids`, `indices`, or `addresses`. + +| Selector | Typical Use | Stability | +|---|---|---| +| `indices` | Positional reads within one dataset snapshot | Stable within that snapshot | +| `ids` | Logical row-id based reads | Stable logical identity (when row ids are available) | +| `addresses` | Low-level physical reads and debugging | Unstable physical location | + +### Read by row indices + +```python +import lance + +ds = lance.dataset("./blobs_v22.lance") +blobs = ds.take_blobs("blob", indices=[0, 1]) + +with blobs[0] as f: + data = f.read() +``` + +### Read by row ids + +```python +import lance + +ds = lance.dataset("./blobs_v22.lance") +row_ids = ds.to_table(columns=[], with_row_id=True).column("_rowid").to_pylist() + +blobs = ds.take_blobs("blob", ids=row_ids[:2]) +``` + +### Read by row addresses + +```python +import lance + +ds = lance.dataset("./blobs_v22.lance") +row_addrs = ds.to_table(columns=[], with_row_address=True).column("_rowaddr").to_pylist() + +blobs = ds.take_blobs("blob", addresses=row_addrs[:2]) +``` -For example, it's easy to use `BlobFile` to extract frames from a video file without -loading the entire video into memory. +### Example: decode video frames lazily ```python -import av # pip install av +import av import lance -ds = lance.dataset("./youtube.lance") -start_time, end_time = 500, 1000 -# Get blob data from the first row (id=0) -blobs = ds.take_blobs("video", ids=[0]) -with av.open(blobs[0]) as container: +ds = lance.dataset("./videos_v22.lance") +blob = ds.take_blobs("video", indices=[0])[0] + +start_ms, end_ms = 500, 1000 + +with av.open(blob) as container: stream = container.streams.video[0] stream.codec_context.skip_frame = "NONKEY" - start_time = start_time / stream.time_base - start_time = start_time.as_integer_ratio()[0] - end_time = end_time / stream.time_base - container.seek(start_time, stream=stream) + start = (start_ms / 1000) / stream.time_base + end = (end_ms / 1000) / stream.time_base + container.seek(int(start), stream=stream) for frame in container.decode(stream): - if frame.time > end_time: + if frame.time is not None and frame.time > end_ms / 1000: break - display(frame.to_image()) - clear_output(wait=True) -``` \ No newline at end of file + # process frame + pass +``` + +## Legacy Compatibility Appendix (`data_storage_version` <= `2.1`) + +If you need to keep writing legacy blob columns, use file format `0.1`, `2.0`, or `2.1` +and mark `LargeBinary` fields with `lance-encoding:blob = true`. + +```python +import lance +import pyarrow as pa + +schema = pa.schema([ + pa.field("id", pa.int64()), + pa.field( + "video", + pa.large_binary(), + metadata={"lance-encoding:blob": "true"}, + ), +]) + +table = pa.table( + { + "id": [1, 2], + "video": [b"foo", b"bar"], + }, + schema=schema, +) + +ds = lance.write_dataset( + table, + "./legacy_blob_dataset", + data_storage_version="2.1", +) +``` + +This write pattern is invalid for `data_storage_version >= 2.2`. +For new datasets, prefer blob v2. + +## Rewrite to a New Blob v2 Dataset + +If your current dataset is legacy blob and you want blob v2, rewrite into a new dataset with `data_storage_version="2.2"`. + +```python +import lance +import pyarrow as pa +from lance import blob_array, blob_field + +legacy = lance.dataset("./legacy_blob_dataset") +raw = legacy.scanner(columns=["id", "video"], blob_handling="all_binary").to_table() + +new_schema = pa.schema([ + pa.field("id", pa.int64()), + blob_field("video"), +]) + +rewritten = pa.table( + { + "id": raw.column("id"), + "video": blob_array(raw.column("video").to_pylist()), + }, + schema=new_schema, +) + +lance.write_dataset( + rewritten, + "./blob_v22_dataset", + data_storage_version="2.2", +) +``` + +Warning: + +- The example above materializes binary payloads in memory (`blob_handling="all_binary"` and `to_pylist()`). +- For large datasets, prefer chunked/batched rewrite pipelines. + +## Troubleshooting + +### "Blob v2 requires file version >= 2.2" + +Cause: + +- You are writing blob v2 values into a dataset/file format below `2.2`. + +Fix: + +- Write to a dataset created with `data_storage_version="2.2"` (or newer). + +### "Legacy blob columns ... are not supported for file version >= 2.2" + +Cause: + +- You are using legacy blob metadata (`lance-encoding:blob`) while writing `2.2+` data. + +Fix: + +- Replace legacy metadata-based columns with blob v2 columns (`blob_field` / `blob_array`). + +### "Exactly one of ids, indices, or addresses must be specified" + +Cause: + +- `take_blobs` received none or multiple selectors. + +Fix: + +- Provide exactly one of `ids`, `indices`, or `addresses`. diff --git a/docs/src/guide/data_types.md b/docs/src/guide/data_types.md index c7853695f36..2217954ac20 100644 --- a/docs/src/guide/data_types.md +++ b/docs/src/guide/data_types.md @@ -32,9 +32,44 @@ Lance supports the full Apache Arrow type system. When writing data through Pyth ### Blob Type for Large Binary Objects -Lance provides a specialized **Blob** type for efficiently storing and retrieving very large binary objects such as videos, images, audio files, or other multimedia content. Unlike regular binary columns, blobs are stored out-of-line and support lazy loading, which means you can read portions of the data without loading everything into memory. +Lance provides a specialized **Blob** type for efficiently storing and retrieving very large binary objects such as videos, images, audio files, or other multimedia content. Unlike regular binary columns, blobs support lazy loading, which means you can read portions of the data without loading everything into memory. -To create a blob column, add the `lance-encoding:blob` metadata to a `LargeBinary` field: +For new datasets, use blob v2 (`lance.blob.v2`) via `blob_field` and `blob_array`. + +Blob versioning follows dataset file format rules: + +- `data_storage_version` is the Lance file format version of a dataset. +- A dataset's `data_storage_version` is fixed once created. +- For `data_storage_version >= 2.2`, legacy blob metadata (`lance-encoding:blob`) is rejected on write. +- Legacy metadata-based blob write remains available for `0.1`, `2.0`, and `2.1`. + +```python +import lance +import pyarrow as pa +from lance import blob_array, blob_field + +schema = pa.schema([ + pa.field("id", pa.int64()), + blob_field("video"), +]) + +table = pa.table( + { + "id": [1], + "video": blob_array([b"sample-video-bytes"]), + }, + schema=schema, +) + +ds = lance.write_dataset(table, "./videos_v22.lance", data_storage_version="2.2") +blob = ds.take_blobs("video", indices=[0])[0] +with blob as f: + payload = f.read() +``` + +For legacy compatibility (`data_storage_version <= 2.1`), you can still write blob columns using `LargeBinary` with `lance-encoding:blob=true`. + +To create a blob column with the legacy path, add the `lance-encoding:blob` metadata to a `LargeBinary` field: ```python import pyarrow as pa @@ -58,7 +93,12 @@ table = pa.table({ "video": [video_data], }, schema=schema) -ds = lance.write_dataset(table, "./videos.lance", schema=schema) +ds = lance.write_dataset( + table, + "./videos_legacy.lance", + schema=schema, + data_storage_version="2.1", +) ``` To read blob data, use `take_blobs()` which returns file-like objects for lazy reading: @@ -326,7 +366,7 @@ When integrating Lance with other systems (like Apache Flink, Spark, or Presto), | `TIMESTAMP WITH LOCAL TIMEZONE` | `Timestamp` | With timezone info | | `BINARY` / `VARBINARY` | `Binary` | | | `BYTES` | `Binary` | | -| `BLOB` | `LargeBinary` with `lance-encoding:blob` | Large binary objects with lazy loading | +| `BLOB` | Blob v2 extension type (`lance.blob.v2`) | Use `blob_field` / `blob_array` for new datasets; legacy metadata path applies to `data_storage_version <= 2.1` | | `ARRAY` | `List(T)` | Variable-length array | | `ARRAY(n)` | `FixedSizeList(T, n)` | Fixed-length array (vectors) | | `ROW` / `STRUCT` | `Struct` | Nested structure |