Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 32 additions & 65 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,17 @@

<img width="257" alt="Lance Logo" src="https://user-images.githubusercontent.com/917119/199353423-d3e202f7-0269-411d-8ff2-e747e419e492.png">

**Modern columnar data format for ML. Convert from Parquet in 2-lines of code for 100x faster random access, zero-cost schema evolution, rich secondary indices, versioning, and more.<br/>**
**Compatible with Pandas, DuckDB, Polars, Pyarrow, and Ray with more integrations on the way.**
**The Open Lakehouse Format for Multimodal AI**<br/>
**High-performance vector search, full-text search, random access, and feature engineering capabilities for the lakehouse.**<br/>
**Compatible with Pandas, DuckDB, Polars, PyArrow, Ray, Spark, and more integrations on the way.**

<a href="https://lancedb.github.io/lance/">Documentation</a> •
<a href="https://blog.lancedb.com/">Blog</a> •
<a href="https://discord.gg/zMM32dvNtd">Discord</a> •
<a href="https://x.com/lancedb">X</a>
<a href="https://lance.org">Documentation</a> •
<a href="https://lance.org/community">Community</a> •
<a href="https://discord.gg/zMM32dvNtd">Discord</a>

[CI]: https://github.com/lancedb/lance/actions/workflows/rust.yml
[CI Badge]: https://github.com/lancedb/lance/actions/workflows/rust.yml/badge.svg
[Docs]: https://lancedb.github.io/lance/
[Docs]: https://lance.org
[Docs Badge]: https://img.shields.io/badge/docs-passing-brightgreen
[crates.io]: https://crates.io/crates/lance
[crates.io badge]: https://img.shields.io/crates/v/lance.svg
Expand All @@ -30,24 +30,30 @@

<hr />

Lance is a modern columnar data format that is optimized for ML workflows and datasets. Lance is perfect for:
Lance is an open lakehouse format for multimodal AI. It contains a file format, table format, and catalog spec that allows you to build a complete lakehouse on top of object storage to power your AI workflows. Lance is perfect for:

1. Building search engines and feature stores.
2. Large-scale ML training requiring high performance IO and shuffles.
3. Storing, querying, and inspecting deeply nested data for robotics or large blobs like images, point clouds, and more.
1. Building search engines and feature stores with hybrid search capabilities.
2. Large-scale ML training requiring high performance IO and random access.
3. Storing, querying, and managing multimodal data including images, videos, audio, text, and embeddings.

The key features of Lance include:

* **High-performance random access:** 100x faster than Parquet without sacrificing scan performance.
* **Expressive hybrid search:** Combine vector similarity search, full-text search (BM25), and SQL analytics on the same dataset with accelerated secondary indices.

* **Vector search:** find nearest neighbors in milliseconds and combine OLAP-queries with vector search.
* **Lightning-fast random access:** 100x faster than Parquet or Iceberg for random access without sacrificing scan performance.

* **Zero-copy, automatic versioning:** manage versions of your data without needing extra infrastructure.
* **Native multimodal data support:** Store images, videos, audio, text, and embeddings in a single unified format with efficient blob encoding and lazy loading.

* **Ecosystem integrations:** Apache Arrow, Pandas, Polars, DuckDB, Ray, Spark and more on the way.
* **Data evolution:** Efficiently add columns with backfilled values without full table rewrites, perfect for ML feature engineering.

* **Zero-copy versioning:** ACID transactions, time travel, and automatic versioning without needing extra infrastructure.

* **Rich ecosystem integrations:** Apache Arrow, Pandas, Polars, DuckDB, Apache Spark, Ray, Trino, Apache Flink, and open catalogs (Apache Polaris, Unity Catalog, Apache Gravitino).

For more details, see the full [Lance format specification](https://lance.org/format).

> [!TIP]
> Lance is in active development and we welcome contributions. Please see our [contributing guide](https://lancedb.github.io/lance/community/contributing) for more information.
> Lance is in active development and we welcome contributions. Please see our [contributing guide](https://lance.org/docs/community/contributing) for more information.

## Quick Start

Expand Down Expand Up @@ -171,24 +177,6 @@ rs = [dataset.to_table(nearest={"column": "vector", "k": 10, "q": q})
| [java](./java) | Java bindings (JNI) |
| [docs](./docs) | Documentation source |

## What makes Lance different
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a duplicate of the intro section so removed


Here we will highlight a few aspects of Lance’s design. For more details, see the full [Lance design document](https://lancedb.github.io/lance/format).

**Vector index**: Vector index for similarity search over embedding space.
Support both CPUs (``x86_64`` and ``arm``) and GPU (``Nvidia (cuda)`` and ``Apple Silicon (mps)``).

**Encodings**: To achieve both fast columnar scan and sub-linear point queries, Lance uses custom encodings and layouts.

**Nested fields**: Lance stores each subfield as a separate column to support efficient filters like “find images where detected objects include cats”.

**Versioning**: A Manifest can be used to record snapshots. Currently we support creating new versions automatically via appends, overwrites, and index creation.

**Fast updates** (ROADMAP): Updates will be supported via write-ahead logs.

**Rich secondary indices**: Support `BTree`, `Bitmap`, `Full text search`, `Label list`,
`NGrams`, and more.

## Benchmarks

### Vector search
Expand All @@ -209,9 +197,9 @@ We create a Lance dataset using the Oxford Pet dataset to do some preliminary pe

![](docs/src/images/lance_perf.png)

## Why are you building yet another data format?!
## Why Lance for AI/ML workflows?

The machine learning development cycle involves the steps:
The machine learning development cycle involves multiple stages:

```mermaid
graph LR
Expand All @@ -226,20 +214,16 @@ graph LR
H --> A;
```

People use different data representations to varying stages for the performance or limited by the tooling available.
Academia mainly uses XML / JSON for annotations and zipped images/sensors data for deep learning, which
is difficult to integrate into data infrastructure and slow to train over cloud storage.
While industry uses data lakes (Parquet-based techniques, i.e., Delta Lake, Iceberg) or data warehouses (AWS Redshift
or Google BigQuery) to collect and analyze data, they have to convert the data into training-friendly formats, such
as [Rikai](https://github.com/eto-ai/rikai)/[Petastorm](https://github.com/uber/petastorm)
or [TFRecord](https://www.tensorflow.org/tutorials/load_data/tfrecord).
Multiple single-purpose data transforms, as well as syncing copies between cloud storage to local training
instances have become a common practice.
Traditional lakehouse formats were designed for SQL analytics and struggle with AI/ML workloads that require:
- **Vector search** for similarity and semantic retrieval
- **Fast random access** for sampling and interactive exploration
- **Multimodal data** storage (images, videos, audio alongside embeddings)
- **Data evolution** for feature engineering without full table rewrites
- **Hybrid search** combining vectors, full-text, and SQL predicates

While each of the existing data formats excels at the workload it was originally designed for, we need a new data format
tailored for multistage ML development cycles to reduce and data silos.
While existing formats (Parquet, Iceberg, Delta Lake) excel at SQL analytics, they require additional specialized systems for AI capabilities. Lance brings these AI-first features directly into the lakehouse format.

A comparison of different data formats in each stage of ML development cycle.
A comparison of different formats across ML development stages:

| | Lance | Parquet & ORC | JSON & XML | TFRecord | Database | Warehouse |
|---------------------|-------|---------------|------------|----------|----------|-----------|
Expand All @@ -249,20 +233,3 @@ A comparison of different data formats in each stage of ML development cycle.
| Exploration | Fast | Slow | Fast | Slow | Fast | Decent |
| Infra Support | Rich | Rich | Decent | Limited | Rich | Rich |

## Community Highlights
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this and next section we should probably make them pages in the doc site, so remove for now.


Lance is currently used in production by:
* [LanceDB](https://github.com/lancedb/lancedb), a serverless, low-latency vector database for ML applications
* [LanceDB Enterprise](https://docs.lancedb.com/enterprise/introduction), hyperscale LanceDB with enterprise SLA.
* Leading multimodal Gen AI companies for training over petabyte-scale multimodal data.
* Self-driving car company for large-scale storage, retrieval and processing of multi-modal data.
* E-commerce company for billion-scale+ vector personalized search.
* and more.

## Presentations, Blogs and Talks

* [Designing a Table Format for ML Workloads](https://blog.lancedb.com/designing-a-table-format-for-ml-workloads/), Feb 2025.
* [Transforming Multimodal Data Management with LanceDB, Ray Summit](https://www.youtube.com/watch?v=xmTFEzAh8ho), Oct 2024.
* [Lance v2: A columnar container format for modern data](https://blog.lancedb.com/lance-v2/), Apr 2024.
* [Lance Deep Dive](https://drive.google.com/file/d/1Orh9rK0Mpj9zN_gnQF1eJJFpAc6lStGm/view?usp=drive_link). July 2023.
* [Lance: A New Columnar Data Format](https://docs.google.com/presentation/d/1a4nAiQAkPDBtOfXFpPg7lbeDAxcNDVKgoUkw3cUs2rE/edit#slide=id.p), [Scipy 2022, Austin, TX](https://www.scipy2022.scipy.org/posters). July, 2022.
28 changes: 19 additions & 9 deletions java/README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,30 @@
# Java bindings and SDK for Lance Data Format

> :warning: **Under heavy development**
# Java bindings and SDK for Lance

<div align="center">
<p align="center">

<img width="257" alt="Lance Logo" src="https://user-images.githubusercontent.com/917119/199353423-d3e202f7-0269-411d-8ff2-e747e419e492.png">

Lance is a new columnar data format for data science and machine learning
**The Open Lakehouse Format for Multimodal AI**
</p></div>

Why you should use Lance
1. It is an order of magnitude faster than Parquet for point queries and nested data structures common to DS/ML
2. It comes with a fast vector index that delivers sub-millisecond nearest neighbor search performance
3. It is automatically versioned and supports lineage and time-travel for full reproducibility
4. It is integrated with duckdb/pandas/polars already. Easily convert from/to Parquet in 2 lines of code
Lance is an open lakehouse format for multimodal AI. It contains a file format, table format, and catalog spec that allows you to build a complete lakehouse on top of object storage to power your AI workflows.

The key features of Lance include:

* **Expressive hybrid search:** Combine vector similarity search, full-text search (BM25), and SQL analytics on the same dataset with accelerated secondary indices.

* **Lightning-fast random access:** 100x faster than Parquet or Iceberg for random access without sacrificing scan performance.

* **Native multimodal data support:** Store images, videos, audio, text, and embeddings in a single unified format with efficient blob encoding and lazy loading.

* **Data evolution:** Efficiently add columns with backfilled values without full table rewrites, perfect for ML feature engineering.

* **Zero-copy versioning:** ACID transactions, time travel, and automatic versioning without needing extra infrastructure.

* **Rich ecosystem integrations:** Apache Arrow, Pandas, Polars, DuckDB, Apache Spark, Ray, Trino, Apache Flink, and open catalogs (Apache Polaris, Unity Catalog, Apache Gravitino).

For more details, see the full [Lance format specification](https://lance.org/format).

## Quick start

Expand Down
33 changes: 12 additions & 21 deletions rust/lance/README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# Rust Implementation of Lance Data Format
# Rust Implementation of Lance

<div align="center">
<p align="center">

<img width="257" alt="Lance Logo" src="https://user-images.githubusercontent.com/917119/199353423-d3e202f7-0269-411d-8ff2-e747e419e492.png">

**A new columnar data format for data science and machine learning**
**The Open Lakehouse Format for Multimodal AI**
</p></div>

## Installation
Expand Down Expand Up @@ -67,31 +67,22 @@ params.num_sub_vectors = 16;
dataset.create_index(&["embeddings"], IndexType::Vector, None, &params, true).await;
```

## Motivation
## What is Lance?

Why do we *need* a new format for data science and machine learning?
Lance is an open lakehouse format for multimodal AI. It contains a file format, table format, and catalog spec that allows you to build a complete lakehouse on top of object storage to power your AI workflows.

### 1. Reproducibility is a must-have
The key features of Lance include:

Versioning and experimentation support should be built into the dataset instead of requiring multiple tools.<br/>
It should also be efficient and not require expensive copying everytime you want to create a new version.<br/>
We call this "Zero copy versioning" in Lance. It makes versioning data easy without increasing storage costs.
* **Expressive hybrid search:** Combine vector similarity search, full-text search (BM25), and SQL analytics on the same dataset with accelerated secondary indices.

### 2. Cloud storage is now the default
* **Lightning-fast random access:** 100x faster than Parquet or Iceberg for random access without sacrificing scan performance.

Remote object storage is the default now for data science and machine learning and the performance characteristics of cloud are fundamentally different.<br/>
Lance format is optimized to be cloud native. Common operations like filter-then-take can be order of magnitude faster
using Lance than Parquet, especially for ML data.
* **Native multimodal data support:** Store images, videos, audio, text, and embeddings in a single unified format with efficient blob encoding and lazy loading.

### 3. Vectors must be a first class citizen, not a separate thing
* **Data evolution:** Efficiently add columns with backfilled values without full table rewrites, perfect for ML feature engineering.

The majority of reasonable scale workflows should not require the added complexity and cost of a
specialized database just to compute vector similarity. Lance integrates optimized vector indices
into a columnar format so no additional infrastructure is required to get low latency top-K similarity search.
* **Zero-copy versioning:** ACID transactions, time travel, and automatic versioning without needing extra infrastructure.

### 4. Open standards is a requirement
* **Rich ecosystem integrations:** Apache Arrow, Pandas, Polars, DuckDB, Apache Spark, Ray, Trino, Apache Flink, and open catalogs (Apache Polaris, Unity Catalog, Apache Gravitino).

The DS/ML ecosystem is incredibly rich and data *must be* easily accessible across different languages, tools, and environments.
Lance makes Apache Arrow integration its primary interface, which means conversions to/from is 2 lines of code, your
code does not need to change after conversion, and nothing is locked-up to force you to pay for vendor compute.
We need open-source not fauxpen-source.
For more details, see the full [Lance format specification](https://lance.org/format).
Loading