diff --git a/README.md b/README.md index 3100cbd67b8..137172786d1 100644 --- a/README.md +++ b/README.md @@ -3,17 +3,17 @@ Lance Logo -**Modern columnar data format for ML. Convert from Parquet in 2-lines of code for 100x faster random access, zero-cost schema evolution, rich secondary indices, versioning, and more.
** -**Compatible with Pandas, DuckDB, Polars, Pyarrow, and Ray with more integrations on the way.** +**The Open Lakehouse Format for Multimodal AI**
+**High-performance vector search, full-text search, random access, and feature engineering capabilities for the lakehouse.**
+**Compatible with Pandas, DuckDB, Polars, PyArrow, Ray, Spark, and more integrations on the way.** -Documentation • -Blog • -Discord • -X +Documentation • +Community • +Discord [CI]: https://github.com/lancedb/lance/actions/workflows/rust.yml [CI Badge]: https://github.com/lancedb/lance/actions/workflows/rust.yml/badge.svg -[Docs]: https://lancedb.github.io/lance/ +[Docs]: https://lance.org [Docs Badge]: https://img.shields.io/badge/docs-passing-brightgreen [crates.io]: https://crates.io/crates/lance [crates.io badge]: https://img.shields.io/crates/v/lance.svg @@ -30,24 +30,30 @@
-Lance is a modern columnar data format that is optimized for ML workflows and datasets. Lance is perfect for: +Lance is an open lakehouse format for multimodal AI. It contains a file format, table format, and catalog spec that allows you to build a complete lakehouse on top of object storage to power your AI workflows. Lance is perfect for: -1. Building search engines and feature stores. -2. Large-scale ML training requiring high performance IO and shuffles. -3. Storing, querying, and inspecting deeply nested data for robotics or large blobs like images, point clouds, and more. +1. Building search engines and feature stores with hybrid search capabilities. +2. Large-scale ML training requiring high performance IO and random access. +3. Storing, querying, and managing multimodal data including images, videos, audio, text, and embeddings. The key features of Lance include: -* **High-performance random access:** 100x faster than Parquet without sacrificing scan performance. +* **Expressive hybrid search:** Combine vector similarity search, full-text search (BM25), and SQL analytics on the same dataset with accelerated secondary indices. -* **Vector search:** find nearest neighbors in milliseconds and combine OLAP-queries with vector search. +* **Lightning-fast random access:** 100x faster than Parquet or Iceberg for random access without sacrificing scan performance. -* **Zero-copy, automatic versioning:** manage versions of your data without needing extra infrastructure. +* **Native multimodal data support:** Store images, videos, audio, text, and embeddings in a single unified format with efficient blob encoding and lazy loading. -* **Ecosystem integrations:** Apache Arrow, Pandas, Polars, DuckDB, Ray, Spark and more on the way. +* **Data evolution:** Efficiently add columns with backfilled values without full table rewrites, perfect for ML feature engineering. + +* **Zero-copy versioning:** ACID transactions, time travel, and automatic versioning without needing extra infrastructure. + +* **Rich ecosystem integrations:** Apache Arrow, Pandas, Polars, DuckDB, Apache Spark, Ray, Trino, Apache Flink, and open catalogs (Apache Polaris, Unity Catalog, Apache Gravitino). + +For more details, see the full [Lance format specification](https://lance.org/format). > [!TIP] -> Lance is in active development and we welcome contributions. Please see our [contributing guide](https://lancedb.github.io/lance/community/contributing) for more information. +> Lance is in active development and we welcome contributions. Please see our [contributing guide](https://lance.org/docs/community/contributing) for more information. ## Quick Start @@ -171,24 +177,6 @@ rs = [dataset.to_table(nearest={"column": "vector", "k": 10, "q": q}) | [java](./java) | Java bindings (JNI) | | [docs](./docs) | Documentation source | -## What makes Lance different - -Here we will highlight a few aspects of Lance’s design. For more details, see the full [Lance design document](https://lancedb.github.io/lance/format). - -**Vector index**: Vector index for similarity search over embedding space. -Support both CPUs (``x86_64`` and ``arm``) and GPU (``Nvidia (cuda)`` and ``Apple Silicon (mps)``). - -**Encodings**: To achieve both fast columnar scan and sub-linear point queries, Lance uses custom encodings and layouts. - -**Nested fields**: Lance stores each subfield as a separate column to support efficient filters like “find images where detected objects include cats”. - -**Versioning**: A Manifest can be used to record snapshots. Currently we support creating new versions automatically via appends, overwrites, and index creation. - -**Fast updates** (ROADMAP): Updates will be supported via write-ahead logs. - -**Rich secondary indices**: Support `BTree`, `Bitmap`, `Full text search`, `Label list`, -`NGrams`, and more. - ## Benchmarks ### Vector search @@ -209,9 +197,9 @@ We create a Lance dataset using the Oxford Pet dataset to do some preliminary pe ![](docs/src/images/lance_perf.png) -## Why are you building yet another data format?! +## Why Lance for AI/ML workflows? -The machine learning development cycle involves the steps: +The machine learning development cycle involves multiple stages: ```mermaid graph LR @@ -226,20 +214,16 @@ graph LR H --> A; ``` -People use different data representations to varying stages for the performance or limited by the tooling available. -Academia mainly uses XML / JSON for annotations and zipped images/sensors data for deep learning, which -is difficult to integrate into data infrastructure and slow to train over cloud storage. -While industry uses data lakes (Parquet-based techniques, i.e., Delta Lake, Iceberg) or data warehouses (AWS Redshift -or Google BigQuery) to collect and analyze data, they have to convert the data into training-friendly formats, such -as [Rikai](https://github.com/eto-ai/rikai)/[Petastorm](https://github.com/uber/petastorm) -or [TFRecord](https://www.tensorflow.org/tutorials/load_data/tfrecord). -Multiple single-purpose data transforms, as well as syncing copies between cloud storage to local training -instances have become a common practice. +Traditional lakehouse formats were designed for SQL analytics and struggle with AI/ML workloads that require: +- **Vector search** for similarity and semantic retrieval +- **Fast random access** for sampling and interactive exploration +- **Multimodal data** storage (images, videos, audio alongside embeddings) +- **Data evolution** for feature engineering without full table rewrites +- **Hybrid search** combining vectors, full-text, and SQL predicates -While each of the existing data formats excels at the workload it was originally designed for, we need a new data format -tailored for multistage ML development cycles to reduce and data silos. +While existing formats (Parquet, Iceberg, Delta Lake) excel at SQL analytics, they require additional specialized systems for AI capabilities. Lance brings these AI-first features directly into the lakehouse format. -A comparison of different data formats in each stage of ML development cycle. +A comparison of different formats across ML development stages: | | Lance | Parquet & ORC | JSON & XML | TFRecord | Database | Warehouse | |---------------------|-------|---------------|------------|----------|----------|-----------| @@ -249,20 +233,3 @@ A comparison of different data formats in each stage of ML development cycle. | Exploration | Fast | Slow | Fast | Slow | Fast | Decent | | Infra Support | Rich | Rich | Decent | Limited | Rich | Rich | -## Community Highlights - -Lance is currently used in production by: -* [LanceDB](https://github.com/lancedb/lancedb), a serverless, low-latency vector database for ML applications -* [LanceDB Enterprise](https://docs.lancedb.com/enterprise/introduction), hyperscale LanceDB with enterprise SLA. -* Leading multimodal Gen AI companies for training over petabyte-scale multimodal data. -* Self-driving car company for large-scale storage, retrieval and processing of multi-modal data. -* E-commerce company for billion-scale+ vector personalized search. -* and more. - -## Presentations, Blogs and Talks - -* [Designing a Table Format for ML Workloads](https://blog.lancedb.com/designing-a-table-format-for-ml-workloads/), Feb 2025. -* [Transforming Multimodal Data Management with LanceDB, Ray Summit](https://www.youtube.com/watch?v=xmTFEzAh8ho), Oct 2024. -* [Lance v2: A columnar container format for modern data](https://blog.lancedb.com/lance-v2/), Apr 2024. -* [Lance Deep Dive](https://drive.google.com/file/d/1Orh9rK0Mpj9zN_gnQF1eJJFpAc6lStGm/view?usp=drive_link). July 2023. -* [Lance: A New Columnar Data Format](https://docs.google.com/presentation/d/1a4nAiQAkPDBtOfXFpPg7lbeDAxcNDVKgoUkw3cUs2rE/edit#slide=id.p), [Scipy 2022, Austin, TX](https://www.scipy2022.scipy.org/posters). July, 2022. diff --git a/java/README.md b/java/README.md index 5aae11e9a8f..930bd16a020 100644 --- a/java/README.md +++ b/java/README.md @@ -1,20 +1,30 @@ -# Java bindings and SDK for Lance Data Format - -> :warning: **Under heavy development** +# Java bindings and SDK for Lance

Lance Logo -Lance is a new columnar data format for data science and machine learning +**The Open Lakehouse Format for Multimodal AI**

-Why you should use Lance -1. It is an order of magnitude faster than Parquet for point queries and nested data structures common to DS/ML -2. It comes with a fast vector index that delivers sub-millisecond nearest neighbor search performance -3. It is automatically versioned and supports lineage and time-travel for full reproducibility -4. It is integrated with duckdb/pandas/polars already. Easily convert from/to Parquet in 2 lines of code +Lance is an open lakehouse format for multimodal AI. It contains a file format, table format, and catalog spec that allows you to build a complete lakehouse on top of object storage to power your AI workflows. + +The key features of Lance include: + +* **Expressive hybrid search:** Combine vector similarity search, full-text search (BM25), and SQL analytics on the same dataset with accelerated secondary indices. + +* **Lightning-fast random access:** 100x faster than Parquet or Iceberg for random access without sacrificing scan performance. + +* **Native multimodal data support:** Store images, videos, audio, text, and embeddings in a single unified format with efficient blob encoding and lazy loading. + +* **Data evolution:** Efficiently add columns with backfilled values without full table rewrites, perfect for ML feature engineering. + +* **Zero-copy versioning:** ACID transactions, time travel, and automatic versioning without needing extra infrastructure. + +* **Rich ecosystem integrations:** Apache Arrow, Pandas, Polars, DuckDB, Apache Spark, Ray, Trino, Apache Flink, and open catalogs (Apache Polaris, Unity Catalog, Apache Gravitino). + +For more details, see the full [Lance format specification](https://lance.org/format). ## Quick start diff --git a/rust/lance/README.md b/rust/lance/README.md index 3767c3b7d1f..c36c5186a13 100644 --- a/rust/lance/README.md +++ b/rust/lance/README.md @@ -1,11 +1,11 @@ -# Rust Implementation of Lance Data Format +# Rust Implementation of Lance

Lance Logo -**A new columnar data format for data science and machine learning** +**The Open Lakehouse Format for Multimodal AI**

## Installation @@ -67,31 +67,22 @@ params.num_sub_vectors = 16; dataset.create_index(&["embeddings"], IndexType::Vector, None, ¶ms, true).await; ``` -## Motivation +## What is Lance? -Why do we *need* a new format for data science and machine learning? +Lance is an open lakehouse format for multimodal AI. It contains a file format, table format, and catalog spec that allows you to build a complete lakehouse on top of object storage to power your AI workflows. -### 1. Reproducibility is a must-have +The key features of Lance include: -Versioning and experimentation support should be built into the dataset instead of requiring multiple tools.
-It should also be efficient and not require expensive copying everytime you want to create a new version.
-We call this "Zero copy versioning" in Lance. It makes versioning data easy without increasing storage costs. +* **Expressive hybrid search:** Combine vector similarity search, full-text search (BM25), and SQL analytics on the same dataset with accelerated secondary indices. -### 2. Cloud storage is now the default +* **Lightning-fast random access:** 100x faster than Parquet or Iceberg for random access without sacrificing scan performance. -Remote object storage is the default now for data science and machine learning and the performance characteristics of cloud are fundamentally different.
-Lance format is optimized to be cloud native. Common operations like filter-then-take can be order of magnitude faster -using Lance than Parquet, especially for ML data. +* **Native multimodal data support:** Store images, videos, audio, text, and embeddings in a single unified format with efficient blob encoding and lazy loading. -### 3. Vectors must be a first class citizen, not a separate thing +* **Data evolution:** Efficiently add columns with backfilled values without full table rewrites, perfect for ML feature engineering. -The majority of reasonable scale workflows should not require the added complexity and cost of a -specialized database just to compute vector similarity. Lance integrates optimized vector indices -into a columnar format so no additional infrastructure is required to get low latency top-K similarity search. +* **Zero-copy versioning:** ACID transactions, time travel, and automatic versioning without needing extra infrastructure. -### 4. Open standards is a requirement +* **Rich ecosystem integrations:** Apache Arrow, Pandas, Polars, DuckDB, Apache Spark, Ray, Trino, Apache Flink, and open catalogs (Apache Polaris, Unity Catalog, Apache Gravitino). -The DS/ML ecosystem is incredibly rich and data *must be* easily accessible across different languages, tools, and environments. -Lance makes Apache Arrow integration its primary interface, which means conversions to/from is 2 lines of code, your -code does not need to change after conversion, and nothing is locked-up to force you to pay for vendor compute. -We need open-source not fauxpen-source. +For more details, see the full [Lance format specification](https://lance.org/format).