From 95cf8b6dbbead052dd70a556c9f44fb47b9fc6ee Mon Sep 17 00:00:00 2001 From: Xuanwo Date: Tue, 3 Feb 2026 16:51:06 +0800 Subject: [PATCH 1/6] docs: add lance skills as user guide --- skills/README.md | 13 + skills/lance-user-guide/SKILL.md | 227 ++++++++++++++++++ .../references/index-selection.md | 69 ++++++ .../references/io-cheatsheet.md | 69 ++++++ .../scripts/python_end_to_end.py | 79 ++++++ 5 files changed, 457 insertions(+) create mode 100644 skills/README.md create mode 100644 skills/lance-user-guide/SKILL.md create mode 100644 skills/lance-user-guide/references/index-selection.md create mode 100644 skills/lance-user-guide/references/io-cheatsheet.md create mode 100644 skills/lance-user-guide/scripts/python_end_to_end.py diff --git a/skills/README.md b/skills/README.md new file mode 100644 index 00000000000..3bc81d019f8 --- /dev/null +++ b/skills/README.md @@ -0,0 +1,13 @@ +# Skills + +This directory contains code agent skills for the Lance project. + +Each skill is a folder that contains a required `SKILL.md` (with YAML frontmatter) and optional `scripts/`, `references/`, and `assets/`. + +## Install + +```bash +npx skills add lance-format/lance +``` + +Restart code agents after installing. diff --git a/skills/lance-user-guide/SKILL.md b/skills/lance-user-guide/SKILL.md new file mode 100644 index 00000000000..3855ae86467 --- /dev/null +++ b/skills/lance-user-guide/SKILL.md @@ -0,0 +1,227 @@ +--- +name: lance-user-guide +description: Guide Code Agents to help Lance users write/read datasets and build/choose indices. Use when a user asks how to use Lance (Python/Rust/CLI), how to write_dataset/open/scan, how to build vector indexes (IVF_PQ, IVF_HNSW_*), how to build scalar indexes (BTREE, BITMAP, INVERTED, FTS, etc.), how to combine filters with vector search, or how to debug indexing and scan performance. +--- + +# Lance User Guide + +## Scope + +Use this skill to answer questions about: + +- Writing datasets (create/append/overwrite) and reading/scanning datasets +- Vector search (nearest-neighbor queries) and vector index creation/tuning +- Scalar index creation and choosing a scalar index type for a filter workload +- Combining filters (metadata predicates) with vector search + +Do not use this skill for: + +- Contributing to Lance itself (repo development, internal architecture) +- File format internals beyond what is required to use the API correctly + +## Installation (quick) + +Python: + +```bash +pip install pylance +``` + +Verify: + +```bash +python -c "import lance; print(lance.__version__)" +``` + +Rust: + +```bash +cargo add lance +``` + +Or add it to `Cargo.toml` (choose an appropriate version for your project): + +```toml +[dependencies] +lance = "x.y" +``` + +From source (this repository): + +```bash +maturin develop -m python/Cargo.toml +``` + +## Minimal intake (ask only what you need) + +Collect the minimum information required to avoid wrong guidance: + +- Language/API surface: Python / Rust / CLI +- Storage: local filesystem / S3 / other object store +- Workload: scan-only / filter-heavy / vector search / hybrid (vector + filter) +- Vector details (if applicable): dimension, metric (L2/cosine/dot), latency target, recall target +- Update pattern: mostly append / frequent overwrite / frequent deletes/updates +- Data scale: approximate row count and whether there are many small files + +If the user does not specify a language, default to Python examples and provide a short mapping to Rust concepts. + +## Workflow decision tree + +1. If the question is "How do I write or update data?": use the **Write** playbook. +2. If the question is "How do I read / scan / filter?": use the **Read** playbook. +3. If the question is "How do I do kNN / vector search?": use the **Vector search** playbook. +4. If the question is "Which index should I use?": consult `references/index-selection.md` and confirm constraints. +5. If the question is "Why is this slow / why are results missing?": use **Troubleshooting** and ask for a minimal reproduction. + +## Primary playbooks (Python) + +### Write + +Prefer `lance.write_dataset` for most user workflows. + +```python +import lance +import pyarrow as pa + +vectors = pa.array( + [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]], + type=pa.list_(pa.float32(), 3), +) +table = pa.table({"id": [1, 2], "vector": vectors, "category": ["a", "b"]}) + +ds = lance.write_dataset(table, "my-data.lance", mode="create") +ds = lance.write_dataset(table, "my-data.lance", mode="append") +ds = lance.write_dataset(table, "my-data.lance", mode="overwrite") +``` + +Validation checklist: + +- Re-open and count rows: `lance.dataset(uri).count_rows()` +- Confirm schema: `lance.dataset(uri).schema` + +Notes: + +- Use `storage_options={...}` when writing to an object store URI. +- If the user mentions non-atomic object stores, mention `commit_lock` and point them to the user guide. + +### Read + +Use `lance.dataset` + `scanner(...)` for pushdowns (projection, filter, limit, nearest). + +```python +import lance + +ds = lance.dataset("my-data.lance") +tbl = ds.scanner( + columns=["id", "category"], + filter="category = 'a' and id >= 10", + limit=100, +).to_table() +``` + +Validation checklist: + +- If performance is the concern, ask for a minimal `scanner(...)` call that reproduces it. +- If correctness is the concern, ask for the exact `filter` string and whether `prefilter` is enabled (when using `nearest`). + +### Vector search (nearest) + +Run vector search with `scanner(nearest=...)` or `to_table(nearest=...)`. + +```python +import lance +import numpy as np + +ds = lance.dataset("my-data.lance") +q = np.array([1.0, 2.0, 3.0], dtype=np.float32) +tbl = ds.to_table(nearest={"column": "vector", "q": q, "k": 10}) +``` + +If combining a filter with vector search, decide whether the filter must run before the vector query: + +- Use `prefilter=True` when the filter is highly selective and correctness (top-k among filtered rows) matters. +- Use `prefilter=False` when the filter is not very selective and speed matters, and accept that results can be fewer than `k`. + +```python +tbl = ds.scanner( + nearest={"column": "vector", "q": q, "k": 10}, + filter="category = 'a'", + prefilter=True, +).to_table() +``` + +### Build a vector index + +Create a vector index with `LanceDataset.create_index(...)`. + +Start with a minimal working configuration: + +```python +ds = lance.dataset("my-data.lance") +ds = ds.create_index( + "vector", + index_type="IVF_PQ", + num_partitions=256, + num_sub_vectors=16, +) +``` + +Then verify: + +- `ds.describe_indices()` (preferred) or `ds.list_indices()` (can be expensive) +- A small `nearest` query that uses the index + +For parameter selection and tuning, consult `references/index-selection.md`. + +### Build a scalar index + +Scalar indices speed up scans with filters. Use `create_scalar_index` for a stable entry point. + +```python +ds = lance.dataset("my-data.lance") +ds.create_scalar_index("category", "BTREE", replace=True) +``` + +Then verify: + +- `ds.describe_indices()` +- A representative `scanner(filter=...)` query + +To choose a scalar index type (BTREE vs BITMAP vs INVERTED/FTS/NGRAM, etc.), consult `references/index-selection.md`. + +## Troubleshooting patterns + +### "Vector search + filter returns fewer than k rows" + +- Explain the difference between post-filtering and pre-filtering. +- Suggest `prefilter=True` if the user expects top-k among filtered rows. + +### "Index creation is slow" + +- Confirm vector dimension and `num_sub_vectors`. +- For IVF_PQ, call out the common pitfall: avoid misaligned `dimension / num_sub_vectors` (see `references/index-selection.md`). + +### "Scan is slow even with a scalar index" + +- Ask whether the filter is compatible with the index (equality vs range vs text search). +- Suggest checking whether scalar index usage is disabled (`use_scalar_index=False`). + +## Local verification (when a repo checkout is available) + +When answering API questions, confirm the exact signature and docstrings locally: + +- Python I/O entry points: `python/python/lance/dataset.py` (`write_dataset`, `LanceDataset.scanner`) +- Vector indexing: `python/python/lance/dataset.py` (`create_index`) +- Scalar indexing: `python/python/lance/dataset.py` (`create_scalar_index`) + +Use targeted search: + +```bash +rg -n "def write_dataset\\b|def create_index\\b|def create_scalar_index\\b|def scanner\\b" python/python/lance/dataset.py +``` + +## Bundled resources + +- Index selection and tuning: `references/index-selection.md` +- I/O and versioning cheat sheet: `references/io-cheatsheet.md` +- Runnable minimal example: `scripts/python_end_to_end.py` diff --git a/skills/lance-user-guide/references/index-selection.md b/skills/lance-user-guide/references/index-selection.md new file mode 100644 index 00000000000..aee43816641 --- /dev/null +++ b/skills/lance-user-guide/references/index-selection.md @@ -0,0 +1,69 @@ +## Index selection (quick) + +Use this file when the user asks "which index should I use" or "how do I tune it". + +Always confirm: + +- The query pattern (filter-only, vector-only, hybrid) +- Data scale (rows, vector dimension) +- Update pattern (append vs frequent updates/deletes) +- Correctness needs (must return top-k within a filtered subset vs best-effort) + +## Decision table + +| Workload | Recommended starting point | Notes | +| --- | --- | --- | +| Filter-only scans (`scanner(filter=...)`) | Create a scalar index on the filtered column | Choose scalar index type based on predicate shape and cardinality | +| Vector search only (`nearest=...`) on large data | Build a vector index | Start with `IVF_PQ` if you need compression; tune `nprobes` / `refine_factor` | +| Vector search + selective filter | Scalar index for filter + vector index for search | Use `prefilter=True` when you need true top-k among filtered rows | +| Vector search + non-selective filter | Vector index only (or scalar index optional) | Consider `prefilter=False` for speed; accept fewer than k results | +| Text search | Create a text-oriented scalar index | Use `full_text_query=...` when available; verify the supported index type in the current Lance version | + +## Vector index types (user-facing summary) + +Vector index names typically follow a pattern like `{clustering}_{sub_index}_{quantization}`. + +Common combinations: + +- `IVF_PQ`: IVF clustering + PQ compression +- `IVF_HNSW_SQ`: IVF clustering + HNSW + SQ +- `IVF_SQ`: IVF clustering + SQ +- `IVF_RQ`: IVF clustering + RQ +- `IVF_FLAT`: IVF clustering + no quantization (exact vectors within clusters) + +If you are unsure which types are supported in the user's environment, recommend starting with `IVF_PQ` and fall back to "try and see" (the API will error on unsupported types). + +## Vector index creation defaults + +Start with: + +- `index_type="IVF_PQ"` +- `num_partitions`: 64 to 1024 (higher for larger datasets) +- `num_sub_vectors`: choose a value that divides the vector dimension + +Practical warning (performance): + +- Avoid misalignment: `(dimension / num_sub_vectors) % 8 == 0` is a common sweet spot for faster index creation. + +## Vector search tuning defaults + +Tune recall vs latency with: + +- `nprobes`: how many IVF partitions to search +- `refine_factor`: how many candidates to re-rank to improve accuracy + +When a user reports "too slow" or "bad recall", ask for: + +- Current `nprobes`, `refine_factor`, and index type +- Whether the query is using `prefilter` + +## Scalar index selection (starting guidance) + +Choose scalar index type based on the filter expression: + +- Equality filters on high-cardinality columns: start with `BTREE` +- Equality / IN-list filters on low-cardinality columns: start with `BITMAP` +- Text search: start with `FTS` (or other text index types supported by the version) +- Range filters: start with range-friendly options (for example `ZONEMAP` when appropriate) + +If you cannot confidently map the filter to an index type, recommend `BTREE` as a safe baseline and confirm via a small benchmark on representative queries. diff --git a/skills/lance-user-guide/references/io-cheatsheet.md b/skills/lance-user-guide/references/io-cheatsheet.md new file mode 100644 index 00000000000..acb34ac233a --- /dev/null +++ b/skills/lance-user-guide/references/io-cheatsheet.md @@ -0,0 +1,69 @@ +## I/O cheat sheet (Python) + +Use this file when the user asks how to write/read Lance datasets, manage versions, or work with object stores. + +## Write a dataset + +Use `lance.write_dataset(data, uri, mode=...)`. + +Modes: + +- `mode="create"`: create new dataset (error if exists) +- `mode="overwrite"`: create a new version that replaces the latest snapshot +- `mode="append"`: append data as a new version (or create if missing) + +Inputs: + +- `pyarrow.Table` +- `pyarrow.RecordBatchReader` +- pandas DataFrame +- other reader-like sources supported by the installed Lance version + +## Open a dataset + +Use `lance.dataset(uri, version=..., asof=..., storage_options=...)`. + +Notes: + +- `version` can be a number or a tag (depending on the environment/version). +- Use `storage_options` for object stores (credentials, endpoint, etc.). + +## Read / scan + +Use `ds.scanner(...)` for pushdowns: + +- `columns=[...]` for projection +- `filter="..."` for predicate pushdown +- `limit=...` for limit pushdown +- `nearest={...}` for vector search +- `prefilter=True/False` to control filter ordering when combined with `nearest` +- `use_scalar_index=True/False` to control scalar index usage + +Then materialize: + +- `scanner(...).to_table()` +- `scanner(...).to_batches()` + +## Hybrid query: vector + filter + +Use a scalar index for the filter column when the filter is selective and you set `prefilter=True`. + +Example: + +```python +tbl = ds.scanner( + nearest={"column": "vector", "q": q, "k": 10}, + filter="category = 'a'", + prefilter=True, +).to_table() +``` + +## Inspect indices + +Prefer: + +- `ds.describe_indices()` + +Use with care: + +- `ds.list_indices()` can be expensive because it may load index statistics. diff --git a/skills/lance-user-guide/scripts/python_end_to_end.py b/skills/lance-user-guide/scripts/python_end_to_end.py new file mode 100644 index 00000000000..0d7e70aa6ed --- /dev/null +++ b/skills/lance-user-guide/scripts/python_end_to_end.py @@ -0,0 +1,79 @@ +#!/usr/bin/env python3 + +from __future__ import annotations + +import argparse +from pathlib import Path + +import numpy as np +import pyarrow as pa + +import lance + + +def _build_fixed_size_vectors(num_rows: int, dim: int) -> tuple[pa.FixedSizeListArray, np.ndarray]: + vectors = np.random.rand(num_rows, dim).astype("float32") + flat = pa.array(vectors.reshape(-1), type=pa.float32()) + return pa.FixedSizeListArray.from_arrays(flat, dim), vectors + + +def main() -> None: + parser = argparse.ArgumentParser(description="Minimal Lance write/index/query example") + parser.add_argument("--uri", default="example.lance", help="Dataset URI (directory)") + parser.add_argument("--mode", default="overwrite", choices=["create", "append", "overwrite"]) + parser.add_argument("--rows", type=int, default=1000) + parser.add_argument("--dim", type=int, default=32) + + parser.add_argument("--build-scalar-index", action="store_true") + parser.add_argument("--build-vector-index", action="store_true") + + parser.add_argument("--vector-index-type", default="IVF_PQ") + parser.add_argument("--num-partitions", type=int, default=64) + parser.add_argument("--num-sub-vectors", type=int, default=8) + + parser.add_argument("--k", type=int, default=10) + parser.add_argument("--filter", default="category = 'a'") + parser.add_argument("--prefilter", action="store_true") + + args = parser.parse_args() + + uri = str(Path(args.uri)) + vec_arr, vec_np = _build_fixed_size_vectors(args.rows, args.dim) + categories = pa.array(["a" if i % 2 == 0 else "b" for i in range(args.rows)]) + table = pa.table({"id": pa.array(range(args.rows), pa.int64()), "category": categories, "vector": vec_arr}) + + ds = lance.write_dataset(table, uri, mode=args.mode) + ds = lance.dataset(uri) + + if args.build_scalar_index: + ds.create_scalar_index("category", "BTREE", replace=True) + + if args.build_vector_index: + ds = ds.create_index( + "vector", + index_type=args.vector_index_type, + num_partitions=args.num_partitions, + num_sub_vectors=args.num_sub_vectors, + ) + + print(f"uri={ds.uri}") + print(f"rows={ds.count_rows()}") + print("indices=") + for idx in ds.describe_indices(): + print(f" - {idx}") + + q = vec_np[0] + scan = ds.scanner( + nearest={"column": "vector", "q": q, "k": args.k}, + filter=args.filter if args.filter else None, + prefilter=args.prefilter, + ) + result = scan.to_table() + print("result_schema=") + print(result.schema) + print("result_preview=") + print(result.slice(0, 5).to_pydict()) + + +if __name__ == "__main__": + main() From cd00ef67d85deb6abcfbf55fe41653f5f69db37e Mon Sep 17 00:00:00 2001 From: Xuanwo Date: Tue, 3 Feb 2026 16:57:43 +0800 Subject: [PATCH 2/6] docs(skills): use target_partition_size for vector index --- skills/lance-user-guide/SKILL.md | 2 +- skills/lance-user-guide/references/index-selection.md | 2 +- skills/lance-user-guide/scripts/python_end_to_end.py | 4 ++-- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/skills/lance-user-guide/SKILL.md b/skills/lance-user-guide/SKILL.md index 3855ae86467..d1bf7d38128 100644 --- a/skills/lance-user-guide/SKILL.md +++ b/skills/lance-user-guide/SKILL.md @@ -161,7 +161,7 @@ ds = lance.dataset("my-data.lance") ds = ds.create_index( "vector", index_type="IVF_PQ", - num_partitions=256, + target_partition_size=8192, num_sub_vectors=16, ) ``` diff --git a/skills/lance-user-guide/references/index-selection.md b/skills/lance-user-guide/references/index-selection.md index aee43816641..b225d333b1d 100644 --- a/skills/lance-user-guide/references/index-selection.md +++ b/skills/lance-user-guide/references/index-selection.md @@ -38,7 +38,7 @@ If you are unsure which types are supported in the user's environment, recommend Start with: - `index_type="IVF_PQ"` -- `num_partitions`: 64 to 1024 (higher for larger datasets) +- `target_partition_size`: start with 8192 and adjust based on the dataset size and latency/recall needs - `num_sub_vectors`: choose a value that divides the vector dimension Practical warning (performance): diff --git a/skills/lance-user-guide/scripts/python_end_to_end.py b/skills/lance-user-guide/scripts/python_end_to_end.py index 0d7e70aa6ed..ec2d02713c9 100644 --- a/skills/lance-user-guide/scripts/python_end_to_end.py +++ b/skills/lance-user-guide/scripts/python_end_to_end.py @@ -28,7 +28,7 @@ def main() -> None: parser.add_argument("--build-vector-index", action="store_true") parser.add_argument("--vector-index-type", default="IVF_PQ") - parser.add_argument("--num-partitions", type=int, default=64) + parser.add_argument("--target-partition-size", type=int, default=8192) parser.add_argument("--num-sub-vectors", type=int, default=8) parser.add_argument("--k", type=int, default=10) @@ -52,7 +52,7 @@ def main() -> None: ds = ds.create_index( "vector", index_type=args.vector_index_type, - num_partitions=args.num_partitions, + target_partition_size=args.target_partition_size, num_sub_vectors=args.num_sub_vectors, ) From fb7da11233f660da6a209928c7f733a748b82640 Mon Sep 17 00:00:00 2001 From: Xuanwo Date: Thu, 5 Feb 2026 17:30:10 +0800 Subject: [PATCH 3/6] docs(skills): clarify installation and compatibility --- skills/README.md | 43 +++++++++++++++++-- skills/lance-user-guide/SKILL.md | 4 ++ .../scripts/python_end_to_end.py | 5 +++ 3 files changed, 48 insertions(+), 4 deletions(-) diff --git a/skills/README.md b/skills/README.md index 3bc81d019f8..16ca9e508bd 100644 --- a/skills/README.md +++ b/skills/README.md @@ -1,13 +1,48 @@ # Skills -This directory contains code agent skills for the Lance project. +This directory contains Codex-compatible skills for the Lance project. Each skill is a folder that contains a required `SKILL.md` (with YAML frontmatter) and optional `scripts/`, `references/`, and `assets/`. -## Install +## Install (npx skills) + +If you use `skills.sh`, install from GitHub: + +```bash +npx skills add lance-format/lance --skill lance-user-guide +``` + +Install globally (user-level): + +```bash +npx skills add lance-format/lance --skill lance-user-guide -g +``` + +List available skills in this repository: + +```bash +npx skills add lance-format/lance --list +``` + +## Install (manual copy) + +Codex typically loads skills from: + +- Project: `.codex/skills//` +- Global: `~/.codex/skills//` + +Install into the current repository: + +```bash +mkdir -p .codex/skills +cp -R skills/lance-user-guide .codex/skills/ +``` + +Install globally: ```bash -npx skills add lance-format/lance +mkdir -p ~/.codex/skills +cp -R skills/lance-user-guide ~/.codex/skills/ ``` -Restart code agents after installing. +Restart Codex after installing or updating skills. diff --git a/skills/lance-user-guide/SKILL.md b/skills/lance-user-guide/SKILL.md index d1bf7d38128..e227a2d83f2 100644 --- a/skills/lance-user-guide/SKILL.md +++ b/skills/lance-user-guide/SKILL.md @@ -173,6 +173,10 @@ Then verify: For parameter selection and tuning, consult `references/index-selection.md`. +Compatibility note: + +- `target_partition_size` is preferred for new code. If your installed Lance Python SDK does not support it, fall back to `num_partitions` (deprecated). + ### Build a scalar index Scalar indices speed up scans with filters. Use `create_scalar_index` for a stable entry point. diff --git a/skills/lance-user-guide/scripts/python_end_to_end.py b/skills/lance-user-guide/scripts/python_end_to_end.py index ec2d02713c9..e2bc07654ee 100644 --- a/skills/lance-user-guide/scripts/python_end_to_end.py +++ b/skills/lance-user-guide/scripts/python_end_to_end.py @@ -37,6 +37,11 @@ def main() -> None: args = parser.parse_args() + if args.num_sub_vectors <= 0: + raise ValueError("--num-sub-vectors must be positive") + if args.dim % args.num_sub_vectors != 0: + raise ValueError("--dim must be divisible by --num-sub-vectors") + uri = str(Path(args.uri)) vec_arr, vec_np = _build_fixed_size_vectors(args.rows, args.dim) categories = pa.array(["a" if i % 2 == 0 else "b" for i in range(args.rows)]) From 7edcc66be986945a58f20c4707155de112abfae2 Mon Sep 17 00:00:00 2001 From: Xuanwo Date: Thu, 5 Feb 2026 17:32:40 +0800 Subject: [PATCH 4/6] Revert "docs(skills): clarify installation and compatibility" This reverts commit fb7da11233f660da6a209928c7f733a748b82640. --- skills/README.md | 43 ++----------------- skills/lance-user-guide/SKILL.md | 4 -- .../scripts/python_end_to_end.py | 5 --- 3 files changed, 4 insertions(+), 48 deletions(-) diff --git a/skills/README.md b/skills/README.md index 16ca9e508bd..3bc81d019f8 100644 --- a/skills/README.md +++ b/skills/README.md @@ -1,48 +1,13 @@ # Skills -This directory contains Codex-compatible skills for the Lance project. +This directory contains code agent skills for the Lance project. Each skill is a folder that contains a required `SKILL.md` (with YAML frontmatter) and optional `scripts/`, `references/`, and `assets/`. -## Install (npx skills) - -If you use `skills.sh`, install from GitHub: - -```bash -npx skills add lance-format/lance --skill lance-user-guide -``` - -Install globally (user-level): - -```bash -npx skills add lance-format/lance --skill lance-user-guide -g -``` - -List available skills in this repository: - -```bash -npx skills add lance-format/lance --list -``` - -## Install (manual copy) - -Codex typically loads skills from: - -- Project: `.codex/skills//` -- Global: `~/.codex/skills//` - -Install into the current repository: - -```bash -mkdir -p .codex/skills -cp -R skills/lance-user-guide .codex/skills/ -``` - -Install globally: +## Install ```bash -mkdir -p ~/.codex/skills -cp -R skills/lance-user-guide ~/.codex/skills/ +npx skills add lance-format/lance ``` -Restart Codex after installing or updating skills. +Restart code agents after installing. diff --git a/skills/lance-user-guide/SKILL.md b/skills/lance-user-guide/SKILL.md index e227a2d83f2..d1bf7d38128 100644 --- a/skills/lance-user-guide/SKILL.md +++ b/skills/lance-user-guide/SKILL.md @@ -173,10 +173,6 @@ Then verify: For parameter selection and tuning, consult `references/index-selection.md`. -Compatibility note: - -- `target_partition_size` is preferred for new code. If your installed Lance Python SDK does not support it, fall back to `num_partitions` (deprecated). - ### Build a scalar index Scalar indices speed up scans with filters. Use `create_scalar_index` for a stable entry point. diff --git a/skills/lance-user-guide/scripts/python_end_to_end.py b/skills/lance-user-guide/scripts/python_end_to_end.py index e2bc07654ee..ec2d02713c9 100644 --- a/skills/lance-user-guide/scripts/python_end_to_end.py +++ b/skills/lance-user-guide/scripts/python_end_to_end.py @@ -37,11 +37,6 @@ def main() -> None: args = parser.parse_args() - if args.num_sub_vectors <= 0: - raise ValueError("--num-sub-vectors must be positive") - if args.dim % args.num_sub_vectors != 0: - raise ValueError("--dim must be divisible by --num-sub-vectors") - uri = str(Path(args.uri)) vec_arr, vec_np = _build_fixed_size_vectors(args.rows, args.dim) categories = pa.array(["a" if i % 2 == 0 else "b" for i in range(args.rows)]) From b3ee727c4fc66c6009c5c2e95051c7e9876ffd56 Mon Sep 17 00:00:00 2001 From: Xuanwo Date: Thu, 5 Feb 2026 17:45:13 +0800 Subject: [PATCH 5/6] Address comments --- skills/lance-user-guide/SKILL.md | 4 ++-- .../references/index-selection.md | 23 +++++++++++++++++-- 2 files changed, 23 insertions(+), 4 deletions(-) diff --git a/skills/lance-user-guide/SKILL.md b/skills/lance-user-guide/SKILL.md index d1bf7d38128..4bf7eb515c5 100644 --- a/skills/lance-user-guide/SKILL.md +++ b/skills/lance-user-guide/SKILL.md @@ -1,6 +1,6 @@ --- name: lance-user-guide -description: Guide Code Agents to help Lance users write/read datasets and build/choose indices. Use when a user asks how to use Lance (Python/Rust/CLI), how to write_dataset/open/scan, how to build vector indexes (IVF_PQ, IVF_HNSW_*), how to build scalar indexes (BTREE, BITMAP, INVERTED, FTS, etc.), how to combine filters with vector search, or how to debug indexing and scan performance. +description: Guide Code Agents to help Lance users write/read datasets and build/choose indices. Use when a user asks how to use Lance (Python/Rust/CLI), how to write_dataset/open/scan, how to build vector indexes (IVF_PQ, IVF_HNSW_*), how to build scalar indexes (BTREE, BITMAP, LABEL_LIST, NGRAM, INVERTED, BLOOMFILTER, RTREE, etc.), how to combine filters with vector search, or how to debug indexing and scan performance. --- # Lance User Guide @@ -187,7 +187,7 @@ Then verify: - `ds.describe_indices()` - A representative `scanner(filter=...)` query -To choose a scalar index type (BTREE vs BITMAP vs INVERTED/FTS/NGRAM, etc.), consult `references/index-selection.md`. +To choose a scalar index type (BTREE vs BITMAP vs LABEL_LIST vs NGRAM vs INVERTED, etc.), consult `references/index-selection.md`. ## Troubleshooting patterns diff --git a/skills/lance-user-guide/references/index-selection.md b/skills/lance-user-guide/references/index-selection.md index b225d333b1d..7f6d926c138 100644 --- a/skills/lance-user-guide/references/index-selection.md +++ b/skills/lance-user-guide/references/index-selection.md @@ -17,7 +17,7 @@ Always confirm: | Vector search only (`nearest=...`) on large data | Build a vector index | Start with `IVF_PQ` if you need compression; tune `nprobes` / `refine_factor` | | Vector search + selective filter | Scalar index for filter + vector index for search | Use `prefilter=True` when you need true top-k among filtered rows | | Vector search + non-selective filter | Vector index only (or scalar index optional) | Consider `prefilter=False` for speed; accept fewer than k results | -| Text search | Create a text-oriented scalar index | Use `full_text_query=...` when available; verify the supported index type in the current Lance version | +| Text search | Create an `INVERTED` scalar index | Use `full_text_query=...` when available; note that `FTS` is not a universal alias in all SDK versions | ## Vector index types (user-facing summary) @@ -63,7 +63,26 @@ Choose scalar index type based on the filter expression: - Equality filters on high-cardinality columns: start with `BTREE` - Equality / IN-list filters on low-cardinality columns: start with `BITMAP` -- Text search: start with `FTS` (or other text index types supported by the version) +- List membership filters on list-like columns: start with `LABEL_LIST` +- Substring / `contains(...)` filters on strings: start with `NGRAM` +- Text search: start with `INVERTED` - Range filters: start with range-friendly options (for example `ZONEMAP` when appropriate) +- Highly selective negative membership / presence checks: consider `BLOOMFILTER` (inexact) +- Geospatial queries (if present in your build): use `RTREE` + +## JSON fields + +Lance scalar indices are created on physical columns. If you want to index a JSON sub-field: + +1. Materialize the extracted value into a new column (for example with `add_columns`) +2. Create a scalar index on that new column + +Example (Python, using SQL expressions): + +```python +ds = lance.dataset(uri) +ds.add_columns({"country": "json_extract(payload, '$.country')"}) +ds.create_scalar_index("country", "BTREE", replace=True) +``` If you cannot confidently map the filter to an index type, recommend `BTREE` as a safe baseline and confirm via a small benchmark on representative queries. From 494065fdfa44ae581b46475d2100875cffacc0b0 Mon Sep 17 00:00:00 2001 From: Xuanwo Date: Mon, 23 Feb 2026 17:05:49 +0800 Subject: [PATCH 6/6] Update skills/lance-user-guide/references/index-selection.md Co-authored-by: Prashanth Rao <35005448+prrao87@users.noreply.github.com> --- skills/lance-user-guide/references/index-selection.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/skills/lance-user-guide/references/index-selection.md b/skills/lance-user-guide/references/index-selection.md index 7f6d926c138..f83764f1a67 100644 --- a/skills/lance-user-guide/references/index-selection.md +++ b/skills/lance-user-guide/references/index-selection.md @@ -65,7 +65,7 @@ Choose scalar index type based on the filter expression: - Equality / IN-list filters on low-cardinality columns: start with `BITMAP` - List membership filters on list-like columns: start with `LABEL_LIST` - Substring / `contains(...)` filters on strings: start with `NGRAM` -- Text search: start with `INVERTED` +- Full-text search (FTS): start with `INVERTED` - Range filters: start with range-friendly options (for example `ZONEMAP` when appropriate) - Highly selective negative membership / presence checks: consider `BLOOMFILTER` (inexact) - Geospatial queries (if present in your build): use `RTREE`