Skip to content

Fix crash in FilteredReadStream on task cancellation#17

Closed
emilk wants to merge 217 commits intorelease-3.0.0from
emilk/fix-unwrap
Closed

Fix crash in FilteredReadStream on task cancellation#17
emilk wants to merge 217 commits intorelease-3.0.0from
emilk/fix-unwrap

Conversation

@emilk
Copy link
Copy Markdown
Member

@emilk emilk commented Apr 16, 2026

Summary

  • Replace .unwrap() with ? in FilteredReadStream to propagate thread join errors instead of panicking

Test plan

  • Existing tests pass — this is a pure error-handling improvement with no behavioral change

🤖 Generated with Claude Code

Xuanwo and others added 30 commits March 6, 2026 00:47
All rules are derived from analysis of ~1000 PR reviews to capture
recurring review patterns as actionable guidelines.

Changes are as following:

- Restructure root `AGENTS.md`: consolidate 3 overlapping overview
sections into one, merge "Key Technical Details" + "Development Notes" +
"Development tips" into organized Coding/Testing/Documentation Standards
sections, deduplicate Python/Java commands (now link to subdirectory
files)
- Create `rust/AGENTS.md` with ~66 Rust-specific rules covering code
style, API design, error handling, naming, testing, documentation, and
lance-encoding hot path patterns
- Enhance `java/AGENTS.md` with API design (Options pattern, JNI enum
serialization), code style (JavaBean conventions), and documentation
rules
- Enhance `python/AGENTS.md` with Pythonic API design, PyO3 dataclass
rules, type hints, and testing patterns
- Enhance `protos/AGENTS.md` with proto3 `optional` semantics,
structured message design, and documentation rules
- Create `docs/src/format/AGENTS.md` with format spec documentation
standards (pyarrow schemas, language-agnostic definitions, algorithm
detail requirements)
Add a new metric `removed_data_file_num` to the `RemovalStats` of the
cleanup operation results,

So that users can easily perceive how many lance data files have been
deleted in the current cleanup operation and better evaluate the impact
scope of the current cleanup.

---------

Co-authored-by: YueZhang <zhangyue.1010@bytedance.com>
…fragments (lance-format#6040)

In lance-spark, distributed vector queries rely on fragmentScanner. When
a specific fragment is targeted, prefilter must be set to true to ensure
correct execution. This change exposes the variable through JNI to
enable this functionality.

---------

Co-authored-by: niuyulin <niuyulin@chinamobile.com>
…at#5691)

add Python support for defer_index_remap

---------

Co-authored-by: YueZhang <zhangyue.1010@bytedance.com>
…ension type (lance-format#6107)

Closes lance-format#6106

AI Disclaimer: I have used Claude Code to help draft this PR and have
manually reviewed its contents.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…nce-format#6042)

When stable row IDs are enabled, FTS and vector indexes may return row
IDs for rows that have since been deleted. The row ID index excludes
deleted rows, so get_row_addrs() would silently drop these entries via
filter_map, producing an addresses array shorter than the input batch.
The downstream merge_with_schema then failed with "Attempt to merge two
RecordBatch with different sizes".

Fix: track which row IDs are valid in get_row_addrs() and return a
validity mask. In map_batch(), filter the input batch to remove rows
whose IDs no longer exist before merging.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Xuanwo <github@xuanwo.io>
…-format#6120)

Currently, if a user manually sets a compression type (ex. zstd) we can
inject FSST compression underneath. Resulting in `zstd(fsst(data))`
rather than `zstd(data)` that the user intended. This PR updates our
auto-injection of FSST to only occur if the user has not specified a
compression type manually.

Related to lance-format#5248

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
Some quantizers (just RQ at the moment) don't actually need to sample
any training data. In some paths (e.g. the fallback sampling with lots
of nulls) this could lead to errors because it would actually still do
some sampling. This adds a shortcut to immediately return an empty array
if no sampling is required.
…6114)

set `skip_transpose=True` to let the vector index builder not transpose
the quantized vectors, this is useful for distributed indexing as we
don't need to do inverse transpose and transpose again when we merge
indices
…rmat#6122)

Vector index still needs to be opened to get the right type, otherwise
it is shown as unknown.
…t#6046)

In mask_to_offset_ranges, the RangeWithBitmap case advanced the bitmap
iterator using a global offset (addr - range.start + offset_start)
instead of a range-local position (addr - range.start). When a
RangeWithBitmap segment appeared after other segments (offset_start >
0), the iterator was advanced past its end, causing a panic.

The fix separates range-local iteration from the final offset
calculation: iterate the bitmap using position_in_range, then add
offset_start at the end.

Includes an integration test that reproduces the panic through the
user-facing API: write 2 fragments with stable row IDs, delete some
rows, compact, create a BTree index, then run a filtered scan.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Xuanwo <github@xuanwo.io>
Co-authored-by: Amp <amp@ampcode.com>
…tistics (lance-format#5805)

These are just some handy scripts for looking at community health.

---

I used Claude Opus 4.5 (claude-opus-4-5-20251101) in the creation of
these scripts. I have reviewed the contents and take full responsibility
for them.
…ze (lance-format#6117)

# Preserve stable row-id entries during scalar index optimize
Fixes lance-format#6116

## Summary

This PR fixes a bug where `optimize_indices()` could drop valid BTree
index entries when the dataset used stable row IDs.

I hit this while building the music module of
[StaticFlow](https://ackingliu.top/), my personal project built on top
of Lance/LanceDB. The `songs` dataset uses:

- `enable_stable_row_ids = true`
- a BTree scalar index on `id`

After running:

1. `compact_files()`
2. `optimize_indices()`

full scans still returned the expected rows, but indexed equality
lookups such as `id = 'song-42'` returned no rows.

## Root Cause

The old optimize path filtered old BTree rows with logic equivalent to:

```rust
valid_fragments.contains((row_id >> 32) as u32)
```

That is correct for address-style row IDs:

```text
row_id = (fragment_id << 32) | row_offset
```

But it is incorrect for stable row IDs, because stable row IDs are
opaque logical IDs and do not encode fragment ownership in their upper
bits.

As a result, valid old index rows could be removed during optimize even
though the underlying rows were still present after compaction.

## What This PR Changes

- adds an explicit old-data filter mode to scalar index update
- keeps fragment-based filtering for address-style row IDs
- builds an exact retained row-ID set for stable-row-ID datasets from
persisted row-id sequences
- filters old BTree rows by exact row-ID membership for the
stable-row-ID case
- adds regression coverage for both the BTree update path and the
end-to-end compaction plus optimize flow

## Implementation Notes

The key change is to stop assuming that every `row_id` can be
interpreted as a row address.

For stable-row-ID datasets, the optimize path now:

1. computes the retained old fragments
2. loads their row-ID sequences
3. builds one exact retained row-ID set
4. keeps only old index rows whose row IDs are still valid

This preserves the existing fast path for address-style row IDs and only
uses exact row-ID filtering when the dataset actually uses stable row
IDs.

## Additional Context

I also wrote a longer deep dive covering the bug, the stable-row-ID
model, and the full repair process:

- https://ackingliu.top/posts/lance-stable-row-id-deep-dive

## Final Note

This PR document and parts of the implementation work were prepared with
assistance from Codex GPT-5.4. The final patch has been reviewed by me
personally, and it has been running normally in my production StaticFlow
environment for one week.
…at#6084)

Closes lance-format#3291

```
    stats = dataset.cleanup_old_versions(
        older_than=(datetime.now() - moment), delete_rate_limit=100.0
    )
```

---------

Co-authored-by: YueZhang <zhangyue.1010@bytedance.com>
…6146)

fix CI error: `FAILED
python/tests/test_integration.py::test_duckdb_pushdown_extension_types -
_duckdb.Error: DeprecationWarning: fetch_arrow_table() is deprecated,
use to_arrow_table() instead.`
20%+ faster for 2GB index, could be more for larger index
)

This PR fixes the regression benchmarks workflow failing to resolve the
pinned `google-github-actions/auth` action. The workflow had quoted the
entire `uses` value, which caused the trailing `# v2` comment to be
parsed as part of the action ref.
There was a conflict table in transaction.rs but this was incomplete
(some rows/columns missing) and seemed to be imprecise or incorrect in a
few spots. I've attempted to more thoroughly document this in
transaction.md instead.
…ance-format#6160)

Previously, `adjust_child_validity` would call `ArrayData::try_new` with
a null bitmap on a `DataType::Null` array, causing an `.unwrap()` panic
with `InvalidArgumentError("Arrays of type Null cannot contain a null
bitmask")`.

The trigger: when a user inserts rows where a struct sub-field has only
null values, Arrow infers `DataType::Null` for that column. If a
subsequent fragment omits that nullable sub-field, Lance inserts a
`NullReader` to fill it in. `MergeStream` then merges the real batch
(with null struct rows) and the `NullReader` batch (all-null struct),
recursing into the struct where `adjust_child_validity` is called with
the `Null`-typed child and a non-empty parent validity — triggering the
panic.

Fix: skip the bitmask operation when `child.data_type() ==
DataType::Null`. A `Null` array is always entirely null by definition
and needs no validity adjustment.

Closes lance-format#6159

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…e-format#6163)

Previously, when `FragReuseIndexDetails` exceeded 204800 bytes
(triggered by large compactions with many fragments), the code wrote the
details to an external file (`details.binpb`). On local filesystems,
`ObjectStore::create` returns a `LocalWriter` that atomically renames a
temp file to the final path in `Writer::shutdown`. However,
`frag_reuse.rs` imported `tokio::io::AsyncWriteExt` but not
`lance_io::traits::Writer`, so `writer.shutdown()` resolved to
`AsyncWriteExt::shutdown` (flush/close only) — the temp file was deleted
on drop without being persisted. Any subsequent `load_indices` call
would fail with `Not found: .../details.binpb`.

Fixed by using UFCS `Writer::shutdown(writer.as_mut()).await?` to
explicitly call the lance trait method, matching the existing pattern in
`ivf.rs` and `blob.rs`.

Fixes lance-format#6161

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
This breaks the "build_partitions" stage into "build_partitions" and
"merge_partitions", and also updates the progress reporting on the
shuffle phase to be in terms of rows instead of batches.
This PR moves a few unrelated clippy cleanups out of lance-format#6168 so the blob
empty-range fix can stay focused on the regression it addresses. The
changes here are all mechanical simplifications with no intended
behavior change.
…t#6175)

This PR moves the Linux and Windows workflows that currently run on Warp
onto GitHub-hosted runners. The goal is to reduce reliance on custom
runners and take advantage of the sponsored larger GitHub-hosted
machines for the slowest CI paths.

This is focused on the current CI bottlenecks we observed in recent
successful PR runs, especially Rust ARM and Python Windows jobs, while
keeping the existing macOS and benchmark-specific runners unchanged
until we verify equivalent GitHub-hosted options for them.

Context:
- Recent PR history shows Rust `linux-arm` and Python `windows` as the
dominant critical-path jobs.
- This change upgrades those jobs to larger GitHub-hosted runners where
available (`ubuntu-24.04-8x`, `ubuntu-24.04-arm64-8x`,
`windows-latest-4x`) and aligns the remaining Linux/Windows workflows
with the same runner family.
- I validated the workflow YAML locally after the runner migration; no
product code or test logic changed.

---

Updates:

- Rust linux-arm:40.7 -> 19.4,about -52%
- Rust windows-build:27.7 -> 21.0,about -24%
- Python windows:36.5 -> 23.1,about -37%
- Python Linux 3.13 ARM:26.9 -> 20.7,about -23%
- Python Linux 3.13 x86_64:26.8 -> 19.1,about -29%
- Python Linux 3.9 x86_64:25.9 -> 19.2,about -26%
Improvements lance-format#4247 alicloud
storage config doc.

Signed-off-by: FarmerChillax <farmerchillax@outlook.com>
jackye1995 and others added 25 commits April 11, 2026 15:30
…-format#6488)

## Summary

- Add missing `(SubIndexType::Flat, QuantizationType::FlatBin)` match
arm in `optimize_vector_indices_v2`

The v2 function handles all other sub-index/quantization combinations
but misses the FlatBin case for binary vector IVF_FLAT indices, hitting
the catch-all `unimplemented!` panic during incremental indexing
(`optimize_indices`). The v1 function already handles this correctly.
…t#6435)

This teaches `merge_insert` to keep the delete-by-source fast path even
when a scalar index exists on the join key. The actual indexed join path
is still only used when unmatched target rows are kept, so the presence
of index metadata should not force these operations back to the legacy
full-join path.

This also adds regression coverage for full-schema `FixedSizeList`
merges with `when_not_matched_by_source(Delete)` both with and without a
scalar index. That closes the gap behind lance-format#6195 and preserves the earlier
fix for lancedb/lancedb#3094.
…lance-format#6477)

## Summary

- Change `DataFile.fields` and `DataFile.column_indices` from `Vec<i32>`
to `Arc<[i32]>` so that fragments with identical field lists share a
single heap allocation
- Add `DataFileFieldInterner` that deduplicates these slices during
manifest deserialization
- In homogeneous tables (the common case), every fragment carries the
same field list, so at 20M fragments this saves **~2.4 GB** of redundant
heap allocations

## Motivation

When dataset manifests grow large (>1 GB with millions of fragments),
opening the dataset becomes very expensive in terms of memory. Each
`DataFile` previously owned its own `Vec<i32>` for `fields` and
`column_indices`, even though in most tables every fragment has the
exact same field list. This PR deduplicates those allocations at
deserialization time.

### Per-fragment memory breakdown (before)

| Field | Size per fragment |
|-------|------------------|
| `fields: Vec<i32>` (10 fields) | ~64 bytes |
| `column_indices: Vec<i32>` (10 cols) | ~64 bytes |
| **Total redundant** | **~128 bytes x 20M = ~2.4 GB** |

### After this change

With interning, all 20M fragments share a single `Arc<[i32]>` allocation
(~80 bytes total instead of 2.4 GB).

## Changes

- **`lance-table/src/format/fragment.rs`** — Core struct change
(`Vec<i32>` → `Arc<[i32]>`), custom `Serialize`/`Deserialize` impls, and
`DataFileFieldInterner`
- **`lance-table/src/format/manifest.rs`** — Use interner during
manifest deserialization
- **`lance/src/dataset/fragment.rs`**, **`merge_insert.rs`**,
**`io/commit.rs`** — Tombstoning and field-remapping rebuilt as new
`Arc<[i32]>` instead of in-place mutation
- **`python/src/fragment.rs`**, **`java/lance-jni/src/fragment.rs`** —
FFI boundary conversions
- Various test files — Updated struct literals and assertions

## Compatibility

- No format change — protobuf schema is unchanged
- Serde JSON output is identical (custom impl serializes `Arc<[i32]>` as
`[i32]`)
- All public API signatures that take `Vec<i32>` (e.g.,
`DataFile::new()`, `Fragment::add_file()`) still accept `Vec<i32>` and
convert internally

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…mory (lance-format#6499)

## Summary

- Change `RowDatasetVersionMeta::Inline` from `Vec<u8>` to `Arc<[u8]>`
so that fragments with identical version metadata share a single heap
allocation
- Extend `DataFileFieldInterner` to deduplicate these inline byte
payloads during manifest deserialization
- Introduce `InternCache<T>`: a hybrid cache that uses Vec linear scan
for ≤16 entries and upgrades to HashMap for larger caches
- Add custom `Serialize`/`Deserialize` impls for `RowDatasetVersionMeta`
to handle `Arc<[u8]>` transparently

## Motivation

Follow-up to lance-format#6477 (interning `DataFile.fields`/`column_indices`). After
a compaction, all fragments are stamped with the same version metadata
(both `last_updated_at_version_meta` and `created_at_version_meta`), but
each fragment previously owned its own `Vec<u8>` copy.

### Per-fragment memory breakdown (before)

| Field | Size per fragment |
|-------|------------------|
| `last_updated_at_version_meta: Inline(Vec<u8>)` | ~24 bytes + payload
|
| `created_at_version_meta: Inline(Vec<u8>)` | ~24 bytes + payload |
| **Total redundant at 20M fragments** | **~480 MB+** |

### After this change

With interning, all 20M fragments share a single `Arc<[u8]>` allocation
per unique payload.

## Benchmark results

Microbenchmark at 100K fragments (10 fields per fragment):

| Scenario | No interning | With interning | Delta |
|----------|-------------|----------------|-------|
| **Uniform (1 unique version)** | 24.5 ms | 17.9 ms | **27% faster** |
| **Diverse (10 unique)** | 25.7 ms | 19.7 ms | **23% faster** |
| **Diverse (100 unique)** | 26.0 ms | 23.4 ms | **10% faster** |
| **Diverse (500 unique)** | 26.0 ms | 22.8 ms | **12% faster** |

| Memory (100K fragments) | No interning | With interning | Savings |
|------------------------|-------------|----------------|---------|
| **10 fields** | 39.47 MB | 29.74 MB | **24.6%** |
| **50 fields** | 69.99 MB | 29.74 MB | **57.5%** |

Both memory and speed improve across all scenarios. The hybrid
`InternCache` uses fast Vec scan for the common case (1-3 unique values)
and upgrades to HashMap when diversity exceeds 16 entries.

Run with: `cargo bench -p lance-table --bench manifest_intern`

## Changes

- **`rust/lance-table/src/rowids/version.rs`** — `Inline(Vec<u8>)` →
`Inline(Arc<[u8]>)`, custom serde impls, updated protobuf conversions
- **`rust/lance-table/src/format/fragment.rs`** — `InternCache<T>`
(Vec/HashMap hybrid), extended `DataFileFieldInterner` with version meta
interning
- **`rust/lance-table/benches/manifest_intern.rs`** — Microbenchmark
covering uniform and diverse scenarios

## Compatibility

- No format change — protobuf schema is unchanged
- Serde JSON output is identical (custom impl serializes `Arc<[u8]>` as
`[u8]`)
- `from_sequence()` still works as before (converts internally)

## Test plan

- [x] `cargo check --workspace --tests` passes
- [x] `cargo clippy -p lance-table -p lance -- -D warnings` passes
- [x] All 88 `lance-table` tests pass
- [x] `cargo fmt --all -- --check` passes
- [x] Microbenchmark validates performance across uniform and diverse
scenarios
- [ ] CI

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rmat#6308)

- `list_all_tables`
- `restore_table`
- `update_table_schema_metadata`
- `get_table_stats`
- `explain_table_query_plan`
- `analyze_table_query_plan`

---------

Co-authored-by: zhangyue19921010 <zhangyue.1010@bytedance.com>
## Summary

- Adds `#[instrument]` attributes from the `tracing` crate to key
functions across the `mem_wal` module
- Covers write path (`RegionWriter::open`, `put`, `close`), flush path
(`MemTableFlusher::flush`, `flush_with_indexes`), WAL operations,
manifest store, memtable inserts, scanner/planner, point lookups, and
vector search
- Uses appropriate trace levels (`info` for high-level operations,
`debug` for internals) with relevant fields (region_id, epoch, row
counts, batch counts)

## Test plan

- [x] `cargo check` passes — no functional changes, only attribute
additions
- [x] Existing `mem_wal` tests continue to pass
- [ ] Tracing output verified with `RUST_LOG=debug` showing instrumented
spans

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
)

## Summary

Refactor `FullZipScheduler::create_page_load_task` to accept a
pre-submitted I/O future instead of deferring I/O submission until the
async task executes. This allows the I/O requests to be submitted
immediately during scheduling, enabling the object store layer to batch
and parallelize them. close lance-format#6504

## I/O Model Change

### Before: Lazy I/O submission (serialized)

Previously, `create_page_load_task` received a
`FullZipReadSource::Remote(io)` along with byte ranges and priority. The
actual `io.submit_request()` call happened **inside** the async block,
meaning the I/O request was not submitted until the future was first
polled.

When decoding multiple pages (e.g. across many fragments), this created
a sequential I/O pattern:

```
Page 1: [schedule] -> [poll] -> [submit I/O] -> [wait response] -> [decode]
Page 2:                                          [schedule] -> [poll] -> [submit I/O] -> [wait response] -> [decode]
Page 3:                                                                                   [schedule] -> [poll] -> ...
```

Each page's I/O request could only be submitted after the previous task
started executing. The I/O scheduler had no visibility into upcoming
requests, preventing it from batching or parallelizing them effectively.

### After: Eager I/O submission (pipelined)

Now, `io.submit_request()` is called **before** constructing the
`PageLoadTask`, and the resulting future is passed into
`create_page_load_task`. All I/O requests for all pages are submitted
upfront during the scheduling phase:

```
[schedule all pages] --> submit I/O page 1 -+
                     --> submit I/O page 2 -+
                     --> submit I/O page 3 -+  (all in-flight concurrently)
                     --> submit I/O page N -+
                                            |
                     [poll] -> [await page 1 response] -> [decode]
                     [poll] -> [await page 2 response] -> [decode]
                     [poll] -> [await page 3 response] -> [decode]
```

The object store layer can now see all pending requests at once and
optimize I/O through batching, connection multiplexing, and parallel
fetches. The async tasks only await the already-in-flight I/O futures.

## Changes

- `rust/lance-encoding/src/encodings/logical/primitive.rs`:
- Changed `create_page_load_task` signature to accept
`BoxFuture<'static, Result<Vec<Bytes>>>` instead of `FullZipReadSource`
+ byte ranges + priority
- Moved `io.submit_request()` calls to happen eagerly at both call sites
(`schedule_ranges_with_rep_index` and the non-rep-index path), before
constructing the page load task

## Performance

Tested with a multi-fragment dataset containing fixed-width columns
(768-dim float32 vectors, 40 fragments, 50 rows/fragment):

| Benchmark | Before (p50) | After (p50) | Speedup |
|---|---|---|---|
| Fixed-width column scan | 3453 ms | 523 ms | **6.6x** |

The improvement comes entirely from I/O pipelining — the decoding logic
itself is unchanged. The effect is most pronounced with many fragments
or pages, where the serialized I/O submission was the dominant
bottleneck.
## Summary
- Add `blob_max_pack_file_bytes` to `WriteParams`, allowing users to
override the default 1 GiB maximum pack (`.blob`) sidecar file size
- Thread the configuration through the full write path: `WriteParams` ->
`WriterGenerator` -> `WriterOptions` -> `BlobPreprocessor` ->
`PackWriter`
- Expose the option in Python (`write_dataset`) and Java
(`WriteParams.Builder`) bindings

## Test plan
- [x] All 37 existing blob tests pass (`cargo test -p lance blob`)
- [x] Clippy clean on `lance` and `lance-jni` crates
- [x] Verify Python binding works end-to-end with
`blob_max_pack_file_bytes` kwarg
- [x] Verify Java binding compiles with `./mvnw compile`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary

- Bump `jieba-rs` from 0.8.1 to 0.9.0 to fix the `build-no-lock` CI job
- The `core2` crate v0.4.0 was yanked from crates.io, breaking fresh
dependency resolution (`jieba-rs` → `include-flate` → `libflate` →
`core2`)
- `jieba-rs` 0.9.0 drops the `include-flate`/`libflate`/`core2` chain
entirely, removing 9 transitive dependencies with no API changes

## Test plan

- [x] `cargo check -p lance-index --features tokenizer-jieba` passes
- [x] Verified build succeeds without `Cargo.lock` (simulating the CI
job)
- [ ] CI `build-no-lock` job passes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This tightens the repository's environment guidance so language-specific
tasks must follow the documented workflow before reporting missing tools
or dependencies.

For Python work, the docs now make `uv sync --extra tests --extra dev`
and `uv run ...` mandatory, and explicitly call out the common failure
mode where slow `uv sync` is interrupted or global Python is used
instead.
This changes per-base runtime configuration to use exact
`ObjectStoreParams` bindings keyed by `BasePath.path` instead of
per-base storage option overrides. Dataset-level and write-level store
params now act only as fallbacks, while reads, target-base writes, and
external blob resolution all consult the same base-scoped binding model.

This keeps provider-specific runtime state out of the manifest and
follows the direction in discussion lance-format#6307 to keep `BasePath` focused on
identity.
This PR vendors the tokenizer stack Lance actually uses into a new
`rust/lance-tokenizer` crate and rewires FTS and inverted-index code to
depend on it instead of `tantivy` and `lindera-tantivy`. It keeps the
existing document and query tokenization semantics in-tree, renames the
old FTS document adapter module to `document_tokenizer`, and preserves
upstream license headers on vendored code.
…ormat#6517)

## Summary

- Add hand-written AVX2 and AVX-512 VNNI backends for u8 squared L2
distance (`Σ(a-b)²`) in new `l2_u8.rs`
- Add fused single-pass u8 cosine distance kernel in new `cosine_u8.rs`
— computes `dot(a,b)`, `‖a‖²`, `‖b‖²` simultaneously, halving memory
traffic vs the previous 2-3 pass approach
- Wire both into the `L2 for u8` and `Cosine for u8` trait impls
- Add benchmarks comparing scalar vs SIMD for both kernels

### Algorithmic approach (adapted from
[NumKong](https://github.com/ashvardanian/NumKong))

**L2 (AVX2):** Saturating subtraction for `|a-b|`, zero-extend u8→i16,
`VPMADDWD(diff, diff)` to square and accumulate into i32. 32
elements/iter.

**L2 (AVX-512 VNNI):** Same abs-diff approach with `VPDPWSSD` for fused
square-accumulate. 64 elements/iter.

**Cosine (AVX2):** Zero-extend both vectors to i16, triple `VPMADDWD`
per half (a·b, a·a, b·b). 32 elements/iter, single pass.

**Cosine (AVX-512 VNNI):** Same three-accumulator approach with
`VPDPWSSD`. 64 elements/iter.

Both kernels use `OnceLock`-based runtime CPU dispatch, falling back to
portable scalar on non-x86 platforms.

### Benchmarks

*1M × 1024-dim u8 vectors.*

**x86_64 — AMD Ryzen 5 4500 6-Core (AVX2, no AVX-512)**

| Kernel | Scalar | SIMD | Speedup |
|--------|--------|------|---------|
| L2(u8) | 73.5 ms | 58.2 ms | **1.26x** |
| Cosine(u8) | 122.2 ms | 82.1 ms | **1.49x** |

L2 auto-vectorization baseline was 91.5 ms, so SIMD is 1.57x faster than
that path.

**aarch64 — Apple Silicon M3 Max (no AVX2, scalar fallback)**

| Kernel | Scalar | SIMD (dispatch) |
|--------|--------|-----------------|
| L2(u8) | 26.8 ms | 27.3 ms |
| Cosine(u8) | 90.1 ms | 90.4 ms |

On aarch64 the SIMD path falls through to scalar (no AVX2), so times are
identical — confirms no regression on non-x86 platforms. AVX-512 VNNI
systems (Ice Lake+, Zen 4+) should see larger gains.

## Test plan

- [x] All 11 new tests pass: SIMD backends verified against scalar
reference across 18 vector sizes (0–4097), boundary values (0/255),
alternating patterns, random seeds
- [x] All 63 existing lance-linalg tests pass (no regressions)
- [x] Clippy clean, fmt clean
- [x] Benchmarked on x86_64 AVX2 (AMD Ryzen 5 4500) — L2 1.26x, Cosine
1.49x faster
- [ ] Verify on AVX-512 VNNI system for additional speedup data

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This fixes the release bump configuration after `lance-tokenizer` was
added to the workspace dependencies. `.bumpversion.toml` was missing the
corresponding replacement rule, so version bumps could leave that
internal dependency on the previous version. This is a targeted
config-only fix to keep the release automation updating all workspace
crates consistently.
This fixes the directory namespace CI failure where single-instance
concurrent create/drop operations on `__manifest` could time out with
`TooMuchWriteContention`, especially in the Windows build.

Manifest mutations are now serialized within a single
`ManifestNamespace` instance so concurrent operations stop racing on
stale in-memory snapshots, and inline manifest maintenance now defers
compaction/index merges until the table has accumulated enough
fragments.

Context:
https://github.com/lance-format/lance/actions/runs/24439767878/job/71401857043
Blob columns can be represented either as loaded values or as unloaded
descriptor schemas, but our schema projection logic still treated those
views as incompatible types. This change teaches field projection and
intersection to recognize blob loaded/unloaded pairs as the same logical
column, and adds regression coverage for both the core schema path and
the projection-plan path that previously failed.
## Summary

- Adds `ChopBatchesStream`, a stream wrapper that splits oversized
batches (>1.5x target `batch_size_bytes`) into smaller sub-batches using
zero-copy `RecordBatch::slice`
- Wraps the filtered read output stream with `ChopBatchesStream` when
`batch_size_bytes` is configured via `FileReaderOptions`
- Serves as a safety net when the underlying file reader doesn't
estimate batch sizes accurately enough

**Stacked on feat/byte-sized-batches-file-reader** — wait for that to
merge first, then rebase this PR.

## Test plan

- [x] Unit tests for `ChopBatchesStream`: splits large batches, passes
small batches through, `wrap_if_needed(None)` is a no-op
- [x] `cargo clippy` clean
- [x] `cargo fmt` clean

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…-format#6503)

Add protobuf encode/decode for `ANNIvfSubIndexExec`

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This isolates the `test_memory_leaks` index statistics probe into a
fresh subprocess instead of running it inside the long-lived pytest
worker process. That keeps the test focused on repeated
`index_statistics` calls and avoids false positives from RSS growth left
behind by earlier tests such as the recent batch-chopping coverage added
in `test_dataset.py`.
…at#6352)

This PR improves blob I/O in two complementary ways: `BlobFile`
instances that resolve to the same physical object now share a lazy
`BlobSource` and can opportunistically coalesce concurrent reads before
handing them to Lance's existing scheduler, and datasets now expose a
planned `read_blobs` API for materializing blob payloads directly. It
also adds explicit cursor-preserving range reads for `BlobFile` across
Rust, Python, and Java, with end-to-end Python coverage for the new API
and the edge cases it uncovered.

This keeps the optimization aligned with Lance's existing scheduler
model while giving callers a higher-level path for sequential and
batched blob access.

## Python example

```python
import lance

dataset = lance.dataset("/path/to/dataset")
blobs = dataset.read_blobs(
    "images",
    indices=[0, 4, 8],
    target_request_bytes=8 * 1024 * 1024,
    max_gap_bytes=64 * 1024,
    max_concurrency=4,
    preserve_order=True,
)

for row_address, payload in blobs:
    print(row_address, len(payload))
```
…mat#6540)

## Summary

- Adds `f64x4` and `f64x8` SIMD types to `lance-linalg` with support for
x86_64 (AVX2/AVX-512), aarch64 (NEON), and loongarch64 (LASX)
- Replaces auto-vectorization-dependent f64 distance functions with
explicit SIMD using two-level unrolling (f64x8 + f64x4 + scalar tail)
- Updates norm_l2, dot, L2, and cosine distance for f64

## Benchmark Results (Apple M-series, aarch64 NEON)

1M vectors × 1024 dimensions:

| Benchmark | Before | After | Change |
|-----------|--------|-------|--------|
| NormL2(f64, auto-vec) | 117.76 ms | 116.04 ms | ~same |
| NormL2(f64, SIMD) | N/A (TODO) | 119.16 ms | new |
| Dot(f64, auto-vec) | 129.36 ms | 130.23 ms | ~same |
| L2(f64, auto-vec) | 132.53 ms | 135.15 ms | ~same |
| **Cosine(f64, auto-vec)** | **202.52 ms** | **139.23 ms** | **-31.4%**
|

The biggest win is **cosine distance**, which previously had an empty
`impl Cosine for f64 {}` falling back to the scalar path. The explicit
SIMD implementation is **31% faster**.

For norm_l2, dot, and L2, LLVM's auto-vectorization with the LANES=8
hint was already producing good code on this platform. The explicit SIMD
ensures consistent performance across compilers and platforms rather
than relying on fragile auto-vectorization hints.

## Test plan
- [x] All 59 lance-linalg tests pass
- [x] Clippy clean (`-D warnings`)
- [x] `cargo fmt` clean
- [ ] CI passes on all platforms

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@emilk emilk force-pushed the emilk/fix-unwrap branch 2 times, most recently from 3ec544e to a38ffae Compare April 16, 2026 13:05
@emilk emilk closed this Apr 16, 2026
@emilk emilk changed the title Replace .unwrap() with ? in filtered_read.rs Fix crash in FilteredReadStream on task cancellation Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.