Fix crash in FilteredReadStream on task cancellation#17
Closed
emilk wants to merge 217 commits intorelease-3.0.0from
Closed
Fix crash in FilteredReadStream on task cancellation#17emilk wants to merge 217 commits intorelease-3.0.0from
FilteredReadStream on task cancellation#17emilk wants to merge 217 commits intorelease-3.0.0from
Conversation
All rules are derived from analysis of ~1000 PR reviews to capture recurring review patterns as actionable guidelines. Changes are as following: - Restructure root `AGENTS.md`: consolidate 3 overlapping overview sections into one, merge "Key Technical Details" + "Development Notes" + "Development tips" into organized Coding/Testing/Documentation Standards sections, deduplicate Python/Java commands (now link to subdirectory files) - Create `rust/AGENTS.md` with ~66 Rust-specific rules covering code style, API design, error handling, naming, testing, documentation, and lance-encoding hot path patterns - Enhance `java/AGENTS.md` with API design (Options pattern, JNI enum serialization), code style (JavaBean conventions), and documentation rules - Enhance `python/AGENTS.md` with Pythonic API design, PyO3 dataclass rules, type hints, and testing patterns - Enhance `protos/AGENTS.md` with proto3 `optional` semantics, structured message design, and documentation rules - Create `docs/src/format/AGENTS.md` with format spec documentation standards (pyarrow schemas, language-agnostic definitions, algorithm detail requirements)
Add a new metric `removed_data_file_num` to the `RemovalStats` of the cleanup operation results, So that users can easily perceive how many lance data files have been deleted in the current cleanup operation and better evaluate the impact scope of the current cleanup. --------- Co-authored-by: YueZhang <zhangyue.1010@bytedance.com>
…fragments (lance-format#6040) In lance-spark, distributed vector queries rely on fragmentScanner. When a specific fragment is targeted, prefilter must be set to true to ensure correct execution. This change exposes the variable through JNI to enable this functionality. --------- Co-authored-by: niuyulin <niuyulin@chinamobile.com>
…at#5691) add Python support for defer_index_remap --------- Co-authored-by: YueZhang <zhangyue.1010@bytedance.com>
…ension type (lance-format#6107) Closes lance-format#6106 AI Disclaimer: I have used Claude Code to help draft this PR and have manually reviewed its contents. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…nce-format#6042) When stable row IDs are enabled, FTS and vector indexes may return row IDs for rows that have since been deleted. The row ID index excludes deleted rows, so get_row_addrs() would silently drop these entries via filter_map, producing an addresses array shorter than the input batch. The downstream merge_with_schema then failed with "Attempt to merge two RecordBatch with different sizes". Fix: track which row IDs are valid in get_row_addrs() and return a validity mask. In map_batch(), filter the input batch to remove rows whose IDs no longer exist before merging. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Xuanwo <github@xuanwo.io>
…-format#6120) Currently, if a user manually sets a compression type (ex. zstd) we can inject FSST compression underneath. Resulting in `zstd(fsst(data))` rather than `zstd(data)` that the user intended. This PR updates our auto-injection of FSST to only occur if the user has not specified a compression type manually. Related to lance-format#5248 Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
Some quantizers (just RQ at the moment) don't actually need to sample any training data. In some paths (e.g. the fallback sampling with lots of nulls) this could lead to errors because it would actually still do some sampling. This adds a shortcut to immediately return an empty array if no sampling is required.
…6114) set `skip_transpose=True` to let the vector index builder not transpose the quantized vectors, this is useful for distributed indexing as we don't need to do inverse transpose and transpose again when we merge indices
…rmat#6122) Vector index still needs to be opened to get the right type, otherwise it is shown as unknown.
…ance-format#6113) Just fix `indexes` to `indices` for uniformity.
…t#6046) In mask_to_offset_ranges, the RangeWithBitmap case advanced the bitmap iterator using a global offset (addr - range.start + offset_start) instead of a range-local position (addr - range.start). When a RangeWithBitmap segment appeared after other segments (offset_start > 0), the iterator was advanced past its end, causing a panic. The fix separates range-local iteration from the final offset calculation: iterate the bitmap using position_in_range, then add offset_start at the end. Includes an integration test that reproduces the panic through the user-facing API: write 2 fragments with stable row IDs, delete some rows, compact, create a BTree index, then run a filtered scan. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Xuanwo <github@xuanwo.io> Co-authored-by: Amp <amp@ampcode.com>
…tistics (lance-format#5805) These are just some handy scripts for looking at community health. --- I used Claude Opus 4.5 (claude-opus-4-5-20251101) in the creation of these scripts. I have reviewed the contents and take full responsibility for them.
…ze (lance-format#6117) # Preserve stable row-id entries during scalar index optimize Fixes lance-format#6116 ## Summary This PR fixes a bug where `optimize_indices()` could drop valid BTree index entries when the dataset used stable row IDs. I hit this while building the music module of [StaticFlow](https://ackingliu.top/), my personal project built on top of Lance/LanceDB. The `songs` dataset uses: - `enable_stable_row_ids = true` - a BTree scalar index on `id` After running: 1. `compact_files()` 2. `optimize_indices()` full scans still returned the expected rows, but indexed equality lookups such as `id = 'song-42'` returned no rows. ## Root Cause The old optimize path filtered old BTree rows with logic equivalent to: ```rust valid_fragments.contains((row_id >> 32) as u32) ``` That is correct for address-style row IDs: ```text row_id = (fragment_id << 32) | row_offset ``` But it is incorrect for stable row IDs, because stable row IDs are opaque logical IDs and do not encode fragment ownership in their upper bits. As a result, valid old index rows could be removed during optimize even though the underlying rows were still present after compaction. ## What This PR Changes - adds an explicit old-data filter mode to scalar index update - keeps fragment-based filtering for address-style row IDs - builds an exact retained row-ID set for stable-row-ID datasets from persisted row-id sequences - filters old BTree rows by exact row-ID membership for the stable-row-ID case - adds regression coverage for both the BTree update path and the end-to-end compaction plus optimize flow ## Implementation Notes The key change is to stop assuming that every `row_id` can be interpreted as a row address. For stable-row-ID datasets, the optimize path now: 1. computes the retained old fragments 2. loads their row-ID sequences 3. builds one exact retained row-ID set 4. keeps only old index rows whose row IDs are still valid This preserves the existing fast path for address-style row IDs and only uses exact row-ID filtering when the dataset actually uses stable row IDs. ## Additional Context I also wrote a longer deep dive covering the bug, the stable-row-ID model, and the full repair process: - https://ackingliu.top/posts/lance-stable-row-id-deep-dive ## Final Note This PR document and parts of the implementation work were prepared with assistance from Codex GPT-5.4. The final patch has been reviewed by me personally, and it has been running normally in my production StaticFlow environment for one week.
…at#6084) Closes lance-format#3291 ``` stats = dataset.cleanup_old_versions( older_than=(datetime.now() - moment), delete_rate_limit=100.0 ) ``` --------- Co-authored-by: YueZhang <zhangyue.1010@bytedance.com>
…6146) fix CI error: `FAILED python/tests/test_integration.py::test_duckdb_pushdown_extension_types - _duckdb.Error: DeprecationWarning: fetch_arrow_table() is deprecated, use to_arrow_table() instead.`
20%+ faster for 2GB index, could be more for larger index
There was a conflict table in transaction.rs but this was incomplete (some rows/columns missing) and seemed to be imprecise or incorrect in a few spots. I've attempted to more thoroughly document this in transaction.md instead.
…ance-format#6160) Previously, `adjust_child_validity` would call `ArrayData::try_new` with a null bitmap on a `DataType::Null` array, causing an `.unwrap()` panic with `InvalidArgumentError("Arrays of type Null cannot contain a null bitmask")`. The trigger: when a user inserts rows where a struct sub-field has only null values, Arrow infers `DataType::Null` for that column. If a subsequent fragment omits that nullable sub-field, Lance inserts a `NullReader` to fill it in. `MergeStream` then merges the real batch (with null struct rows) and the `NullReader` batch (all-null struct), recursing into the struct where `adjust_child_validity` is called with the `Null`-typed child and a non-empty parent validity — triggering the panic. Fix: skip the bitmask operation when `child.data_type() == DataType::Null`. A `Null` array is always entirely null by definition and needs no validity adjustment. Closes lance-format#6159 --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…e-format#6163) Previously, when `FragReuseIndexDetails` exceeded 204800 bytes (triggered by large compactions with many fragments), the code wrote the details to an external file (`details.binpb`). On local filesystems, `ObjectStore::create` returns a `LocalWriter` that atomically renames a temp file to the final path in `Writer::shutdown`. However, `frag_reuse.rs` imported `tokio::io::AsyncWriteExt` but not `lance_io::traits::Writer`, so `writer.shutdown()` resolved to `AsyncWriteExt::shutdown` (flush/close only) — the temp file was deleted on drop without being persisted. Any subsequent `load_indices` call would fail with `Not found: .../details.binpb`. Fixed by using UFCS `Writer::shutdown(writer.as_mut()).await?` to explicitly call the lance trait method, matching the existing pattern in `ivf.rs` and `blob.rs`. Fixes lance-format#6161 --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
This breaks the "build_partitions" stage into "build_partitions" and "merge_partitions", and also updates the progress reporting on the shuffle phase to be in terms of rows instead of batches.
This PR moves a few unrelated clippy cleanups out of lance-format#6168 so the blob empty-range fix can stay focused on the regression it addresses. The changes here are all mechanical simplifications with no intended behavior change.
…t#6175) This PR moves the Linux and Windows workflows that currently run on Warp onto GitHub-hosted runners. The goal is to reduce reliance on custom runners and take advantage of the sponsored larger GitHub-hosted machines for the slowest CI paths. This is focused on the current CI bottlenecks we observed in recent successful PR runs, especially Rust ARM and Python Windows jobs, while keeping the existing macOS and benchmark-specific runners unchanged until we verify equivalent GitHub-hosted options for them. Context: - Recent PR history shows Rust `linux-arm` and Python `windows` as the dominant critical-path jobs. - This change upgrades those jobs to larger GitHub-hosted runners where available (`ubuntu-24.04-8x`, `ubuntu-24.04-arm64-8x`, `windows-latest-4x`) and aligns the remaining Linux/Windows workflows with the same runner family. - I validated the workflow YAML locally after the runner migration; no product code or test logic changed. --- Updates: - Rust linux-arm:40.7 -> 19.4,about -52% - Rust windows-build:27.7 -> 21.0,about -24% - Python windows:36.5 -> 23.1,about -37% - Python Linux 3.13 ARM:26.9 -> 20.7,about -23% - Python Linux 3.13 x86_64:26.8 -> 19.1,about -29% - Python Linux 3.9 x86_64:25.9 -> 19.2,about -26%
Improvements lance-format#4247 alicloud storage config doc. Signed-off-by: FarmerChillax <farmerchillax@outlook.com>
…-format#6488) ## Summary - Add missing `(SubIndexType::Flat, QuantizationType::FlatBin)` match arm in `optimize_vector_indices_v2` The v2 function handles all other sub-index/quantization combinations but misses the FlatBin case for binary vector IVF_FLAT indices, hitting the catch-all `unimplemented!` panic during incremental indexing (`optimize_indices`). The v1 function already handles this correctly.
…t#6435) This teaches `merge_insert` to keep the delete-by-source fast path even when a scalar index exists on the join key. The actual indexed join path is still only used when unmatched target rows are kept, so the presence of index metadata should not force these operations back to the legacy full-join path. This also adds regression coverage for full-schema `FixedSizeList` merges with `when_not_matched_by_source(Delete)` both with and without a scalar index. That closes the gap behind lance-format#6195 and preserves the earlier fix for lancedb/lancedb#3094.
…lance-format#6477) ## Summary - Change `DataFile.fields` and `DataFile.column_indices` from `Vec<i32>` to `Arc<[i32]>` so that fragments with identical field lists share a single heap allocation - Add `DataFileFieldInterner` that deduplicates these slices during manifest deserialization - In homogeneous tables (the common case), every fragment carries the same field list, so at 20M fragments this saves **~2.4 GB** of redundant heap allocations ## Motivation When dataset manifests grow large (>1 GB with millions of fragments), opening the dataset becomes very expensive in terms of memory. Each `DataFile` previously owned its own `Vec<i32>` for `fields` and `column_indices`, even though in most tables every fragment has the exact same field list. This PR deduplicates those allocations at deserialization time. ### Per-fragment memory breakdown (before) | Field | Size per fragment | |-------|------------------| | `fields: Vec<i32>` (10 fields) | ~64 bytes | | `column_indices: Vec<i32>` (10 cols) | ~64 bytes | | **Total redundant** | **~128 bytes x 20M = ~2.4 GB** | ### After this change With interning, all 20M fragments share a single `Arc<[i32]>` allocation (~80 bytes total instead of 2.4 GB). ## Changes - **`lance-table/src/format/fragment.rs`** — Core struct change (`Vec<i32>` → `Arc<[i32]>`), custom `Serialize`/`Deserialize` impls, and `DataFileFieldInterner` - **`lance-table/src/format/manifest.rs`** — Use interner during manifest deserialization - **`lance/src/dataset/fragment.rs`**, **`merge_insert.rs`**, **`io/commit.rs`** — Tombstoning and field-remapping rebuilt as new `Arc<[i32]>` instead of in-place mutation - **`python/src/fragment.rs`**, **`java/lance-jni/src/fragment.rs`** — FFI boundary conversions - Various test files — Updated struct literals and assertions ## Compatibility - No format change — protobuf schema is unchanged - Serde JSON output is identical (custom impl serializes `Arc<[i32]>` as `[i32]`) - All public API signatures that take `Vec<i32>` (e.g., `DataFile::new()`, `Fragment::add_file()`) still accept `Vec<i32>` and convert internally 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…mory (lance-format#6499) ## Summary - Change `RowDatasetVersionMeta::Inline` from `Vec<u8>` to `Arc<[u8]>` so that fragments with identical version metadata share a single heap allocation - Extend `DataFileFieldInterner` to deduplicate these inline byte payloads during manifest deserialization - Introduce `InternCache<T>`: a hybrid cache that uses Vec linear scan for ≤16 entries and upgrades to HashMap for larger caches - Add custom `Serialize`/`Deserialize` impls for `RowDatasetVersionMeta` to handle `Arc<[u8]>` transparently ## Motivation Follow-up to lance-format#6477 (interning `DataFile.fields`/`column_indices`). After a compaction, all fragments are stamped with the same version metadata (both `last_updated_at_version_meta` and `created_at_version_meta`), but each fragment previously owned its own `Vec<u8>` copy. ### Per-fragment memory breakdown (before) | Field | Size per fragment | |-------|------------------| | `last_updated_at_version_meta: Inline(Vec<u8>)` | ~24 bytes + payload | | `created_at_version_meta: Inline(Vec<u8>)` | ~24 bytes + payload | | **Total redundant at 20M fragments** | **~480 MB+** | ### After this change With interning, all 20M fragments share a single `Arc<[u8]>` allocation per unique payload. ## Benchmark results Microbenchmark at 100K fragments (10 fields per fragment): | Scenario | No interning | With interning | Delta | |----------|-------------|----------------|-------| | **Uniform (1 unique version)** | 24.5 ms | 17.9 ms | **27% faster** | | **Diverse (10 unique)** | 25.7 ms | 19.7 ms | **23% faster** | | **Diverse (100 unique)** | 26.0 ms | 23.4 ms | **10% faster** | | **Diverse (500 unique)** | 26.0 ms | 22.8 ms | **12% faster** | | Memory (100K fragments) | No interning | With interning | Savings | |------------------------|-------------|----------------|---------| | **10 fields** | 39.47 MB | 29.74 MB | **24.6%** | | **50 fields** | 69.99 MB | 29.74 MB | **57.5%** | Both memory and speed improve across all scenarios. The hybrid `InternCache` uses fast Vec scan for the common case (1-3 unique values) and upgrades to HashMap when diversity exceeds 16 entries. Run with: `cargo bench -p lance-table --bench manifest_intern` ## Changes - **`rust/lance-table/src/rowids/version.rs`** — `Inline(Vec<u8>)` → `Inline(Arc<[u8]>)`, custom serde impls, updated protobuf conversions - **`rust/lance-table/src/format/fragment.rs`** — `InternCache<T>` (Vec/HashMap hybrid), extended `DataFileFieldInterner` with version meta interning - **`rust/lance-table/benches/manifest_intern.rs`** — Microbenchmark covering uniform and diverse scenarios ## Compatibility - No format change — protobuf schema is unchanged - Serde JSON output is identical (custom impl serializes `Arc<[u8]>` as `[u8]`) - `from_sequence()` still works as before (converts internally) ## Test plan - [x] `cargo check --workspace --tests` passes - [x] `cargo clippy -p lance-table -p lance -- -D warnings` passes - [x] All 88 `lance-table` tests pass - [x] `cargo fmt --all -- --check` passes - [x] Microbenchmark validates performance across uniform and diverse scenarios - [ ] CI 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rmat#6308) - `list_all_tables` - `restore_table` - `update_table_schema_metadata` - `get_table_stats` - `explain_table_query_plan` - `analyze_table_query_plan` --------- Co-authored-by: zhangyue19921010 <zhangyue.1010@bytedance.com>
## Summary - Adds `#[instrument]` attributes from the `tracing` crate to key functions across the `mem_wal` module - Covers write path (`RegionWriter::open`, `put`, `close`), flush path (`MemTableFlusher::flush`, `flush_with_indexes`), WAL operations, manifest store, memtable inserts, scanner/planner, point lookups, and vector search - Uses appropriate trace levels (`info` for high-level operations, `debug` for internals) with relevant fields (region_id, epoch, row counts, batch counts) ## Test plan - [x] `cargo check` passes — no functional changes, only attribute additions - [x] Existing `mem_wal` tests continue to pass - [ ] Tracing output verified with `RUST_LOG=debug` showing instrumented spans 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
) ## Summary Refactor `FullZipScheduler::create_page_load_task` to accept a pre-submitted I/O future instead of deferring I/O submission until the async task executes. This allows the I/O requests to be submitted immediately during scheduling, enabling the object store layer to batch and parallelize them. close lance-format#6504 ## I/O Model Change ### Before: Lazy I/O submission (serialized) Previously, `create_page_load_task` received a `FullZipReadSource::Remote(io)` along with byte ranges and priority. The actual `io.submit_request()` call happened **inside** the async block, meaning the I/O request was not submitted until the future was first polled. When decoding multiple pages (e.g. across many fragments), this created a sequential I/O pattern: ``` Page 1: [schedule] -> [poll] -> [submit I/O] -> [wait response] -> [decode] Page 2: [schedule] -> [poll] -> [submit I/O] -> [wait response] -> [decode] Page 3: [schedule] -> [poll] -> ... ``` Each page's I/O request could only be submitted after the previous task started executing. The I/O scheduler had no visibility into upcoming requests, preventing it from batching or parallelizing them effectively. ### After: Eager I/O submission (pipelined) Now, `io.submit_request()` is called **before** constructing the `PageLoadTask`, and the resulting future is passed into `create_page_load_task`. All I/O requests for all pages are submitted upfront during the scheduling phase: ``` [schedule all pages] --> submit I/O page 1 -+ --> submit I/O page 2 -+ --> submit I/O page 3 -+ (all in-flight concurrently) --> submit I/O page N -+ | [poll] -> [await page 1 response] -> [decode] [poll] -> [await page 2 response] -> [decode] [poll] -> [await page 3 response] -> [decode] ``` The object store layer can now see all pending requests at once and optimize I/O through batching, connection multiplexing, and parallel fetches. The async tasks only await the already-in-flight I/O futures. ## Changes - `rust/lance-encoding/src/encodings/logical/primitive.rs`: - Changed `create_page_load_task` signature to accept `BoxFuture<'static, Result<Vec<Bytes>>>` instead of `FullZipReadSource` + byte ranges + priority - Moved `io.submit_request()` calls to happen eagerly at both call sites (`schedule_ranges_with_rep_index` and the non-rep-index path), before constructing the page load task ## Performance Tested with a multi-fragment dataset containing fixed-width columns (768-dim float32 vectors, 40 fragments, 50 rows/fragment): | Benchmark | Before (p50) | After (p50) | Speedup | |---|---|---|---| | Fixed-width column scan | 3453 ms | 523 ms | **6.6x** | The improvement comes entirely from I/O pipelining — the decoding logic itself is unchanged. The effect is most pronounced with many fragments or pages, where the serialized I/O submission was the dominant bottleneck.
## Summary - Add `blob_max_pack_file_bytes` to `WriteParams`, allowing users to override the default 1 GiB maximum pack (`.blob`) sidecar file size - Thread the configuration through the full write path: `WriteParams` -> `WriterGenerator` -> `WriterOptions` -> `BlobPreprocessor` -> `PackWriter` - Expose the option in Python (`write_dataset`) and Java (`WriteParams.Builder`) bindings ## Test plan - [x] All 37 existing blob tests pass (`cargo test -p lance blob`) - [x] Clippy clean on `lance` and `lance-jni` crates - [x] Verify Python binding works end-to-end with `blob_max_pack_file_bytes` kwarg - [x] Verify Java binding compiles with `./mvnw compile` 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary - Bump `jieba-rs` from 0.8.1 to 0.9.0 to fix the `build-no-lock` CI job - The `core2` crate v0.4.0 was yanked from crates.io, breaking fresh dependency resolution (`jieba-rs` → `include-flate` → `libflate` → `core2`) - `jieba-rs` 0.9.0 drops the `include-flate`/`libflate`/`core2` chain entirely, removing 9 transitive dependencies with no API changes ## Test plan - [x] `cargo check -p lance-index --features tokenizer-jieba` passes - [x] Verified build succeeds without `Cargo.lock` (simulating the CI job) - [ ] CI `build-no-lock` job passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This tightens the repository's environment guidance so language-specific tasks must follow the documented workflow before reporting missing tools or dependencies. For Python work, the docs now make `uv sync --extra tests --extra dev` and `uv run ...` mandatory, and explicitly call out the common failure mode where slow `uv sync` is interrupted or global Python is used instead.
This changes per-base runtime configuration to use exact `ObjectStoreParams` bindings keyed by `BasePath.path` instead of per-base storage option overrides. Dataset-level and write-level store params now act only as fallbacks, while reads, target-base writes, and external blob resolution all consult the same base-scoped binding model. This keeps provider-specific runtime state out of the manifest and follows the direction in discussion lance-format#6307 to keep `BasePath` focused on identity.
This PR vendors the tokenizer stack Lance actually uses into a new `rust/lance-tokenizer` crate and rewires FTS and inverted-index code to depend on it instead of `tantivy` and `lindera-tantivy`. It keeps the existing document and query tokenization semantics in-tree, renames the old FTS document adapter module to `document_tokenizer`, and preserves upstream license headers on vendored code.
…ormat#6517) ## Summary - Add hand-written AVX2 and AVX-512 VNNI backends for u8 squared L2 distance (`Σ(a-b)²`) in new `l2_u8.rs` - Add fused single-pass u8 cosine distance kernel in new `cosine_u8.rs` — computes `dot(a,b)`, `‖a‖²`, `‖b‖²` simultaneously, halving memory traffic vs the previous 2-3 pass approach - Wire both into the `L2 for u8` and `Cosine for u8` trait impls - Add benchmarks comparing scalar vs SIMD for both kernels ### Algorithmic approach (adapted from [NumKong](https://github.com/ashvardanian/NumKong)) **L2 (AVX2):** Saturating subtraction for `|a-b|`, zero-extend u8→i16, `VPMADDWD(diff, diff)` to square and accumulate into i32. 32 elements/iter. **L2 (AVX-512 VNNI):** Same abs-diff approach with `VPDPWSSD` for fused square-accumulate. 64 elements/iter. **Cosine (AVX2):** Zero-extend both vectors to i16, triple `VPMADDWD` per half (a·b, a·a, b·b). 32 elements/iter, single pass. **Cosine (AVX-512 VNNI):** Same three-accumulator approach with `VPDPWSSD`. 64 elements/iter. Both kernels use `OnceLock`-based runtime CPU dispatch, falling back to portable scalar on non-x86 platforms. ### Benchmarks *1M × 1024-dim u8 vectors.* **x86_64 — AMD Ryzen 5 4500 6-Core (AVX2, no AVX-512)** | Kernel | Scalar | SIMD | Speedup | |--------|--------|------|---------| | L2(u8) | 73.5 ms | 58.2 ms | **1.26x** | | Cosine(u8) | 122.2 ms | 82.1 ms | **1.49x** | L2 auto-vectorization baseline was 91.5 ms, so SIMD is 1.57x faster than that path. **aarch64 — Apple Silicon M3 Max (no AVX2, scalar fallback)** | Kernel | Scalar | SIMD (dispatch) | |--------|--------|-----------------| | L2(u8) | 26.8 ms | 27.3 ms | | Cosine(u8) | 90.1 ms | 90.4 ms | On aarch64 the SIMD path falls through to scalar (no AVX2), so times are identical — confirms no regression on non-x86 platforms. AVX-512 VNNI systems (Ice Lake+, Zen 4+) should see larger gains. ## Test plan - [x] All 11 new tests pass: SIMD backends verified against scalar reference across 18 vector sizes (0–4097), boundary values (0/255), alternating patterns, random seeds - [x] All 63 existing lance-linalg tests pass (no regressions) - [x] Clippy clean, fmt clean - [x] Benchmarked on x86_64 AVX2 (AMD Ryzen 5 4500) — L2 1.26x, Cosine 1.49x faster - [ ] Verify on AVX-512 VNNI system for additional speedup data 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This fixes the release bump configuration after `lance-tokenizer` was added to the workspace dependencies. `.bumpversion.toml` was missing the corresponding replacement rule, so version bumps could leave that internal dependency on the previous version. This is a targeted config-only fix to keep the release automation updating all workspace crates consistently.
This fixes the directory namespace CI failure where single-instance concurrent create/drop operations on `__manifest` could time out with `TooMuchWriteContention`, especially in the Windows build. Manifest mutations are now serialized within a single `ManifestNamespace` instance so concurrent operations stop racing on stale in-memory snapshots, and inline manifest maintenance now defers compaction/index merges until the table has accumulated enough fragments. Context: https://github.com/lance-format/lance/actions/runs/24439767878/job/71401857043
Blob columns can be represented either as loaded values or as unloaded descriptor schemas, but our schema projection logic still treated those views as incompatible types. This change teaches field projection and intersection to recognize blob loaded/unloaded pairs as the same logical column, and adds regression coverage for both the core schema path and the projection-plan path that previously failed.
## Summary - Adds `ChopBatchesStream`, a stream wrapper that splits oversized batches (>1.5x target `batch_size_bytes`) into smaller sub-batches using zero-copy `RecordBatch::slice` - Wraps the filtered read output stream with `ChopBatchesStream` when `batch_size_bytes` is configured via `FileReaderOptions` - Serves as a safety net when the underlying file reader doesn't estimate batch sizes accurately enough **Stacked on feat/byte-sized-batches-file-reader** — wait for that to merge first, then rebase this PR. ## Test plan - [x] Unit tests for `ChopBatchesStream`: splits large batches, passes small batches through, `wrap_if_needed(None)` is a no-op - [x] `cargo clippy` clean - [x] `cargo fmt` clean 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…-format#6503) Add protobuf encode/decode for `ANNIvfSubIndexExec` --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This isolates the `test_memory_leaks` index statistics probe into a fresh subprocess instead of running it inside the long-lived pytest worker process. That keeps the test focused on repeated `index_statistics` calls and avoids false positives from RSS growth left behind by earlier tests such as the recent batch-chopping coverage added in `test_dataset.py`.
…at#6352) This PR improves blob I/O in two complementary ways: `BlobFile` instances that resolve to the same physical object now share a lazy `BlobSource` and can opportunistically coalesce concurrent reads before handing them to Lance's existing scheduler, and datasets now expose a planned `read_blobs` API for materializing blob payloads directly. It also adds explicit cursor-preserving range reads for `BlobFile` across Rust, Python, and Java, with end-to-end Python coverage for the new API and the edge cases it uncovered. This keeps the optimization aligned with Lance's existing scheduler model while giving callers a higher-level path for sequential and batched blob access. ## Python example ```python import lance dataset = lance.dataset("/path/to/dataset") blobs = dataset.read_blobs( "images", indices=[0, 4, 8], target_request_bytes=8 * 1024 * 1024, max_gap_bytes=64 * 1024, max_concurrency=4, preserve_order=True, ) for row_address, payload in blobs: print(row_address, len(payload)) ```
…mat#6540) ## Summary - Adds `f64x4` and `f64x8` SIMD types to `lance-linalg` with support for x86_64 (AVX2/AVX-512), aarch64 (NEON), and loongarch64 (LASX) - Replaces auto-vectorization-dependent f64 distance functions with explicit SIMD using two-level unrolling (f64x8 + f64x4 + scalar tail) - Updates norm_l2, dot, L2, and cosine distance for f64 ## Benchmark Results (Apple M-series, aarch64 NEON) 1M vectors × 1024 dimensions: | Benchmark | Before | After | Change | |-----------|--------|-------|--------| | NormL2(f64, auto-vec) | 117.76 ms | 116.04 ms | ~same | | NormL2(f64, SIMD) | N/A (TODO) | 119.16 ms | new | | Dot(f64, auto-vec) | 129.36 ms | 130.23 ms | ~same | | L2(f64, auto-vec) | 132.53 ms | 135.15 ms | ~same | | **Cosine(f64, auto-vec)** | **202.52 ms** | **139.23 ms** | **-31.4%** | The biggest win is **cosine distance**, which previously had an empty `impl Cosine for f64 {}` falling back to the scalar path. The explicit SIMD implementation is **31% faster**. For norm_l2, dot, and L2, LLVM's auto-vectorization with the LANES=8 hint was already producing good code on this platform. The explicit SIMD ensures consistent performance across compilers and platforms rather than relying on fragile auto-vectorization hints. ## Test plan - [x] All 59 lance-linalg tests pass - [x] Clippy clean (`-D warnings`) - [x] `cargo fmt` clean - [ ] CI passes on all platforms 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
3ec544e to
a38ffae
Compare
FilteredReadStream on task cancellation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
.unwrap()with?inFilteredReadStreamto propagate thread join errors instead of panickingFilteredReadStreamon task cancellation lance-format/lance#6545Test plan
🤖 Generated with Claude Code