chore: backport from main to release/3.0 branch by wjones127 · Pull Request #6019 · lance-format/lance

wjones127 · 2026-02-25T23:11:25Z

No description provided.

1. fix python binding short-circuit for DirectoryNamespace and RestNamespace binding 2. fix local file system access to LanceFileSession for namespace-based access 3. fix propagating storage options to __manifest table in DirectoryNamespace

…nce-format#5995) In full-zip variable packed decoding, rep/def may produce visible rows with empty payloads (for null/invalid items). The decoder previously assumed every visible row had bytes for each child and failed with `Packed struct fixed child exceeds row bounds`. This happened during write new tables with blob v2. ## Summary - fix `PackedStructVariablePerValueDecompressor` to handle empty packed rows (`row_start == row_end`) - append one per-child placeholder value for empty rows so child builders remain row-aligned --- **Parts of this PR were drafted with assistance from Codex (with `gpt-5.3-codex`) and fully reviewed and edited by me. I take full responsibility for all changes.**

- add a warning that `drop_columns` is metadata-only but data can become unrecoverable after `compact_files` + `cleanup_old_versions` - add operational guidance for rollback windows (tag/snapshot, delayed cleanup, validation before aggressive cleanup) --- **Parts of this PR were drafted with assistance from Codex (with `gpt-5.3-codex`) and fully reviewed and edited by me. I take full responsibility for all changes.**

…ltiple partitions (lance-format#5988)

This introduces a DeleteResult with num_rows_deleted, similar to UpdateResult. --------- Co-authored-by: Will Jones <willjones127@gmail.com>

From lance-format#5983 (comment), we currently use `CommitConflict` for two situations: 1. Incompatible transactions: there is a conflict that is not retry-able. For example, you are trying to create an index, but a concurrent transaction overwrote the table and changed the schema. 1. Commit step ran out of retries: we hit the max number of rebase attempts, and even though we could retry again, we aren't. This is indeed just throttling. This makes them separate errors.

…at#6002) Also added helper function `extract_namespace_arc` for shared logic --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

…format#6006) Summary - short-circuit FTS scans when `fast_search` is enabled and no indexed fragments exist so we return an empty plan instead of scanning unindexed data - skip the unindexed-match planning path entirely under `fast_search`, forcing only index-backed queries even when fragments exist - add plan verification and a regression test proving `fast_search` excludes rows appended after building the FTS index

…nce-format#6008) This was a bug that would be encountered whenever a list had nullable items (struct or otherwise) but the list itself was never null and never empty. The unraveler was incorrectly skipping offsets and returning fewer lists than it should. This was a reader-only bug. No corrupt data would have been written. Closes lance-format#5930

…ance-format#6007) Before this, lance still performs flat search if there's no vector index.

In order to compress complex all null, we need to add additional parameters in the proto so we know what compression are used for definition level and repetition level and the number of values accordingly. resolve lance-format#4885 --------- Co-authored-by: stevie9868 <yingjianwu2@email.com> Co-authored-by: Xuanwo <github@xuanwo.io>

…schema (lance-format#5976) ### What Closes lance-format#5642 (incrementally) Enhances "column not found" and "field not found" error messages in `Schema` to suggest the closest matching field name using Levenshtein distance. **Before:** `LanceError(Schema): Column vectr does not exist` **After:** `LanceError(Schema): Column vectr does not exist. Did you mean 'vector'?` ### Changes Single file modified: `lance-core/src/datatypes/schema.rs` - Added `levenshtein_distance()` — standard edit distance with two-row DP optimization - Added `suggest_field()` — finds closest field name (threshold: edit distance ≤ 1/3 of the longer name's length) - Enhanced 3 error sites: - `FieldRef::into_id` — "Field 'X' not found in schema" - `Schema::do_project` — "Column X does not exist" - `Schema::project_by_schema` — "Field X not found" ### Design Decisions - **No new dependencies** — implemented Levenshtein inline rather than adding `strsim` crate - **No new error variants** — enhanced existing `Error::InvalidInput` and `Error::Schema` message strings - **1/3 threshold** — per issue guidance: suggestions only appear when fewer than 1/3 of characters need to change, preventing unhelpful suggestions for completely unrelated names - **Incremental scope** — this PR covers `schema.rs` only; additional error sites (scanner, projection, etc.) can follow ### Testing Added 4 tests: - `test_levenshtein_distance` — 11 assertions covering identical, empty, single-edit, multi-edit, and completely different strings - `test_suggest_field` — 6 assertions: close match, no match, exact match rejection, empty list, short names - `test_suggest_field_edge_cases` — 2 assertions: all-different short names, picks-closest-among-multiple - `test_project_with_suggestion` — integration test: verifies `Schema::project` includes suggestion for typo, and omits it for completely wrong names --------- Co-authored-by: Will Jones <willjones127@gmail.com>

…r dict decision (lance-format#5891) This PR changed how we decide to use dict or now. Instead of cardinality, we will use dict entries and encoded size instead. --- **Parts of this PR were drafted with assistance from Codex (with `gpt-5.2`) and fully reviewed and edited by me. I take full responsibility for all changes.**

…sks (lance-format#5982) `NextDecodeTask::into_batch` is synchronous and can be CPU-heavy. Running it inline in the future poll path blocks Tokio workers and reduces effective decode concurrency. This changes becomes more meaningful while we are using zstd. Benchmarks were run on AWS EC2 using both local and S3 copies of the same dataset (`fineweb.lance.v2_2.lz4`) with repeated scans. Main run (3 rounds, 20 repeats each): - Local median latency: - p50: `894675us -> 289781us` (`3.087x`, `-67.61%`) - p95: `929515us -> 307874us` (`3.019x`, `-66.88%`) - p99: `1034383us -> 375041us` (`2.758x`, `-63.74%`) - S3 median latency: - p50: `3998660us -> 3510771us` (`1.139x`, `-12.20%`) - p95: `4068799us -> 3572090us` (`1.139x`, `-12.21%`) - p99: `4153371us -> 3592478us` (`1.156x`, `-13.50%`) ## Changes move structural decode batch conversion in `StructuralBatchDecodeStream::into_stream` to `tokio::spawn(...).await`

- **Bump datafusion requirement to 52** - **ruff format** - **fix: use fields_with_udf for aggregate type coercion (DF52)** - **fix: use OutputBatches metric variant for DF52 compatibility** --------- Co-authored-by: Tim Saucer <timsaucer@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

From a Snapshot isolation perspective, 2 overwrites should not just be compatible because they have modified overlapping data between 2 writes. We will mark it as retryable and client can make a decision if it can be retried or not depending on the specific case.

github-actions · 2026-02-25T23:13:24Z

Code Review

I reviewed this backport PR from main to release/3.0. Here are the key points to consider:

P1: Breaking API Change - DeleteBuilder::execute()

The DeleteBuilder::execute() return type changed from Result<Arc<Dataset>> to Result<DeleteResult> in rust/lance/src/dataset/write/delete.rs. This is a breaking change for downstream users.

// Before
pub async fn execute(self) -> Result<Arc<Dataset>>

// After  
pub async fn execute(self) -> Result<DeleteResult>

Consider whether this is acceptable for a backport to a release branch, or if it should be released in a major/minor version.

P1: Behavioral Change - Concurrent Overwrites

The conflict resolution behavior for concurrent overwrites changed (rust/lance/src/io/commit/conflict_resolver.rs):

Before: Two concurrent Overwrite operations were both Compatible
After: Two concurrent Overwrite operations return Retryable conflict

Users who previously relied on concurrent overwrites both succeeding will now need to handle retry logic. The test update confirms this intentional change:

// Before: [Compatible; 9] for all operations
// After: Retryable for overwrite vs overwrite

This is consistent with the related PR #6014 ("make overwrites retryable instead of compatible"), so this appears intentional, but worth highlighting for the release notes.

The implementation quality looks good - proper test coverage for the new behaviors, new IncompatibleTransaction error type cleanly separates retryable vs non-retryable conflicts, and the Levenshtein suggestion feature for field errors is a nice UX improvement.

codecov · 2026-02-25T23:47:57Z

Codecov Report

❌ Patch coverage is 87.26115% with 140 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
.../lance-encoding/src/encodings/logical/primitive.rs	81.45%	47 Missing and 22 partials ⚠️
rust/lance/src/io/commit/conflict_resolver.rs	81.08%	5 Missing and 9 partials ⚠️
rust/lance/src/dataset/write/delete.rs	80.35%	5 Missing and 6 partials ⚠️
rust/lance-core/src/error.rs	70.37%	2 Missing and 6 partials ⚠️
rust/lance/src/dataset/scanner.rs	89.61%	2 Missing and 6 partials ⚠️
...st/lance-encoding/src/encodings/physical/packed.rs	92.63%	6 Missing and 1 partial ⚠️
rust/lance-encoding/src/decoder.rs	79.31%	5 Missing and 1 partial ⚠️
rust/lance-core/src/levenshtein.rs	94.36%	4 Missing ⚠️
rust/lance-namespace-impls/src/dir.rs	0.00%	3 Missing ⚠️
rust/lance/src/index/mem_wal.rs	50.00%	0 Missing and 3 partials ⚠️
... and 4 more

📢 Thoughts on this report? Let us know!

jackye1995 and others added 16 commits February 25, 2026 15:10

docs: expand the FTS index doc explaining the training process and mu…

ef3e0e8

…ltiple partitions (lance-format#5988)

feat: add DeleteResult with num_deleted_rows (lance-format#6001)

0cc1794

This introduces a DeleteResult with num_rows_deleted, similar to UpdateResult. --------- Co-authored-by: Will Jones <willjones127@gmail.com>

fix: set namespace commit handler for LanceDataset.commit (lance-form…

f79c4e5

…at#6002) Also added helper function `extract_namespace_arc` for shared logic --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

fix: fast_search should ignore any unindexed data for vector search (l…

8ef25ee

…ance-format#6007) Before this, lance still performs flat search if there's no vector index.

github-actions Bot added python java chore labels Feb 25, 2026

wjones127 closed this Feb 25, 2026

wjones127 deleted the backport-3.0.0 branch February 25, 2026 23:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: backport from main to release/3.0 branch#6019

chore: backport from main to release/3.0 branch#6019
wjones127 wants to merge 16 commits intolance-format:release/v3.0from
wjones127:backport-3.0.0

wjones127 commented Feb 25, 2026

Uh oh!

github-actions Bot commented Feb 25, 2026

Uh oh!

codecov Bot commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

wjones127 commented Feb 25, 2026

Uh oh!

github-actions Bot commented Feb 25, 2026

Code Review

P1: Breaking API Change - DeleteBuilder::execute()

P1: Behavioral Change - Concurrent Overwrites

Uh oh!

codecov Bot commented Feb 25, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants