Skip to content

chore: backport from main to release/3.0 branch#6019

Closed
wjones127 wants to merge 16 commits intolance-format:release/v3.0from
wjones127:backport-3.0.0
Closed

chore: backport from main to release/3.0 branch#6019
wjones127 wants to merge 16 commits intolance-format:release/v3.0from
wjones127:backport-3.0.0

Conversation

@wjones127
Copy link
Copy Markdown
Contributor

No description provided.

jackye1995 and others added 16 commits February 25, 2026 15:10
1. fix python binding short-circuit for DirectoryNamespace and
RestNamespace binding
2. fix local file system access to LanceFileSession for namespace-based
access
3. fix propagating storage options to __manifest table in
DirectoryNamespace
…nce-format#5995)

In full-zip variable packed decoding, rep/def may produce visible rows
with empty payloads (for null/invalid items). The decoder previously
assumed every visible row had bytes for each child and failed with
`Packed struct fixed child exceeds row bounds`.

This happened during write new tables with blob v2.

## Summary

- fix `PackedStructVariablePerValueDecompressor` to handle empty packed
rows (`row_start == row_end`)
- append one per-child placeholder value for empty rows so child
builders remain row-aligned

---

**Parts of this PR were drafted with assistance from Codex (with
`gpt-5.3-codex`) and fully reviewed and edited by me. I take full
responsibility for all changes.**
- add a warning that `drop_columns` is metadata-only but data can become
unrecoverable after `compact_files` + `cleanup_old_versions`
- add operational guidance for rollback windows (tag/snapshot, delayed
cleanup, validation before aggressive cleanup)

---

**Parts of this PR were drafted with assistance from Codex (with
`gpt-5.3-codex`) and fully reviewed and edited by me. I take full
responsibility for all changes.**
This introduces a DeleteResult with num_rows_deleted, similar to
UpdateResult.

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
From
lance-format#5983 (comment),
we currently use `CommitConflict` for two situations:

1. Incompatible transactions: there is a conflict that is not
retry-able. For example, you are trying to create an index, but a
concurrent transaction overwrote the table and changed the schema.
1. Commit step ran out of retries: we hit the max number of rebase
attempts, and even though we could retry again, we aren't. This is
indeed just throttling.

This makes them separate errors.
…at#6002)

Also added helper function `extract_namespace_arc` for shared logic

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…format#6006)

Summary
- short-circuit FTS scans when `fast_search` is enabled and no indexed
fragments exist so we return an empty plan instead of scanning unindexed
data
- skip the unindexed-match planning path entirely under `fast_search`,
forcing only index-backed queries even when fragments exist
- add plan verification and a regression test proving `fast_search`
excludes rows appended after building the FTS index
…nce-format#6008)

This was a bug that would be encountered whenever a list had nullable
items (struct or otherwise) but the list itself was never null and never
empty. The unraveler was incorrectly skipping offsets and returning
fewer lists than it should.

This was a reader-only bug.  No corrupt data would have been written.

Closes lance-format#5930
…ance-format#6007)

Before this, lance still performs flat search if there's no vector
index.
In order to compress complex all null, we need to add additional
parameters in the proto so we know what compression are used for
definition level and repetition level and the number of values
accordingly.

resolve lance-format#4885

---------

Co-authored-by: stevie9868 <yingjianwu2@email.com>
Co-authored-by: Xuanwo <github@xuanwo.io>
…schema (lance-format#5976)

### What

Closes lance-format#5642 (incrementally)

Enhances "column not found" and "field not found" error messages in
`Schema` to suggest the closest matching field name using Levenshtein
distance.

**Before:**
`LanceError(Schema): Column vectr does not exist`

**After:**
`LanceError(Schema): Column vectr does not exist. Did you mean
'vector'?`

### Changes

Single file modified: `lance-core/src/datatypes/schema.rs`

- Added `levenshtein_distance()` — standard edit distance with two-row
DP optimization
- Added `suggest_field()` — finds closest field name (threshold: edit
distance ≤ 1/3 of the longer name's length)
- Enhanced 3 error sites:
  - `FieldRef::into_id` — "Field 'X' not found in schema"
  - `Schema::do_project` — "Column X does not exist"  
  - `Schema::project_by_schema` — "Field X not found"

### Design Decisions

- **No new dependencies** — implemented Levenshtein inline rather than
adding `strsim` crate
- **No new error variants** — enhanced existing `Error::InvalidInput`
and `Error::Schema` message strings
- **1/3 threshold** — per issue guidance: suggestions only appear when
fewer than 1/3 of characters need to change, preventing unhelpful
suggestions for completely unrelated names
- **Incremental scope** — this PR covers `schema.rs` only; additional
error sites (scanner, projection, etc.) can follow

### Testing

Added 4 tests:
- `test_levenshtein_distance` — 11 assertions covering identical, empty,
single-edit, multi-edit, and completely different strings
- `test_suggest_field` — 6 assertions: close match, no match, exact
match rejection, empty list, short names
- `test_suggest_field_edge_cases` — 2 assertions: all-different short
names, picks-closest-among-multiple
- `test_project_with_suggestion` — integration test: verifies
`Schema::project` includes suggestion for typo, and omits it for
completely wrong names

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
…r dict decision (lance-format#5891)

This PR changed how we decide to use dict or now. Instead of
cardinality, we will use dict entries and encoded size instead.

---

**Parts of this PR were drafted with assistance from Codex (with
`gpt-5.2`) and fully reviewed and edited by me. I take full
responsibility for all changes.**
…sks (lance-format#5982)

`NextDecodeTask::into_batch` is synchronous and can be CPU-heavy.
Running it inline in the future poll path blocks Tokio workers and
reduces effective decode concurrency.

This changes becomes more meaningful while we are using zstd.

Benchmarks were run on AWS EC2 using both local and S3 copies of the
same dataset (`fineweb.lance.v2_2.lz4`) with repeated scans.

Main run (3 rounds, 20 repeats each):
- Local median latency:
  - p50: `894675us -> 289781us` (`3.087x`, `-67.61%`)
  - p95: `929515us -> 307874us` (`3.019x`, `-66.88%`)
  - p99: `1034383us -> 375041us` (`2.758x`, `-63.74%`)
- S3 median latency:
  - p50: `3998660us -> 3510771us` (`1.139x`, `-12.20%`)
  - p95: `4068799us -> 3572090us` (`1.139x`, `-12.21%`)
  - p99: `4153371us -> 3592478us` (`1.156x`, `-13.50%`)


## Changes

move structural decode batch conversion in
`StructuralBatchDecodeStream::into_stream` to `tokio::spawn(...).await`
- **Bump datafusion requirement to 52**
- **ruff format**
- **fix: use fields_with_udf for aggregate type coercion (DF52)**
- **fix: use OutputBatches metric variant for DF52 compatibility**

---------

Co-authored-by: Tim Saucer <timsaucer@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
From a Snapshot isolation perspective, 2 overwrites should not just be
compatible because they have modified overlapping data between 2 writes.
We will mark it as retryable and client can make a decision if it can be
retried or not depending on the specific case.
@github-actions
Copy link
Copy Markdown
Contributor

Code Review

I reviewed this backport PR from main to release/3.0. Here are the key points to consider:

P1: Breaking API Change - DeleteBuilder::execute()

The DeleteBuilder::execute() return type changed from Result<Arc<Dataset>> to Result<DeleteResult> in rust/lance/src/dataset/write/delete.rs. This is a breaking change for downstream users.

// Before
pub async fn execute(self) -> Result<Arc<Dataset>>

// After  
pub async fn execute(self) -> Result<DeleteResult>

Consider whether this is acceptable for a backport to a release branch, or if it should be released in a major/minor version.

P1: Behavioral Change - Concurrent Overwrites

The conflict resolution behavior for concurrent overwrites changed (rust/lance/src/io/commit/conflict_resolver.rs):

  • Before: Two concurrent Overwrite operations were both Compatible
  • After: Two concurrent Overwrite operations return Retryable conflict

Users who previously relied on concurrent overwrites both succeeding will now need to handle retry logic. The test update confirms this intentional change:

// Before: [Compatible; 9] for all operations
// After: Retryable for overwrite vs overwrite

This is consistent with the related PR #6014 ("make overwrites retryable instead of compatible"), so this appears intentional, but worth highlighting for the release notes.


The implementation quality looks good - proper test coverage for the new behaviors, new IncompatibleTransaction error type cleanly separates retryable vs non-retryable conflicts, and the Levenshtein suggestion feature for field errors is a nice UX improvement.

@wjones127 wjones127 deleted the backport-3.0.0 branch February 25, 2026 23:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants