chore: backport from main to release/3.0 branch#6019
chore: backport from main to release/3.0 branch#6019wjones127 wants to merge 16 commits intolance-format:release/v3.0from
Conversation
1. fix python binding short-circuit for DirectoryNamespace and RestNamespace binding 2. fix local file system access to LanceFileSession for namespace-based access 3. fix propagating storage options to __manifest table in DirectoryNamespace
…nce-format#5995) In full-zip variable packed decoding, rep/def may produce visible rows with empty payloads (for null/invalid items). The decoder previously assumed every visible row had bytes for each child and failed with `Packed struct fixed child exceeds row bounds`. This happened during write new tables with blob v2. ## Summary - fix `PackedStructVariablePerValueDecompressor` to handle empty packed rows (`row_start == row_end`) - append one per-child placeholder value for empty rows so child builders remain row-aligned --- **Parts of this PR were drafted with assistance from Codex (with `gpt-5.3-codex`) and fully reviewed and edited by me. I take full responsibility for all changes.**
- add a warning that `drop_columns` is metadata-only but data can become unrecoverable after `compact_files` + `cleanup_old_versions` - add operational guidance for rollback windows (tag/snapshot, delayed cleanup, validation before aggressive cleanup) --- **Parts of this PR were drafted with assistance from Codex (with `gpt-5.3-codex`) and fully reviewed and edited by me. I take full responsibility for all changes.**
This introduces a DeleteResult with num_rows_deleted, similar to UpdateResult. --------- Co-authored-by: Will Jones <willjones127@gmail.com>
From lance-format#5983 (comment), we currently use `CommitConflict` for two situations: 1. Incompatible transactions: there is a conflict that is not retry-able. For example, you are trying to create an index, but a concurrent transaction overwrote the table and changed the schema. 1. Commit step ran out of retries: we hit the max number of rebase attempts, and even though we could retry again, we aren't. This is indeed just throttling. This makes them separate errors.
…at#6002) Also added helper function `extract_namespace_arc` for shared logic --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
…format#6006) Summary - short-circuit FTS scans when `fast_search` is enabled and no indexed fragments exist so we return an empty plan instead of scanning unindexed data - skip the unindexed-match planning path entirely under `fast_search`, forcing only index-backed queries even when fragments exist - add plan verification and a regression test proving `fast_search` excludes rows appended after building the FTS index
…nce-format#6008) This was a bug that would be encountered whenever a list had nullable items (struct or otherwise) but the list itself was never null and never empty. The unraveler was incorrectly skipping offsets and returning fewer lists than it should. This was a reader-only bug. No corrupt data would have been written. Closes lance-format#5930
…ance-format#6007) Before this, lance still performs flat search if there's no vector index.
In order to compress complex all null, we need to add additional parameters in the proto so we know what compression are used for definition level and repetition level and the number of values accordingly. resolve lance-format#4885 --------- Co-authored-by: stevie9868 <yingjianwu2@email.com> Co-authored-by: Xuanwo <github@xuanwo.io>
…schema (lance-format#5976) ### What Closes lance-format#5642 (incrementally) Enhances "column not found" and "field not found" error messages in `Schema` to suggest the closest matching field name using Levenshtein distance. **Before:** `LanceError(Schema): Column vectr does not exist` **After:** `LanceError(Schema): Column vectr does not exist. Did you mean 'vector'?` ### Changes Single file modified: `lance-core/src/datatypes/schema.rs` - Added `levenshtein_distance()` — standard edit distance with two-row DP optimization - Added `suggest_field()` — finds closest field name (threshold: edit distance ≤ 1/3 of the longer name's length) - Enhanced 3 error sites: - `FieldRef::into_id` — "Field 'X' not found in schema" - `Schema::do_project` — "Column X does not exist" - `Schema::project_by_schema` — "Field X not found" ### Design Decisions - **No new dependencies** — implemented Levenshtein inline rather than adding `strsim` crate - **No new error variants** — enhanced existing `Error::InvalidInput` and `Error::Schema` message strings - **1/3 threshold** — per issue guidance: suggestions only appear when fewer than 1/3 of characters need to change, preventing unhelpful suggestions for completely unrelated names - **Incremental scope** — this PR covers `schema.rs` only; additional error sites (scanner, projection, etc.) can follow ### Testing Added 4 tests: - `test_levenshtein_distance` — 11 assertions covering identical, empty, single-edit, multi-edit, and completely different strings - `test_suggest_field` — 6 assertions: close match, no match, exact match rejection, empty list, short names - `test_suggest_field_edge_cases` — 2 assertions: all-different short names, picks-closest-among-multiple - `test_project_with_suggestion` — integration test: verifies `Schema::project` includes suggestion for typo, and omits it for completely wrong names --------- Co-authored-by: Will Jones <willjones127@gmail.com>
…r dict decision (lance-format#5891) This PR changed how we decide to use dict or now. Instead of cardinality, we will use dict entries and encoded size instead. --- **Parts of this PR were drafted with assistance from Codex (with `gpt-5.2`) and fully reviewed and edited by me. I take full responsibility for all changes.**
…sks (lance-format#5982) `NextDecodeTask::into_batch` is synchronous and can be CPU-heavy. Running it inline in the future poll path blocks Tokio workers and reduces effective decode concurrency. This changes becomes more meaningful while we are using zstd. Benchmarks were run on AWS EC2 using both local and S3 copies of the same dataset (`fineweb.lance.v2_2.lz4`) with repeated scans. Main run (3 rounds, 20 repeats each): - Local median latency: - p50: `894675us -> 289781us` (`3.087x`, `-67.61%`) - p95: `929515us -> 307874us` (`3.019x`, `-66.88%`) - p99: `1034383us -> 375041us` (`2.758x`, `-63.74%`) - S3 median latency: - p50: `3998660us -> 3510771us` (`1.139x`, `-12.20%`) - p95: `4068799us -> 3572090us` (`1.139x`, `-12.21%`) - p99: `4153371us -> 3592478us` (`1.156x`, `-13.50%`) ## Changes move structural decode batch conversion in `StructuralBatchDecodeStream::into_stream` to `tokio::spawn(...).await`
- **Bump datafusion requirement to 52** - **ruff format** - **fix: use fields_with_udf for aggregate type coercion (DF52)** - **fix: use OutputBatches metric variant for DF52 compatibility** --------- Co-authored-by: Tim Saucer <timsaucer@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
From a Snapshot isolation perspective, 2 overwrites should not just be compatible because they have modified overlapping data between 2 writes. We will mark it as retryable and client can make a decision if it can be retried or not depending on the specific case.
Code ReviewI reviewed this backport PR from main to release/3.0. Here are the key points to consider: P1: Breaking API Change - DeleteBuilder::execute()The // Before
pub async fn execute(self) -> Result<Arc<Dataset>>
// After
pub async fn execute(self) -> Result<DeleteResult>Consider whether this is acceptable for a backport to a release branch, or if it should be released in a major/minor version. P1: Behavioral Change - Concurrent OverwritesThe conflict resolution behavior for concurrent overwrites changed (
Users who previously relied on concurrent overwrites both succeeding will now need to handle retry logic. The test update confirms this intentional change: // Before: [Compatible; 9] for all operations
// After: Retryable for overwrite vs overwriteThis is consistent with the related PR #6014 ("make overwrites retryable instead of compatible"), so this appears intentional, but worth highlighting for the release notes. The implementation quality looks good - proper test coverage for the new behaviors, new |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
No description provided.