Skip to content

fix: filter stale row IDs in TakeExec for FTS/vector after delete#6042

Merged
Xuanwo merged 4 commits intolance-format:mainfrom
wkalt:fix/take-exec-deleted-rows
Mar 6, 2026
Merged

fix: filter stale row IDs in TakeExec for FTS/vector after delete#6042
Xuanwo merged 4 commits intolance-format:mainfrom
wkalt:fix/take-exec-deleted-rows

Conversation

@wkalt
Copy link
Copy Markdown
Contributor

@wkalt wkalt commented Feb 27, 2026

When stable row IDs are enabled, FTS and vector indexes may return row IDs for rows that have since been deleted. The row ID index excludes deleted rows, so get_row_addrs() would silently drop these entries via filter_map, producing an addresses array shorter than the input batch. The downstream merge_with_schema then failed with "Attempt to merge two RecordBatch with different sizes".

Fix: track which row IDs are valid in get_row_addrs() and return a validity mask. In map_batch(), filter the input batch to remove rows whose IDs no longer exist before merging.

When stable row IDs are enabled, FTS and vector indexes may return row
IDs for rows that have since been deleted. The row ID index excludes
deleted rows, so get_row_addrs() would silently drop these entries via
filter_map, producing an addresses array shorter than the input batch.
The downstream merge_with_schema then failed with "Attempt to merge two
RecordBatch with different sizes".

Fix: track which row IDs are valid in get_row_addrs() and return a
validity mask. In map_batch(), filter the input batch to remove rows
whose IDs no longer exist before merging.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added the bug Something isn't working label Feb 27, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR Review: fix: filter stale row IDs in TakeExec for FTS/vector after delete

No P0/P1 issues found. The fix is correct and well-targeted.

Summary

The root cause is clear: filter_map in the old get_row_addrs() silently dropped entries for deleted row IDs, producing an addresses array shorter than the input batch. The new approach of returning a validity mask and applying filter_record_batch keeps the two arrays in sync before the downstream merge.

Minor observations (not blocking)

  1. Edge case with all-deleted batches: If every row ID in a batch maps to a deleted row, addresses is empty, batches is empty, and the early return at line ~289 produces RecordBatch::new_empty(output_schema). This is correct behavior but worth noting — the caller gets a valid empty batch rather than an error.

  2. Vector index test coverage: The fix covers both FTS and vector indexes (same code path), but the regression test only exercises FTS. A vector search + delete test would add confidence, though this is not blocking since the code path is identical.

LGTM

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 27, 2026

Codecov Report

❌ Patch coverage is 69.56522% with 7 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/io/exec/take.rs 69.56% 4 Missing and 3 partials ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this!

@Xuanwo Xuanwo merged commit 5bc01c8 into lance-format:main Mar 6, 2026
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants