fix: deduplicate row addresses in take to prevent panic by wjones127 · Pull Request #5881 · lance-format/lance

wjones127 · 2026-02-03T19:44:33Z

Previously, duplicate row addresses (e.g. from FTS on List<Utf8> where
multiple list elements in the same row match) were passed to the v2 encoding
layer, which requires strictly increasing indices. This caused
indices_to_ranges to produce overlapping ranges, panicking with "attempt to
subtract with overflow".

Dedup is handled at two sites:

FragmentReader::take_as_batch: This is the common entry point for all
callers that need a single batch (FileFragment::take_rows,
FragmentSession::take_rows, TakeStream::map_batch). Since it already
collects into one batch, the expand-after-dedup step is trivial. Handling it
here rather than in the v2 adapter's take_all_tasks avoids complications
with the streaming ReadBatchTaskStream return type. Handling it lower in
schedule_take wouldn't work because the encoding layer has no notion of
duplicates by design.
TakeStream::map_batch: Dedup here serves a different purpose — it
reduces I/O by avoiding redundant reads of the same row across fragments.
This also optimizes the common path (sorted, unique addresses) to skip all
sorting/dedup/permutation overhead entirely.

Fixes #5260

Test plan

test_take_with_duplicate_row_addrs — sorted duplicate row addresses
test_take_with_unsorted_duplicate_row_addrs — unsorted duplicates
test_fragment_take_indices / test_fragment_take_rows — fragment-level
duplicate indices (previously failing on LanceFileVersion::Stable)
All 57 take-related tests pass

🤖 Generated with Claude Code

TakeStream::map_batch passed duplicate row addresses straight through to the encoding layer, which requires strictly increasing indices. Duplicates (e.g. from FTS on List<Utf8> where multiple list elements in the same row match) caused indices_to_ranges to produce overlapping ranges, panicking in BinaryPageScheduler with "attempt to subtract with overflow". Dedup sorted addresses before passing them to fragment readers, then expand the results back to include duplicates. Also tighten the schedule_take debug_assert from <= to < to catch this earlier. Fixes lance-format#5260 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

github-actions · 2026-02-03T19:45:45Z

Code Review

This PR correctly fixes issue #5260 by deduplicating row addresses before passing them to the encoding layer which requires strictly increasing indices.

P1 Issue: Use proper error propagation instead of .unwrap()

At line 290 in the updated code, .unwrap() on take_record_batch could panic if it fails. While the existing code on line 272 also uses .unwrap() for the permutation restore, adding proper error propagation with ? would be more defensive.

Consider changing to:

new_data = arrow_select::take::take_record_batch(&new_data, &expand_indices)
    .map_err(|e| DataFusionError::Execution(e.to_string()))?;

The same applies to the existing .unwrap() on line 272, though that is outside the scope of this PR.

Overall the fix is correct and well-tested. The dedup approach is sound: using the sorted order to efficiently track unique addresses and then expanding results back with duplicates is the right pattern.

The common path (sorted, no duplicates) now does a single-pass is_sorted_and_unique check and skips all sorting, dedup, and permutation logic. The unsorted/duplicate paths are unified and the expand + inverse permutation are composed into a single take. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

codecov · 2026-02-03T23:28:09Z

Codecov Report

❌ Patch coverage is 94.53125% with 7 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance/src/io/exec/take.rs	95.45%	5 Missing ⚠️
rust/lance/src/dataset/fragment.rs	88.23%	0 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

jackye1995

Looks good to me

westonpace

This PR is fine to avoid panic but does this mean the search will still output duplicate rows? Is that something we want long term? Or should we make a follow-up?

westonpace · 2026-02-04T19:21:59Z

+        if let Some(expand_map) = expand_map {
+            batch = arrow_select::take::take_record_batch(&batch, &expand_map)?;
+        }


Someday it would be nice to coalesce the dataset paths and the TakeExec paths so that we don't have to duplicate this logic everywhere. Not a comment on this PR, just griping.

wjones127 · 2026-02-04T21:03:21Z

This PR is fine to avoid panic but does this mean the search will still output duplicate rows? Is that something we want long term?

Yeah I'll make a follow up issue. Not sure what the intended semantics of FTS in a list are.

github-actions Bot added the bug Something isn't working label Feb 3, 2026

wjones127 and others added 2 commits February 3, 2026 12:05

lower level fix

0da33c1

wjones127 changed the title ~~fix: deduplicate row addresses in TakeStream to prevent panic~~ fix: deduplicate row addresses in take to prevent panic Feb 3, 2026

wjones127 marked this pull request as ready for review February 3, 2026 23:24

wjones127 requested a review from westonpace February 3, 2026 23:24

wjones127 added the critical-fix Bugs that cause crashes, security vulnerabilities, or incorrect data. label Feb 3, 2026

jackye1995 approved these changes Feb 4, 2026

View reviewed changes

westonpace approved these changes Feb 4, 2026

View reviewed changes

wjones127 merged commit f777697 into lance-format:main Feb 4, 2026
33 checks passed

wjones127 deleted the fix/dedup-take-row-addrs branch February 4, 2026 21:23

andrea-reale mentioned this pull request Mar 30, 2026

emilk/fix write starvation rerun-io/lance#12

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: deduplicate row addresses in take to prevent panic#5881

fix: deduplicate row addresses in take to prevent panic#5881
wjones127 merged 3 commits intolance-format:mainfrom
wjones127:fix/dedup-take-row-addrs

wjones127 commented Feb 3, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Feb 3, 2026

Uh oh!

codecov Bot commented Feb 3, 2026

Uh oh!

jackye1995 left a comment

Uh oh!

westonpace left a comment

Uh oh!

westonpace Feb 4, 2026

Uh oh!

wjones127 commented Feb 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wjones127 commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Uh oh!

github-actions Bot commented Feb 3, 2026

Code Review

P1 Issue: Use proper error propagation instead of .unwrap()

Uh oh!

codecov Bot commented Feb 3, 2026

Codecov Report

Uh oh!

jackye1995 left a comment

Choose a reason for hiding this comment

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

westonpace Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

wjones127 commented Feb 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wjones127 commented Feb 3, 2026 •

edited

Loading