feat: support stop-word gaps in phrase queries by BubbleCal · Pull Request #6277 · lance-format/lance

BubbleCal · 2026-03-24T15:16:08Z

This change enables phrase queries to match across stop-word gaps.

Example:
For doc="love the format" indexed with remove_stop_words=True, the index does not store the stop word the.

With this change, users can still match the document with the phrase query q="love the format". In this mode, all stop words are treated as equivalent placeholders for phrase matching, so q="love a format" will also match the same document.

This makes queries that containing stop words 3x~10x faster in the cost of a lit bit accuracy

github-actions · 2026-03-24T15:17:40Z

PR Review

P0: Remove `.codex/task-memory/` files

The two files under .codex/task-memory/phrase-stop-words/ are AI agent working memory and should not be committed to the repository. Please remove them from the PR.

P1: Performance — double tokenization per candidate document in post-validation

filter_stop_word_phrase_matches calls collect_phrase_tokens_with_stop_words for every candidate row, and that method constructs two tokenizers (one with stop words, one without) and tokenizes the full document text twice. For queries with stop words that produce a large candidate set, this could be costly.

Consider:

Constructing the tokenizer pair once and passing them into the per-document loop rather than rebuilding them on every call via params.clone().build().
Short-circuiting earlier: check whether the query contains stop words before fetching document text with take_rows, not after (currently the check happens inside filter_stop_word_phrase_matches after the take_rows call — wait, actually it does check before take_rows, so this part is fine).

On re-reading: the early-exit check is correct (returns early if no placeholder tokens in query). The main concern is reconstructing tokenizers per document row — those should be hoisted out of the loop.

P1: Recursive phrase matching with slop could be expensive

matches_phrase_tokens_from uses unbounded recursion with backtracking. Worst case is O(slop^n) where n is query token count. For typical small slop values this is fine, but consider adding a comment noting this assumption, or converting to an iterative approach if slop values can be large.

Minor: Test could also assert negative case for `"want green apple"`

The test correctly asserts 2 results (ids 0 and 1), which implicitly excludes "want green apple" (id 2). But an explicit assert!(!ids.contains(&2)) would make the intent clearer and directly document the false-positive prevention that motivated the post-validation step.

Overall approach is sound: preserve tokenizer positions for gap-aware phrase matching in the index, then post-validate against original text only when stop words are present in the query. The core index changes in query.rs and index.rs are clean.

🤖 Generated with Claude Code

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

codecov · 2026-03-24T16:10:18Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

westonpace

Nice fix. I didn't realize this could be a query-path-only change, very cool.

This change enables phrase queries to match across stop-word gaps. Example: For `doc="love the format"` indexed with `remove_stop_words=True`, the index does not store the stop word the. With this change, users can still match the document with the phrase query `q="love the format"`. In this mode, all stop words are treated as equivalent placeholders for phrase matching, so `q="love a format"` will also match the same document. This makes queries that containing stop words 3x~10x faster in the cost of a lit bit accuracy --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix: support stop-word gaps in phrase queries

ac6f5b3

github-actions Bot added the bug Something isn't working label Mar 24, 2026

BubbleCal added 2 commits March 24, 2026 23:31

delete temp files

05a1420

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

fix: optimize stop-word phrase validation

73a4fd1

BubbleCal requested review from LuQQiu, Xuanwo and westonpace March 24, 2026 15:43

BubbleCal marked this pull request as draft March 24, 2026 17:04

BubbleCal added 2 commits March 25, 2026 10:27

fix missing doc

f639b40

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

more test

1d6fb7a

Signed-off-by: BubbleCal <bubble-cal@outlook.com>

BubbleCal marked this pull request as ready for review March 25, 2026 11:54

BubbleCal changed the title ~~fix: support stop-word gaps in phrase queries~~ feat: support stop-word gaps in phrase queries Mar 25, 2026

github-actions Bot added the enhancement New feature or request label Mar 25, 2026

BubbleCal removed the bug Something isn't working label Mar 25, 2026

westonpace approved these changes Mar 25, 2026

View reviewed changes

BubbleCal merged commit d016a82 into main Mar 25, 2026
33 checks passed

BubbleCal deleted the yang/phrase-query-allow-stop-words branch March 25, 2026 15:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support stop-word gaps in phrase queries#6277

feat: support stop-word gaps in phrase queries#6277
BubbleCal merged 5 commits intomainfrom
yang/phrase-query-allow-stop-words

BubbleCal commented Mar 24, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Mar 24, 2026

Uh oh!

codecov Bot commented Mar 24, 2026 •

edited

Loading

Uh oh!

westonpace left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

BubbleCal commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Mar 24, 2026

PR Review

P0: Remove .codex/task-memory/ files

P1: Performance — double tokenization per candidate document in post-validation

P1: Recursive phrase matching with slop could be expensive

Minor: Test could also assert negative case for "want green apple"

Uh oh!

codecov Bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BubbleCal commented Mar 24, 2026 •

edited

Loading

P0: Remove `.codex/task-memory/` files

Minor: Test could also assert negative case for `"want green apple"`

codecov Bot commented Mar 24, 2026 •

edited

Loading