Skip to content

feat: support stop-word gaps in phrase queries#6277

Merged
BubbleCal merged 5 commits intomainfrom
yang/phrase-query-allow-stop-words
Mar 25, 2026
Merged

feat: support stop-word gaps in phrase queries#6277
BubbleCal merged 5 commits intomainfrom
yang/phrase-query-allow-stop-words

Conversation

@BubbleCal
Copy link
Copy Markdown
Contributor

@BubbleCal BubbleCal commented Mar 24, 2026

This change enables phrase queries to match across stop-word gaps.

Example:
For doc="love the format" indexed with remove_stop_words=True, the index does not store the stop word the.

With this change, users can still match the document with the phrase query q="love the format". In this mode, all stop words are treated as equivalent placeholders for phrase matching, so q="love a format" will also match the same document.

This makes queries that containing stop words 3x~10x faster in the cost of a lit bit accuracy

@github-actions github-actions Bot added the bug Something isn't working label Mar 24, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR Review

P0: Remove .codex/task-memory/ files

The two files under .codex/task-memory/phrase-stop-words/ are AI agent working memory and should not be committed to the repository. Please remove them from the PR.

P1: Performance — double tokenization per candidate document in post-validation

filter_stop_word_phrase_matches calls collect_phrase_tokens_with_stop_words for every candidate row, and that method constructs two tokenizers (one with stop words, one without) and tokenizes the full document text twice. For queries with stop words that produce a large candidate set, this could be costly.

Consider:

  1. Constructing the tokenizer pair once and passing them into the per-document loop rather than rebuilding them on every call via params.clone().build().
  2. Short-circuiting earlier: check whether the query contains stop words before fetching document text with take_rows, not after (currently the check happens inside filter_stop_word_phrase_matches after the take_rows call — wait, actually it does check before take_rows, so this part is fine).

On re-reading: the early-exit check is correct (returns early if no placeholder tokens in query). The main concern is reconstructing tokenizers per document row — those should be hoisted out of the loop.

P1: Recursive phrase matching with slop could be expensive

matches_phrase_tokens_from uses unbounded recursion with backtracking. Worst case is O(slop^n) where n is query token count. For typical small slop values this is fine, but consider adding a comment noting this assumption, or converting to an iterative approach if slop values can be large.

Minor: Test could also assert negative case for "want green apple"

The test correctly asserts 2 results (ids 0 and 1), which implicitly excludes "want green apple" (id 2). But an explicit assert!(!ids.contains(&2)) would make the intent clearer and directly document the false-positive prevention that motivated the post-validation step.


Overall approach is sound: preserve tokenizer positions for gap-aware phrase matching in the index, then post-validate against original text only when stop words are present in the query. The core index changes in query.rs and index.rs are clean.

🤖 Generated with Claude Code

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@BubbleCal BubbleCal marked this pull request as draft March 24, 2026 17:04
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@BubbleCal BubbleCal marked this pull request as ready for review March 25, 2026 11:54
@BubbleCal BubbleCal changed the title fix: support stop-word gaps in phrase queries feat: support stop-word gaps in phrase queries Mar 25, 2026
@github-actions github-actions Bot added the enhancement New feature or request label Mar 25, 2026
@BubbleCal BubbleCal removed the bug Something isn't working label Mar 25, 2026
Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix. I didn't realize this could be a query-path-only change, very cool.

@BubbleCal BubbleCal merged commit d016a82 into main Mar 25, 2026
33 checks passed
@BubbleCal BubbleCal deleted the yang/phrase-query-allow-stop-words branch March 25, 2026 15:14
wjones127 pushed a commit to wjones127/lance that referenced this pull request Mar 29, 2026
This change enables phrase queries to match across stop-word gaps.

Example:
For `doc="love the format"` indexed with `remove_stop_words=True`, the
index does not store the stop word the.

With this change, users can still match the document with the phrase
query `q="love the format"`. In this mode, all stop words are treated as
equivalent placeholders for phrase matching, so `q="love a format"` will
also match the same document.

This makes queries that containing stop words 3x~10x faster in the cost
of a lit bit accuracy

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants