Skip to content

Commit 3e0dd25

Browse files
committed
fix: normalize queryWords with same punctuation/split pipeline as tokenize
queryWords was built by splitting the raw lowercased query on whitespace only, skipping the punctuation stripping and hyphen/underscore/slash splitting that tokenize() applies. This caused dirty tokens like pattern? or whats to form bigrams that never matched against clean document text, silently disabling partial phrase boost for punctuated queries. Apply the same replace/split pipeline as tokenize (minus stemming) so bigram matching works correctly.
1 parent 44aa004 commit 3e0dd25

2 files changed

Lines changed: 2 additions & 2 deletions

File tree

src/search/bm25.js

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@ export function searchBM25(index, query, limit = 5) {
8181

8282
// Pre-compute phrase matching inputs once, outside the per-doc loop.
8383
const queryLower = query.toLowerCase();
84-
const queryWords = queryLower.split(/\s+/).filter((w) => w.length > 1 && !STOP_WORDS.has(w));
84+
const queryWords = queryLower.replace(/[^\w\s-]/g, " ").split(/[\s\-_/]+/).filter((w) => w.length > 1 && !STOP_WORDS.has(w));
8585

8686
const scores = [];
8787

workers/src/bm25.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ export function searchBM25(
102102

103103
// Pre-compute phrase matching inputs once, outside the per-doc loop.
104104
const queryLower = query.toLowerCase();
105-
const queryWords = queryLower.split(/\s+/).filter((w) => w.length > 1 && !STOP_WORDS.has(w));
105+
const queryWords = queryLower.replace(/[^\w\s-]/g, " ").split(/[\s\-_/]+/).filter((w) => w.length > 1 && !STOP_WORDS.has(w));
106106

107107
const scores: Array<{ id: string; score: number }> = [];
108108

0 commit comments

Comments
 (0)