fix: BM25 phrase boost + index freshness verification#72
Merged
Conversation
Bug 1 (workers/src/bm25.ts, src/search/bm25.js):
BM25 scored every query token independently, letting high-frequency
terms like 'pattern' dilute rare-but-precise ones like 'vodka',
pushing exact-title matches down the rankings.
Fix: store originalText on BM25Doc during buildBM25Index, then after
BM25 scoring apply a phrase boost in searchBM25:
- +5.0 (PHRASE_BOOST_EXACT) if the full lowercased query appears
as a substring of the doc's original text
- +2.0 (PHRASE_BOOST_PARTIAL) if any consecutive word bigram from
the query appears in the doc text (first hit wins)
These boosts supplement BM25; they never replace it. Applied to both
the Worker TypeScript version and the Node/stdio JS version for
consistency.
Bug 2 (workers/src/zip-baseline-fetcher.ts):
Cloudflare KV is eventually consistent — two requests seconds apart
can hit different edge nodes and return stale cached indexes even
when the SHA-keyed cache key looks valid.
Fix: after a KV cache hit in getIndex(), cross-check the cached
index's embedded commit_sha / canon_commit_sha against the SHAs just
resolved from the GitHub API. If they diverge the entry is stale;
log a warning, discard it, and rebuild from source.
Deploying with
|
| Status | Name | Latest Commit | Preview URL | Updated (UTC) |
|---|---|---|---|---|
| ✅ Deployment successful! View logs |
oddkit | 3e0dd25 | Commit Preview URL Branch Preview URL |
Apr 09 2026, 01:35 PM |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Bigram phrase boost skips punctuation normalization unlike tokenizer
- Applied the same punctuation-stripping (.replace(/[^\w\s-]/g, " ")) and compound-splitting (.split(/[\s-_/]+/)) to queryWords in both bm25.ts and bm25.js so bigram tokens are clean and match document text correctly.
Preview (3e0dd25e14)
diff --git a/src/search/bm25.js b/src/search/bm25.js
--- a/src/search/bm25.js
+++ b/src/search/bm25.js
@@ -48,7 +48,7 @@
for (const doc of documents) {
const terms = tokenize(doc.text);
- docs.push({ id: doc.id, terms, length: terms.length });
+ docs.push({ id: doc.id, terms, length: terms.length, originalText: doc.text });
totalLength += terms.length;
const seen = new Set();
@@ -68,11 +68,21 @@
};
}
+// Phrase boost constants — supplement BM25, never replace it.
+// Exact: full query string found as substring in doc text.
+// Partial: any consecutive two-word query bigram found in doc text.
+const PHRASE_BOOST_EXACT = 5.0;
+const PHRASE_BOOST_PARTIAL = 2.0;
+
/** Search BM25 index, return sorted {id, score} pairs */
export function searchBM25(index, query, limit = 5) {
const queryTerms = tokenize(query);
if (queryTerms.length === 0) return [];
+ // Pre-compute phrase matching inputs once, outside the per-doc loop.
+ const queryLower = query.toLowerCase();
+ const queryWords = queryLower.replace(/[^\w\s-]/g, " ").split(/[\s\-_/]+/).filter((w) => w.length > 1 && !STOP_WORDS.has(w));
+
const scores = [];
for (const doc of index.docs) {
@@ -96,6 +106,23 @@
score += idf * tfNorm;
}
+ // Phrase boost: supplement BM25 — never replace it.
+ // Only apply when the document already has genuine BM25 relevance.
+ if (score > 0) {
+ const docLower = doc.originalText.toLowerCase();
+ if (docLower.includes(queryLower)) {
+ score += PHRASE_BOOST_EXACT;
+ } else if (queryWords.length >= 2) {
+ for (let i = 0; i < queryWords.length - 1; i++) {
+ const bigram = queryWords[i] + " " + queryWords[i + 1];
+ if (docLower.includes(bigram)) {
+ score += PHRASE_BOOST_PARTIAL;
+ break;
+ }
+ }
+ }
+ }
+
if (score > 0) scores.push({ id: doc.id, score });
}
diff --git a/workers/src/bm25.ts b/workers/src/bm25.ts
--- a/workers/src/bm25.ts
+++ b/workers/src/bm25.ts
@@ -44,6 +44,8 @@
id: string;
terms: string[];
length: number;
+ /** Original (pre-tokenization) text, used for phrase-level scoring. */
+ originalText: string;
}
export interface BM25Index {
@@ -63,7 +65,7 @@
for (const doc of documents) {
const terms = tokenize(doc.text);
- docs.push({ id: doc.id, terms, length: terms.length });
+ docs.push({ id: doc.id, terms, length: terms.length, originalText: doc.text });
totalLength += terms.length;
const seen = new Set<string>();
@@ -83,6 +85,12 @@
};
}
+// Phrase boost constants — supplement BM25, never replace it.
+// Exact: full query string found as substring in doc text.
+// Partial: any consecutive two-word query bigram found in doc text.
+const PHRASE_BOOST_EXACT = 5.0;
+const PHRASE_BOOST_PARTIAL = 2.0;
+
/** Search BM25 index, return sorted {id, score} pairs */
export function searchBM25(
index: BM25Index,
@@ -92,6 +100,10 @@
const queryTerms = tokenize(query);
if (queryTerms.length === 0) return [];
+ // Pre-compute phrase matching inputs once, outside the per-doc loop.
+ const queryLower = query.toLowerCase();
+ const queryWords = queryLower.replace(/[^\w\s-]/g, " ").split(/[\s\-_/]+/).filter((w) => w.length > 1 && !STOP_WORDS.has(w));
+
const scores: Array<{ id: string; score: number }> = [];
for (const doc of index.docs) {
@@ -119,6 +131,23 @@
score += idf * tfNorm;
}
+ // Phrase boost: supplement BM25 — never replace it.
+ // Only apply when the document already has genuine BM25 relevance.
+ if (score > 0) {
+ const docLower = doc.originalText.toLowerCase();
+ if (docLower.includes(queryLower)) {
+ score += PHRASE_BOOST_EXACT;
+ } else if (queryWords.length >= 2) {
+ for (let i = 0; i < queryWords.length - 1; i++) {
+ const bigram = queryWords[i] + " " + queryWords[i + 1];
+ if (docLower.includes(bigram)) {
+ score += PHRASE_BOOST_PARTIAL;
+ break;
+ }
+ }
+ }
+ }
+
if (score > 0) scores.push({ id: doc.id, score });
}
diff --git a/workers/src/zip-baseline-fetcher.ts b/workers/src/zip-baseline-fetcher.ts
--- a/workers/src/zip-baseline-fetcher.ts
+++ b/workers/src/zip-baseline-fetcher.ts
@@ -760,8 +760,22 @@
if (this.env.BASELINE_CACHE) {
const cached = await this.env.BASELINE_CACHE.get(cacheKey, "json") as BaselineIndex | null;
if (cached) {
- // Content-addressed cache hit: SHA matches, content is truthful
- return cached;
+ // Cloudflare KV is eventually consistent — two requests seconds apart
+ // can hit different edge nodes and return stale data even when the
+ // cache key looks correct. Cross-check the cached index's embedded
+ // commit SHAs against the SHAs we just resolved from the GitHub API.
+ // If they diverge, the cached entry is stale; discard and rebuild.
+ const baselineShaMatch = !baselineSha || cached.commit_sha === baselineSha;
+ const canonShaMatch = !canonSha || cached.canon_commit_sha === canonSha;
+ if (baselineShaMatch && canonShaMatch) {
+ // Content-addressed cache hit: SHA verified, content is truthful.
+ return cached;
+ }
+ console.warn(
+ `KV cache SHA mismatch — discarding stale index. ` +
+ `cached=${cached.commit_sha}/${cached.canon_commit_sha} ` +
+ `resolved=${baselineSha}/${canonSha ?? "none"}`
+ );
}
}You can send follow-ups to the cloud agent here.
Reviewed by Cursor Bugbot for commit 44aa004. Configure here.
…enize queryWords was built by splitting the raw lowercased query on whitespace only, skipping the punctuation stripping and hyphen/underscore/slash splitting that tokenize() applies. This caused dirty tokens like pattern? or whats to form bigrams that never matched against clean document text, silently disabling partial phrase boost for punctuated queries. Apply the same replace/split pipeline as tokenize (minus stemming) so bigram matching works correctly.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
Two surgical fixes to the search layer.
Bug 1: BM25 phrase boost (
workers/src/bm25.ts,src/search/bm25.js)Root cause:
searchBM25tokenises query terms and scores each independently via BM25. High-frequency terms (e.g.pattern) inflate their document-frequency contribution and dilute the signal from rare, precise terms (e.g.vodka), pushing exact-title matches down the ranking.Fix: Store
originalTexton everyBM25DocduringbuildBM25Index. After BM25 scoring, apply a phrase-level boost:PHRASE_BOOST_EXACT— full lowercased query appears as a substring of the doc text (verbatim title/tag match).PHRASE_BOOST_PARTIAL— any consecutive word bigram from the query appears in the doc text (e.g."vodka architecture"inside"vodka architecture pattern"). First matching bigram wins.Boosts supplement BM25 — they never replace it. Applied identically to the Cloudflare Worker TypeScript version and the Node/stdio JavaScript version.
Bug 2: KV index staleness (
workers/src/zip-baseline-fetcher.ts)Root cause: Cloudflare KV is eventually consistent — two requests seconds apart can hit different edge nodes and receive stale cached indexes, even when the SHA-keyed cache key looks valid.
Fix: After a KV cache hit in
getIndex(), cross-check the cached index's embeddedcommit_sha/canon_commit_shaagainst the SHAs just resolved from the GitHub API. If they diverge, the entry is stale: log a warning, discard it, and rebuild from source.Files changed
workers/src/bm25.tsoriginalTexttoBM25Doc; phrase boost insearchBM25src/search/bm25.jsworkers/src/zip-baseline-fetcher.tsgetIndex()cache-hit pathNo other files touched.
Note
Medium Risk
Changes search ranking and cache-hit behavior in production paths; regressions could affect result ordering and increase index rebuilds if SHA checks are wrong or metadata is missing.
Overview
Improves BM25 search relevance by storing each document’s
originalTextin the index and applying an additional phrase-level boost insearchBM25for exact query substrings or matching query bigrams (only when BM25 score is already positive), implemented in bothsrc/search/bm25.jsandworkers/src/bm25.ts.Hardens index cache correctness in
workers/src/zip-baseline-fetcher.tsby validating KV cache hits against freshly resolvedcommit_sha/canon_commit_sha; mismatches are logged and force an index rebuild to avoid serving eventually-consistent stale data.Reviewed by Cursor Bugbot for commit 3e0dd25. Bugbot is set up for automated code reviews on this repo. Configure here.