Skip to content

fix: BM25 phrase boost + index freshness verification#72

Merged
klappy merged 3 commits into
mainfrom
fix/search-phrase-boost-and-index-freshness
Apr 9, 2026
Merged

fix: BM25 phrase boost + index freshness verification#72
klappy merged 3 commits into
mainfrom
fix/search-phrase-boost-and-index-freshness

Conversation

@klappy
Copy link
Copy Markdown
Owner

@klappy klappy commented Apr 9, 2026

Summary

Two surgical fixes to the search layer.


Bug 1: BM25 phrase boost (workers/src/bm25.ts, src/search/bm25.js)

Root cause: searchBM25 tokenises query terms and scores each independently via BM25. High-frequency terms (e.g. pattern) inflate their document-frequency contribution and dilute the signal from rare, precise terms (e.g. vodka), pushing exact-title matches down the ranking.

Fix: Store originalText on every BM25Doc during buildBM25Index. After BM25 scoring, apply a phrase-level boost:

  • +5.0 PHRASE_BOOST_EXACT — full lowercased query appears as a substring of the doc text (verbatim title/tag match).
  • +2.0 PHRASE_BOOST_PARTIAL — any consecutive word bigram from the query appears in the doc text (e.g. "vodka architecture" inside "vodka architecture pattern"). First matching bigram wins.

Boosts supplement BM25 — they never replace it. Applied identically to the Cloudflare Worker TypeScript version and the Node/stdio JavaScript version.


Bug 2: KV index staleness (workers/src/zip-baseline-fetcher.ts)

Root cause: Cloudflare KV is eventually consistent — two requests seconds apart can hit different edge nodes and receive stale cached indexes, even when the SHA-keyed cache key looks valid.

Fix: After a KV cache hit in getIndex(), cross-check the cached index's embedded commit_sha / canon_commit_sha against the SHAs just resolved from the GitHub API. If they diverge, the entry is stale: log a warning, discard it, and rebuild from source.


Files changed

File Change
workers/src/bm25.ts Add originalText to BM25Doc; phrase boost in searchBM25
src/search/bm25.js Same fix for Node/stdio server
workers/src/zip-baseline-fetcher.ts SHA cross-check in getIndex() cache-hit path

No other files touched.


Note

Medium Risk
Changes search ranking and cache-hit behavior in production paths; regressions could affect result ordering and increase index rebuilds if SHA checks are wrong or metadata is missing.

Overview
Improves BM25 search relevance by storing each document’s originalText in the index and applying an additional phrase-level boost in searchBM25 for exact query substrings or matching query bigrams (only when BM25 score is already positive), implemented in both src/search/bm25.js and workers/src/bm25.ts.

Hardens index cache correctness in workers/src/zip-baseline-fetcher.ts by validating KV cache hits against freshly resolved commit_sha/canon_commit_sha; mismatches are logged and force an index rebuild to avoid serving eventually-consistent stale data.

Reviewed by Cursor Bugbot for commit 3e0dd25. Bugbot is set up for automated code reviews on this repo. Configure here.

Bug 1 (workers/src/bm25.ts, src/search/bm25.js):
BM25 scored every query token independently, letting high-frequency
terms like 'pattern' dilute rare-but-precise ones like 'vodka',
pushing exact-title matches down the rankings.

Fix: store originalText on BM25Doc during buildBM25Index, then after
BM25 scoring apply a phrase boost in searchBM25:
  - +5.0 (PHRASE_BOOST_EXACT)   if the full lowercased query appears
    as a substring of the doc's original text
  - +2.0 (PHRASE_BOOST_PARTIAL) if any consecutive word bigram from
    the query appears in the doc text (first hit wins)

These boosts supplement BM25; they never replace it. Applied to both
the Worker TypeScript version and the Node/stdio JS version for
consistency.

Bug 2 (workers/src/zip-baseline-fetcher.ts):
Cloudflare KV is eventually consistent — two requests seconds apart
can hit different edge nodes and return stale cached indexes even
when the SHA-keyed cache key looks valid.

Fix: after a KV cache hit in getIndex(), cross-check the cached
index's embedded commit_sha / canon_commit_sha against the SHAs just
resolved from the GitHub API. If they diverge the entry is stale;
log a warning, discard it, and rebuild from source.
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented Apr 9, 2026

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
oddkit 3e0dd25 Commit Preview URL

Branch Preview URL
Apr 09 2026, 01:35 PM

Comment thread workers/src/bm25.ts
Comment thread workers/src/bm25.ts Outdated
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Bigram phrase boost skips punctuation normalization unlike tokenizer
    • Applied the same punctuation-stripping (.replace(/[^\w\s-]/g, " ")) and compound-splitting (.split(/[\s-_/]+/)) to queryWords in both bm25.ts and bm25.js so bigram tokens are clean and match document text correctly.
Preview (3e0dd25e14)
diff --git a/src/search/bm25.js b/src/search/bm25.js
--- a/src/search/bm25.js
+++ b/src/search/bm25.js
@@ -48,7 +48,7 @@
 
   for (const doc of documents) {
     const terms = tokenize(doc.text);
-    docs.push({ id: doc.id, terms, length: terms.length });
+    docs.push({ id: doc.id, terms, length: terms.length, originalText: doc.text });
     totalLength += terms.length;
 
     const seen = new Set();
@@ -68,11 +68,21 @@
   };
 }
 
+// Phrase boost constants — supplement BM25, never replace it.
+// Exact: full query string found as substring in doc text.
+// Partial: any consecutive two-word query bigram found in doc text.
+const PHRASE_BOOST_EXACT = 5.0;
+const PHRASE_BOOST_PARTIAL = 2.0;
+
 /** Search BM25 index, return sorted {id, score} pairs */
 export function searchBM25(index, query, limit = 5) {
   const queryTerms = tokenize(query);
   if (queryTerms.length === 0) return [];
 
+  // Pre-compute phrase matching inputs once, outside the per-doc loop.
+  const queryLower = query.toLowerCase();
+  const queryWords = queryLower.replace(/[^\w\s-]/g, " ").split(/[\s\-_/]+/).filter((w) => w.length > 1 && !STOP_WORDS.has(w));
+
   const scores = [];
 
   for (const doc of index.docs) {
@@ -96,6 +106,23 @@
       score += idf * tfNorm;
     }
 
+    // Phrase boost: supplement BM25 — never replace it.
+    // Only apply when the document already has genuine BM25 relevance.
+    if (score > 0) {
+      const docLower = doc.originalText.toLowerCase();
+      if (docLower.includes(queryLower)) {
+        score += PHRASE_BOOST_EXACT;
+      } else if (queryWords.length >= 2) {
+        for (let i = 0; i < queryWords.length - 1; i++) {
+          const bigram = queryWords[i] + " " + queryWords[i + 1];
+          if (docLower.includes(bigram)) {
+            score += PHRASE_BOOST_PARTIAL;
+            break;
+          }
+        }
+      }
+    }
+
     if (score > 0) scores.push({ id: doc.id, score });
   }
 

diff --git a/workers/src/bm25.ts b/workers/src/bm25.ts
--- a/workers/src/bm25.ts
+++ b/workers/src/bm25.ts
@@ -44,6 +44,8 @@
   id: string;
   terms: string[];
   length: number;
+  /** Original (pre-tokenization) text, used for phrase-level scoring. */
+  originalText: string;
 }
 
 export interface BM25Index {
@@ -63,7 +65,7 @@
 
   for (const doc of documents) {
     const terms = tokenize(doc.text);
-    docs.push({ id: doc.id, terms, length: terms.length });
+    docs.push({ id: doc.id, terms, length: terms.length, originalText: doc.text });
     totalLength += terms.length;
 
     const seen = new Set<string>();
@@ -83,6 +85,12 @@
   };
 }
 
+// Phrase boost constants — supplement BM25, never replace it.
+// Exact: full query string found as substring in doc text.
+// Partial: any consecutive two-word query bigram found in doc text.
+const PHRASE_BOOST_EXACT = 5.0;
+const PHRASE_BOOST_PARTIAL = 2.0;
+
 /** Search BM25 index, return sorted {id, score} pairs */
 export function searchBM25(
   index: BM25Index,
@@ -92,6 +100,10 @@
   const queryTerms = tokenize(query);
   if (queryTerms.length === 0) return [];
 
+  // Pre-compute phrase matching inputs once, outside the per-doc loop.
+  const queryLower = query.toLowerCase();
+  const queryWords = queryLower.replace(/[^\w\s-]/g, " ").split(/[\s\-_/]+/).filter((w) => w.length > 1 && !STOP_WORDS.has(w));
+
   const scores: Array<{ id: string; score: number }> = [];
 
   for (const doc of index.docs) {
@@ -119,6 +131,23 @@
       score += idf * tfNorm;
     }
 
+    // Phrase boost: supplement BM25 — never replace it.
+    // Only apply when the document already has genuine BM25 relevance.
+    if (score > 0) {
+      const docLower = doc.originalText.toLowerCase();
+      if (docLower.includes(queryLower)) {
+        score += PHRASE_BOOST_EXACT;
+      } else if (queryWords.length >= 2) {
+        for (let i = 0; i < queryWords.length - 1; i++) {
+          const bigram = queryWords[i] + " " + queryWords[i + 1];
+          if (docLower.includes(bigram)) {
+            score += PHRASE_BOOST_PARTIAL;
+            break;
+          }
+        }
+      }
+    }
+
     if (score > 0) scores.push({ id: doc.id, score });
   }
 

diff --git a/workers/src/zip-baseline-fetcher.ts b/workers/src/zip-baseline-fetcher.ts
--- a/workers/src/zip-baseline-fetcher.ts
+++ b/workers/src/zip-baseline-fetcher.ts
@@ -760,8 +760,22 @@
     if (this.env.BASELINE_CACHE) {
       const cached = await this.env.BASELINE_CACHE.get(cacheKey, "json") as BaselineIndex | null;
       if (cached) {
-        // Content-addressed cache hit: SHA matches, content is truthful
-        return cached;
+        // Cloudflare KV is eventually consistent — two requests seconds apart
+        // can hit different edge nodes and return stale data even when the
+        // cache key looks correct. Cross-check the cached index's embedded
+        // commit SHAs against the SHAs we just resolved from the GitHub API.
+        // If they diverge, the cached entry is stale; discard and rebuild.
+        const baselineShaMatch = !baselineSha || cached.commit_sha === baselineSha;
+        const canonShaMatch = !canonSha || cached.canon_commit_sha === canonSha;
+        if (baselineShaMatch && canonShaMatch) {
+          // Content-addressed cache hit: SHA verified, content is truthful.
+          return cached;
+        }
+        console.warn(
+          `KV cache SHA mismatch — discarding stale index. ` +
+          `cached=${cached.commit_sha}/${cached.canon_commit_sha} ` +
+          `resolved=${baselineSha}/${canonSha ?? "none"}`
+        );
       }
     }

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit 44aa004. Configure here.

Comment thread workers/src/bm25.ts Outdated
…enize

queryWords was built by splitting the raw lowercased query on whitespace
only, skipping the punctuation stripping and hyphen/underscore/slash
splitting that tokenize() applies. This caused dirty tokens like
pattern? or whats to form bigrams that never matched against clean
document text, silently disabling partial phrase boost for punctuated
queries. Apply the same replace/split pipeline as tokenize (minus
stemming) so bigram matching works correctly.
@klappy klappy merged commit b7826ba into main Apr 9, 2026
5 checks passed
@klappy klappy deleted the fix/search-phrase-boost-and-index-freshness branch April 9, 2026 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants