fix: BM25 phrase boost + index freshness verification by klappy · Pull Request #72 · klappy/oddkit

klappy · 2026-04-09T13:08:47Z

Summary

Two surgical fixes to the search layer.

Bug 1: BM25 phrase boost (`workers/src/bm25.ts`, `src/search/bm25.js`)

Root cause: searchBM25 tokenises query terms and scores each independently via BM25. High-frequency terms (e.g. pattern) inflate their document-frequency contribution and dilute the signal from rare, precise terms (e.g. vodka), pushing exact-title matches down the ranking.

Fix: Store originalText on every BM25Doc during buildBM25Index. After BM25 scoring, apply a phrase-level boost:

+5.0 PHRASE_BOOST_EXACT — full lowercased query appears as a substring of the doc text (verbatim title/tag match).
+2.0 PHRASE_BOOST_PARTIAL — any consecutive word bigram from the query appears in the doc text (e.g. "vodka architecture" inside "vodka architecture pattern"). First matching bigram wins.

Boosts supplement BM25 — they never replace it. Applied identically to the Cloudflare Worker TypeScript version and the Node/stdio JavaScript version.

Bug 2: KV index staleness (`workers/src/zip-baseline-fetcher.ts`)

Root cause: Cloudflare KV is eventually consistent — two requests seconds apart can hit different edge nodes and receive stale cached indexes, even when the SHA-keyed cache key looks valid.

Fix: After a KV cache hit in getIndex(), cross-check the cached index's embedded commit_sha / canon_commit_sha against the SHAs just resolved from the GitHub API. If they diverge, the entry is stale: log a warning, discard it, and rebuild from source.

Files changed

File	Change
`workers/src/bm25.ts`	Add `originalText` to `BM25Doc`; phrase boost in `searchBM25`
`src/search/bm25.js`	Same fix for Node/stdio server
`workers/src/zip-baseline-fetcher.ts`	SHA cross-check in `getIndex()` cache-hit path

No other files touched.

Note

Medium Risk
Changes search ranking and cache-hit behavior in production paths; regressions could affect result ordering and increase index rebuilds if SHA checks are wrong or metadata is missing.

Overview
Improves BM25 search relevance by storing each document’s originalText in the index and applying an additional phrase-level boost in searchBM25 for exact query substrings or matching query bigrams (only when BM25 score is already positive), implemented in both src/search/bm25.js and workers/src/bm25.ts.

Hardens index cache correctness in workers/src/zip-baseline-fetcher.ts by validating KV cache hits against freshly resolved commit_sha/canon_commit_sha; mismatches are logged and force an index rebuild to avoid serving eventually-consistent stale data.

^{Reviewed by Cursor Bugbot for commit 3e0dd25. Bugbot is set up for automated code reviews on this repo. Configure here.}

Bug 1 (workers/src/bm25.ts, src/search/bm25.js): BM25 scored every query token independently, letting high-frequency terms like 'pattern' dilute rare-but-precise ones like 'vodka', pushing exact-title matches down the rankings. Fix: store originalText on BM25Doc during buildBM25Index, then after BM25 scoring apply a phrase boost in searchBM25: - +5.0 (PHRASE_BOOST_EXACT) if the full lowercased query appears as a substring of the doc's original text - +2.0 (PHRASE_BOOST_PARTIAL) if any consecutive word bigram from the query appears in the doc text (first hit wins) These boosts supplement BM25; they never replace it. Applied to both the Worker TypeScript version and the Node/stdio JS version for consistency. Bug 2 (workers/src/zip-baseline-fetcher.ts): Cloudflare KV is eventually consistent — two requests seconds apart can hit different edge nodes and return stale cached indexes even when the SHA-keyed cache key looks valid. Fix: after a KV cache hit in getIndex(), cross-check the cached index's embedded commit_sha / canon_commit_sha against the SHAs just resolved from the GitHub API. If they diverge the entry is stale; log a warning, discard it, and rebuild from source.

cloudflare-workers-and-pages · 2026-04-09T13:09:10Z

Deploying with Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status	Name	Latest Commit	Preview URL	Updated (UTC)
✅ Deployment successful! View logs	oddkit	`3e0dd25`	Commit Preview URL Branch Preview URL	Apr 09 2026, 01:35 PM

…rds from bigrams

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix prepared a fix for the issue found in the latest run.

✅ Fixed: Bigram phrase boost skips punctuation normalization unlike tokenizer
- Applied the same punctuation-stripping (.replace(/[^\w\s-]/g, " ")) and compound-splitting (.split(/[\s-_/]+/)) to queryWords in both bm25.ts and bm25.js so bigram tokens are clean and match document text correctly.

Preview (3e0dd25e14)

diff --git a/src/search/bm25.js b/src/search/bm25.js
--- a/src/search/bm25.js
+++ b/src/search/bm25.js
@@ -48,7 +48,7 @@
 
   for (const doc of documents) {
     const terms = tokenize(doc.text);
-    docs.push({ id: doc.id, terms, length: terms.length });
+    docs.push({ id: doc.id, terms, length: terms.length, originalText: doc.text });
     totalLength += terms.length;
 
     const seen = new Set();
@@ -68,11 +68,21 @@
   };
 }
 
+// Phrase boost constants — supplement BM25, never replace it.
+// Exact: full query string found as substring in doc text.
+// Partial: any consecutive two-word query bigram found in doc text.
+const PHRASE_BOOST_EXACT = 5.0;
+const PHRASE_BOOST_PARTIAL = 2.0;
+
 /** Search BM25 index, return sorted {id, score} pairs */
 export function searchBM25(index, query, limit = 5) {
   const queryTerms = tokenize(query);
   if (queryTerms.length === 0) return [];
 
+  // Pre-compute phrase matching inputs once, outside the per-doc loop.
+  const queryLower = query.toLowerCase();
+  const queryWords = queryLower.replace(/[^\w\s-]/g, " ").split(/[\s\-_/]+/).filter((w) => w.length > 1 && !STOP_WORDS.has(w));
+
   const scores = [];
 
   for (const doc of index.docs) {
@@ -96,6 +106,23 @@
       score += idf * tfNorm;
     }
 
+    // Phrase boost: supplement BM25 — never replace it.
+    // Only apply when the document already has genuine BM25 relevance.
+    if (score > 0) {
+      const docLower = doc.originalText.toLowerCase();
+      if (docLower.includes(queryLower)) {
+        score += PHRASE_BOOST_EXACT;
+      } else if (queryWords.length >= 2) {
+        for (let i = 0; i < queryWords.length - 1; i++) {
+          const bigram = queryWords[i] + " " + queryWords[i + 1];
+          if (docLower.includes(bigram)) {
+            score += PHRASE_BOOST_PARTIAL;
+            break;
+          }
+        }
+      }
+    }
+
     if (score > 0) scores.push({ id: doc.id, score });
   }
 

diff --git a/workers/src/bm25.ts b/workers/src/bm25.ts
--- a/workers/src/bm25.ts
+++ b/workers/src/bm25.ts
@@ -44,6 +44,8 @@
   id: string;
   terms: string[];
   length: number;
+  /** Original (pre-tokenization) text, used for phrase-level scoring. */
+  originalText: string;
 }
 
 export interface BM25Index {
@@ -63,7 +65,7 @@
 
   for (const doc of documents) {
     const terms = tokenize(doc.text);
-    docs.push({ id: doc.id, terms, length: terms.length });
+    docs.push({ id: doc.id, terms, length: terms.length, originalText: doc.text });
     totalLength += terms.length;
 
     const seen = new Set<string>();
@@ -83,6 +85,12 @@
   };
 }
 
+// Phrase boost constants — supplement BM25, never replace it.
+// Exact: full query string found as substring in doc text.
+// Partial: any consecutive two-word query bigram found in doc text.
+const PHRASE_BOOST_EXACT = 5.0;
+const PHRASE_BOOST_PARTIAL = 2.0;
+
 /** Search BM25 index, return sorted {id, score} pairs */
 export function searchBM25(
   index: BM25Index,
@@ -92,6 +100,10 @@
   const queryTerms = tokenize(query);
   if (queryTerms.length === 0) return [];
 
+  // Pre-compute phrase matching inputs once, outside the per-doc loop.
+  const queryLower = query.toLowerCase();
+  const queryWords = queryLower.replace(/[^\w\s-]/g, " ").split(/[\s\-_/]+/).filter((w) => w.length > 1 && !STOP_WORDS.has(w));
+
   const scores: Array<{ id: string; score: number }> = [];
 
   for (const doc of index.docs) {
@@ -119,6 +131,23 @@
       score += idf * tfNorm;
     }
 
+    // Phrase boost: supplement BM25 — never replace it.
+    // Only apply when the document already has genuine BM25 relevance.
+    if (score > 0) {
+      const docLower = doc.originalText.toLowerCase();
+      if (docLower.includes(queryLower)) {
+        score += PHRASE_BOOST_EXACT;
+      } else if (queryWords.length >= 2) {
+        for (let i = 0; i < queryWords.length - 1; i++) {
+          const bigram = queryWords[i] + " " + queryWords[i + 1];
+          if (docLower.includes(bigram)) {
+            score += PHRASE_BOOST_PARTIAL;
+            break;
+          }
+        }
+      }
+    }
+
     if (score > 0) scores.push({ id: doc.id, score });
   }
 

diff --git a/workers/src/zip-baseline-fetcher.ts b/workers/src/zip-baseline-fetcher.ts
--- a/workers/src/zip-baseline-fetcher.ts
+++ b/workers/src/zip-baseline-fetcher.ts
@@ -760,8 +760,22 @@
     if (this.env.BASELINE_CACHE) {
       const cached = await this.env.BASELINE_CACHE.get(cacheKey, "json") as BaselineIndex | null;
       if (cached) {
-        // Content-addressed cache hit: SHA matches, content is truthful
-        return cached;
+        // Cloudflare KV is eventually consistent — two requests seconds apart
+        // can hit different edge nodes and return stale data even when the
+        // cache key looks correct. Cross-check the cached index's embedded
+        // commit SHAs against the SHAs we just resolved from the GitHub API.
+        // If they diverge, the cached entry is stale; discard and rebuild.
+        const baselineShaMatch = !baselineSha || cached.commit_sha === baselineSha;
+        const canonShaMatch = !canonSha || cached.canon_commit_sha === canonSha;
+        if (baselineShaMatch && canonShaMatch) {
+          // Content-addressed cache hit: SHA verified, content is truthful.
+          return cached;
+        }
+        console.warn(
+          `KV cache SHA mismatch — discarding stale index. ` +
+          `cached=${cached.commit_sha}/${cached.canon_commit_sha} ` +
+          `resolved=${baselineSha}/${canonSha ?? "none"}`
+        );
       }
     }

_{You can send follow-ups to the cloud agent here.}

^{Reviewed by Cursor Bugbot for commit 44aa004. Configure here.}

…enize queryWords was built by splitting the raw lowercased query on whitespace only, skipping the punctuation stripping and hyphen/underscore/slash splitting that tokenize() applies. This caused dirty tokens like pattern? or whats to form bigrams that never matched against clean document text, silently disabling partial phrase boost for punctuated queries. Apply the same replace/split pipeline as tokenize (minus stemming) so bigram matching works correctly.

cursor Bot reviewed Apr 9, 2026

View reviewed changes

Comment thread workers/src/bm25.ts

Comment thread workers/src/bm25.ts Outdated

Fix phrase boost: guard behind positive BM25 score and filter stop wo…

44aa004

…rds from bigrams

cursor Bot reviewed Apr 9, 2026

View reviewed changes

Comment thread workers/src/bm25.ts Outdated

klappy merged commit b7826ba into main Apr 9, 2026
5 checks passed

klappy deleted the fix/search-phrase-boost-and-index-freshness branch April 9, 2026 16:28

klappy mentioned this pull request Apr 9, 2026

deploy: BM25 phrase boost + index freshness + README overhaul #73

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: BM25 phrase boost + index freshness verification#72

fix: BM25 phrase boost + index freshness verification#72
klappy merged 3 commits into
mainfrom
fix/search-phrase-boost-and-index-freshness

klappy commented Apr 9, 2026 •

edited by cursor Bot

Loading

Uh oh!

cloudflare-workers-and-pages Bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

klappy commented Apr 9, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Bug 1: BM25 phrase boost (workers/src/bm25.ts, src/search/bm25.js)

Bug 2: KV index staleness (workers/src/zip-baseline-fetcher.ts)

Files changed

Uh oh!

cloudflare-workers-and-pages Bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying with Cloudflare Workers

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

klappy commented Apr 9, 2026 •

edited by cursor Bot

Loading

Bug 1: BM25 phrase boost (`workers/src/bm25.ts`, `src/search/bm25.js`)

Bug 2: KV index staleness (`workers/src/zip-baseline-fetcher.ts`)

cloudflare-workers-and-pages Bot commented Apr 9, 2026 •

edited

Loading

cursor Bot left a comment •

edited

Loading