Skip to content

feat(search): incremental index updates#41

Merged
harlan-zw merged 1 commit intomainfrom
feat/incremental-search-index
Mar 21, 2026
Merged

feat(search): incremental index updates#41
harlan-zw merged 1 commit intomainfrom
feat/incremental-search-index

Conversation

@harlan-zw
Copy link
Copy Markdown
Collaborator

@harlan-zw harlan-zw commented Mar 21, 2026

🔗 Linked issue

Resolves #28

❓ Type of change

  • 📖 Documentation
  • 🐞 Bug fix
  • 👌 Enhancement
  • ✨ New feature
  • 🧹 Chore
  • ⚠️ Breaking change

📚 Description

indexResources previously had all-or-nothing behavior: if the search DB existed, it skipped entirely; if not, it rebuilt everything from scratch. Adding a single new issue to a package with 2000+ indexed docs meant either keeping a stale index or nuking and rebuilding the entire corpus.

Now when the DB exists, the function diffs incoming docs against stored IDs and only processes the delta. New docs get chunked, embedded, and stored; stale docs (and their chunks) get removed; unchanged docs are skipped. Uses node:sqlite directly to query raw chunk-level IDs from the DB, bypassing retriv's parent-ID deduplication so exact chunk IDs can be passed to remove(). Bumps retriv to 0.12.0 for listIds() support.

Summary by CodeRabbit

Release Notes

  • New Features

    • Search index updates are now incremental, processing only new and changed documents instead of rebuilding the entire index—resulting in faster indexing operations
    • Automatic detection and removal of stale documents from the search index
  • Dependencies

    • Updated search indexing library to ^0.12.0

…uild

When the search DB already exists, indexResources now diffs incoming
docs against the stored index and only processes the delta: new docs
get chunked/embedded/stored, stale docs and their chunks get removed,
unchanged docs are skipped entirely.

Uses node:sqlite directly to query raw chunk-level IDs from the DB
(bypassing retriv's parent-ID deduplication) so exact chunk IDs can
be passed to remove(). Bumps retriv to 0.12.0 for listIds() support.

Resolves #28
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 21, 2026

📝 Walkthrough

Walkthrough

The changes implement incremental search index updates, allowing the system to track existing indexed documents, compute diffs between incoming and stored documents, and perform targeted index modifications instead of full rebuilds.

Changes

Cohort / File(s) Summary
Configuration & Dependencies
package.json, pnpm-workspace.yaml
Updated retriv dependency resolution from catalog:deps to catalog: and bumped version from ^0.11.0 to ^0.12.0.
Retriv Index APIs
src/retriv/index.ts
Extended createIndex and createIndexDirect to accept optional removeIds for targeted deletions. Added listIndexIds() to query existing indexed document IDs and removeFromIndex() to remove documents by ID.
Worker & Pool Layer
src/retriv/pool.ts, src/retriv/worker.ts
Updated worker communication to forward optional removeIds through WorkerMessage, and worker handler now conditionally removes stale documents before indexing.
Sync Command Logic
src/commands/sync-shared.ts
Refactored indexResources to support incremental updates: lists existing index IDs, diffs incoming documents, and performs targeted createIndex calls with new/removed documents. Added helper functions parentDocId() and capDocs() for diff computation and document truncation.
Unit Tests
test/unit/sync-shared.test.ts
Updated mocks and tests to cover three incremental scenarios: index up-to-date (no-op), new documents added, and stale documents removed.

Sequence Diagram

sequenceDiagram
    participant SyncCmd as sync-shared command
    participant ReIndex as retriv index.ts
    participant Pool as retriv pool.ts
    participant Worker as retriv worker
    participant DB as SQLite DB

    SyncCmd->>ReIndex: listIndexIds(config)
    ReIndex->>DB: SELECT id FROM documents_meta
    DB-->>ReIndex: existing IDs
    ReIndex-->>SyncCmd: existing IDs

    SyncCmd->>SyncCmd: compute diff:<br/>newDocs, removeIds
    
    alt Up to date
        SyncCmd-->>SyncCmd: emit "Search index up to date"
    else Has changes
        SyncCmd->>Pool: createIndexInWorker(newDocs, {removeIds})
        Pool->>Worker: post WorkerMessage {removeIds, docs}
        Worker->>DB: getDb(config)
        Worker->>DB: db.remove(removeIds)
        Worker->>DB: db.index(newDocs)
        Worker-->>Pool: indexing complete
        Pool-->>SyncCmd: done
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 Hops with joy through indexed fields,
No more full rebuilds on the wheel,
Diff the docs, keep what's the same,
Incremental updates—what a game!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(search): incremental index updates' directly and clearly describes the main change: implementing incremental search index updates instead of all-or-nothing behavior.
Linked Issues check ✅ Passed All coding requirements from issue #28 are met: tracking existing document IDs via listIndexIds [#28], diffing incoming docs against stored IDs to identify new/stale docs [#28], chunking/embedding only the delta by passing removeIds to createIndex [#28], and supporting incremental updates in retriv layer [#28].
Out of Scope Changes check ✅ Passed All changes are directly scoped to implementing incremental search indexing: dependency upgrade to retriv 0.12.0, additions to retriv APIs for listing and removing documents, sync-shared diff logic, worker/pool integration, and corresponding tests.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/incremental-search-index

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
src/retriv/index.ts (1)

86-100: Consider handling the case when the DB file doesn't exist.

DatabaseSync will throw SQLITE_CANTOPEN if the database file doesn't exist. While the caller (indexResources) guards with existsSync(dbPath), this function's contract doesn't make that precondition explicit. If called independently without the file existing, it will throw rather than returning [].

Consider adding a file existence check or documenting the precondition:

🛡️ Optional: Add defensive file check
 export async function listIndexIds(
   config: Pick<IndexConfig, 'dbPath'>,
 ): Promise<string[]> {
   const nodeSqlite = globalThis.process?.getBuiltinModule?.('node:sqlite') as typeof import('node:sqlite') | undefined
   if (!nodeSqlite)
     return []
+  const { existsSync } = await import('node:fs')
+  if (!existsSync(config.dbPath))
+    return []
   const db = new nodeSqlite.DatabaseSync(config.dbPath, { open: true, readOnly: true })
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/retriv/index.ts` around lines 86 - 100, The listIndexIds function
currently calls new nodeSqlite.DatabaseSync(config.dbPath) which throws if the
DB file is missing; update listIndexIds to defensively handle a missing DB by
checking fs.existsSync(config.dbPath) (or catching the SQLITE_CANTOPEN error)
before creating DatabaseSync and return [] when the file doesn't exist or cannot
be opened; reference the existing symbols listIndexIds and DatabaseSync when
locating the change and ensure the finally block still closes the DB only when
opened.
test/unit/sync-shared.test.ts (1)

696-739: Good test coverage for incremental indexing scenarios.

The tests cover the three key cases:

  1. Up-to-date index (no changes needed)
  2. New docs only (add without removals)
  3. Stale docs (removals including chunk IDs)

Consider adding a test for the combined case where both new docs are added and stale docs are removed simultaneously, as this is the common real-world scenario.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/unit/sync-shared.test.ts` around lines 696 - 739, Add a new unit test
that covers the combined scenario where the DB exists and there are both new
docs to add and stale IDs to remove: mock existsSync to true, mock listIndexIds
to include existing IDs plus stale IDs (and their chunk IDs), mock resolvePkgDir
as in other tests, call indexResources with docs that include an existing doc
and at least one new doc, then assert createIndex was called, that the first
argument contains only the new doc(s) (by id) and the second argument's
removeIds lists the stale IDs (including chunk ids). Use the same helpers/mocks
referenced in nearby tests: indexResources, createIndex, listIndexIds,
existsSync, resolvePkgDir.
src/commands/sync-shared.ts (1)

824-835: Verify listIndexIds error handling for corrupt/unreadable DB.

listIndexIds can throw errors other than SearchDepsUnavailableError (e.g., SQLITE_CORRUPT, permission errors). These would propagate as uncaught exceptions.

Consider whether these should be caught and trigger a full rebuild instead:

🛡️ Optional: Fallback to full rebuild on read errors
   let existingIds: string[]
   try {
     existingIds = await listIndexIds({ dbPath })
   }
   catch (err) {
     if (err instanceof SearchDepsUnavailableError) {
       onProgress('Search indexing skipped (native deps unavailable)')
       return
     }
+    // DB unreadable (corrupt, permissions, etc.) - fall back to full rebuild
+    onProgress('Index unreadable, rebuilding')
+    rmSync(dbPath, { recursive: true, force: true })
+    existingIds = []
-    throw err
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/commands/sync-shared.ts` around lines 824 - 835, The call to listIndexIds
can throw DB read errors (e.g., SQLITE_CORRUPT, EACCES) that currently bubble
up; instead of rethrowing those, detect non-SearchDepsUnavailableError
exceptions inside the try/catch around listIndexIds and treat them as a
recoverable read-failure by logging via onProgress (include err.message) and
falling back to a full rebuild path — for example set existingIds to an empty
array or flip the incremental/full-rebuild flag so the subsequent incremental
update logic (which uses existingIds and dbPath) performs a full rebuild; keep
the existing behavior for SearchDepsUnavailableError (still skip indexing) and
do not swallow unexpected exceptions silently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/commands/sync-shared.ts`:
- Around line 824-835: The call to listIndexIds can throw DB read errors (e.g.,
SQLITE_CORRUPT, EACCES) that currently bubble up; instead of rethrowing those,
detect non-SearchDepsUnavailableError exceptions inside the try/catch around
listIndexIds and treat them as a recoverable read-failure by logging via
onProgress (include err.message) and falling back to a full rebuild path — for
example set existingIds to an empty array or flip the incremental/full-rebuild
flag so the subsequent incremental update logic (which uses existingIds and
dbPath) performs a full rebuild; keep the existing behavior for
SearchDepsUnavailableError (still skip indexing) and do not swallow unexpected
exceptions silently.

In `@src/retriv/index.ts`:
- Around line 86-100: The listIndexIds function currently calls new
nodeSqlite.DatabaseSync(config.dbPath) which throws if the DB file is missing;
update listIndexIds to defensively handle a missing DB by checking
fs.existsSync(config.dbPath) (or catching the SQLITE_CANTOPEN error) before
creating DatabaseSync and return [] when the file doesn't exist or cannot be
opened; reference the existing symbols listIndexIds and DatabaseSync when
locating the change and ensure the finally block still closes the DB only when
opened.

In `@test/unit/sync-shared.test.ts`:
- Around line 696-739: Add a new unit test that covers the combined scenario
where the DB exists and there are both new docs to add and stale IDs to remove:
mock existsSync to true, mock listIndexIds to include existing IDs plus stale
IDs (and their chunk IDs), mock resolvePkgDir as in other tests, call
indexResources with docs that include an existing doc and at least one new doc,
then assert createIndex was called, that the first argument contains only the
new doc(s) (by id) and the second argument's removeIds lists the stale IDs
(including chunk ids). Use the same helpers/mocks referenced in nearby tests:
indexResources, createIndex, listIndexIds, existsSync, resolvePkgDir.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 85e3bd59-c515-43ca-bf20-eeaf4833a0f9

📥 Commits

Reviewing files that changed from the base of the PR and between 9f2b9b7 and 5cd8fb0.

⛔ Files ignored due to path filters (1)
  • pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
📒 Files selected for processing (7)
  • package.json
  • pnpm-workspace.yaml
  • src/commands/sync-shared.ts
  • src/retriv/index.ts
  • src/retriv/pool.ts
  • src/retriv/worker.ts
  • test/unit/sync-shared.test.ts

@harlan-zw harlan-zw merged commit fe66fe2 into main Mar 21, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Search index rebuild is all-or-nothing - no incremental updates

1 participant