feat(search): incremental index updates by harlan-zw · Pull Request #41 · skilld-dev/skilld

harlan-zw · 2026-03-21T05:13:19Z

🔗 Linked issue

Resolves #28

❓ Type of change

📚 Description

indexResources previously had all-or-nothing behavior: if the search DB existed, it skipped entirely; if not, it rebuilt everything from scratch. Adding a single new issue to a package with 2000+ indexed docs meant either keeping a stale index or nuking and rebuilding the entire corpus.

Now when the DB exists, the function diffs incoming docs against stored IDs and only processes the delta. New docs get chunked, embedded, and stored; stale docs (and their chunks) get removed; unchanged docs are skipped. Uses node:sqlite directly to query raw chunk-level IDs from the DB, bypassing retriv's parent-ID deduplication so exact chunk IDs can be passed to remove(). Bumps retriv to 0.12.0 for listIds() support.

Summary by CodeRabbit

Release Notes

New Features
- Search index updates are now incremental, processing only new and changed documents instead of rebuilding the entire index—resulting in faster indexing operations
- Automatic detection and removal of stale documents from the search index
Dependencies
- Updated search indexing library to ^0.12.0

…uild When the search DB already exists, indexResources now diffs incoming docs against the stored index and only processes the delta: new docs get chunked/embedded/stored, stale docs and their chunks get removed, unchanged docs are skipped entirely. Uses node:sqlite directly to query raw chunk-level IDs from the DB (bypassing retriv's parent-ID deduplication) so exact chunk IDs can be passed to remove(). Bumps retriv to 0.12.0 for listIds() support. Resolves #28

coderabbitai · 2026-03-21T05:13:34Z

📝 Walkthrough

Walkthrough

The changes implement incremental search index updates, allowing the system to track existing indexed documents, compute diffs between incoming and stored documents, and perform targeted index modifications instead of full rebuilds.

Changes

Cohort / File(s)	Summary
Configuration & Dependencies `package.json`, `pnpm-workspace.yaml`	Updated `retriv` dependency resolution from `catalog:deps` to `catalog:` and bumped version from `^0.11.0` to `^0.12.0`.
Retriv Index APIs `src/retriv/index.ts`	Extended `createIndex` and `createIndexDirect` to accept optional `removeIds` for targeted deletions. Added `listIndexIds()` to query existing indexed document IDs and `removeFromIndex()` to remove documents by ID.
Worker & Pool Layer `src/retriv/pool.ts`, `src/retriv/worker.ts`	Updated worker communication to forward optional `removeIds` through `WorkerMessage`, and worker handler now conditionally removes stale documents before indexing.
Sync Command Logic `src/commands/sync-shared.ts`	Refactored `indexResources` to support incremental updates: lists existing index IDs, diffs incoming documents, and performs targeted `createIndex` calls with new/removed documents. Added helper functions `parentDocId()` and `capDocs()` for diff computation and document truncation.
Unit Tests `test/unit/sync-shared.test.ts`	Updated mocks and tests to cover three incremental scenarios: index up-to-date (no-op), new documents added, and stale documents removed.

Sequence Diagram

sequenceDiagram
    participant SyncCmd as sync-shared command
    participant ReIndex as retriv index.ts
    participant Pool as retriv pool.ts
    participant Worker as retriv worker
    participant DB as SQLite DB

    SyncCmd->>ReIndex: listIndexIds(config)
    ReIndex->>DB: SELECT id FROM documents_meta
    DB-->>ReIndex: existing IDs
    ReIndex-->>SyncCmd: existing IDs

    SyncCmd->>SyncCmd: compute diff:<br/>newDocs, removeIds
    
    alt Up to date
        SyncCmd-->>SyncCmd: emit "Search index up to date"
    else Has changes
        SyncCmd->>Pool: createIndexInWorker(newDocs, {removeIds})
        Pool->>Worker: post WorkerMessage {removeIds, docs}
        Worker->>DB: getDb(config)
        Worker->>DB: db.remove(removeIds)
        Worker->>DB: db.index(newDocs)
        Worker-->>Pool: indexing complete
        Pool-->>SyncCmd: done
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 Hops with joy through indexed fields,
No more full rebuilds on the wheel,
Diff the docs, keep what's the same,
Incremental updates—what a game! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat(search): incremental index updates' directly and clearly describes the main change: implementing incremental search index updates instead of all-or-nothing behavior.
Linked Issues check	✅ Passed	All coding requirements from issue `#28` are met: tracking existing document IDs via listIndexIds [`#28`], diffing incoming docs against stored IDs to identify new/stale docs [`#28`], chunking/embedding only the delta by passing removeIds to createIndex [`#28`], and supporting incremental updates in retriv layer [`#28`].
Out of Scope Changes check	✅ Passed	All changes are directly scoped to implementing incremental search indexing: dependency upgrade to retriv 0.12.0, additions to retriv APIs for listing and removing documents, sync-shared diff logic, worker/pool integration, and corresponding tests.
Docstring Coverage	✅ Passed	Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/incremental-search-index

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (3)

src/retriv/index.ts (1)

86-100: Consider handling the case when the DB file doesn't exist.

DatabaseSync will throw SQLITE_CANTOPEN if the database file doesn't exist. While the caller (indexResources) guards with existsSync(dbPath), this function's contract doesn't make that precondition explicit. If called independently without the file existing, it will throw rather than returning [].

Consider adding a file existence check or documenting the precondition:
🛡️ Optional: Add defensive file check
 export async function listIndexIds(
   config: Pick<IndexConfig, 'dbPath'>,
 ): Promise<string[]> {
   const nodeSqlite = globalThis.process?.getBuiltinModule?.('node:sqlite') as typeof import('node:sqlite') | undefined
   if (!nodeSqlite)
     return []
+  const { existsSync } = await import('node:fs')
+  if (!existsSync(config.dbPath))
+    return []
   const db = new nodeSqlite.DatabaseSync(config.dbPath, { open: true, readOnly: true })
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/retriv/index.ts` around lines 86 - 100, The listIndexIds function
currently calls new nodeSqlite.DatabaseSync(config.dbPath) which throws if the
DB file is missing; update listIndexIds to defensively handle a missing DB by
checking fs.existsSync(config.dbPath) (or catching the SQLITE_CANTOPEN error)
before creating DatabaseSync and return [] when the file doesn't exist or cannot
be opened; reference the existing symbols listIndexIds and DatabaseSync when
locating the change and ensure the finally block still closes the DB only when
opened.

test/unit/sync-shared.test.ts (1)

696-739: Good test coverage for incremental indexing scenarios.

The tests cover the three key cases:

Up-to-date index (no changes needed)

New docs only (add without removals)

Stale docs (removals including chunk IDs)

Consider adding a test for the combined case where both new docs are added and stale docs are removed simultaneously, as this is the common real-world scenario.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/unit/sync-shared.test.ts` around lines 696 - 739, Add a new unit test
that covers the combined scenario where the DB exists and there are both new
docs to add and stale IDs to remove: mock existsSync to true, mock listIndexIds
to include existing IDs plus stale IDs (and their chunk IDs), mock resolvePkgDir
as in other tests, call indexResources with docs that include an existing doc
and at least one new doc, then assert createIndex was called, that the first
argument contains only the new doc(s) (by id) and the second argument's
removeIds lists the stale IDs (including chunk ids). Use the same helpers/mocks
referenced in nearby tests: indexResources, createIndex, listIndexIds,
existsSync, resolvePkgDir.

src/commands/sync-shared.ts (1)

824-835: Verify listIndexIds error handling for corrupt/unreadable DB.

listIndexIds can throw errors other than SearchDepsUnavailableError (e.g., SQLITE_CORRUPT, permission errors). These would propagate as uncaught exceptions.

Consider whether these should be caught and trigger a full rebuild instead:

🛡️ Optional: Fallback to full rebuild on read errors

   let existingIds: string[]
   try {
     existingIds = await listIndexIds({ dbPath })
   }
   catch (err) {
     if (err instanceof SearchDepsUnavailableError) {
       onProgress('Search indexing skipped (native deps unavailable)')
       return
     }
+    // DB unreadable (corrupt, permissions, etc.) - fall back to full rebuild
+    onProgress('Index unreadable, rebuilding')
+    rmSync(dbPath, { recursive: true, force: true })
+    existingIds = []
-    throw err
   }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/commands/sync-shared.ts` around lines 824 - 835, The call to listIndexIds
can throw DB read errors (e.g., SQLITE_CORRUPT, EACCES) that currently bubble
up; instead of rethrowing those, detect non-SearchDepsUnavailableError
exceptions inside the try/catch around listIndexIds and treat them as a
recoverable read-failure by logging via onProgress (include err.message) and
falling back to a full rebuild path — for example set existingIds to an empty
array or flip the incremental/full-rebuild flag so the subsequent incremental
update logic (which uses existingIds and dbPath) performs a full rebuild; keep
the existing behavior for SearchDepsUnavailableError (still skip indexing) and
do not swallow unexpected exceptions silently.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/commands/sync-shared.ts`:
- Around line 824-835: The call to listIndexIds can throw DB read errors (e.g.,
SQLITE_CORRUPT, EACCES) that currently bubble up; instead of rethrowing those,
detect non-SearchDepsUnavailableError exceptions inside the try/catch around
listIndexIds and treat them as a recoverable read-failure by logging via
onProgress (include err.message) and falling back to a full rebuild path — for
example set existingIds to an empty array or flip the incremental/full-rebuild
flag so the subsequent incremental update logic (which uses existingIds and
dbPath) performs a full rebuild; keep the existing behavior for
SearchDepsUnavailableError (still skip indexing) and do not swallow unexpected
exceptions silently.

In `@src/retriv/index.ts`:
- Around line 86-100: The listIndexIds function currently calls new
nodeSqlite.DatabaseSync(config.dbPath) which throws if the DB file is missing;
update listIndexIds to defensively handle a missing DB by checking
fs.existsSync(config.dbPath) (or catching the SQLITE_CANTOPEN error) before
creating DatabaseSync and return [] when the file doesn't exist or cannot be
opened; reference the existing symbols listIndexIds and DatabaseSync when
locating the change and ensure the finally block still closes the DB only when
opened.

In `@test/unit/sync-shared.test.ts`:
- Around line 696-739: Add a new unit test that covers the combined scenario
where the DB exists and there are both new docs to add and stale IDs to remove:
mock existsSync to true, mock listIndexIds to include existing IDs plus stale
IDs (and their chunk IDs), mock resolvePkgDir as in other tests, call
indexResources with docs that include an existing doc and at least one new doc,
then assert createIndex was called, that the first argument contains only the
new doc(s) (by id) and the second argument's removeIds lists the stale IDs
(including chunk ids). Use the same helpers/mocks referenced in nearby tests:
indexResources, createIndex, listIndexIds, existsSync, resolvePkgDir.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 85e3bd59-c515-43ca-bf20-eeaf4833a0f9

📥 Commits

Reviewing files that changed from the base of the PR and between 9f2b9b7 and 5cd8fb0.

⛔ Files ignored due to path filters (1)

pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml

📒 Files selected for processing (7)

package.json
pnpm-workspace.yaml
src/commands/sync-shared.ts
src/retriv/index.ts
src/retriv/pool.ts
src/retriv/worker.ts
test/unit/sync-shared.test.ts

coderabbitai bot reviewed Mar 21, 2026

View reviewed changes

harlan-zw merged commit fe66fe2 into main Mar 21, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(search): incremental index updates#41

feat(search): incremental index updates#41
harlan-zw merged 1 commit intomainfrom
feat/incremental-search-index

harlan-zw commented Mar 21, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 21, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

harlan-zw commented Mar 21, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Linked issue

❓ Type of change

📚 Description

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

harlan-zw commented Mar 21, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 21, 2026 •

edited

Loading