From ecd80243f74315cdbd0609906131163bf38c8d7a Mon Sep 17 00:00:00 2001
From: Ruben de Smet <ruben@lunascens.io>
Date: Mon, 18 May 2026 12:28:34 +0200
Subject: [PATCH] fix(boot): make rebuildIndex non-blocking so viewer + later
 boot steps run
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

mem::observe's boot flow had this sequence in main():

  1. registerSearchFunction / registerContextFunction / ...
     (sync — completes immediately)
  2. restore persisted vector index from disk
  3. await rebuildIndex(kv)        ← blocks here
  4. bootLog "Ready" / "REST API" / "MCP surface"
  5. startViewerServer(...)
  6. setInterval auto-forget / lesson decay / consolidation

rebuildIndex iterates every observation across every session and AWAITS
an embedding-provider call per record. On a large corpus + a rate-limited
embedding endpoint (e.g. 100 RPM), step 3 takes hours to days.
Everything that runs AFTER it — including startViewerServer — is
silently delayed for the same duration.

Symptoms in the wild:
- http://localhost:3113/ unreachable (no listening socket on the viewer
  port) even on a freshly-started server
- `agentmemory doctor` reports "viewer-unreachable"
- log floods with `vector-index add: embed failed — skipping {429: ...}`
  from the still-running rebuild burning rate-limit budget
- no error message — the worker stays alive serving HTTP because
  sdk.registerFunction had already completed synchronously in step 1

Fix: detach rebuildIndex with `void` + .then/.catch instead of awaiting.
The index lazily fills in over time, search degrades gracefully (BM25
keeps working immediately, vector results fill in as the embed queue
drains), and the viewer comes up in seconds.

Repro on the operator side:
1. import a sizeable jsonl corpus (`mem::replay::import-jsonl`)
2. clear the persisted vector index so rebuildIndex runs on next boot
3. restart agentmemory with EMBEDDING_PROVIDER pointed at a rate-limited
   endpoint (any OpenAI-compat with low RPM)
4. observe: REST API responds on :3111, but :3113 is never bound, and
   the doctor's "viewer-unreachable" check fires until the rebuild
   finishes (hours-to-days for a 300+ session corpus)

The 5-second non-fix workaround was a hard kill + restart; that just
re-entered the same hang.

No tests added — main() isn't unit-tested today and wiring up a fake
slow rebuildIndex + asserting the post-rebuild boot lines run early
would need the full worker mock harness. The change is one line and
the failure mode is dramatic; visual review + integration smoke covers
the regression risk.
---
 src/index.ts | 28 ++++++++++++++++++----------
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/src/index.ts b/src/index.ts
index 904e08fe..69ab0c9a 100644
--- a/src/index.ts
+++ b/src/index.ts
@@ -412,16 +412,24 @@ async function main() {
   const needsRebuild = bm25Index.size === 0;
 
   if (needsRebuild) {
-    const indexCount = await rebuildIndex(kv).catch((err) => {
-      console.warn(`[agentmemory] Failed to rebuild search index:`, err);
-      return 0;
-    });
-    if (indexCount > 0) {
-      bootLog(
-        `Search index rebuilt: ${indexCount} entries`,
-      );
-      indexPersistence.scheduleSave();
-    }
+    // Fire-and-forget. rebuildIndex iterates every observation across
+    // every session and AWAITS an embedding-provider call per record.
+    // On a large corpus + rate-limited embedding endpoint that can
+    // take HOURS; awaiting it here blocks every subsequent boot step
+    // (including startViewerServer below, leaving the viewer port
+    // unbound for the duration). The index lazily fills in over time
+    // and search degrades gracefully — partial coverage > no viewer
+    // for hours. Errors still surface via the inner .catch.
+    void rebuildIndex(kv)
+      .then((indexCount) => {
+        if (indexCount > 0) {
+          bootLog(`Search index rebuilt: ${indexCount} entries`);
+          indexPersistence.scheduleSave();
+        }
+      })
+      .catch((err) => {
+        console.warn(`[agentmemory] Failed to rebuild search index:`, err);
+      });
   } else {
     // Backfill memories into BM25 for users upgrading from <0.9.5: prior
     // versions of mem::remember never indexed memories, so the persisted