From ecd80243f74315cdbd0609906131163bf38c8d7a Mon Sep 17 00:00:00 2001 From: Ruben de Smet Date: Mon, 18 May 2026 12:28:34 +0200 Subject: [PATCH] fix(boot): make rebuildIndex non-blocking so viewer + later boot steps run MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit mem::observe's boot flow had this sequence in main(): 1. registerSearchFunction / registerContextFunction / ... (sync — completes immediately) 2. restore persisted vector index from disk 3. await rebuildIndex(kv) ← blocks here 4. bootLog "Ready" / "REST API" / "MCP surface" 5. startViewerServer(...) 6. setInterval auto-forget / lesson decay / consolidation rebuildIndex iterates every observation across every session and AWAITS an embedding-provider call per record. On a large corpus + a rate-limited embedding endpoint (e.g. 100 RPM), step 3 takes hours to days. Everything that runs AFTER it — including startViewerServer — is silently delayed for the same duration. Symptoms in the wild: - http://localhost:3113/ unreachable (no listening socket on the viewer port) even on a freshly-started server - `agentmemory doctor` reports "viewer-unreachable" - log floods with `vector-index add: embed failed — skipping {429: ...}` from the still-running rebuild burning rate-limit budget - no error message — the worker stays alive serving HTTP because sdk.registerFunction had already completed synchronously in step 1 Fix: detach rebuildIndex with `void` + .then/.catch instead of awaiting. The index lazily fills in over time, search degrades gracefully (BM25 keeps working immediately, vector results fill in as the embed queue drains), and the viewer comes up in seconds. Repro on the operator side: 1. import a sizeable jsonl corpus (`mem::replay::import-jsonl`) 2. clear the persisted vector index so rebuildIndex runs on next boot 3. restart agentmemory with EMBEDDING_PROVIDER pointed at a rate-limited endpoint (any OpenAI-compat with low RPM) 4. observe: REST API responds on :3111, but :3113 is never bound, and the doctor's "viewer-unreachable" check fires until the rebuild finishes (hours-to-days for a 300+ session corpus) The 5-second non-fix workaround was a hard kill + restart; that just re-entered the same hang. No tests added — main() isn't unit-tested today and wiring up a fake slow rebuildIndex + asserting the post-rebuild boot lines run early would need the full worker mock harness. The change is one line and the failure mode is dramatic; visual review + integration smoke covers the regression risk. --- src/index.ts | 28 ++++++++++++++++++---------- 1 file changed, 18 insertions(+), 10 deletions(-) diff --git a/src/index.ts b/src/index.ts index 904e08fe..69ab0c9a 100644 --- a/src/index.ts +++ b/src/index.ts @@ -412,16 +412,24 @@ async function main() { const needsRebuild = bm25Index.size === 0; if (needsRebuild) { - const indexCount = await rebuildIndex(kv).catch((err) => { - console.warn(`[agentmemory] Failed to rebuild search index:`, err); - return 0; - }); - if (indexCount > 0) { - bootLog( - `Search index rebuilt: ${indexCount} entries`, - ); - indexPersistence.scheduleSave(); - } + // Fire-and-forget. rebuildIndex iterates every observation across + // every session and AWAITS an embedding-provider call per record. + // On a large corpus + rate-limited embedding endpoint that can + // take HOURS; awaiting it here blocks every subsequent boot step + // (including startViewerServer below, leaving the viewer port + // unbound for the duration). The index lazily fills in over time + // and search degrades gracefully — partial coverage > no viewer + // for hours. Errors still surface via the inner .catch. + void rebuildIndex(kv) + .then((indexCount) => { + if (indexCount > 0) { + bootLog(`Search index rebuilt: ${indexCount} entries`); + indexPersistence.scheduleSave(); + } + }) + .catch((err) => { + console.warn(`[agentmemory] Failed to rebuild search index:`, err); + }); } else { // Backfill memories into BM25 for users upgrading from <0.9.5: prior // versions of mem::remember never indexed memories, so the persisted