fix(boot): make rebuildIndex non-blocking so viewer + later boot steps run by efenex · Pull Request #500 · rohitg00/agentmemory

efenex · 2026-05-18T10:29:06Z

Summary

The worker boot flow in src/index.ts awaits rebuildIndex(kv) before reaching startViewerServer (and several other boot steps). rebuildIndex iterates every observation across every session and AWAITS an embedding-provider call per record. On a real corpus + a rate-limited embedding endpoint that takes hours to days, and everything that runs after it is silently delayed for the same duration.

Symptom in the wild

Operator imports a sizable jsonl corpus (in our case 320 sessions / ~500k observations), restarts agentmemory with EMBEDDING_PROVIDER pointed at any rate-limited OpenAI-compat endpoint (Novita / DeepInfra / etc., typically 100 RPM on the cheap plans), and:

✅ REST API on :3111 responds normally
❌ Viewer on :3113 is never reachable (no listening socket)
❌ agentmemory doctor reports viewer-unreachable
🌊 Log floods with vector-index add: embed failed — skipping {429: ...} from the still-running rebuild burning the embedding rate limit
No error message — the worker stays alive serving HTTP because sdk.registerFunction calls had already completed synchronously before the rebuild hung

The "obvious" workaround of agentmemory stop && agentmemory just re-enters the same hang.

Root cause

// src/index.ts
const needsRebuild = bm25Index.size === 0;

if (needsRebuild) {
  const indexCount = await rebuildIndex(kv).catch(...);   // ← hours-to-days
  ...
}

// ...
const viewerServer = startViewerServer(viewerPort, ...);  // ← never reached

rebuildIndex(kv) (in src/functions/search.ts) per-record awaits vectorIndexAddGuarded(...) which calls the embedding provider. For a ~500k-observation corpus at 100 RPM = 5,000 minutes = 3.5 days. The viewer / auto-forget / lesson-decay / consolidation timers all sit behind it.

Fix

Detach with void + .then/.catch. The index lazily fills in over hours; search degrades gracefully (BM25 keeps working immediately, vector results fill in as the embed queue drains); the viewer + everything else in main() come up in seconds.

void rebuildIndex(kv)
  .then((indexCount) => {
    if (indexCount > 0) {
      bootLog(`Search index rebuilt: ${indexCount} entries`);
      indexPersistence.scheduleSave();
    }
  })
  .catch((err) => {
    console.warn(`[agentmemory] Failed to rebuild search index:`, err);
  });

Verification

Tested live on the affected corpus before and after:

Before: every restart left lsof -ti :3113 empty, doctor reported viewer-unreachable, log showed 429s pile up indefinitely.

After: viewer binds within ~5 seconds of starting, returns the full 188 KB HTML payload on GET /. Rebuild continues in the background; vector search results improve over the following minutes.

Test plan

No unit tests added. main() isn't unit-tested today and wiring up a fake slow rebuildIndex + asserting the post-rebuild boot lines run early would need the full worker mock harness — disproportionate to a one-line behavior change. The failure mode is dramatic enough that visual review + integration smoke covers regression risk.

Files

src/index.ts — 18 insertions, 10 deletions (mostly the comment explaining the rationale)

Summary by CodeRabbit

Chores
- Improved application boot performance by optimizing search-index initialization to run in the background without blocking startup steps.

…s run mem::observe's boot flow had this sequence in main(): 1. registerSearchFunction / registerContextFunction / ... (sync — completes immediately) 2. restore persisted vector index from disk 3. await rebuildIndex(kv) ← blocks here 4. bootLog "Ready" / "REST API" / "MCP surface" 5. startViewerServer(...) 6. setInterval auto-forget / lesson decay / consolidation rebuildIndex iterates every observation across every session and AWAITS an embedding-provider call per record. On a large corpus + a rate-limited embedding endpoint (e.g. 100 RPM), step 3 takes hours to days. Everything that runs AFTER it — including startViewerServer — is silently delayed for the same duration. Symptoms in the wild: - http://localhost:3113/ unreachable (no listening socket on the viewer port) even on a freshly-started server - `agentmemory doctor` reports "viewer-unreachable" - log floods with `vector-index add: embed failed — skipping {429: ...}` from the still-running rebuild burning rate-limit budget - no error message — the worker stays alive serving HTTP because sdk.registerFunction had already completed synchronously in step 1 Fix: detach rebuildIndex with `void` + .then/.catch instead of awaiting. The index lazily fills in over time, search degrades gracefully (BM25 keeps working immediately, vector results fill in as the embed queue drains), and the viewer comes up in seconds. Repro on the operator side: 1. import a sizeable jsonl corpus (`mem::replay::import-jsonl`) 2. clear the persisted vector index so rebuildIndex runs on next boot 3. restart agentmemory with EMBEDDING_PROVIDER pointed at a rate-limited endpoint (any OpenAI-compat with low RPM) 4. observe: REST API responds on :3111, but :3113 is never bound, and the doctor's "viewer-unreachable" check fires until the rebuild finishes (hours-to-days for a 300+ session corpus) The 5-second non-fix workaround was a hard kill + restart; that just re-entered the same hang. No tests added — main() isn't unit-tested today and wiring up a fake slow rebuildIndex + asserting the post-rebuild boot lines run early would need the full worker mock harness. The change is one line and the failure mode is dramatic; visual review + integration smoke covers the regression risk.

vercel · 2026-05-18T10:29:10Z

@efenex is attempting to deploy a commit to the rohitg00's projects Team on Vercel.

A member of the Team first needs to authorize it.

coderabbitai · 2026-05-18T10:29:14Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6c010713-b1c4-4ce1-b280-08f8ac21bfc9

📥 Commits

Reviewing files that changed from the base of the PR and between caa9f52 and ecd8024.

📒 Files selected for processing (1)

src/index.ts

📝 Walkthrough

Walkthrough

The PR converts the search-index rebuild path in main() from synchronous awaited blocking to non-blocking background execution. When needsRebuild is true, the rebuild promise is initiated without blocking boot, logs and schedules persistence conditionally when the count exceeds zero, and handles errors via separate catch logic.

Changes

Search Index Non-Blocking Rebuild

Layer / File(s)	Summary
Convert search index rebuild to non-blocking background operation `src/index.ts`	Rebuild now executes in the background via promise chains instead of awaiting, logs success and schedules persistence only when `indexCount > 0`, and catches errors without aborting boot.

Sequence Diagram(s)

The change modifies internal control flow within a single function without introducing multi-component interactions or new features, so no sequence diagram is generated.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 The index rebuilds now, a gentle ghost in the night,
No longer blocking the boot with its might!
Fire-and-forget, a non-blocking delight,
Persistence awaits when the count shines bright.
Errors caught gently, all systems stay light! 🌙

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly captures the main change: making rebuildIndex non-blocking to allow viewer and subsequent boot steps to run, which matches the primary objective of the PR.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…rpora) rebuildIndex called `await vectorIndexAddGuarded(...)` per memory and per observation. Each call is one HTTP round-trip to the embedding provider for a single input. On a 500k-observation imported corpus against an embedding endpoint with even modest latency, that's serial 100-200ms per call = 14-28 hours of wallclock. The new non-blocking rebuild path (rohitg00#500) made this no longer block boot, but the rebuild itself still takes the same wallclock. Add `vectorIndexAddBatchGuarded()` next to the existing per-item helper, accepting an array of items and calling `provider.embedBatch()` once. For batchable endpoints (vLLM, Triton, OpenAI's `/v1/embeddings` all accept an `input` array), latency for N items is roughly the latency of a single embed because network + GPU setup amortize. Refactor `rebuildIndex` to accumulate items into a buffer and flush every REBUILD_EMBED_BATCH_SIZE (default 32). BM25 add stays per-item-synchronous; only the vector path is batched. Validated against a vLLM Qwen3-Embedding-8B endpoint: - single embed: 175ms - batch-of-32: 737ms (= 23ms/item amortized, ~7.6× speedup) - projected backfill time for 500k obs: 25h → 3h Per-item failure shape is preserved: - whole-batch network/provider error → all skipped, single warn line (vs N warns previously when the same error hit every item) - per-item dimension mismatch → that item skipped, others continue - rebuildIndex return value unchanged (count of attempted items) Override knob: - REBUILD_EMBED_BATCH_SIZE (default 32) — set lower for endpoints with small per-request input limits, higher for endpoints that prefer larger batches. Set to 1 to fall back to the per-item path. 39/39 existing tests in search-index/vector-index/remember-bm25-index pass unchanged. Related: rohitg00#500 (non-blocking rebuildIndex), rohitg00#503 (separate embedding base URL).

…rpora) (#504) * fix(rebuild): batch embed calls in rebuildIndex (25h → 3h on large corpora) rebuildIndex called `await vectorIndexAddGuarded(...)` per memory and per observation. Each call is one HTTP round-trip to the embedding provider for a single input. On a 500k-observation imported corpus against an embedding endpoint with even modest latency, that's serial 100-200ms per call = 14-28 hours of wallclock. The new non-blocking rebuild path (#500) made this no longer block boot, but the rebuild itself still takes the same wallclock. Add `vectorIndexAddBatchGuarded()` next to the existing per-item helper, accepting an array of items and calling `provider.embedBatch()` once. For batchable endpoints (vLLM, Triton, OpenAI's `/v1/embeddings` all accept an `input` array), latency for N items is roughly the latency of a single embed because network + GPU setup amortize. Refactor `rebuildIndex` to accumulate items into a buffer and flush every REBUILD_EMBED_BATCH_SIZE (default 32). BM25 add stays per-item-synchronous; only the vector path is batched. Validated against a vLLM Qwen3-Embedding-8B endpoint: - single embed: 175ms - batch-of-32: 737ms (= 23ms/item amortized, ~7.6× speedup) - projected backfill time for 500k obs: 25h → 3h Per-item failure shape is preserved: - whole-batch network/provider error → all skipped, single warn line (vs N warns previously when the same error hit every item) - per-item dimension mismatch → that item skipped, others continue - rebuildIndex return value unchanged (count of attempted items) Override knob: - REBUILD_EMBED_BATCH_SIZE (default 32) — set lower for endpoints with small per-request input limits, higher for endpoints that prefer larger batches. Set to 1 to fall back to the per-item path. 39/39 existing tests in search-index/vector-index/remember-bm25-index pass unchanged. Related: #500 (non-blocking rebuildIndex), #503 (separate embedding base URL). * fix(rebuild): per-item vi.add try/catch to preserve soft-fail Restores the pre-batch soft-fail behavior — a single failing vi.add() no longer aborts the entire rebuild batch. Failures are logged and counted toward fail, just like dimension mismatches above.

@cl0ckt0wer

Quality + integration wave. Bundles 11 PRs since v0.9.20: Contributor feature: - #237 OpenCode plugin with 22 auto-capture hooks (@cl0ckt0wer) Bug fixes (9): - #516 memory_recall endpoint + format/token_budget (@serhiizghama, closes #507/#440) - #461 env-file AGENTMEMORY_DROP_STALE_INDEX flag honored (@honor2030, closes #456) - #487 Windows hook path quoting (@honor2030, closes #477) - #517 viewer IME composition guard (@jonathanzhan1975) - #472 chunk large sessions for LLM context window (@efenex) - #473 surface lessons in smart-search + diagnose tally (@efenex) - #486 declare all Hermes plugin hooks (@honor2030) - #500 rebuildIndex non-blocking on boot (@efenex) - #504 batched embed in rebuildIndex (25h -> 3h) (@efenex) - #491 cli skip onboarding without tty (@honor2030) Upstream-installer revert: - #546 drop --next workaround now that iii-hq/iii#1660 shipped 1067/1067 tests pass across 95 files.

efenex mentioned this pull request May 18, 2026

fix(rebuild): batch embed calls in rebuildIndex (25h → 3h on large corpora) #504

Merged

rohitg00 mentioned this pull request May 19, 2026

chore(release): v0.9.21 #550

Closed

rohitg00 merged commit c6a1fec into rohitg00:main May 19, 2026
1 of 2 checks passed

rohitg00 mentioned this pull request May 19, 2026

chore(release): v0.9.21 #551

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(boot): make rebuildIndex non-blocking so viewer + later boot steps run#500

fix(boot): make rebuildIndex non-blocking so viewer + later boot steps run#500
rohitg00 merged 1 commit into
rohitg00:mainfrom
efenex:fix/non-blocking-rebuild-index

efenex commented May 18, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

vercel Bot commented May 18, 2026

Uh oh!

coderabbitai Bot commented May 18, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Review ran into problems

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

efenex commented May 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Symptom in the wild

Root cause

Fix

Verification

Test plan

Files

Related

Summary by CodeRabbit

Uh oh!

vercel Bot commented May 18, 2026

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Review ran into problems

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

efenex commented May 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 18, 2026 •

edited

Loading