Skip to content

fix(boot): make rebuildIndex non-blocking so viewer + later boot steps run#500

Merged
rohitg00 merged 1 commit into
rohitg00:mainfrom
efenex:fix/non-blocking-rebuild-index
May 19, 2026
Merged

fix(boot): make rebuildIndex non-blocking so viewer + later boot steps run#500
rohitg00 merged 1 commit into
rohitg00:mainfrom
efenex:fix/non-blocking-rebuild-index

Conversation

@efenex
Copy link
Copy Markdown
Contributor

@efenex efenex commented May 18, 2026

Summary

The worker boot flow in src/index.ts awaits rebuildIndex(kv) before reaching startViewerServer (and several other boot steps). rebuildIndex iterates every observation across every session and AWAITS an embedding-provider call per record. On a real corpus + a rate-limited embedding endpoint that takes hours to days, and everything that runs after it is silently delayed for the same duration.

Symptom in the wild

Operator imports a sizable jsonl corpus (in our case 320 sessions / ~500k observations), restarts agentmemory with EMBEDDING_PROVIDER pointed at any rate-limited OpenAI-compat endpoint (Novita / DeepInfra / etc., typically 100 RPM on the cheap plans), and:

  • ✅ REST API on :3111 responds normally
  • ❌ Viewer on :3113 is never reachable (no listening socket)
  • agentmemory doctor reports viewer-unreachable
  • 🌊 Log floods with vector-index add: embed failed — skipping {429: ...} from the still-running rebuild burning the embedding rate limit
  • No error message — the worker stays alive serving HTTP because sdk.registerFunction calls had already completed synchronously before the rebuild hung

The "obvious" workaround of agentmemory stop && agentmemory just re-enters the same hang.

Root cause

// src/index.ts
const needsRebuild = bm25Index.size === 0;

if (needsRebuild) {
  const indexCount = await rebuildIndex(kv).catch(...);   // ← hours-to-days
  ...
}

// ...
const viewerServer = startViewerServer(viewerPort, ...);  // ← never reached

rebuildIndex(kv) (in src/functions/search.ts) per-record awaits vectorIndexAddGuarded(...) which calls the embedding provider. For a ~500k-observation corpus at 100 RPM = 5,000 minutes = 3.5 days. The viewer / auto-forget / lesson-decay / consolidation timers all sit behind it.

Fix

Detach with void + .then/.catch. The index lazily fills in over hours; search degrades gracefully (BM25 keeps working immediately, vector results fill in as the embed queue drains); the viewer + everything else in main() come up in seconds.

void rebuildIndex(kv)
  .then((indexCount) => {
    if (indexCount > 0) {
      bootLog(`Search index rebuilt: ${indexCount} entries`);
      indexPersistence.scheduleSave();
    }
  })
  .catch((err) => {
    console.warn(`[agentmemory] Failed to rebuild search index:`, err);
  });

Verification

Tested live on the affected corpus before and after:

Before: every restart left lsof -ti :3113 empty, doctor reported viewer-unreachable, log showed 429s pile up indefinitely.

After: viewer binds within ~5 seconds of starting, returns the full 188 KB HTML payload on GET /. Rebuild continues in the background; vector search results improve over the following minutes.

Test plan

No unit tests added. main() isn't unit-tested today and wiring up a fake slow rebuildIndex + asserting the post-rebuild boot lines run early would need the full worker mock harness — disproportionate to a one-line behavior change. The failure mode is dramatic enough that visual review + integration smoke covers regression risk.

Files

  • src/index.ts — 18 insertions, 10 deletions (mostly the comment explaining the rationale)

Related

Surfaced while operating an agentmemory install against a 320-session bulk-imported corpus. The 429 floods that finally pointed at this had me chasing port-conflict / stale-process explanations first (see #474). The real cause is here.

Summary by CodeRabbit

  • Chores
    • Improved application boot performance by optimizing search-index initialization to run in the background without blocking startup steps.

Review Change Stack

…s run

mem::observe's boot flow had this sequence in main():

  1. registerSearchFunction / registerContextFunction / ...
     (sync — completes immediately)
  2. restore persisted vector index from disk
  3. await rebuildIndex(kv)        ← blocks here
  4. bootLog "Ready" / "REST API" / "MCP surface"
  5. startViewerServer(...)
  6. setInterval auto-forget / lesson decay / consolidation

rebuildIndex iterates every observation across every session and AWAITS
an embedding-provider call per record. On a large corpus + a rate-limited
embedding endpoint (e.g. 100 RPM), step 3 takes hours to days.
Everything that runs AFTER it — including startViewerServer — is
silently delayed for the same duration.

Symptoms in the wild:
- http://localhost:3113/ unreachable (no listening socket on the viewer
  port) even on a freshly-started server
- `agentmemory doctor` reports "viewer-unreachable"
- log floods with `vector-index add: embed failed — skipping {429: ...}`
  from the still-running rebuild burning rate-limit budget
- no error message — the worker stays alive serving HTTP because
  sdk.registerFunction had already completed synchronously in step 1

Fix: detach rebuildIndex with `void` + .then/.catch instead of awaiting.
The index lazily fills in over time, search degrades gracefully (BM25
keeps working immediately, vector results fill in as the embed queue
drains), and the viewer comes up in seconds.

Repro on the operator side:
1. import a sizeable jsonl corpus (`mem::replay::import-jsonl`)
2. clear the persisted vector index so rebuildIndex runs on next boot
3. restart agentmemory with EMBEDDING_PROVIDER pointed at a rate-limited
   endpoint (any OpenAI-compat with low RPM)
4. observe: REST API responds on :3111, but :3113 is never bound, and
   the doctor's "viewer-unreachable" check fires until the rebuild
   finishes (hours-to-days for a 300+ session corpus)

The 5-second non-fix workaround was a hard kill + restart; that just
re-entered the same hang.

No tests added — main() isn't unit-tested today and wiring up a fake
slow rebuildIndex + asserting the post-rebuild boot lines run early
would need the full worker mock harness. The change is one line and
the failure mode is dramatic; visual review + integration smoke covers
the regression risk.
@vercel
Copy link
Copy Markdown

vercel Bot commented May 18, 2026

@efenex is attempting to deploy a commit to the rohitg00's projects Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6c010713-b1c4-4ce1-b280-08f8ac21bfc9

📥 Commits

Reviewing files that changed from the base of the PR and between caa9f52 and ecd8024.

📒 Files selected for processing (1)
  • src/index.ts

📝 Walkthrough

Walkthrough

The PR converts the search-index rebuild path in main() from synchronous awaited blocking to non-blocking background execution. When needsRebuild is true, the rebuild promise is initiated without blocking boot, logs and schedules persistence conditionally when the count exceeds zero, and handles errors via separate catch logic.

Changes

Search Index Non-Blocking Rebuild

Layer / File(s) Summary
Convert search index rebuild to non-blocking background operation
src/index.ts
Rebuild now executes in the background via promise chains instead of awaiting, logs success and schedules persistence only when indexCount > 0, and catches errors without aborting boot.

Sequence Diagram(s)

The change modifies internal control flow within a single function without introducing multi-component interactions or new features, so no sequence diagram is generated.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 The index rebuilds now, a gentle ghost in the night,
No longer blocking the boot with its might!
Fire-and-forget, a non-blocking delight,
Persistence awaits when the count shines bright.
Errors caught gently, all systems stay light! 🌙

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly captures the main change: making rebuildIndex non-blocking to allow viewer and subsequent boot steps to run, which matches the primary objective of the PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

efenex added a commit to efenex/agentmemory that referenced this pull request May 18, 2026
…rpora)

rebuildIndex called `await vectorIndexAddGuarded(...)` per memory and
per observation. Each call is one HTTP round-trip to the embedding
provider for a single input. On a 500k-observation imported corpus
against an embedding endpoint with even modest latency, that's
serial 100-200ms per call = 14-28 hours of wallclock. The new
non-blocking rebuild path (rohitg00#500) made this no longer block boot, but
the rebuild itself still takes the same wallclock.

Add `vectorIndexAddBatchGuarded()` next to the existing per-item
helper, accepting an array of items and calling `provider.embedBatch()`
once. For batchable endpoints (vLLM, Triton, OpenAI's `/v1/embeddings`
all accept an `input` array), latency for N items is roughly the
latency of a single embed because network + GPU setup amortize.

Refactor `rebuildIndex` to accumulate items into a buffer and flush
every REBUILD_EMBED_BATCH_SIZE (default 32). BM25 add stays
per-item-synchronous; only the vector path is batched.

Validated against a vLLM Qwen3-Embedding-8B endpoint:
  - single embed: 175ms
  - batch-of-32:  737ms (= 23ms/item amortized, ~7.6× speedup)
  - projected backfill time for 500k obs: 25h → 3h

Per-item failure shape is preserved:
  - whole-batch network/provider error → all skipped, single warn line
    (vs N warns previously when the same error hit every item)
  - per-item dimension mismatch → that item skipped, others continue
  - rebuildIndex return value unchanged (count of attempted items)

Override knob:
  - REBUILD_EMBED_BATCH_SIZE (default 32) — set lower for endpoints
    with small per-request input limits, higher for endpoints that
    prefer larger batches. Set to 1 to fall back to the per-item path.

39/39 existing tests in search-index/vector-index/remember-bm25-index
pass unchanged.

Related: rohitg00#500 (non-blocking rebuildIndex), rohitg00#503 (separate embedding
base URL).
@rohitg00 rohitg00 mentioned this pull request May 19, 2026
@rohitg00 rohitg00 merged commit c6a1fec into rohitg00:main May 19, 2026
1 of 2 checks passed
rohitg00 pushed a commit that referenced this pull request May 19, 2026
…rpora) (#504)

* fix(rebuild): batch embed calls in rebuildIndex (25h → 3h on large corpora)

rebuildIndex called `await vectorIndexAddGuarded(...)` per memory and
per observation. Each call is one HTTP round-trip to the embedding
provider for a single input. On a 500k-observation imported corpus
against an embedding endpoint with even modest latency, that's
serial 100-200ms per call = 14-28 hours of wallclock. The new
non-blocking rebuild path (#500) made this no longer block boot, but
the rebuild itself still takes the same wallclock.

Add `vectorIndexAddBatchGuarded()` next to the existing per-item
helper, accepting an array of items and calling `provider.embedBatch()`
once. For batchable endpoints (vLLM, Triton, OpenAI's `/v1/embeddings`
all accept an `input` array), latency for N items is roughly the
latency of a single embed because network + GPU setup amortize.

Refactor `rebuildIndex` to accumulate items into a buffer and flush
every REBUILD_EMBED_BATCH_SIZE (default 32). BM25 add stays
per-item-synchronous; only the vector path is batched.

Validated against a vLLM Qwen3-Embedding-8B endpoint:
  - single embed: 175ms
  - batch-of-32:  737ms (= 23ms/item amortized, ~7.6× speedup)
  - projected backfill time for 500k obs: 25h → 3h

Per-item failure shape is preserved:
  - whole-batch network/provider error → all skipped, single warn line
    (vs N warns previously when the same error hit every item)
  - per-item dimension mismatch → that item skipped, others continue
  - rebuildIndex return value unchanged (count of attempted items)

Override knob:
  - REBUILD_EMBED_BATCH_SIZE (default 32) — set lower for endpoints
    with small per-request input limits, higher for endpoints that
    prefer larger batches. Set to 1 to fall back to the per-item path.

39/39 existing tests in search-index/vector-index/remember-bm25-index
pass unchanged.

Related: #500 (non-blocking rebuildIndex), #503 (separate embedding
base URL).

* fix(rebuild): per-item vi.add try/catch to preserve soft-fail

Restores the pre-batch soft-fail behavior — a single failing
vi.add() no longer aborts the entire rebuild batch. Failures
are logged and counted toward fail, just like dimension
mismatches above.
@rohitg00 rohitg00 mentioned this pull request May 19, 2026
rohitg00 added a commit that referenced this pull request May 19, 2026
Quality + integration wave. Bundles 11 PRs since v0.9.20:

Contributor feature:
- #237 OpenCode plugin with 22 auto-capture hooks (@cl0ckt0wer)

Bug fixes (9):
- #516 memory_recall endpoint + format/token_budget (@serhiizghama, closes #507/#440)
- #461 env-file AGENTMEMORY_DROP_STALE_INDEX flag honored (@honor2030, closes #456)
- #487 Windows hook path quoting (@honor2030, closes #477)
- #517 viewer IME composition guard (@jonathanzhan1975)
- #472 chunk large sessions for LLM context window (@efenex)
- #473 surface lessons in smart-search + diagnose tally (@efenex)
- #486 declare all Hermes plugin hooks (@honor2030)
- #500 rebuildIndex non-blocking on boot (@efenex)
- #504 batched embed in rebuildIndex (25h -> 3h) (@efenex)
- #491 cli skip onboarding without tty (@honor2030)

Upstream-installer revert:
- #546 drop --next workaround now that iii-hq/iii#1660 shipped

1067/1067 tests pass across 95 files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants