Skip to content

fix: prevent process crash on iii-engine state::set timeout (#204)#209

Merged
rohitg00 merged 1 commit intomainfrom
fix/204-iii-timeout-crash
Apr 27, 2026
Merged

fix: prevent process crash on iii-engine state::set timeout (#204)#209
rohitg00 merged 1 commit intomainfrom
fix/204-iii-timeout-crash

Conversation

@rohitg00
Copy link
Copy Markdown
Owner

@rohitg00 rohitg00 commented Apr 27, 2026

Summary

Under sustained write load (Claude Code plugin hooks across multiple agents at ~75+ obs/hour), `IndexPersistence`'s 5s debounce flush could fire into an iii-engine queue that hadn't drained, hitting the SDK's 30s default timeout on `state::set`. The rejection escaped the `setTimeout(() => this.save(), DEBOUNCE_MS)` callback in `src/state/index-persistence.ts` because the returned promise was discarded — landing as an `unhandledRejection` that terminated the long-lived memory service.

Reproducible at ~5–15 crashes/hour on ~1.7K observations/day workloads (see #204 for trace).

Two layers of defense

  1. `IndexPersistence` swallows kv.set rejections internally. Scheduled-save callback now `.catch()`s the returned promise; `save()` wraps kv.set in try/catch. Failures logged via `logger.warn` with once-per-minute throttle so queue-pressure bursts don't spam the log. Recent index updates stay in memory and retry on next debounce flush.

  2. Top-level `unhandledRejection` handler in `src/index.ts`. Logs and continues for any other path we missed — defense in depth, also throttled.

Tests

  • 2 new regression cases in `test/index-persistence.test.ts`: scheduled save with rejecting kv.set must not raise unhandledRejection; `save()` must resolve (not throw) when kv.set rejects
  • Full suite: 829/829 passing

Thanks @bunke for the painstaking writeup — the journalctl trace pointing at `iii-sdk/dist/index.mjs:405` made the root cause obvious.

Closes #204

Test plan

  • Drive ~1 observation/sec via `POST /agentmemory/observe` for 10+ minutes — service must not exit
  • Verify the throttled `[agentmemory] warn index persistence: failed to save` log appears under timeout pressure

Summary by CodeRabbit

  • Bug Fixes

    • Improved process stability by preventing crashes from unhandled asynchronous errors
    • Enhanced data persistence error handling with better recovery mechanisms and logging
    • Added throttled error messages to help diagnose timeout and persistence issues without log spam
  • Tests

    • Added test coverage for error handling during data persistence operations

Under sustained write load (e.g. Claude Code plugin hooks across
multiple agents), `IndexPersistence`'s 5s debounce flush could fire
into an iii-engine queue that hadn't drained, hitting the SDK's 30s
default timeout on `state::set`. The rejection escaped the
`setTimeout(() => this.save(), DEBOUNCE_MS)` callback in
`src/state/index-persistence.ts` because the returned promise was
discarded — landing as an unhandledRejection that terminated the
long-lived memory service. Reproducible at ~5–15 crashes/hour on
~1.7K observations/day workloads.

Two layers of defense:

1. **`IndexPersistence` swallows kv.set rejections internally.** The
   scheduled-save callback now `.catch()`s the returned promise and
   `save()` wraps the kv.set calls in try/catch. Failures are logged
   via `logger.warn` with throttling (once per minute) so a queue-
   pressure burst doesn't spam the log. Recent index updates stay
   in memory and retry on the next debounce flush.

2. **Top-level `unhandledRejection` handler in `src/index.ts`.**
   Logs and continues for any other path we missed — defense in
   depth so a single SDK timeout can't take down the memory mesh.
   Also throttled.

Adds two regression tests in `test/index-persistence.test.ts`:
- scheduled save under a kv.set that rejects must not raise
  unhandledRejection
- direct `save()` must resolve (not throw) when kv.set rejects

Thanks @bunke for the painstaking writeup — the journalctl trace
pointing at iii-sdk/dist/index.mjs:405 made the root cause obvious.

Closes #204
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

The changes add resilience to handle uncaught promise rejections from state persistence operations. A top-level rejection handler is introduced in the service entrypoint, alongside improved error handling in IndexPersistence.scheduleSave() with throttled logging. Test coverage validates that timeouts are caught and don't crash the process.

Changes

Cohort / File(s) Summary
Global error safety net
src/index.ts
Adds process.on("unhandledRejection") handler to catch otherwise-unhandled rejections. Implements throttled warning logging (60s throttle window) that extracts and logs code, function_id, and message from rejection reasons.
Persistence error handling
src/state/index-persistence.ts
Wraps IndexPersistence.scheduleSave() promise with .catch() to prevent rejections from propagating. Updates save() method with try/catch around KV set operations. Adds logFailure() helper for throttled failure logging with TIMEOUT hint text.
Persistence error tests
test/index-persistence.test.ts
New test cases validate that TIMEOUT errors during scheduled saves are caught without emitting unhandled rejections, and that persistence.save() gracefully resolves to undefined on thrown errors.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A crash came swift, but now we're fleet,
With handlers catching timeouts neat,
No more rejections left untamed—
The process lives, the logs are framed! 🎯

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: preventing process crashes caused by iii-engine state::set timeouts, which is the core fix addressed in this PR.
Linked Issues check ✅ Passed All objectives from issue #204 are met: process crash prevention via dual-layer defense (IndexPersistence rejection handling + top-level unhandledRejection handler), throttled logging, in-memory index preservation, and comprehensive regression tests.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing issue #204: unhandled rejection handling in IndexPersistence, top-level process handler, and regression tests. No unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/204-iii-timeout-crash

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@rohitg00 rohitg00 merged commit bd4b74e into main Apr 27, 2026
2 of 4 checks passed
@rohitg00 rohitg00 deleted the fix/204-iii-timeout-crash branch April 27, 2026 10:08
@bunke
Copy link
Copy Markdown

bunke commented Apr 27, 2026

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Service crashes with uncaught IIIInvocationError TIMEOUT on state::set (v0.9.3)

2 participants