Skip to content

fix: SSE robustness — race, idle timeout, replay, circuit breaker (#308)#332

Merged
OneStepAt4time merged 7 commits intomainfrom
fix/308-sse-robustness
Mar 27, 2026
Merged

fix: SSE robustness — race, idle timeout, replay, circuit breaker (#308)#332
OneStepAt4time merged 7 commits intomainfrom
fix/308-sse-robustness

Conversation

@OneStepAt4time
Copy link
Copy Markdown
Owner

Summary

Fixes 6 SSE/event bus robustness issues under production load:

  • Emitter cleanup raceemitEnded setTimeout no longer deletes fresh emitters created by new subscribers during the 1s cleanup window
  • Async emitSessionEventBus.emit() now uses setImmediate to decouple from the monitor loop, preventing slow SSE writes from blocking polling
  • Event ring buffer — Per-session buffer (50 events) with incrementing IDs enables Last-Event-ID replay on reconnection
  • Idle SSE timeout — Zombie connections are detected and destroyed after 90s of inactivity (checked on heartbeat tick)
  • Duplicate route removed — Deleted legacy /sessions/:id/events route; all routes now use /v1/ prefix
  • Circuit breakerResilientEventSource wraps browser EventSource with exponential backoff (1s→30s cap), 5-min give-up, and failure counter reset on success

Files changed:

  • src/events.ts — ring buffer, setImmediate emit, cleanup guard
  • src/server.ts — Last-Event-ID replay, idle timeout, removed duplicate route
  • dashboard/src/api/client.ts — uses ResilientEventSource
  • dashboard/src/api/resilient-eventsource.ts — new module
  • src/__tests__/sse-events.test.ts — ring buffer tests, async flush fixes
  • src/__tests__/hooks.test.ts — async flush fixes
  • src/__tests__/latency.test.ts — async flush fixes
  • src/__tests__/monitor-fixes.test.ts — async flush fixes
  • dashboard/src/__tests__/resilient-eventsource.test.ts — new test file

Test plan

  • npm test — 882/882 pass (root)
  • npm test — 9/9 pass (dashboard)
  • npx tsc --noEmit — clean
  • npm run build — clean

Closes #308

Generated by Hephaestus (Aegis dev agent)

Generated by Hephaestus (Aegis dev agent)
Generated by Hephaestus (Aegis dev agent)
The 1-second setTimeout after emitEnded blindly deleted whatever emitter
was mapped to the session ID, which could nuke a fresh emitter created
by a new subscriber during that window. Now captures the emitter reference
and only deletes if it's still the same instance.

Generated by Hephaestus (Aegis dev agent)
SessionEventBus.emit() now uses setImmediate to deliver events
asynchronously, preventing slow SSE writes from blocking the monitor
loop. Updated all test files to flush async before asserting.

Generated by Hephaestus (Aegis dev agent)
Per-session ring buffer (50 events) with incrementing IDs enables
SSE clients to replay missed events after reconnection via the
standard Last-Event-ID header.

Generated by Hephaestus (Aegis dev agent)
…te (#308)

- Per-session SSE handler now reads Last-Event-ID header and replays
  missed events from the ring buffer
- Both SSE handlers track last write time and destroy zombie connections
  after 90s of inactivity (checked on heartbeat tick)
- Removed duplicate /sessions/:id/events route (use /v1/ prefix)

Generated by Hephaestus (Aegis dev agent)
Wraps browser EventSource with exponential backoff (1s to 30s cap),
5-minute give-up timeout, and failure counter reset on successful
connection. subscribeSSE and subscribeGlobalSSE now use it.

Generated by Hephaestus (Aegis dev agent)
@OneStepAt4time OneStepAt4time merged commit 436e97a into main Mar 27, 2026
3 checks passed
OneStepAt4time added a commit that referenced this pull request Mar 27, 2026
Buffer last 50 global SSE events so clients connecting mid-session
get recent events replayed via Last-Event-ID header on the /v1/events
endpoint. Mirrors the per-session ring buffer pattern from PR #332.

Generated by Hephaestus (Aegis dev agent)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SSE: Multiple robustness issues (cleanup races, no idle timeout, no reconnection handling)

1 participant