Skip to content

feat(kernel): port resilience (circuit breakers + retry/backoff) from daemon #33

@stackbilt-admin

Description

@stackbilt-admin

Problem

`web/src/landing.ts` makes two explicit product claims about aegis-core's resilience surface:

"Every query is classified by complexity and routed to the cheapest executor that can handle it. Procedural memory learns from outcomes and short-circuits future routing. A circuit breaker degrades executors that fail consecutively." — landing.ts:938

"Learns which executors succeed for which task patterns. Procedures graduate from learning to learned after consistent success. A circuit breaker degrades unreliable routes. Stale procedures decay after 14 days of disuse." — landing.ts:1028

But `grep -rn "circuit|resilience" src/` returns only those two prose strings. No implementation exists in core. Meanwhile the daemon runs 301 LOC of production-tested resilience logic at `web/src/kernel/resilience.ts` covering:

  • Circuit breaker state machine (closed / half-open / open) per external binding
  • Retry with exponential backoff
  • Concurrency limits per executor class
  • Stats exposure via `/health?format=json` (documented in CLAUDE.md)

Applied to: Groq, Anthropic, Brave, MCP, BizOps, service bindings.

Why this belongs in core

  • Landing page already promises it — current state is a credibility gap
  • Every downstream aegis variant will need the same guarantees for the same external substrates
  • Clean lift: the daemon implementation is self-contained (no Stackbilt-specific deps in the file)
  • Enables Phase C router/dispatch collapse — those shadows partially exist because daemon callers wrap core calls in resilience at the call site

Acceptance criteria

  • `web/src/kernel/resilience.ts` exists in core with circuit-breaker, retry-with-backoff, concurrency-limit primitives
  • Subpath export added to `package.json`: `"./kernel/resilience"`
  • Existing core call sites for external bindings (Groq, Anthropic, feed fetching) routed through the primitives
  • Daemon's `web/src/kernel/resilience.ts` becomes a re-export (or stays local if configuration differs, but imports from core)
  • Stats surface documented and tested
  • Unit tests cover state transitions + backoff jitter

Out of scope

  • Telemetry export to external systems (OpenTelemetry, Datadog) — separate
  • Distributed rate limiting across workers — separate

References

  • Daemon: `web/src/kernel/resilience.ts` (301 LOC, production-proven)
  • Landing page claims: `web/src/landing.ts:938`, `web/src/landing.ts:1028`
  • Memory: `project_daemon_kernel_shadow` — listed as daemon-only (not a shadow, true daemon asset that should upstream)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions