Configurable health policy, shared utils, failure-mode tests#6
Configurable health policy, shared utils, failure-mode tests#6Royal-lobster merged 7 commits intomainfrom
Conversation
Remove duplicate formatDuration implementations from health-manager.ts and discord/formatter.ts in favor of a single shared utility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace hardcoded values in HealthManager (3 failures, 30s window, 10s drain, 3 retries, 1h expiry) with a typed HealthPolicy config. Defaults preserve existing behavior. Users can now tune these product decisions per deployment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Custom threshold: verifies unhealthyThreshold is respected - Custom expiry: verifies entryExpiryMs controls entry discard timing - Custom maxRetries: verifies entries are dropped after configured retries - Full recovery flow: unhealthy → queued → drain succeeds → recovery callback fires → adapter returns to healthy state Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a configurable HealthPolicy for the HealthManager, allowing customization of failure thresholds, health windows, drain intervals, retry limits, and entry expiration. It also refactors the formatDuration utility into a shared core file. Feedback focuses on refining the health check logic to better handle high-traffic failure bursts, improving the efficiency of clearing expired queue entries by scheduling immediate re-drains, and ensuring the recovery callback triggers correctly when adapters recover during background drain attempts.
| return !( | ||
| health.consecutiveFailures >= unhealthyThreshold && | ||
| Date.now() - health.lastSuccessAt > healthWindowMs | ||
| ) |
There was a problem hiding this comment.
The health check logic requires both the failure threshold to be met AND the time window to have elapsed. In high-traffic scenarios, an adapter could fail hundreds of times in a few seconds but still be considered 'healthy' because the healthWindowMs (default 30s) hasn't passed. This causes dispatch to continue attempting direct sends, which is inefficient and may overwhelm the failing adapter. Consider if a simpler threshold-only check or an alternative logic would better protect the system from failure bursts.
There was a problem hiding this comment.
The dual condition (threshold AND window) is intentional — it prevents transient blips (e.g., a brief network hiccup causing 3 fast failures) from marking an adapter unhealthy. During the burst, failed sends still get enqueued and retried via drain, so no alerts are lost. The adapter just isn't preemptively bypassed for direct sends. A threshold-only check would cause unnecessary queueing on every momentary glitch. Since both the threshold and window are now configurable via HealthPolicy, users who want aggressive failure detection can set healthWindowMs: 0.
| // Discard expired entries (>1 hour old) | ||
| if (Date.now() - entry.enqueuedAt > 3_600_000) { | ||
| // Discard expired entries | ||
| if (Date.now() - entry.enqueuedAt > this.config.policy.entryExpiryMs) { |
There was a problem hiding this comment.
When an expired entry is discarded, the function returns without scheduling a re-drain for the next item in the queue. If the queue contains many expired entries, they will only be cleared one by one at every drainIntervalMs (default 10s). This can cause the queue to remain full of stale data and block new alerts for an extended period. Consider scheduling an immediate re-drain after dequeuing an expired entry:
if (Date.now() - entry.enqueuedAt > this.config.policy.entryExpiryMs) {
health.queue.dequeue()
if (!health.queue.isEmpty) {
setTimeout(() => void this.drainOnce(adapter), 0)
}
return
}There was a problem hiding this comment.
Fixed in fa65565 — expired entries now schedule an immediate re-drain via setTimeout(() => void this.drainOnce(adapter), 0), flushing the entire stale queue in O(n) instead of O(n × drainIntervalMs).
| health.consecutiveFailures++ | ||
| entry.retryCount++ | ||
| if (entry.retryCount >= 3) { | ||
| if (entry.retryCount >= this.config.policy.maxRetries) { |
There was a problem hiding this comment.
The warnedAt timestamp, used to calculate downtime for onRecovery, is currently only set within the dispatch method. If an adapter becomes unhealthy during background drain attempts and then recovers before any new alerts are dispatched, onRecovery will not be triggered because warnedAt remains null. You should check the health status and update warnedAt (and log the warning) here when a send attempt fails.
if (!this.isHealthy(adapter) && health.warnedAt === null) {
console.warn(
"[alert-logger] " + adapter.name + " adapter is unhealthy (" + health.consecutiveFailures + " consecutive failures)",
)
health.warnedAt = Date.now()
}
if (entry.retryCount >= this.config.policy.maxRetries) {There was a problem hiding this comment.
Fixed in fa65565 — the drain failure path now checks health and sets warnedAt + logs the warning, so onRecovery fires correctly even when the adapter becomes unhealthy purely through drain retries. Added a dedicated test for this scenario (drain-only unhealthy: onRecovery fires even without new dispatches).
- Import formatDuration directly from utils.js instead of re-exporting through health-manager.ts - Add queueSize() method to HealthManager for observability - Add explicit queue state assertions to custom expiry and maxRetries tests so they verify behavior, not just absence of errors Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Set warnedAt during drain failures so onRecovery fires even when adapter becomes unhealthy purely through drain retries (no dispatch) - Schedule immediate re-drain after discarding expired entries to flush stale queues in O(n) time instead of O(n * drainIntervalMs) - Add test: drain-only unhealthy → recovery fires without new dispatches Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
formatDurationintosrc/core/utils.ts— removes duplicate implementations fromhealth-manager.tsanddiscord/formatter.tsHealthPolicytype — replaces hardcoded values (3 failures, 30s window, 10s drain, 3 retries, 1h expiry) with user-configurable settings while preserving defaultsMotivation
Prompted by code review feedback that
HealthManagerbakes in product decisions as implementation details. The hardcoded retry/health constants are deployment-specific — a high-traffic service may want a shorter drain interval, while a batch job may want a longer health window. These are now configurable viaAlertLoggerConfig.health.Test plan
pnpm test)pnpm biome check src/)🤖 Generated with Claude Code
Closes #15