Skip to content

Configurable health policy, shared utils, failure-mode tests#6

Merged
Royal-lobster merged 7 commits intomainfrom
refactor/health-policy-and-utils
Apr 6, 2026
Merged

Configurable health policy, shared utils, failure-mode tests#6
Royal-lobster merged 7 commits intomainfrom
refactor/health-policy-and-utils

Conversation

@Royal-lobster
Copy link
Copy Markdown
Member

@Royal-lobster Royal-lobster commented Apr 6, 2026

Summary

  • Extract shared formatDuration into src/core/utils.ts — removes duplicate implementations from health-manager.ts and discord/formatter.ts
  • Make health/retry policy configurable via new HealthPolicy type — replaces hardcoded values (3 failures, 30s window, 10s drain, 3 retries, 1h expiry) with user-configurable settings while preserving defaults
  • Add failure-mode tests — custom threshold behavior, custom entry expiry, custom maxRetries, and full unhealthy→recovery→healthy flow

Motivation

Prompted by code review feedback that HealthManager bakes in product decisions as implementation details. The hardcoded retry/health constants are deployment-specific — a high-traffic service may want a shorter drain interval, while a batch job may want a longer health window. These are now configurable via AlertLoggerConfig.health.

Test plan

  • All 125 existing + new tests pass (pnpm test)
  • Biome lint/format clean (pnpm biome check src/)
  • Verify default behavior unchanged (no config changes needed for existing consumers)
  • New tests validate custom policy thresholds, expiry, maxRetries, and full recovery flow

🤖 Generated with Claude Code

Closes #15

Royal-lobster and others added 4 commits April 6, 2026 09:46
Remove duplicate formatDuration implementations from health-manager.ts
and discord/formatter.ts in favor of a single shared utility.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace hardcoded values in HealthManager (3 failures, 30s window,
10s drain, 3 retries, 1h expiry) with a typed HealthPolicy config.
Defaults preserve existing behavior. Users can now tune these
product decisions per deployment.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Custom threshold: verifies unhealthyThreshold is respected
- Custom expiry: verifies entryExpiryMs controls entry discard timing
- Custom maxRetries: verifies entries are dropped after configured retries
- Full recovery flow: unhealthy → queued → drain succeeds → recovery
  callback fires → adapter returns to healthy state

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a configurable HealthPolicy for the HealthManager, allowing customization of failure thresholds, health windows, drain intervals, retry limits, and entry expiration. It also refactors the formatDuration utility into a shared core file. Feedback focuses on refining the health check logic to better handle high-traffic failure bursts, improving the efficiency of clearing expired queue entries by scheduling immediate re-drains, and ensuring the recovery callback triggers correctly when adapters recover during background drain attempts.

Comment on lines +51 to +54
return !(
health.consecutiveFailures >= unhealthyThreshold &&
Date.now() - health.lastSuccessAt > healthWindowMs
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The health check logic requires both the failure threshold to be met AND the time window to have elapsed. In high-traffic scenarios, an adapter could fail hundreds of times in a few seconds but still be considered 'healthy' because the healthWindowMs (default 30s) hasn't passed. This causes dispatch to continue attempting direct sends, which is inefficient and may overwhelm the failing adapter. Consider if a simpler threshold-only check or an alternative logic would better protect the system from failure bursts.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dual condition (threshold AND window) is intentional — it prevents transient blips (e.g., a brief network hiccup causing 3 fast failures) from marking an adapter unhealthy. During the burst, failed sends still get enqueued and retried via drain, so no alerts are lost. The adapter just isn't preemptively bypassed for direct sends. A threshold-only check would cause unnecessary queueing on every momentary glitch. Since both the threshold and window are now configurable via HealthPolicy, users who want aggressive failure detection can set healthWindowMs: 0.

// Discard expired entries (>1 hour old)
if (Date.now() - entry.enqueuedAt > 3_600_000) {
// Discard expired entries
if (Date.now() - entry.enqueuedAt > this.config.policy.entryExpiryMs) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When an expired entry is discarded, the function returns without scheduling a re-drain for the next item in the queue. If the queue contains many expired entries, they will only be cleared one by one at every drainIntervalMs (default 10s). This can cause the queue to remain full of stale data and block new alerts for an extended period. Consider scheduling an immediate re-drain after dequeuing an expired entry:

if (Date.now() - entry.enqueuedAt > this.config.policy.entryExpiryMs) {
  health.queue.dequeue()
  if (!health.queue.isEmpty) {
    setTimeout(() => void this.drainOnce(adapter), 0)
  }
  return
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fa65565 — expired entries now schedule an immediate re-drain via setTimeout(() => void this.drainOnce(adapter), 0), flushing the entire stale queue in O(n) instead of O(n × drainIntervalMs).

health.consecutiveFailures++
entry.retryCount++
if (entry.retryCount >= 3) {
if (entry.retryCount >= this.config.policy.maxRetries) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The warnedAt timestamp, used to calculate downtime for onRecovery, is currently only set within the dispatch method. If an adapter becomes unhealthy during background drain attempts and then recovers before any new alerts are dispatched, onRecovery will not be triggered because warnedAt remains null. You should check the health status and update warnedAt (and log the warning) here when a send attempt fails.

        if (!this.isHealthy(adapter) && health.warnedAt === null) {
          console.warn(
            "[alert-logger] " + adapter.name + " adapter is unhealthy (" + health.consecutiveFailures + " consecutive failures)",
          )
          health.warnedAt = Date.now()
        }
        if (entry.retryCount >= this.config.policy.maxRetries) {

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in fa65565 — the drain failure path now checks health and sets warnedAt + logs the warning, so onRecovery fires correctly even when the adapter becomes unhealthy purely through drain retries. Added a dedicated test for this scenario (drain-only unhealthy: onRecovery fires even without new dispatches).

Royal-lobster and others added 3 commits April 6, 2026 09:54
- Import formatDuration directly from utils.js instead of re-exporting
  through health-manager.ts
- Add queueSize() method to HealthManager for observability
- Add explicit queue state assertions to custom expiry and maxRetries
  tests so they verify behavior, not just absence of errors

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Set warnedAt during drain failures so onRecovery fires even when
  adapter becomes unhealthy purely through drain retries (no dispatch)
- Schedule immediate re-drain after discarding expired entries to
  flush stale queues in O(n) time instead of O(n * drainIntervalMs)
- Add test: drain-only unhealthy → recovery fires without new dispatches

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Build out @iqai/alert-logger — health policy, CJS compat, description field, fingerprint dedup

1 participant