Configurable health policy, shared utils, failure-mode tests by Royal-lobster · Pull Request #6 · IQAIcom/alert-logger

Royal-lobster · 2026-04-06T04:20:41Z

Summary

Extract shared formatDuration into src/core/utils.ts — removes duplicate implementations from health-manager.ts and discord/formatter.ts
Make health/retry policy configurable via new HealthPolicy type — replaces hardcoded values (3 failures, 30s window, 10s drain, 3 retries, 1h expiry) with user-configurable settings while preserving defaults
Add failure-mode tests — custom threshold behavior, custom entry expiry, custom maxRetries, and full unhealthy→recovery→healthy flow

Motivation

Prompted by code review feedback that HealthManager bakes in product decisions as implementation details. The hardcoded retry/health constants are deployment-specific — a high-traffic service may want a shorter drain interval, while a batch job may want a longer health window. These are now configurable via AlertLoggerConfig.health.

Test plan

All 125 existing + new tests pass (pnpm test)
Biome lint/format clean (pnpm biome check src/)
Verify default behavior unchanged (no config changes needed for existing consumers)
New tests validate custom policy thresholds, expiry, maxRetries, and full recovery flow

🤖 Generated with Claude Code

Closes #15

Remove duplicate formatDuration implementations from health-manager.ts and discord/formatter.ts in favor of a single shared utility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace hardcoded values in HealthManager (3 failures, 30s window, 10s drain, 3 retries, 1h expiry) with a typed HealthPolicy config. Defaults preserve existing behavior. Users can now tune these product decisions per deployment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Custom threshold: verifies unhealthyThreshold is respected - Custom expiry: verifies entryExpiryMs controls entry discard timing - Custom maxRetries: verifies entries are dropped after configured retries - Full recovery flow: unhealthy → queued → drain succeeds → recovery callback fires → adapter returns to healthy state Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces a configurable HealthPolicy for the HealthManager, allowing customization of failure thresholds, health windows, drain intervals, retry limits, and entry expiration. It also refactors the formatDuration utility into a shared core file. Feedback focuses on refining the health check logic to better handle high-traffic failure bursts, improving the efficiency of clearing expired queue entries by scheduling immediate re-drains, and ensuring the recovery callback triggers correctly when adapters recover during background drain attempts.

gemini-code-assist · 2026-04-06T04:23:44Z

+    return !(
+      health.consecutiveFailures >= unhealthyThreshold &&
+      Date.now() - health.lastSuccessAt > healthWindowMs
+    )


The health check logic requires both the failure threshold to be met AND the time window to have elapsed. In high-traffic scenarios, an adapter could fail hundreds of times in a few seconds but still be considered 'healthy' because the healthWindowMs (default 30s) hasn't passed. This causes dispatch to continue attempting direct sends, which is inefficient and may overwhelm the failing adapter. Consider if a simpler threshold-only check or an alternative logic would better protect the system from failure bursts.

The dual condition (threshold AND window) is intentional — it prevents transient blips (e.g., a brief network hiccup causing 3 fast failures) from marking an adapter unhealthy. During the burst, failed sends still get enqueued and retried via drain, so no alerts are lost. The adapter just isn't preemptively bypassed for direct sends. A threshold-only check would cause unnecessary queueing on every momentary glitch. Since both the threshold and window are now configurable via HealthPolicy, users who want aggressive failure detection can set healthWindowMs: 0.

gemini-code-assist · 2026-04-06T04:23:44Z

-      // Discard expired entries (>1 hour old)
-      if (Date.now() - entry.enqueuedAt > 3_600_000) {
+      // Discard expired entries
+      if (Date.now() - entry.enqueuedAt > this.config.policy.entryExpiryMs) {


When an expired entry is discarded, the function returns without scheduling a re-drain for the next item in the queue. If the queue contains many expired entries, they will only be cleared one by one at every drainIntervalMs (default 10s). This can cause the queue to remain full of stale data and block new alerts for an extended period. Consider scheduling an immediate re-drain after dequeuing an expired entry:

if (Date.now() - entry.enqueuedAt > this.config.policy.entryExpiryMs) { health.queue.dequeue() if (!health.queue.isEmpty) { setTimeout(() => void this.drainOnce(adapter), 0) } return }

Fixed in fa65565 — expired entries now schedule an immediate re-drain via setTimeout(() => void this.drainOnce(adapter), 0), flushing the entire stale queue in O(n) instead of O(n × drainIntervalMs).

gemini-code-assist · 2026-04-06T04:23:44Z

        health.consecutiveFailures++
        entry.retryCount++
-        if (entry.retryCount >= 3) {
+        if (entry.retryCount >= this.config.policy.maxRetries) {


The warnedAt timestamp, used to calculate downtime for onRecovery, is currently only set within the dispatch method. If an adapter becomes unhealthy during background drain attempts and then recovers before any new alerts are dispatched, onRecovery will not be triggered because warnedAt remains null. You should check the health status and update warnedAt (and log the warning) here when a send attempt fails.

if (!this.isHealthy(adapter) && health.warnedAt === null) { console.warn( "[alert-logger] " + adapter.name + " adapter is unhealthy (" + health.consecutiveFailures + " consecutive failures)", ) health.warnedAt = Date.now() } if (entry.retryCount >= this.config.policy.maxRetries) {

Fixed in fa65565 — the drain failure path now checks health and sets warnedAt + logs the warning, so onRecovery fires correctly even when the adapter becomes unhealthy purely through drain retries. Added a dedicated test for this scenario (drain-only unhealthy: onRecovery fires even without new dispatches).

- Import formatDuration directly from utils.js instead of re-exporting through health-manager.ts - Add queueSize() method to HealthManager for observability - Add explicit queue state assertions to custom expiry and maxRetries tests so they verify behavior, not just absence of errors Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Set warnedAt during drain failures so onRecovery fires even when adapter becomes unhealthy purely through drain retries (no dispatch) - Schedule immediate re-drain after discarding expired entries to flush stale queues in O(n) time instead of O(n * drainIntervalMs) - Add test: drain-only unhealthy → recovery fires without new dispatches Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Royal-lobster and others added 4 commits April 6, 2026 09:46

Extract shared formatDuration into src/core/utils.ts

f7f4513

Remove duplicate formatDuration implementations from health-manager.ts and discord/formatter.ts in favor of a single shared utility. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fix import ordering (biome)

a8c7b27

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gemini-code-assist Bot reviewed Apr 6, 2026

View reviewed changes

Royal-lobster and others added 3 commits April 6, 2026 09:54

Add changeset for health policy configurability

4dcc267

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Royal-lobster merged commit 83c2811 into main Apr 6, 2026

Royal-lobster deleted the refactor/health-policy-and-utils branch April 6, 2026 04:29

github-actions Bot mentioned this pull request Apr 6, 2026

chore: version packages #7

Merged

Royal-lobster mentioned this pull request Apr 7, 2026

Build out @iqai/alert-logger — health policy, CJS compat, description field, fingerprint dedup #15

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configurable health policy, shared utils, failure-mode tests#6

Configurable health policy, shared utils, failure-mode tests#6
Royal-lobster merged 7 commits intomainfrom
refactor/health-policy-and-utils

Royal-lobster commented Apr 6, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 6, 2026

Uh oh!

Royal-lobster Apr 6, 2026

Uh oh!

gemini-code-assist Bot Apr 6, 2026

Uh oh!

Royal-lobster Apr 6, 2026

Uh oh!

gemini-code-assist Bot Apr 6, 2026

Uh oh!

Royal-lobster Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Royal-lobster commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Royal-lobster Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Royal-lobster Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Royal-lobster Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Royal-lobster commented Apr 6, 2026 •

edited

Loading