Skip to content

feat(health): Auto log collection mode with per-endpoint SSE->Periodic fallback#1063

Open
mkoci wants to merge 3 commits intoNVIDIA:mainfrom
mkoci:feature/auto-log-collection-mode
Open

feat(health): Auto log collection mode with per-endpoint SSE->Periodic fallback#1063
mkoci wants to merge 3 commits intoNVIDIA:mainfrom
mkoci:feature/auto-log-collection-mode

Conversation

@mkoci
Copy link
Copy Markdown
Contributor

@mkoci mkoci commented Apr 21, 2026

Summary

Fixes #1005.

After the merge of #711, collectors.logs.mode is a single global choice. When an SSE-mode endpoint returns HealthError::SseNotAvailable, Collector::start_streaming logs the error and retries forever with exponential backoff (max 30s) instead of falling back to periodic polling. The LogCollectionMode doc string claimed periodic was an automatic fallback, which was a good idea and also a lie.

This PR introduces LogCollectionMode::Auto as the new default. It spawns SSE per-endpoint and transparently (tracing:warn!) downgrades to periodic polling when SSE is unsupported or repeatedly failing.

How it works

  • A per-endpoint failure budget classifies connect failures:
    • SseNotAvailable - BMC doesn't support SSE subscriptions - fallback to periodic
    • All other connection errors (network, auth, 5xx, TLS) are counted in a rolling window (default 5 failures in 5 min)
  • Hitting either threshold for any BMC records the endpoint_key in a process-wide LogDowngradeRegistry to downgrade the SSE task
  • In the discovery loop, prune_finished_logs() drops the exited slot before the respawn pass, and the Auto arm in spawn_collectors_for_endpoint checks registry.is_downgraded(&key) to pick the periodic LogsCollector instead of following normal reconnect logic
  • Downgrading is in-memory. In order to regain SSE (perhaps after a BMC firmware update) the Health process would need to be restarted - slightly controversial? Might lead to cases where many BMCs are failing for other reasons so they are downgraded to periodic.
  • Observability is a one-shot tracing::warn! on each downgrade transition keyed on endpoint_key + reason; existing _stream_reconnections_total / _stream_errors_total already surface per-endpoint activity pre-downgrade

Config surface

[collectors.logs]
mode = "auto" # new default

[collectors.logs.periodic]
logs_collection_interval = "5m"
state_refresh_interval = "30m"
logs_state_file = "/tmp/logs_collector_{machine_id}.json"

[collectors.logs.auto]
sse_not_available_threshold = 1
connect_failure_window = "5m"
connect_failure_threshold = 5

mode = "auto" requires a [collectors.logs.periodic] section (the downgrade target). [collectors.logs.auto] is optional with sensible defaults. Existing mode = "sse" (no periodic) and mode = "periodic" configs keep working unchanged.

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed - ONGOING
  • No testing required (docs, internal refactor, etc.)

Historical context

@Copilot actually did something useful with a comment and suggested we create #1005 as a follow-up on PR #711 (the SSE streaming log support PR).

@mkoci mkoci requested a review from a team as a code owner April 21, 2026 17:53
Copilot AI review requested due to automatic review settings April 21, 2026 17:53
@github-actions
Copy link
Copy Markdown

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-04-21 17:55:04 UTC | Commit: 2998de4

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new per-endpoint “auto” log collection mode so the health service can start with SSE streaming and transparently downgrade individual endpoints to periodic polling when SSE is unsupported or repeatedly fails, matching the documented behavior and fixing the “retry forever” issue.

Changes:

  • Introduces LogCollectionMode::Auto (new default) plus AutoModeConfig downgrade thresholds and updated config/docs/validation.
  • Adds an in-memory LogDowngradeRegistry and an auto-mode SSE task (spawn_sse_log_auto) with a per-endpoint failure budget.
  • Updates the discovery loop to prune finished log collectors so downgraded endpoints can be respawned as periodic collectors.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
crates/health/src/discovery/spawn.rs Adds Auto mode spawning logic (SSE first, periodic after downgrade) and unit tests.
crates/health/src/discovery/iteration.rs Prunes finished log collectors before respawn pass to enable SSE→periodic replacement.
crates/health/src/discovery/context.rs Adds LogDowngradeRegistry to discovery context and implements prune_finished_logs().
crates/health/src/config.rs Adds Auto mode + AutoModeConfig, updates docs, defaults, and validation/tests.
crates/health/src/collectors/runtime.rs Exposes internal helpers and adds Collector::is_finished() + Collector::spawn_task() for auto-mode tasks.
crates/health/src/collectors/mod.rs Re-exports downgrade-related types and wires in spawn_sse_log_auto.
crates/health/src/collectors/logs/mod.rs Adds auto + downgrade modules and exports registry/reason types.
crates/health/src/collectors/logs/downgrade.rs New in-memory downgrade registry with idempotent marking + tests.
crates/health/src/collectors/logs/auto.rs New auto-mode SSE task with failure budgeting + tests.
crates/health/example/config.example.toml Updates example config to mode = "auto" with required periodic + optional auto knobs.
Comments suppressed due to low confidence (1)

crates/health/src/config.rs:538

  • LogsCollectorConfig::validate enforces that periodic exists for auto, but it doesn't validate the AutoModeConfig knob values themselves. As written, sse_not_available_threshold = 0 or connect_failure_threshold = 0 will cause immediate downgrade on first failure (since the counter will always be >= 0), and connect_failure_window = "0s" will effectively prevent accumulation (window resets every record). Consider adding validation (when mode = Auto and auto is set) that thresholds are > 0 and connect_failure_window is non-zero, returning a clear error message for invalid values.
impl LogsCollectorConfig {
    pub fn validate(&self) -> Result<(), String> {
        match self.mode {
            LogCollectionMode::Auto if self.periodic.is_none() => Err(
                "[collectors.logs.periodic] is required when mode = \"auto\" (used as the \
                 downgrade target when SSE is unavailable)"
                    .to_string(),
            ),
            LogCollectionMode::Periodic if self.periodic.is_none() => {
                Err("[collectors.logs.periodic] is required when mode = \"periodic\"".to_string())
            }
            LogCollectionMode::Sse if self.periodic.is_some() => {
                Err("[collectors.logs.periodic] should not be set when mode = \"sse\"".to_string())
            }
            _ => Ok(()),
        }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/health/src/discovery/spawn.rs
Comment thread crates/health/src/collectors/logs/downgrade.rs
Comment thread crates/health/src/collectors/runtime.rs Outdated
@mkoci mkoci force-pushed the feature/auto-log-collection-mode branch 8 times, most recently from 53c8ef3 to 2e4ea14 Compare April 24, 2026 20:23
Comment thread crates/health/src/discovery/spawn.rs Outdated
… endpoint

Today `collectors.logs.mode` is a single global choice. When an SSE-mode
endpoint returns `HealthError::SseNotAvailable`, the streaming collector
logs the error and retries forever with exponential backoff instead of
falling back to periodic polling. The `LogCollectionMode` doc string
claimed periodic was an automatic fallback, which was not true.

Introduce `LogCollectionMode::Auto` (the new default) which spawns SSE
per endpoint and transparently downgrades to periodic polling when SSE
is unsupported or chronically failing:

- a per-endpoint failure budget classifies connect failures as terminal
  (`SseNotAvailable`, threshold 1 by default) or transient (N failures
  in a rolling window, default 5 in 5 minutes); hitting a threshold
  records the endpoint in a process-wide `LogDowngradeRegistry` and
  exits the SSE task cleanly.
- the discovery loop prunes finished log collectors before its respawn
  step, so the vacated slot is re-filled with a periodic `LogsCollector`
  without operator intervention.
- once downgraded, the decision sticks for the lifetime of the process;
  a rolling restart of the health service is the operator-visible
  "retry SSE" action (documented in `LogCollectionMode`).
- observability is a one-shot `tracing::warn!` per downgrade transition
  keyed on `endpoint_key` and `reason`; existing stream metrics already
  surface per-endpoint reconnect activity.

Config validation now requires a `[collectors.logs.periodic]` block
under `mode = "auto"` so the downgrade target is always configured.
`[collectors.logs.auto]` is optional and defaults apply when omitted.
The example config demonstrates all three sections and `mode = "sse"`
with no periodic block remains valid for fleets that guarantee SSE.

Tests:
- budget state machine: SSE-not-available path, transient window math,
  window reset, reset_transient on successful connect.
- downgrade registry: insert-if-absent, idempotency, multi-endpoint.
- spawn decision: Auto + pre-seeded downgrade -> periodic spawn;
  Auto + no data sink -> graceful skip.
- config validation covers the new Auto arm and keeps SSE/Periodic
  regressions.

Fixes NVIDIA#1005

Signed-off-by: mkoci <mkoci@nvidia.com>
@mkoci mkoci force-pushed the feature/auto-log-collection-mode branch from ecc5476 to 5b8e6f1 Compare April 25, 2026 13:06
…ig tests

Signed-off-by: mkoci <mkoci@nvidia.com>
@mkoci mkoci force-pushed the feature/auto-log-collection-mode branch from a12580f to 59c55a5 Compare April 29, 2026 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: (health) add a per-endpoint SSE-> Periodic auto-fallback for BMCs that don't support SSE

3 participants