feat(health): Auto log collection mode with per-endpoint SSE->Periodic fallback by mkoci · Pull Request #1063 · NVIDIA/infra-controller-core

mkoci · 2026-04-21T17:53:19Z

Summary

After the merge of #711, collectors.logs.mode is a single global choice. When an SSE-mode endpoint returns HealthError::SseNotAvailable, Collector::start_streaming logs the error and retries forever with exponential backoff (max 30s) instead of falling back to periodic polling. The LogCollectionMode doc string claimed periodic was an automatic fallback, which was a good idea and also a lie.

This PR introduces LogCollectionMode::Auto as the new default. It spawns SSE per-endpoint and transparently (tracing:warn!) downgrades to periodic polling when SSE is unsupported or repeatedly failing.

How it works

A per-endpoint failure budget classifies connect failures:
- SseNotAvailable - BMC doesn't support SSE subscriptions - fallback to periodic
- All other connection errors (network, auth, 5xx, TLS) are counted in a rolling window (default 5 failures in 5 min)
Hitting either threshold for any BMC records the endpoint_key in a process-wide LogDowngradeRegistry to downgrade the SSE task
In the discovery loop, prune_finished_logs() drops the exited slot before the respawn pass, and the Auto arm in spawn_collectors_for_endpoint checks registry.is_downgraded(&key) to pick the periodic LogsCollector instead of following normal reconnect logic
Downgrading is in-memory. In order to regain SSE (perhaps after a BMC firmware update) the Health process would need to be restarted - slightly controversial? Might lead to cases where many BMCs are failing for other reasons so they are downgraded to periodic.
Observability is a one-shot tracing::warn! on each downgrade transition keyed on endpoint_key + reason; existing _stream_reconnections_total / _stream_errors_total already surface per-endpoint activity pre-downgrade

Config surface

[collectors.logs]
mode = "auto" # new default

[collectors.logs.periodic]
logs_collection_interval = "5m"
state_refresh_interval = "30m"
logs_state_file = "/tmp/logs_collector_{machine_id}.json"

[collectors.logs.auto]
sse_not_available_threshold = 1
connect_failure_window = "5m"
connect_failure_threshold = 5

mode = "auto" requires a [collectors.logs.periodic] section (the downgrade target). [collectors.logs.auto] is optional with sensible defaults. Existing mode = "sse" (no periodic) and mode = "periodic" configs keep working unchanged.

Testing

Unit tests added/updated
Integration tests added/updated
Manual testing performed - ONGOING
No testing required (docs, internal refactor, etc.)

Historical context

@Copilot actually did something useful with a comment and suggested we create #1005 as a follow-up on PR #711 (the SSE streaming log support PR).

github-actions · 2026-04-21T17:55:04Z

🔐 TruffleHog Secret Scan

✅ No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

_{🕐 Last updated: 2026-04-21 17:55:04 UTC | Commit: 2998de4}

Copilot

Pull request overview

Adds a new per-endpoint “auto” log collection mode so the health service can start with SSE streaming and transparently downgrade individual endpoints to periodic polling when SSE is unsupported or repeatedly fails, matching the documented behavior and fixing the “retry forever” issue.

Changes:

Introduces LogCollectionMode::Auto (new default) plus AutoModeConfig downgrade thresholds and updated config/docs/validation.
Adds an in-memory LogDowngradeRegistry and an auto-mode SSE task (spawn_sse_log_auto) with a per-endpoint failure budget.
Updates the discovery loop to prune finished log collectors so downgraded endpoints can be respawned as periodic collectors.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
crates/health/src/discovery/spawn.rs	Adds `Auto` mode spawning logic (SSE first, periodic after downgrade) and unit tests.
crates/health/src/discovery/iteration.rs	Prunes finished log collectors before respawn pass to enable SSE→periodic replacement.
crates/health/src/discovery/context.rs	Adds `LogDowngradeRegistry` to discovery context and implements `prune_finished_logs()`.
crates/health/src/config.rs	Adds `Auto` mode + `AutoModeConfig`, updates docs, defaults, and validation/tests.
crates/health/src/collectors/runtime.rs	Exposes internal helpers and adds `Collector::is_finished()` + `Collector::spawn_task()` for auto-mode tasks.
crates/health/src/collectors/mod.rs	Re-exports downgrade-related types and wires in `spawn_sse_log_auto`.
crates/health/src/collectors/logs/mod.rs	Adds `auto` + `downgrade` modules and exports registry/reason types.
crates/health/src/collectors/logs/downgrade.rs	New in-memory downgrade registry with idempotent marking + tests.
crates/health/src/collectors/logs/auto.rs	New auto-mode SSE task with failure budgeting + tests.
crates/health/example/config.example.toml	Updates example config to `mode = "auto"` with required periodic + optional auto knobs.

Comments suppressed due to low confidence (1)

crates/health/src/config.rs:538

LogsCollectorConfig::validate enforces that periodic exists for auto, but it doesn't validate the AutoModeConfig knob values themselves. As written, sse_not_available_threshold = 0 or connect_failure_threshold = 0 will cause immediate downgrade on first failure (since the counter will always be >= 0), and connect_failure_window = "0s" will effectively prevent accumulation (window resets every record). Consider adding validation (when mode = Auto and auto is set) that thresholds are > 0 and connect_failure_window is non-zero, returning a clear error message for invalid values.

impl LogsCollectorConfig {
    pub fn validate(&self) -> Result<(), String> {
        match self.mode {
            LogCollectionMode::Auto if self.periodic.is_none() => Err(
                "[collectors.logs.periodic] is required when mode = \"auto\" (used as the \
                 downgrade target when SSE is unavailable)"
                    .to_string(),
            ),
            LogCollectionMode::Periodic if self.periodic.is_none() => {
                Err("[collectors.logs.periodic] is required when mode = \"periodic\"".to_string())
            }
            LogCollectionMode::Sse if self.periodic.is_some() => {
                Err("[collectors.logs.periodic] should not be set when mode = \"sse\"".to_string())
            }
            _ => Ok(()),
        }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

… endpoint Today `collectors.logs.mode` is a single global choice. When an SSE-mode endpoint returns `HealthError::SseNotAvailable`, the streaming collector logs the error and retries forever with exponential backoff instead of falling back to periodic polling. The `LogCollectionMode` doc string claimed periodic was an automatic fallback, which was not true. Introduce `LogCollectionMode::Auto` (the new default) which spawns SSE per endpoint and transparently downgrades to periodic polling when SSE is unsupported or chronically failing: - a per-endpoint failure budget classifies connect failures as terminal (`SseNotAvailable`, threshold 1 by default) or transient (N failures in a rolling window, default 5 in 5 minutes); hitting a threshold records the endpoint in a process-wide `LogDowngradeRegistry` and exits the SSE task cleanly. - the discovery loop prunes finished log collectors before its respawn step, so the vacated slot is re-filled with a periodic `LogsCollector` without operator intervention. - once downgraded, the decision sticks for the lifetime of the process; a rolling restart of the health service is the operator-visible "retry SSE" action (documented in `LogCollectionMode`). - observability is a one-shot `tracing::warn!` per downgrade transition keyed on `endpoint_key` and `reason`; existing stream metrics already surface per-endpoint reconnect activity. Config validation now requires a `[collectors.logs.periodic]` block under `mode = "auto"` so the downgrade target is always configured. `[collectors.logs.auto]` is optional and defaults apply when omitted. The example config demonstrates all three sections and `mode = "sse"` with no periodic block remains valid for fleets that guarantee SSE. Tests: - budget state machine: SSE-not-available path, transient window math, window reset, reset_transient on successful connect. - downgrade registry: insert-if-absent, idempotency, multi-endpoint. - spawn decision: Auto + pre-seeded downgrade -> periodic spawn; Auto + no data sink -> graceful skip. - config validation covers the new Auto arm and keeps SSE/Periodic regressions. Fixes NVIDIA#1005 Signed-off-by: mkoci <mkoci@nvidia.com>

…ig tests Signed-off-by: mkoci <mkoci@nvidia.com>

mkoci requested a review from a team as a code owner April 21, 2026 17:53

Copilot AI review requested due to automatic review settings April 21, 2026 17:53

Copilot started reviewing on behalf of mkoci April 21, 2026 17:53 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

Comment thread crates/health/src/discovery/spawn.rs

Comment thread crates/health/src/collectors/logs/downgrade.rs

Comment thread crates/health/src/collectors/runtime.rs Outdated

mkoci force-pushed the feature/auto-log-collection-mode branch 8 times, most recently from 53c8ef3 to 2e4ea14 Compare April 24, 2026 20:23

yoks reviewed Apr 24, 2026

View reviewed changes

Comment thread crates/health/src/discovery/spawn.rs Outdated

mkoci force-pushed the feature/auto-log-collection-mode branch from ecc5476 to 5b8e6f1 Compare April 25, 2026 13:06

lint(health): remove trailing blank line before closing brace in conf…

59c55a5

…ig tests Signed-off-by: mkoci <mkoci@nvidia.com>

mkoci force-pushed the feature/auto-log-collection-mode branch from a12580f to 59c55a5 Compare April 29, 2026 14:42

Merge branch 'main' into feature/auto-log-collection-mode

7d63907

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(health): Auto log collection mode with per-endpoint SSE->Periodic fallback#1063

feat(health): Auto log collection mode with per-endpoint SSE->Periodic fallback#1063
mkoci wants to merge 3 commits intoNVIDIA:mainfrom
mkoci:feature/auto-log-collection-mode

mkoci commented Apr 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mkoci commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Config surface

Testing

Historical context

Uh oh!

github-actions Bot commented Apr 21, 2026

🔐 TruffleHog Secret Scan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mkoci commented Apr 21, 2026 •

edited

Loading