feat(health): Auto log collection mode with per-endpoint SSE->Periodic fallback#1063
Open
mkoci wants to merge 3 commits intoNVIDIA:mainfrom
Open
feat(health): Auto log collection mode with per-endpoint SSE->Periodic fallback#1063mkoci wants to merge 3 commits intoNVIDIA:mainfrom
mkoci wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
🔐 TruffleHog Secret Scan✅ No secrets or credentials found! Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉 🕐 Last updated: 2026-04-21 17:55:04 UTC | Commit: 2998de4 |
There was a problem hiding this comment.
Pull request overview
Adds a new per-endpoint “auto” log collection mode so the health service can start with SSE streaming and transparently downgrade individual endpoints to periodic polling when SSE is unsupported or repeatedly fails, matching the documented behavior and fixing the “retry forever” issue.
Changes:
- Introduces
LogCollectionMode::Auto(new default) plusAutoModeConfigdowngrade thresholds and updated config/docs/validation. - Adds an in-memory
LogDowngradeRegistryand an auto-mode SSE task (spawn_sse_log_auto) with a per-endpoint failure budget. - Updates the discovery loop to prune finished log collectors so downgraded endpoints can be respawned as periodic collectors.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/health/src/discovery/spawn.rs | Adds Auto mode spawning logic (SSE first, periodic after downgrade) and unit tests. |
| crates/health/src/discovery/iteration.rs | Prunes finished log collectors before respawn pass to enable SSE→periodic replacement. |
| crates/health/src/discovery/context.rs | Adds LogDowngradeRegistry to discovery context and implements prune_finished_logs(). |
| crates/health/src/config.rs | Adds Auto mode + AutoModeConfig, updates docs, defaults, and validation/tests. |
| crates/health/src/collectors/runtime.rs | Exposes internal helpers and adds Collector::is_finished() + Collector::spawn_task() for auto-mode tasks. |
| crates/health/src/collectors/mod.rs | Re-exports downgrade-related types and wires in spawn_sse_log_auto. |
| crates/health/src/collectors/logs/mod.rs | Adds auto + downgrade modules and exports registry/reason types. |
| crates/health/src/collectors/logs/downgrade.rs | New in-memory downgrade registry with idempotent marking + tests. |
| crates/health/src/collectors/logs/auto.rs | New auto-mode SSE task with failure budgeting + tests. |
| crates/health/example/config.example.toml | Updates example config to mode = "auto" with required periodic + optional auto knobs. |
Comments suppressed due to low confidence (1)
crates/health/src/config.rs:538
LogsCollectorConfig::validateenforces thatperiodicexists forauto, but it doesn't validate theAutoModeConfigknob values themselves. As written,sse_not_available_threshold = 0orconnect_failure_threshold = 0will cause immediate downgrade on first failure (since the counter will always be>= 0), andconnect_failure_window = "0s"will effectively prevent accumulation (window resets every record). Consider adding validation (whenmode = Autoandautois set) that thresholds are > 0 andconnect_failure_windowis non-zero, returning a clear error message for invalid values.
impl LogsCollectorConfig {
pub fn validate(&self) -> Result<(), String> {
match self.mode {
LogCollectionMode::Auto if self.periodic.is_none() => Err(
"[collectors.logs.periodic] is required when mode = \"auto\" (used as the \
downgrade target when SSE is unavailable)"
.to_string(),
),
LogCollectionMode::Periodic if self.periodic.is_none() => {
Err("[collectors.logs.periodic] is required when mode = \"periodic\"".to_string())
}
LogCollectionMode::Sse if self.periodic.is_some() => {
Err("[collectors.logs.periodic] should not be set when mode = \"sse\"".to_string())
}
_ => Ok(()),
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
53c8ef3 to
2e4ea14
Compare
yoks
reviewed
Apr 24, 2026
… endpoint Today `collectors.logs.mode` is a single global choice. When an SSE-mode endpoint returns `HealthError::SseNotAvailable`, the streaming collector logs the error and retries forever with exponential backoff instead of falling back to periodic polling. The `LogCollectionMode` doc string claimed periodic was an automatic fallback, which was not true. Introduce `LogCollectionMode::Auto` (the new default) which spawns SSE per endpoint and transparently downgrades to periodic polling when SSE is unsupported or chronically failing: - a per-endpoint failure budget classifies connect failures as terminal (`SseNotAvailable`, threshold 1 by default) or transient (N failures in a rolling window, default 5 in 5 minutes); hitting a threshold records the endpoint in a process-wide `LogDowngradeRegistry` and exits the SSE task cleanly. - the discovery loop prunes finished log collectors before its respawn step, so the vacated slot is re-filled with a periodic `LogsCollector` without operator intervention. - once downgraded, the decision sticks for the lifetime of the process; a rolling restart of the health service is the operator-visible "retry SSE" action (documented in `LogCollectionMode`). - observability is a one-shot `tracing::warn!` per downgrade transition keyed on `endpoint_key` and `reason`; existing stream metrics already surface per-endpoint reconnect activity. Config validation now requires a `[collectors.logs.periodic]` block under `mode = "auto"` so the downgrade target is always configured. `[collectors.logs.auto]` is optional and defaults apply when omitted. The example config demonstrates all three sections and `mode = "sse"` with no periodic block remains valid for fleets that guarantee SSE. Tests: - budget state machine: SSE-not-available path, transient window math, window reset, reset_transient on successful connect. - downgrade registry: insert-if-absent, idempotency, multi-endpoint. - spawn decision: Auto + pre-seeded downgrade -> periodic spawn; Auto + no data sink -> graceful skip. - config validation covers the new Auto arm and keeps SSE/Periodic regressions. Fixes NVIDIA#1005 Signed-off-by: mkoci <mkoci@nvidia.com>
ecc5476 to
5b8e6f1
Compare
…ig tests Signed-off-by: mkoci <mkoci@nvidia.com>
a12580f to
59c55a5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #1005.
After the merge of #711,
collectors.logs.modeis a single global choice. When an SSE-mode endpoint returnsHealthError::SseNotAvailable,Collector::start_streaminglogs the error and retries forever with exponential backoff (max 30s) instead of falling back to periodic polling. TheLogCollectionModedoc string claimed periodic was an automatic fallback, which was a good idea and also a lie.This PR introduces
LogCollectionMode::Autoas the new default. It spawns SSE per-endpoint and transparently (tracing:warn!) downgrades to periodic polling when SSE is unsupported or repeatedly failing.How it works
SseNotAvailable- BMC doesn't support SSE subscriptions - fallback to periodicendpoint_keyin a process-wideLogDowngradeRegistryto downgrade the SSE taskprune_finished_logs()drops the exited slot before the respawn pass, and theAutoarm inspawn_collectors_for_endpointchecksregistry.is_downgraded(&key)to pick the periodicLogsCollectorinstead of following normal reconnect logictracing::warn!on each downgrade transition keyed onendpoint_key+reason; existing_stream_reconnections_total/_stream_errors_totalalready surface per-endpoint activity pre-downgradeConfig surface
mode = "auto"requires a[collectors.logs.periodic]section (the downgrade target).[collectors.logs.auto]is optional with sensible defaults. Existingmode = "sse"(no periodic) andmode = "periodic"configs keep working unchanged.Testing
Historical context
@Copilot actually did something useful with a comment and suggested we create #1005 as a follow-up on PR #711 (the SSE streaming log support PR).