feat(health): SSE streaming log support. OtlpSink. FileSync#711
feat(health): SSE streaming log support. OtlpSink. FileSync#711yoks merged 7 commits intoNVIDIA:mainfrom
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Matthias247
left a comment
There was a problem hiding this comment.
I have not looked at everything so far.
But one high level question upfront: Is there a specific reason that SSE connections need to make the internal event handling asynchronous? My understanding is it could still be synchronous (the events are written to channels synchronously and can then be processed on the other side at arbitrary speed).
Is the reasoning here backpressure? If yes, it could likely be handled other ways too. E.g. we could implement a channel where we just drop the newest (or oldest) events if the processing infrastructure can't keep up. But @yoks is certainly the expert on how its designed so far and can provide more input.
|
I think OTLP should be modelled as a sink, as it async and can have backpressure i think you can reuse |
Yes, they should be synchronios as long as they are not hitting slow sink (e.g. API/OTEL). I think File sink should be fine as well. Answered above, but i design similar on how HealthOverride handled looks like the way how they can be handled. |
🔐 TruffleHog Secret Scan✅ No secrets or credentials found! Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉 🕐 Last updated: 2026-04-02 02:55:54 UTC | Commit: 616afc7 |
There was a problem hiding this comment.
Pull request overview
Adds experimental Redfish LogEntries streaming via SSE to reduce bursty periodic log scrapes, and introduces an async OTLP (gRPC) export path with bounded-channel backpressure to protect the health service under load.
Changes:
- Introduces an
EventPipelinethat preserves the existing synchronous sink behavior while optionally forwarding OTLP-relevant events through a bounded async channel for backpressure. - Adds a streaming-collector runtime abstraction + SSE log collector, with logs collection now configurable as
sse(default) orperiodic. - Implements OTLP log export (proto generation, event→OTLP conversion, drain task with batching/flush/retry).
Reviewed changes
Copilot reviewed 24 out of 26 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| Makefile.toml | Formatting adjustments for env vars and rustfmt tasks. |
| crates/ssh-console/Cargo.toml | Reformat tokio features list. |
| crates/health/src/processor/mod.rs | Adds handle_and_collect to capture original + derived events for downstream forwarding. |
| crates/health/src/pipeline.rs | New async EventPipeline that forwards OTLP-relevant events via bounded channel. |
| crates/health/src/otlp/mod.rs | New OTLP module with generated proto includes + re-exports. |
| crates/health/src/otlp/drain.rs | New OTLP drain task: connect, batch, flush, retry w/ backoff. |
| crates/health/src/otlp/convert.rs | Converts internal events/logs into OTLP ExportLogsServiceRequest. |
| crates/health/src/lib.rs | Wires EventPipeline, OTLP drain lifecycle, and SSE-related error type. |
| crates/health/src/discovery/spawn.rs | Spawns logs collectors based on mode (SSE vs periodic) and injects pipeline. |
| crates/health/src/discovery/iteration.rs | Threads EventPipeline through discovery iteration/spawn. |
| crates/health/src/config.rs | Adds OTLP sink config + logs mode config/validation and updates tests. |
| crates/health/src/collectors/sensors.rs | Switches sensor collector emission to async pipeline. |
| crates/health/src/collectors/runtime.rs | Adds streaming collector trait/runtime, SSE open helper, and backoff utilities. |
| crates/health/src/collectors/nvue/rest/collector.rs | Switches NVUE collector emission to async pipeline. |
| crates/health/src/collectors/nmxt.rs | Switches NMX-T collector emission to async pipeline. |
| crates/health/src/collectors/mod.rs | Re-exports streaming runtime types and new SSE log collector. |
| crates/health/src/collectors/logs/sse.rs | New SSE log collector mapping Redfish EventService payloads into log events. |
| crates/health/src/collectors/logs/periodic.rs | Periodic logs updated to use pipeline + optional file writer + rotation tweak. |
| crates/health/src/collectors/logs/mod.rs | New logs module split (periodic + sse). |
| crates/health/src/collectors/firmware.rs | Switches firmware collector emission to async pipeline. |
| crates/health/example/config.example.toml | Updates example to document collectors.logs.mode and periodic sub-table. |
| crates/health/Cargo.toml | Adds prost/tonic-prost deps and build dep for proto compilation. |
| crates/health/build.rs | New build script fetching/compiling OTLP protos for generated gRPC client. |
| crates/health/benches/processor_pipeline.rs | Adds bench for handle_and_collect overhead. |
| crates/bmc-explorer/Cargo.toml | Reformat nv-redfish feature list. |
| Cargo.lock | Locks new prost/tonic-prost deps (and related lockfile churn). |
Comments suppressed due to low confidence (1)
crates/health/src/collectors/logs/periodic.rs:612
last_seen_idsis updated even when writing the log batch to disk fails. If the log file is a required output, this can permanently skip entries on the next iteration (data loss on disk). Consider only advancinglast_seen_ids(andtotal_log_count) after a successful write, or otherwise making the failure semantics explicit (e.g., retry on next run when the writer errors).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
From the PR description I added after this comment:
If we don't care about dropping, this is much easier to implement similar to the |
kensimon
left a comment
There was a problem hiding this comment.
Haven't looked at the code yet but this really stuck out to me, I think we ought to vendor these files rather than shelling out to curl on every build.
kensimon
left a comment
There was a problem hiding this comment.
These are all nitpicks and can be ignored, only the curl-in-build-rs issue is a "request changes" from me.
Add criterion benchmarks for OtlpSink enqueue throughput (low/high contention) and queue key construction latency. Remove the stale handle_and_collect benchmark that referenced a deleted method. Signed-off-by: mkoci <mkoci@nvidia.com>
3ed7f31 to
8ab95ce
Compare
2c28312 to
2343042
Compare
- convert StreamingCollector::connect to #[async_trait]; drop the ConnectFuture Pin<Box> type alias. No blocker surfaced on re-evaluation -- async-trait is already a workspace dep and the trait is only used generically, never as dyn. - replace the four tokio::select! blocks in Collector::start_streaming with CancellationToken::run_until_cancelled per kensimon's suggestions. - consolidate batch.clear() to a single call site in OtlpDrainTask::flush(), right after build_export_request. - prefer HealthReportProcessor::default() in benches (the struct derives Default). Signed-off-by: mkoci <mkoci@nvidia.com>
Replace the per-endpoint connection_state Gauge<f64> (which encoded
browser EventSource readyState as 0/1/2) with a single service-level
active_sse_connections IntGauge on MetricsManager. This answers the
actual operational question -- "how many SSE streams are live?" -- as
a gauge read rather than a `count(connection_state == 1)` aggregation
across thousands of endpoints.
Reconnect-loop detection moves to the existing per-endpoint counters:
# live streams
carbide_hardware_health_active_sse_connections
# endpoints reconnecting but processing zero events
rate(carbide_hardware_health_stream_reconnections_total[5m]) > 0
and rate(carbide_hardware_health_stream_items_processed_total[5m]) == 0
Details:
- remove STREAM_STATE_{CONNECTING,OPEN,CLOSED} constants and the
connection_state Gauge from StreamMetrics + its registration.
- add active_sse_connections IntGauge to MetricsManager with an
active_sse_connections() accessor.
- add SseConnectionGuard (RAII): inc on construction, dec on Drop.
Construct inside the Ok(mut stream) arm of start_streaming so the
guard's lifetime is precisely the connected phase; drop covers every
exit path (cancel, stream error, stream end, task panic) and fires
before the reconnect backoff sleep.
- introduce StreamingCollectorStartContext, mirroring the existing
CollectorStartContext pattern, to keep start_streaming within the
clippy::too_many_arguments limit while threading the new gauge.
Addresses:
- Matthias247 NVIDIA#711 (r3017733441) -- oddly specific connection-state metric
- kensimon NVIDIA#711 (r3040326749) -- IntGauge for discrete state (field deleted entirely)
Signed-off-by: mkoci <mkoci@nvidia.com>
I don't think there's a great reason anymore. Just because we can, doesn't mean we should. "if you need to backpressure Redfish, you're already broken" - Ed Tanous |
Replace the per-endpoint connection_state Gauge<f64> (which encoded
browser EventSource readyState as 0/1/2) with a per-endpoint
stream_connected IntGauge that is 1 while the stream is live and 0
otherwise. The gauge is colocated with the existing per-stream counters
on StreamMetrics and carries the same {collector_type, endpoint_key}
labels, so all four SSE metrics are registered and torn down together
with the owning CollectorRegistry -- no new leak surface relative to
what already exists (tracked in NVIDIA#989).
Per endpoint remains consistent with yoks' guidance that "collectors
create per-endpoint, so this should be fine" and lets rack validation
filter on a specific endpoint_key to see connection state over time.
Service-wide count is derivable as `sum(stream_connected)`.
Details:
- remove STREAM_STATE_{CONNECTING,OPEN,CLOSED} constants.
- swap the connection_state Gauge<f64> field on StreamMetrics for an
IntGauge `connected` with the same const labels and same registry
lifecycle.
- add SseConnectionGuard (RAII): inc on construction, dec on Drop.
Construct inside the Ok(mut stream) arm of start_streaming so the
guard's lifetime is precisely the connected phase; Drop covers every
exit path (cancel, stream error, stream end, task panic) and fires
before the reconnect backoff sleep, so an endpoint in a reconnect
loop correctly reports 0 during backoff.
- introduce StreamingCollectorStartContext, mirroring the existing
CollectorStartContext pattern, to keep start_streaming readable.
Grafana:
# service-wide active count
sum(carbide_hardware_health_stream_connected)
# per-endpoint liveness
carbide_hardware_health_stream_connected{endpoint_key="..."}
# endpoints reconnecting but processing zero events
rate(carbide_hardware_health_stream_reconnections_total[5m]) > 0
and rate(carbide_hardware_health_stream_items_processed_total[5m]) == 0
Addresses:
- Matthias247 NVIDIA#711 (r3017733441) -- readyState integer encoding dropped
- kensimon NVIDIA#711 (r3040326749) -- IntGauge (0/1) instead of Gauge<f64>
Signed-off-by: mkoci <mkoci@nvidia.com>
ea44a25 to
b161caf
Compare
|
Finally found time to revisit this one! Re-requesting reviews from @Matthias247 @yoks @kensimon, perhaps @chet might take a look as well? TLDR;
Many nits were addressed, one remains. Need feedback from @kensimon here Issues Opened
|
kensimon
left a comment
There was a problem hiding this comment.
LGTM, thanks for making those changes in build.rs.
|
can this be merged now? @Matthias247 @yoks @kensimon |
|
One thing for future - look at the proto/grpc. Most likely health would need to provide streaming grpc server for health data, so we will need to unify it. But it is not part of this PR, it big as it is. |
|
I can merge as soon as Matt says it is ready |
Do you mean health will be a standalone gRPC server for outputting telemetry? I have another stacked code change. After I refactor it, it will provide NVOS gNMI streaming support for Switch hosts - health as a gRPC client collector. |
Description
Adds support for streaming Redfish LogEntries through SSE (Server-Sent Events). Introduces an OTLP log export pipeline that batches and ships collected log events to an OpenTelemetry collector over gRPC.
Breaking Changes
Type of Change
Testing
Additional Notes
tonic-prost-buildrather than pinningtonicto avoid theopentelemetry-protoconflict.