Skip to content

feat(health): NVUE gNMI streaming collector + OTLP metrics export#1097

Draft
mkoci wants to merge 6 commits intoNVIDIA:mainfrom
mkoci:feature/gnmi_streaming
Draft

feat(health): NVUE gNMI streaming collector + OTLP metrics export#1097
mkoci wants to merge 6 commits intoNVIDIA:mainfrom
mkoci:feature/gnmi_streaming

Conversation

@mkoci
Copy link
Copy Markdown
Contributor

@mkoci mkoci commented Apr 23, 2026

Description

Adds NVOS gRPC switch node streaming telemetry collection for NVLink Switches to the health service, plus OTLP metric export so the high-cardinality switch data can push to an OTel Collector instead of accumulating in Prometheus gauges.

gNMI collector ([collectors.nvue.gnmi], disabled by default) subscribes to gNMI SAMPLE paths for /components, /interfaces, and /leak-sensors. It uses a long-lived bidirectional gRPC stream with reconnection support (exponential backoff + jitter) and fits into the existing discovery/sharding loop with proper spawn, cleanup, and cancellation. Metrics flow through the normal DataSink pipeline — no parallel GaugeVec registration.

OTLP metrics export extends OtlpSink (introduced in #711 for logs) with a MetricsService drain alongside the existing LogsService drain. Both share one OTel Collector endpoint but use separate queues and drain tasks. PrometheusSink stays in the composite during the migration period so /metrics keeps serving.

Builds on #711 (SSE streaming + OtlpSink for logs). Protos vendored for reproducible offline builds, same as #711.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

  • gNMI subscriptions currently skip server certificate validation (AcceptAnyCertVerifier in tls.rs), but traffic remains encrypted. Code changes will be needed after we properly manage and install TLS certificates during switch configuration.
  • Needs rack testing against live NVOS switches with gNMI enabled.
  • The StreamingConnectionGuard (RAII inc/dec on the per-stream connected gauge) from feat(health): SSE streaming log support. OtlpSink. FileSync #711 is generalized here — the gNMI subscriber constructs one scoped to the READY phase of its gRPC stream so Drop covers every exit path.
  • Collector::spawn_task is a new generic spawn helper for streaming collectors that don't fit the StreamingCollector trait shape (gNMI subscriptions are bidirectional and multiplex paths on a single stream).

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 23, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

mkoci added 4 commits April 23, 2026 13:24
@mkoci mkoci force-pushed the feature/gnmi_streaming branch from 0648153 to 1c442ee Compare April 23, 2026 11:33
@ajf
Copy link
Copy Markdown
Collaborator

ajf commented May 4, 2026

@yoks @poroh can you help to review this?

@ajf ajf requested review from poroh and yoks May 4, 2026 23:06
@mkoci
Copy link
Copy Markdown
Contributor Author

mkoci commented May 4, 2026

@yoks @poroh can you help to review this?

It's a draft!

Need to wrap up a few small tweaks and write a proper description. Should land this week!

# Conflicts:
#	Cargo.lock
#	crates/health/src/sink/otlp.rs
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 5, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@mkoci mkoci force-pushed the feature/gnmi_streaming branch from e1f2701 to 7c9e28c Compare May 6, 2026 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants