Skip to content

Conversation

@rafabene
Copy link
Contributor

@rafabene rafabene commented Jan 2, 2026

Summary

Adds comprehensive tracing and telemetry standard for all HyperFleet components (API, Sentinel, Adapters).

New Document: hyperfleet/standards/tracing.md

  • OpenTelemetry Adoption: SDK requirements, justification for vendor-neutral approach
  • Configuration: Environment variables following OTEL conventions
  • W3C Trace Context Propagation: HTTP headers and CloudEvents extensions
  • Required Spans: Per-component (API, Sentinel, Adapters) with naming conventions
  • Standard Span Attributes: Semantic conventions + HyperFleet-specific attributes
  • Sampling Strategy: Head-based vs tail-based comparison, environment-specific rates
  • Exporter Configuration: OTLP setup for Kubernetes and local development
  • Logging Integration: trace_id/span_id correlation with logs
  • Best Practices: Error handling, span lifecycle, context propagation

Cross-Reference Updates

Document Change
logging-specification.md Added References section
metrics.md Added tracing reference
error-model.md Added tracing reference
adapter-frame-design.md Aligned sample rates + added reference
adapter-deployment.md Added tracing and logging references

Sampling Rates (Aligned)

Environment Rate
Development 1.0 (100%)
Staging 0.1 (10%)
Production 0.01 (1%)

Acceptance Criteria

  • Standard documented in hyperfleet/standards/tracing.md
  • OpenTelemetry adoption decision documented
  • Required spans and attributes defined
  • Sampling strategy defined
  • Follow-up ticket created: HYPERFLEET-433

Related

Summary by CodeRabbit

  • Documentation
    • Added a comprehensive Tracing & Telemetry Standard with OpenTelemetry guidance, exporter and sampling recommendations.
    • Updated typical trace sampling rates: Staging 0.5→0.1, Production 0.1→0.01 (Development unchanged).
    • Added tracing references across deployment, logging, metrics and error-model docs; added a References subsection in logging guidance.
    • No API, configuration, or behavioral changes.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 2, 2026

Walkthrough

Added a new HyperFleet Tracing and Telemetry Standard at hyperfleet/standards/tracing.md describing OpenTelemetry-based distributed tracing (goals, SDK requirements, env var configuration, per-component service names, resource attributes, HTTP and CloudEvents propagation, required spans/attributes, sampling strategy with environment-specific rates, OTLP exporter defaults, deployment guidance, logging correlation, and span lifecycle examples). Updated adapter-frame-design.md sample rates (Staging 0.5 → 0.1; Production 0.1 → 0.01) and added tracing references across multiple docs and the logging specification. All changes are documentation-only except the new tracing.md file.

Sequence Diagram(s)

sequenceDiagram
  rect rgb(248,249,251)
    participant Client
    participant Adapter
    participant OTLP_Collector as "OTLP Collector"
    participant Tracing_Backend as "Tracing Backend"
  end

  Client->>Adapter: HTTP request / CloudEvent (may carry trace context)
  Note right of Adapter: start/continue spans\napply sampling, add attributes
  Adapter->>OTLP_Collector: Export spans (OTLP gRPC/HTTP)
  OTLP_Collector->>Tracing_Backend: Forward traces for storage/analysis
  Tracing_Backend-->>OTLP_Collector: Ack / ingestion status
  OTLP_Collector-->>Adapter: Export result (async)
  Note left of Client: Logs/metrics correlated via trace IDs
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • ciaranRoche
  • xueli181114
  • rh-amarin

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title clearly and concisely summarizes the main change: adding a comprehensive tracing and telemetry standard for HyperFleet. It is specific, directly related to the primary deliverable (tracing.md), and avoids vague terminology.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Comment @coderabbitai help to get the list of available commands and usage tips.

@rafabene rafabene force-pushed the HYPERFLEET-383-tracing-telemetry-standard branch from 985cc20 to ca745a3 Compare January 2, 2026 17:29
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
hyperfleet/standards/tracing.md (4)

54-64: Clarify usage of TRACING_ENABLED environment variable.

The config table introduces TRACING_ENABLED (line 63) but it is never referenced in code examples or exporter sections. Either remove it if unused, or add a brief explanation of when/how to use it and what happens when set to false.


282-289: Configuration example should include all three environments.

The example shows only Development and Production. For clarity, also include Staging with its 0.1 sampling rate to match the sampling strategy section.

🔎 Proposed addition
 # Development
 OTEL_TRACES_SAMPLER=always_on

+# Staging
+OTEL_TRACES_SAMPLER=parentbased_traceidratio
+OTEL_TRACES_SAMPLER_ARG=0.1
+
 # Production
 OTEL_TRACES_SAMPLER=parentbased_traceidratio
 OTEL_TRACES_SAMPLER_ARG=0.01

61-61: Baggage propagation mentioned in config but not documented in usage section.

The OTEL_PROPAGATORS config includes baggage (line 61), but there is no section explaining when/how to use W3C Baggage in HyperFleet spans or context. Consider either clarifying whether baggage propagation is required, or adding a brief section with examples.


303-342: Exporter configuration examples clear for both environments; consider adding TLS note for production.

The OTLP examples are practical for local development and Kubernetes. For completeness, consider a brief note in the Kubernetes section mentioning TLS configuration in production environments (though detailed TLS setup may belong in infrastructure docs).

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 985cc20 and ca745a3.

📒 Files selected for processing (6)
  • hyperfleet/components/adapter/framework/adapter-deployment.md
  • hyperfleet/components/adapter/framework/adapter-frame-design.md
  • hyperfleet/standards/error-model.md
  • hyperfleet/standards/logging-specification.md
  • hyperfleet/standards/metrics.md
  • hyperfleet/standards/tracing.md
🚧 Files skipped from review as they are similar to previous changes (4)
  • hyperfleet/standards/metrics.md
  • hyperfleet/components/adapter/framework/adapter-deployment.md
  • hyperfleet/components/adapter/framework/adapter-frame-design.md
  • hyperfleet/standards/error-model.md
🔇 Additional comments (3)
hyperfleet/standards/logging-specification.md (1)

277-281: References section properly integrates tracing and metrics standards.

The new References section appropriately links to related standards and establishes bidirectional cross-references with the tracing and metrics documents.

hyperfleet/standards/tracing.md (2)

1-100: Excellent comprehensive tracing standard with clear OpenTelemetry guidance.

The document provides strong foundational guidance: vendor-neutral rationale, OTEL SDK requirements, configuration following conventions, proper W3C Trace Context propagation, and practical Go examples. Cross-references to logging and metrics are well-integrated.


254-300: Sampling strategy clearly explained with environment-specific rates.

The head-based vs tail-based comparison is helpful, and the environment-specific defaults (Dev 100%, Staging 10%, Prod 1%) align with the PR objectives and adapter-frame-design updates.

@rafabene rafabene force-pushed the HYPERFLEET-383-tracing-telemetry-standard branch 2 times, most recently from c532786 to 3adb216 Compare January 2, 2026 17:45
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
hyperfleet/components/adapter/framework/adapter-frame-design.md (1)

1273-1273: Clarify which environment the sampling rate example applies to.

The comment "10% sampling in production" is misleading. Per the HyperFleet Tracing Standard, 0.1 (10%) is the Staging sampling rate; Production uses 0.01 (1%). Clarify whether this line shows a Staging example or needs adjustment.

🧹 Nitpick comments (1)
hyperfleet/standards/tracing.md (1)

1-484: Comprehensive tracing standard that establishes clear OpenTelemetry adoption.

The complete document is well-structured with clear sections on adoption, configuration, propagation, spans, attributes, sampling, exporters, and best practices. Practical Go code examples and YAML configurations make the standard actionable. The sampling rates (Dev 1.0, Staging 0.1, Prod 0.01) align with adapter-frame-design.md updates. Cross-references to related standards provide good context.

One suggestion for future work: Consider adding a troubleshooting section (perhaps as part of follow-up ticket HYPERFLEET-433) with common tracing issues, debugging tips, and sample queries for visualizing traces in observability backends.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c532786 and 3adb216.

📒 Files selected for processing (6)
  • hyperfleet/components/adapter/framework/adapter-deployment.md
  • hyperfleet/components/adapter/framework/adapter-frame-design.md
  • hyperfleet/standards/error-model.md
  • hyperfleet/standards/logging-specification.md
  • hyperfleet/standards/metrics.md
  • hyperfleet/standards/tracing.md
🚧 Files skipped from review as they are similar to previous changes (4)
  • hyperfleet/standards/logging-specification.md
  • hyperfleet/standards/metrics.md
  • hyperfleet/components/adapter/framework/adapter-deployment.md
  • hyperfleet/standards/error-model.md
🔇 Additional comments (9)
hyperfleet/components/adapter/framework/adapter-frame-design.md (1)

1341-1346: Sampling rates and tracing standard reference look good.

The updated sampling rates (Staging 0.1, Production 0.01) and reference to the HyperFleet Tracing Standard are appropriately added. This provides good cross-documentation clarity.

hyperfleet/standards/tracing.md (8)

1-47: OpenTelemetry adoption section is well-motivated and actionable.

Clear goals, appropriate non-goals, and specific SDK requirements with concrete Go imports make this a strong foundation for the standard.


50-84: Configuration section is comprehensive and follows OTEL conventions.

Environment variables, service names, and resource attributes are clearly documented and align with OpenTelemetry standards. The defaults and examples are practical for Kubernetes deployment.


86-152: Trace context propagation is correctly implemented across HTTP and CloudEvents.

W3C Trace Context usage is appropriate, code examples are practical, and the propagation flow diagram clearly illustrates how traces flow through Sentinel→API→Pub/Sub→Adapter. The CloudEvents extension approach aligns with standard practices.


157-208: Required spans follow semantic conventions with clear component-specific guidance.

Naming patterns are consistent and practical (HTTP: {method} {route}, Database: {operation} {table}, etc.). Coverage of API, Sentinel, and Adapter spans is comprehensive. The guidance to use attributes for high-cardinality values prevents span explosion.


211-265: Span attributes comprehensively combine semantic conventions with HyperFleet-specific attributes.

Attributes for HTTP, database, and messaging are well-documented. HyperFleet-specific attributes (cluster_id, adapter, etc.) use clear namespacing. The DO/DON'T guidance prevents common issues (sensitive data, high-cardinality names, large payloads).


268-313: Sampling strategy clearly justifies head-based approach with environment-specific rates.

Trade-offs between head-based and tail-based sampling are well-explained. The chosen head-based approach with parent-based trace ID ratio sampler is practical. Environment-specific rates (Dev 1.0, Staging 0.1, Prod 0.01) are well-justified. The "Always Sample" section appropriately recommends custom samplers for error/latency-based sampling.


317-357: OTLP exporter configuration covers Kubernetes and local development scenarios.

Code examples use standard Go SDK APIs. Kubernetes deployment shows appropriate service discovery to OpenTelemetry Collector. Local development guidance with console and Jaeger options is helpful for debugging. Note: The collector service name (otel-collector.observability.svc) may vary by deployment; ensure it matches your observability infrastructure setup.


360-473: Logging integration, error handling, and span lifecycle sections provide actionable best practices.

Trace-log correlation via trace_id/span_id is clearly demonstrated with JSON examples. Error handling guidance is practical with context attributes. Span lifecycle shows both correct patterns and anti-patterns (e.g., the "Bad" example with context.Background() clearly illustrates trace breakage). The context propagation emphasis throughout the call chain is critical and well-highlighted.

@rafabene rafabene force-pushed the HYPERFLEET-383-tracing-telemetry-standard branch from 3adb216 to c886495 Compare January 2, 2026 17:49
Add comprehensive tracing standard document covering:
- OpenTelemetry SDK adoption and configuration
- W3C Trace Context propagation (HTTP and CloudEvents)
- Required spans for API, Sentinel, and Adapters
- Standard span attributes following semantic conventions
- Sampling strategy (head-based vs tail-based) with environment-specific rates
- OTLP exporter configuration
- Integration with logging via trace_id/span_id

Update related documents with cross-references:
- Add References section to logging-specification.md
- Add tracing reference to metrics.md and error-model.md
- Update adapter-frame-design.md with aligned sample rates and reference
- Add tracing and logging references to adapter-deployment.md

Follow-up ticket created: HYPERFLEET-433 (Claude plugin integration)

Co-Authored-By: Claude <noreply@anthropic.com>
@rafabene rafabene force-pushed the HYPERFLEET-383-tracing-telemetry-standard branch from c886495 to c68a672 Compare January 2, 2026 17:51

### Resource Attributes

Components SHOULD include these resource attributes via `OTEL_RESOURCE_ATTRIBUTES`:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a list missing here?
I guess the attributes are the ones in the example below?

If we want to avoid duplication, the service.version could be generated out of the build sha.
This is how it is now in hyperfleet-api

build_version:=$(git_sha)$(git_dirty)
ldflags=-X github.com/openshift-hyperfleet/hyperfleet-api/pkg/api.Version=$(build_version)

The namespace can come from the helm deployment

But, I wonder where the environment will come from. Would different environments collect different OTEL data? I mean....

  • Some production infra will collect OTEL data from production apps
  • Some dev, stage infra will collect OTEL data possible from multiple "environments"...
    • but in that case.... could we use the k8s.namespace.name to differentiate the environment?
    • e.g. hyperfleet-stage, hyperfleet-shared, hyperfleet-rafabenede

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the section to:

  1. Added a clear attribute table listing service.version and k8s.namespace.name with their sources
  2. Clarified that k8s.namespace.name serves as the environment differentiator - different namespaces (e.g., hyperfleet-prod, hyperfleet-stage, hyperfleet-dev-rafael) naturally isolate traces in multi-tenant deployments
  3. Added a build example showing how to use git SHA for service.version (matching the existing pattern in hyperfleet-api)
  4. Removed deployment.environment as a required attribute since namespace-based isolation is sufficient for our use case

This aligns with your suggestion - we use k8s.namespace.name to differentiate environments rather than a separate deployment.environment attribute.

end
```

> **Note:** Sentinel initiates traces during polling cycles. The trace context propagates through CloudEvents to Adapters, which include it when updating status back to the API.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: There is an additional call from adapters to API to get the cluster/nodepool status

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Updated the diagram to show both API calls from the Adapter:

  1. GET status - to fetch current cluster/nodepool status
  2. PATCH status - to update status after reconciliation

Both calls propagate the trace context.

| Poll cycle | `sentinel.{operation}` | `sentinel.poll` |
| Decision evaluation | `sentinel.{operation}` | `sentinel.evaluate` |
| Event publish | `{destination} {operation}` | `hyperfleet-clusters publish` |
| API call | `{method}` | `GET` |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the API name specified somehow?
how to differentiate if Sentinel is doing 2 API calls to 2 different APIs?

I mean, for the Event publish, the destination is specified, but not for API calls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! Per https://opentelemetry.io/docs/specs/semconv/http/http-spans/#http-client, HTTP client span names should only include the method (e.g., GET), not the destination.

To differentiate calls to different APIs, use the span attributes:

  • server.address - the target API hostname
  • url.path - the request path

I've added a note clarifying this and referencing the HTTP Spans attributes section.

|-----------|-------------------|---------|
| Event receive | `{destination} {operation}` | `hyperfleet-clusters receive` |
| Event process | `adapter.{operation}` | `adapter.process` |
| Cloud provider call | `{method}` | `POST` |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the cloud service being called also be specified? bucket, pubsub, gke....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I've added:

  1. A note in the Adapters section referencing the new Cloud Provider Spans attributes
  2. A new Cloud Provider Spans section with attributes to identify the cloud service:
    - cloud.provider - the cloud provider (gcp, aws, azure)
    - cloud.service - the service name (gke, storage, pubsub)
    - server.address - the API endpoint
    - cloud.resource_id - the resource identifier

Also included a Go code example showing how to set these attributes for a GKE API call.


### Default: Parent-Based Trace ID Ratio

HyperFleet uses `parentbased_traceidratio` as the default sampler:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reasoning for this decision?

ChatGPT says:

Most OTel SDKs and the OTel Collector docs suggest this as a default production sampler.

which could be a valid reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a "Why this sampler" section explaining the reasoning:

  1. Recommended by OpenTelemetry - Default production sampler suggested by OTel SDKs and Collector docs
  2. Trace completeness - Ensures all spans in a trace are sampled together, avoiding broken traces
  3. Distributed consistency - When Sentinel starts a trace, all downstream services (Adapters via Pub/Sub) respect the original sampling decision


### Always Sample

Certain operations SHOULD always be sampled regardless of the base rate:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this possible with the head-based approach selected previously?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right - I've clarified this section.

With head-based sampling, you can only force 100% sampling for known operations at trace start (specific routes, resource types, debug headers).

Sampling based on outcomes (errors, latency) is not possible with head-based because the decision is made before the operation completes.

I've added a note explaining this limitation and that tail-based sampling would be required if we need to capture all errors/slow requests in the future.

- Expand Resource Attributes section with clear attribute table
- Clarify k8s.namespace.name as environment differentiator
- Add service.version build example using git SHA
- Update propagation flow diagram to include Adapter GET status call
- Add note explaining HTTP client span naming convention
- Add Cloud Provider Spans section with attributes (cloud.provider, cloud.service)
- Document reasoning for parentbased_traceidratio sampler choice
- Clarify head-based sampling limitations for outcome-based sampling
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
hyperfleet/standards/tracing.md (2)

144-150: CloudEvents helper functions need implementation guidance.

Lines 144–150 reference traceparentFromContext() and contextFromTraceparent(), which are not shown. While a standards document doesn't need to provide all implementation details, developers implementing this may benefit from either a brief code example or a pointer to where these utilities come from (e.g., a shared helper library, community package, or custom implementation pattern).

Consider adding a brief note or example showing how to extract/reconstruct trace context from traceparent strings, or reference an existing helper (e.g., from an OTel community library or internal utility).


99-102: Clarify how Kubernetes namespace is populated in resource attributes.

The example at line 101 uses $(NAMESPACE) as a shell/templating variable. While the table at line 83 mentions "Helm/Downward API" as the source, the YAML snippet could be more explicit about how this variable is populated—e.g., whether it comes from a Downward API reference in the Helm chart or a hardcoded value. This will help teams implement it correctly across environments.

Expand the YAML example to show the Downward API pattern or a Helm-specific approach, e.g.:

Example Kubernetes Downward API approach
env:
  - name: NAMESPACE
    valueFrom:
      fieldRef:
        fieldPath: metadata.namespace
  - name: OTEL_RESOURCE_ATTRIBUTES
    value: "service.version=$(VERSION),k8s.namespace.name=$(NAMESPACE)"
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c886495 and 46fce7e.

📒 Files selected for processing (6)
  • hyperfleet/components/adapter/framework/adapter-deployment.md
  • hyperfleet/components/adapter/framework/adapter-frame-design.md
  • hyperfleet/standards/error-model.md
  • hyperfleet/standards/logging-specification.md
  • hyperfleet/standards/metrics.md
  • hyperfleet/standards/tracing.md
🚧 Files skipped from review as they are similar to previous changes (5)
  • hyperfleet/components/adapter/framework/adapter-frame-design.md
  • hyperfleet/standards/metrics.md
  • hyperfleet/standards/logging-specification.md
  • hyperfleet/components/adapter/framework/adapter-deployment.md
  • hyperfleet/standards/error-model.md
🔇 Additional comments (1)
hyperfleet/standards/tracing.md (1)

1-547: Comprehensive tracing standard with clear OpenTelemetry adoption and practical guidance.

This document establishes a solid foundation for distributed tracing across HyperFleet components. The structure is logical, spanning goals, configuration, span requirements, sampling strategy, exporters, and integration patterns. OpenTelemetry patterns and semantic conventions are correctly applied throughout, and the past review feedback has been thoughtfully incorporated—particularly the clarified resource attributes (with build-time versioning and namespace-based environment differentiation), trace flow diagram showing both API calls, and detailed sampling rationale addressing the head-based limitations.

The configuration guidance is practical (environment variables, service names, Kubernetes examples), span naming conventions align with OTel standards, and the examples effectively illustrate patterns for context propagation, error handling, and span lifecycle. The logging integration section ties traces to logs via trace_id/span_id, and the best practices for span attributes (avoiding PII, high cardinality, large payloads) are sensible and well-highlighted.

@rh-amarin
Copy link
Collaborator

/lgtm

Copy link
Contributor

@ciaranRoche ciaranRoche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@rafabene rafabene merged commit 5a6ad67 into openshift-hyperfleet:main Jan 8, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants