-
Notifications
You must be signed in to change notification settings - Fork 13
HYPERFLEET-383: Add tracing and telemetry standard #64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HYPERFLEET-383: Add tracing and telemetry standard #64
Conversation
WalkthroughAdded a new HyperFleet Tracing and Telemetry Standard at hyperfleet/standards/tracing.md describing OpenTelemetry-based distributed tracing (goals, SDK requirements, env var configuration, per-component service names, resource attributes, HTTP and CloudEvents propagation, required spans/attributes, sampling strategy with environment-specific rates, OTLP exporter defaults, deployment guidance, logging correlation, and span lifecycle examples). Updated adapter-frame-design.md sample rates (Staging 0.5 → 0.1; Production 0.1 → 0.01) and added tracing references across multiple docs and the logging specification. All changes are documentation-only except the new tracing.md file. Sequence Diagram(s)sequenceDiagram
rect rgb(248,249,251)
participant Client
participant Adapter
participant OTLP_Collector as "OTLP Collector"
participant Tracing_Backend as "Tracing Backend"
end
Client->>Adapter: HTTP request / CloudEvent (may carry trace context)
Note right of Adapter: start/continue spans\napply sampling, add attributes
Adapter->>OTLP_Collector: Export spans (OTLP gRPC/HTTP)
OTLP_Collector->>Tracing_Backend: Forward traces for storage/analysis
Tracing_Backend-->>OTLP_Collector: Ack / ingestion status
OTLP_Collector-->>Adapter: Export result (async)
Note left of Client: Logs/metrics correlated via trace IDs
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
Pre-merge checks✅ Passed checks (3 passed)
Comment |
985cc20 to
ca745a3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
hyperfleet/standards/tracing.md (4)
54-64: Clarify usage ofTRACING_ENABLEDenvironment variable.The config table introduces
TRACING_ENABLED(line 63) but it is never referenced in code examples or exporter sections. Either remove it if unused, or add a brief explanation of when/how to use it and what happens when set tofalse.
282-289: Configuration example should include all three environments.The example shows only Development and Production. For clarity, also include Staging with its 0.1 sampling rate to match the sampling strategy section.
🔎 Proposed addition
# Development OTEL_TRACES_SAMPLER=always_on +# Staging +OTEL_TRACES_SAMPLER=parentbased_traceidratio +OTEL_TRACES_SAMPLER_ARG=0.1 + # Production OTEL_TRACES_SAMPLER=parentbased_traceidratio OTEL_TRACES_SAMPLER_ARG=0.01
61-61: Baggage propagation mentioned in config but not documented in usage section.The
OTEL_PROPAGATORSconfig includesbaggage(line 61), but there is no section explaining when/how to use W3C Baggage in HyperFleet spans or context. Consider either clarifying whether baggage propagation is required, or adding a brief section with examples.
303-342: Exporter configuration examples clear for both environments; consider adding TLS note for production.The OTLP examples are practical for local development and Kubernetes. For completeness, consider a brief note in the Kubernetes section mentioning TLS configuration in production environments (though detailed TLS setup may belong in infrastructure docs).
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
hyperfleet/components/adapter/framework/adapter-deployment.mdhyperfleet/components/adapter/framework/adapter-frame-design.mdhyperfleet/standards/error-model.mdhyperfleet/standards/logging-specification.mdhyperfleet/standards/metrics.mdhyperfleet/standards/tracing.md
🚧 Files skipped from review as they are similar to previous changes (4)
- hyperfleet/standards/metrics.md
- hyperfleet/components/adapter/framework/adapter-deployment.md
- hyperfleet/components/adapter/framework/adapter-frame-design.md
- hyperfleet/standards/error-model.md
🔇 Additional comments (3)
hyperfleet/standards/logging-specification.md (1)
277-281: References section properly integrates tracing and metrics standards.The new References section appropriately links to related standards and establishes bidirectional cross-references with the tracing and metrics documents.
hyperfleet/standards/tracing.md (2)
1-100: Excellent comprehensive tracing standard with clear OpenTelemetry guidance.The document provides strong foundational guidance: vendor-neutral rationale, OTEL SDK requirements, configuration following conventions, proper W3C Trace Context propagation, and practical Go examples. Cross-references to logging and metrics are well-integrated.
254-300: Sampling strategy clearly explained with environment-specific rates.The head-based vs tail-based comparison is helpful, and the environment-specific defaults (Dev 100%, Staging 10%, Prod 1%) align with the PR objectives and adapter-frame-design updates.
c532786 to
3adb216
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
hyperfleet/components/adapter/framework/adapter-frame-design.md (1)
1273-1273: Clarify which environment the sampling rate example applies to.The comment "10% sampling in production" is misleading. Per the HyperFleet Tracing Standard,
0.1(10%) is the Staging sampling rate; Production uses0.01(1%). Clarify whether this line shows a Staging example or needs adjustment.
🧹 Nitpick comments (1)
hyperfleet/standards/tracing.md (1)
1-484: Comprehensive tracing standard that establishes clear OpenTelemetry adoption.The complete document is well-structured with clear sections on adoption, configuration, propagation, spans, attributes, sampling, exporters, and best practices. Practical Go code examples and YAML configurations make the standard actionable. The sampling rates (Dev 1.0, Staging 0.1, Prod 0.01) align with adapter-frame-design.md updates. Cross-references to related standards provide good context.
One suggestion for future work: Consider adding a troubleshooting section (perhaps as part of follow-up ticket HYPERFLEET-433) with common tracing issues, debugging tips, and sample queries for visualizing traces in observability backends.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
hyperfleet/components/adapter/framework/adapter-deployment.mdhyperfleet/components/adapter/framework/adapter-frame-design.mdhyperfleet/standards/error-model.mdhyperfleet/standards/logging-specification.mdhyperfleet/standards/metrics.mdhyperfleet/standards/tracing.md
🚧 Files skipped from review as they are similar to previous changes (4)
- hyperfleet/standards/logging-specification.md
- hyperfleet/standards/metrics.md
- hyperfleet/components/adapter/framework/adapter-deployment.md
- hyperfleet/standards/error-model.md
🔇 Additional comments (9)
hyperfleet/components/adapter/framework/adapter-frame-design.md (1)
1341-1346: Sampling rates and tracing standard reference look good.The updated sampling rates (Staging 0.1, Production 0.01) and reference to the HyperFleet Tracing Standard are appropriately added. This provides good cross-documentation clarity.
hyperfleet/standards/tracing.md (8)
1-47: OpenTelemetry adoption section is well-motivated and actionable.Clear goals, appropriate non-goals, and specific SDK requirements with concrete Go imports make this a strong foundation for the standard.
50-84: Configuration section is comprehensive and follows OTEL conventions.Environment variables, service names, and resource attributes are clearly documented and align with OpenTelemetry standards. The defaults and examples are practical for Kubernetes deployment.
86-152: Trace context propagation is correctly implemented across HTTP and CloudEvents.W3C Trace Context usage is appropriate, code examples are practical, and the propagation flow diagram clearly illustrates how traces flow through Sentinel→API→Pub/Sub→Adapter. The CloudEvents extension approach aligns with standard practices.
157-208: Required spans follow semantic conventions with clear component-specific guidance.Naming patterns are consistent and practical (HTTP:
{method} {route}, Database:{operation} {table}, etc.). Coverage of API, Sentinel, and Adapter spans is comprehensive. The guidance to use attributes for high-cardinality values prevents span explosion.
211-265: Span attributes comprehensively combine semantic conventions with HyperFleet-specific attributes.Attributes for HTTP, database, and messaging are well-documented. HyperFleet-specific attributes (cluster_id, adapter, etc.) use clear namespacing. The DO/DON'T guidance prevents common issues (sensitive data, high-cardinality names, large payloads).
268-313: Sampling strategy clearly justifies head-based approach with environment-specific rates.Trade-offs between head-based and tail-based sampling are well-explained. The chosen head-based approach with parent-based trace ID ratio sampler is practical. Environment-specific rates (Dev 1.0, Staging 0.1, Prod 0.01) are well-justified. The "Always Sample" section appropriately recommends custom samplers for error/latency-based sampling.
317-357: OTLP exporter configuration covers Kubernetes and local development scenarios.Code examples use standard Go SDK APIs. Kubernetes deployment shows appropriate service discovery to OpenTelemetry Collector. Local development guidance with console and Jaeger options is helpful for debugging. Note: The collector service name (otel-collector.observability.svc) may vary by deployment; ensure it matches your observability infrastructure setup.
360-473: Logging integration, error handling, and span lifecycle sections provide actionable best practices.Trace-log correlation via trace_id/span_id is clearly demonstrated with JSON examples. Error handling guidance is practical with context attributes. Span lifecycle shows both correct patterns and anti-patterns (e.g., the "Bad" example with context.Background() clearly illustrates trace breakage). The context propagation emphasis throughout the call chain is critical and well-highlighted.
3adb216 to
c886495
Compare
Add comprehensive tracing standard document covering: - OpenTelemetry SDK adoption and configuration - W3C Trace Context propagation (HTTP and CloudEvents) - Required spans for API, Sentinel, and Adapters - Standard span attributes following semantic conventions - Sampling strategy (head-based vs tail-based) with environment-specific rates - OTLP exporter configuration - Integration with logging via trace_id/span_id Update related documents with cross-references: - Add References section to logging-specification.md - Add tracing reference to metrics.md and error-model.md - Update adapter-frame-design.md with aligned sample rates and reference - Add tracing and logging references to adapter-deployment.md Follow-up ticket created: HYPERFLEET-433 (Claude plugin integration) Co-Authored-By: Claude <noreply@anthropic.com>
c886495 to
c68a672
Compare
hyperfleet/standards/tracing.md
Outdated
|
|
||
| ### Resource Attributes | ||
|
|
||
| Components SHOULD include these resource attributes via `OTEL_RESOURCE_ATTRIBUTES`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is a list missing here?
I guess the attributes are the ones in the example below?
If we want to avoid duplication, the service.version could be generated out of the build sha.
This is how it is now in hyperfleet-api
build_version:=$(git_sha)$(git_dirty)
ldflags=-X github.com/openshift-hyperfleet/hyperfleet-api/pkg/api.Version=$(build_version)
The namespace can come from the helm deployment
But, I wonder where the environment will come from. Would different environments collect different OTEL data? I mean....
- Some production infra will collect OTEL data from production apps
- Some dev, stage infra will collect OTEL data possible from multiple "environments"...
- but in that case.... could we use the
k8s.namespace.nameto differentiate the environment? - e.g.
hyperfleet-stage,hyperfleet-shared,hyperfleet-rafabenede
- but in that case.... could we use the
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the section to:
- Added a clear attribute table listing service.version and k8s.namespace.name with their sources
- Clarified that k8s.namespace.name serves as the environment differentiator - different namespaces (e.g., hyperfleet-prod, hyperfleet-stage, hyperfleet-dev-rafael) naturally isolate traces in multi-tenant deployments
- Added a build example showing how to use git SHA for service.version (matching the existing pattern in hyperfleet-api)
- Removed deployment.environment as a required attribute since namespace-based isolation is sufficient for our use case
This aligns with your suggestion - we use k8s.namespace.name to differentiate environments rather than a separate deployment.environment attribute.
| end | ||
| ``` | ||
|
|
||
| > **Note:** Sentinel initiates traces during polling cycles. The trace context propagates through CloudEvents to Adapters, which include it when updating status back to the API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: There is an additional call from adapters to API to get the cluster/nodepool status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Updated the diagram to show both API calls from the Adapter:
- GET status - to fetch current cluster/nodepool status
- PATCH status - to update status after reconciliation
Both calls propagate the trace context.
| | Poll cycle | `sentinel.{operation}` | `sentinel.poll` | | ||
| | Decision evaluation | `sentinel.{operation}` | `sentinel.evaluate` | | ||
| | Event publish | `{destination} {operation}` | `hyperfleet-clusters publish` | | ||
| | API call | `{method}` | `GET` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the API name specified somehow?
how to differentiate if Sentinel is doing 2 API calls to 2 different APIs?
I mean, for the Event publish, the destination is specified, but not for API calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! Per https://opentelemetry.io/docs/specs/semconv/http/http-spans/#http-client, HTTP client span names should only include the method (e.g., GET), not the destination.
To differentiate calls to different APIs, use the span attributes:
- server.address - the target API hostname
- url.path - the request path
I've added a note clarifying this and referencing the HTTP Spans attributes section.
| |-----------|-------------------|---------| | ||
| | Event receive | `{destination} {operation}` | `hyperfleet-clusters receive` | | ||
| | Event process | `adapter.{operation}` | `adapter.process` | | ||
| | Cloud provider call | `{method}` | `POST` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the cloud service being called also be specified? bucket, pubsub, gke....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I've added:
- A note in the Adapters section referencing the new Cloud Provider Spans attributes
- A new Cloud Provider Spans section with attributes to identify the cloud service:
- cloud.provider - the cloud provider (gcp, aws, azure)
- cloud.service - the service name (gke, storage, pubsub)
- server.address - the API endpoint
- cloud.resource_id - the resource identifier
Also included a Go code example showing how to set these attributes for a GKE API call.
hyperfleet/standards/tracing.md
Outdated
|
|
||
| ### Default: Parent-Based Trace ID Ratio | ||
|
|
||
| HyperFleet uses `parentbased_traceidratio` as the default sampler: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reasoning for this decision?
ChatGPT says:
Most OTel SDKs and the OTel Collector docs suggest this as a default production sampler.
which could be a valid reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a "Why this sampler" section explaining the reasoning:
- Recommended by OpenTelemetry - Default production sampler suggested by OTel SDKs and Collector docs
- Trace completeness - Ensures all spans in a trace are sampled together, avoiding broken traces
- Distributed consistency - When Sentinel starts a trace, all downstream services (Adapters via Pub/Sub) respect the original sampling decision
hyperfleet/standards/tracing.md
Outdated
|
|
||
| ### Always Sample | ||
|
|
||
| Certain operations SHOULD always be sampled regardless of the base rate: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this possible with the head-based approach selected previously?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right - I've clarified this section.
With head-based sampling, you can only force 100% sampling for known operations at trace start (specific routes, resource types, debug headers).
Sampling based on outcomes (errors, latency) is not possible with head-based because the decision is made before the operation completes.
I've added a note explaining this limitation and that tail-based sampling would be required if we need to capture all errors/slow requests in the future.
- Expand Resource Attributes section with clear attribute table - Clarify k8s.namespace.name as environment differentiator - Add service.version build example using git SHA - Update propagation flow diagram to include Adapter GET status call - Add note explaining HTTP client span naming convention - Add Cloud Provider Spans section with attributes (cloud.provider, cloud.service) - Document reasoning for parentbased_traceidratio sampler choice - Clarify head-based sampling limitations for outcome-based sampling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
hyperfleet/standards/tracing.md (2)
144-150: CloudEvents helper functions need implementation guidance.Lines 144–150 reference
traceparentFromContext()andcontextFromTraceparent(), which are not shown. While a standards document doesn't need to provide all implementation details, developers implementing this may benefit from either a brief code example or a pointer to where these utilities come from (e.g., a shared helper library, community package, or custom implementation pattern).Consider adding a brief note or example showing how to extract/reconstruct trace context from traceparent strings, or reference an existing helper (e.g., from an OTel community library or internal utility).
99-102: Clarify how Kubernetes namespace is populated in resource attributes.The example at line 101 uses
$(NAMESPACE)as a shell/templating variable. While the table at line 83 mentions "Helm/Downward API" as the source, the YAML snippet could be more explicit about how this variable is populated—e.g., whether it comes from a Downward API reference in the Helm chart or a hardcoded value. This will help teams implement it correctly across environments.Expand the YAML example to show the Downward API pattern or a Helm-specific approach, e.g.:
Example Kubernetes Downward API approach
env: - name: NAMESPACE valueFrom: fieldRef: fieldPath: metadata.namespace - name: OTEL_RESOURCE_ATTRIBUTES value: "service.version=$(VERSION),k8s.namespace.name=$(NAMESPACE)"
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
hyperfleet/components/adapter/framework/adapter-deployment.mdhyperfleet/components/adapter/framework/adapter-frame-design.mdhyperfleet/standards/error-model.mdhyperfleet/standards/logging-specification.mdhyperfleet/standards/metrics.mdhyperfleet/standards/tracing.md
🚧 Files skipped from review as they are similar to previous changes (5)
- hyperfleet/components/adapter/framework/adapter-frame-design.md
- hyperfleet/standards/metrics.md
- hyperfleet/standards/logging-specification.md
- hyperfleet/components/adapter/framework/adapter-deployment.md
- hyperfleet/standards/error-model.md
🔇 Additional comments (1)
hyperfleet/standards/tracing.md (1)
1-547: Comprehensive tracing standard with clear OpenTelemetry adoption and practical guidance.This document establishes a solid foundation for distributed tracing across HyperFleet components. The structure is logical, spanning goals, configuration, span requirements, sampling strategy, exporters, and integration patterns. OpenTelemetry patterns and semantic conventions are correctly applied throughout, and the past review feedback has been thoughtfully incorporated—particularly the clarified resource attributes (with build-time versioning and namespace-based environment differentiation), trace flow diagram showing both API calls, and detailed sampling rationale addressing the head-based limitations.
The configuration guidance is practical (environment variables, service names, Kubernetes examples), span naming conventions align with OTel standards, and the examples effectively illustrate patterns for context propagation, error handling, and span lifecycle. The logging integration section ties traces to logs via
trace_id/span_id, and the best practices for span attributes (avoiding PII, high cardinality, large payloads) are sensible and well-highlighted.
|
/lgtm |
ciaranRoche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Summary
Adds comprehensive tracing and telemetry standard for all HyperFleet components (API, Sentinel, Adapters).
New Document:
hyperfleet/standards/tracing.mdCross-Reference Updates
logging-specification.mdmetrics.mderror-model.mdadapter-frame-design.mdadapter-deployment.mdSampling Rates (Aligned)
Acceptance Criteria
hyperfleet/standards/tracing.mdRelated
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.