HYPERFLEET-383: Add tracing and telemetry standard #64

rafabene · 2026-01-02T17:25:51Z

Summary

Adds comprehensive tracing and telemetry standard for all HyperFleet components (API, Sentinel, Adapters).

New Document: `hyperfleet/standards/tracing.md`

OpenTelemetry Adoption: SDK requirements, justification for vendor-neutral approach
Configuration: Environment variables following OTEL conventions
W3C Trace Context Propagation: HTTP headers and CloudEvents extensions
Required Spans: Per-component (API, Sentinel, Adapters) with naming conventions
Standard Span Attributes: Semantic conventions + HyperFleet-specific attributes
Sampling Strategy: Head-based vs tail-based comparison, environment-specific rates
Exporter Configuration: OTLP setup for Kubernetes and local development
Logging Integration: trace_id/span_id correlation with logs
Best Practices: Error handling, span lifecycle, context propagation

Cross-Reference Updates

Document	Change
`logging-specification.md`	Added References section
`metrics.md`	Added tracing reference
`error-model.md`	Added tracing reference
`adapter-frame-design.md`	Aligned sample rates + added reference
`adapter-deployment.md`	Added tracing and logging references

Sampling Rates (Aligned)

Environment	Rate
Development	1.0 (100%)
Staging	0.1 (10%)
Production	0.01 (1%)

Acceptance Criteria

Standard documented in hyperfleet/standards/tracing.md
OpenTelemetry adoption decision documented
Required spans and attributes defined
Sampling strategy defined
Follow-up ticket created: HYPERFLEET-433

Summary by CodeRabbit

Documentation
- Added a comprehensive Tracing & Telemetry Standard with OpenTelemetry guidance, exporter and sampling recommendations.
- Updated typical trace sampling rates: Staging 0.5→0.1, Production 0.1→0.01 (Development unchanged).
- Added tracing references across deployment, logging, metrics and error-model docs; added a References subsection in logging guidance.
- No API, configuration, or behavioral changes.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-02T17:26:00Z

Walkthrough

Added a new HyperFleet Tracing and Telemetry Standard at hyperfleet/standards/tracing.md describing OpenTelemetry-based distributed tracing (goals, SDK requirements, env var configuration, per-component service names, resource attributes, HTTP and CloudEvents propagation, required spans/attributes, sampling strategy with environment-specific rates, OTLP exporter defaults, deployment guidance, logging correlation, and span lifecycle examples). Updated adapter-frame-design.md sample rates (Staging 0.5 → 0.1; Production 0.1 → 0.01) and added tracing references across multiple docs and the logging specification. All changes are documentation-only except the new tracing.md file.

Sequence Diagram(s)

sequenceDiagram
  rect rgb(248,249,251)
    participant Client
    participant Adapter
    participant OTLP_Collector as "OTLP Collector"
    participant Tracing_Backend as "Tracing Backend"
  end

  Client->>Adapter: HTTP request / CloudEvent (may carry trace context)
  Note right of Adapter: start/continue spans\napply sampling, add attributes
  Adapter->>OTLP_Collector: Export spans (OTLP gRPC/HTTP)
  OTLP_Collector->>Tracing_Backend: Forward traces for storage/analysis
  Tracing_Backend-->>OTLP_Collector: Ack / ingestion status
  OTLP_Collector-->>Adapter: Export result (async)
  Note left of Client: Logs/metrics correlated via trace IDs

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

HYPERFLEET-424: Move standards documents from /docs to /standards #60: Modifies the same documentation area and cross-references related standards.

Suggested reviewers

ciaranRoche
xueli181114
rh-amarin

Pre-merge checks

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title clearly and concisely summarizes the main change: adding a comprehensive tracing and telemetry standard for HyperFleet. It is specific, directly related to the primary deliverable (tracing.md), and avoids vague terminology.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

hyperfleet/standards/tracing.md (4)
54-64: Clarify usage of TRACING_ENABLED environment variable.

The config table introduces TRACING_ENABLED (line 63) but it is never referenced in code examples or exporter sections. Either remove it if unused, or add a brief explanation of when/how to use it and what happens when set to false.

282-289: Configuration example should include all three environments.

The example shows only Development and Production. For clarity, also include Staging with its 0.1 sampling rate to match the sampling strategy section.
🔎 Proposed addition
 # Development
 OTEL_TRACES_SAMPLER=always_on

+# Staging
+OTEL_TRACES_SAMPLER=parentbased_traceidratio
+OTEL_TRACES_SAMPLER_ARG=0.1
+
 # Production
 OTEL_TRACES_SAMPLER=parentbased_traceidratio
 OTEL_TRACES_SAMPLER_ARG=0.01
61-61: Baggage propagation mentioned in config but not documented in usage section.

The OTEL_PROPAGATORS config includes baggage (line 61), but there is no section explaining when/how to use W3C Baggage in HyperFleet spans or context. Consider either clarifying whether baggage propagation is required, or adding a brief section with examples.

303-342: Exporter configuration examples clear for both environments; consider adding TLS note for production.

The OTLP examples are practical for local development and Kubernetes. For completeness, consider a brief note in the Kubernetes section mentioning TLS configuration in production environments (though detailed TLS setup may belong in infrastructure docs).

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 985cc20 and ca745a3.

📒 Files selected for processing (6)

hyperfleet/components/adapter/framework/adapter-deployment.md
hyperfleet/components/adapter/framework/adapter-frame-design.md
hyperfleet/standards/error-model.md
hyperfleet/standards/logging-specification.md
hyperfleet/standards/metrics.md
hyperfleet/standards/tracing.md

🚧 Files skipped from review as they are similar to previous changes (4)

hyperfleet/standards/metrics.md
hyperfleet/components/adapter/framework/adapter-deployment.md
hyperfleet/components/adapter/framework/adapter-frame-design.md
hyperfleet/standards/error-model.md

🔇 Additional comments (3)

hyperfleet/standards/logging-specification.md (1)

277-281: References section properly integrates tracing and metrics standards.

The new References section appropriately links to related standards and establishes bidirectional cross-references with the tracing and metrics documents.

hyperfleet/standards/tracing.md (2)

1-100: Excellent comprehensive tracing standard with clear OpenTelemetry guidance.

The document provides strong foundational guidance: vendor-neutral rationale, OTEL SDK requirements, configuration following conventions, proper W3C Trace Context propagation, and practical Go examples. Cross-references to logging and metrics are well-integrated.

254-300: Sampling strategy clearly explained with environment-specific rates.

The head-based vs tail-based comparison is helpful, and the environment-specific defaults (Dev 100%, Staging 10%, Prod 1%) align with the PR objectives and adapter-frame-design updates.

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

hyperfleet/components/adapter/framework/adapter-frame-design.md (1)

1273-1273: Clarify which environment the sampling rate example applies to.

The comment "10% sampling in production" is misleading. Per the HyperFleet Tracing Standard, 0.1 (10%) is the Staging sampling rate; Production uses 0.01 (1%). Clarify whether this line shows a Staging example or needs adjustment.

🧹 Nitpick comments (1)

hyperfleet/standards/tracing.md (1)

1-484: Comprehensive tracing standard that establishes clear OpenTelemetry adoption.

The complete document is well-structured with clear sections on adoption, configuration, propagation, spans, attributes, sampling, exporters, and best practices. Practical Go code examples and YAML configurations make the standard actionable. The sampling rates (Dev 1.0, Staging 0.1, Prod 0.01) align with adapter-frame-design.md updates. Cross-references to related standards provide good context.

One suggestion for future work: Consider adding a troubleshooting section (perhaps as part of follow-up ticket HYPERFLEET-433) with common tracing issues, debugging tips, and sample queries for visualizing traces in observability backends.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c532786 and 3adb216.

📒 Files selected for processing (6)

hyperfleet/components/adapter/framework/adapter-deployment.md
hyperfleet/components/adapter/framework/adapter-frame-design.md
hyperfleet/standards/error-model.md
hyperfleet/standards/logging-specification.md
hyperfleet/standards/metrics.md
hyperfleet/standards/tracing.md

🚧 Files skipped from review as they are similar to previous changes (4)

hyperfleet/standards/logging-specification.md
hyperfleet/standards/metrics.md
hyperfleet/components/adapter/framework/adapter-deployment.md
hyperfleet/standards/error-model.md

🔇 Additional comments (9)

hyperfleet/components/adapter/framework/adapter-frame-design.md (1)

1341-1346: Sampling rates and tracing standard reference look good.

The updated sampling rates (Staging 0.1, Production 0.01) and reference to the HyperFleet Tracing Standard are appropriately added. This provides good cross-documentation clarity.

hyperfleet/standards/tracing.md (8)

1-47: OpenTelemetry adoption section is well-motivated and actionable.

Clear goals, appropriate non-goals, and specific SDK requirements with concrete Go imports make this a strong foundation for the standard.

50-84: Configuration section is comprehensive and follows OTEL conventions.

Environment variables, service names, and resource attributes are clearly documented and align with OpenTelemetry standards. The defaults and examples are practical for Kubernetes deployment.

86-152: Trace context propagation is correctly implemented across HTTP and CloudEvents.

W3C Trace Context usage is appropriate, code examples are practical, and the propagation flow diagram clearly illustrates how traces flow through Sentinel→API→Pub/Sub→Adapter. The CloudEvents extension approach aligns with standard practices.

157-208: Required spans follow semantic conventions with clear component-specific guidance.

Naming patterns are consistent and practical (HTTP: {method} {route}, Database: {operation} {table}, etc.). Coverage of API, Sentinel, and Adapter spans is comprehensive. The guidance to use attributes for high-cardinality values prevents span explosion.

211-265: Span attributes comprehensively combine semantic conventions with HyperFleet-specific attributes.

Attributes for HTTP, database, and messaging are well-documented. HyperFleet-specific attributes (cluster_id, adapter, etc.) use clear namespacing. The DO/DON'T guidance prevents common issues (sensitive data, high-cardinality names, large payloads).

268-313: Sampling strategy clearly justifies head-based approach with environment-specific rates.

Trade-offs between head-based and tail-based sampling are well-explained. The chosen head-based approach with parent-based trace ID ratio sampler is practical. Environment-specific rates (Dev 1.0, Staging 0.1, Prod 0.01) are well-justified. The "Always Sample" section appropriately recommends custom samplers for error/latency-based sampling.

317-357: OTLP exporter configuration covers Kubernetes and local development scenarios.

Code examples use standard Go SDK APIs. Kubernetes deployment shows appropriate service discovery to OpenTelemetry Collector. Local development guidance with console and Jaeger options is helpful for debugging. Note: The collector service name (otel-collector.observability.svc) may vary by deployment; ensure it matches your observability infrastructure setup.

360-473: Logging integration, error handling, and span lifecycle sections provide actionable best practices.

Trace-log correlation via trace_id/span_id is clearly demonstrated with JSON examples. Error handling guidance is practical with context attributes. Span lifecycle shows both correct patterns and anti-patterns (e.g., the "Bad" example with context.Background() clearly illustrates trace breakage). The context propagation emphasis throughout the call chain is critical and well-highlighted.

Add comprehensive tracing standard document covering: - OpenTelemetry SDK adoption and configuration - W3C Trace Context propagation (HTTP and CloudEvents) - Required spans for API, Sentinel, and Adapters - Standard span attributes following semantic conventions - Sampling strategy (head-based vs tail-based) with environment-specific rates - OTLP exporter configuration - Integration with logging via trace_id/span_id Update related documents with cross-references: - Add References section to logging-specification.md - Add tracing reference to metrics.md and error-model.md - Update adapter-frame-design.md with aligned sample rates and reference - Add tracing and logging references to adapter-deployment.md Follow-up ticket created: HYPERFLEET-433 (Claude plugin integration) Co-Authored-By: Claude <noreply@anthropic.com>

rh-amarin · 2026-01-05T11:59:58Z