Add OpenTelemetry SDK as a parallel option to ddtrace#59
Conversation
Deployer picks one at runtime via DATADOG_TRACING_ENABLED or OTEL_TRACING_ENABLED. ddtrace stays first-class for users who want native APM; OpenTelemetry SDK opens any OTLP/HTTP backend (Tempo, Jaeger, Honeycomb, Grafana Cloud) for users who don't. OTel exposes the same call surface as ddtrace via a thin shim, so existing tracer.trace() / span.set_tag() / HTTPPropagator.inject() call sites are unchanged.
Greptile SummaryThis PR introduces a parallel OpenTelemetry tracing path alongside the existing ddtrace integration, controlled by a new Confidence Score: 5/5Safe to merge — all previous P1 findings are addressed and no new blocking issues found. All three P1 issues from previous review threads (BatchSpanProcessor daemon-thread flush on exit, SIGTERM not running atexit, and OTel error status not propagated to backends) are explicitly resolved in this revision. The signal-chaining logic uses a factory closure to correctly capture per-signal previous handlers. The deferred error-status pattern in _OtelSpanShim.finish() correctly handles the OTel set_status lock-after-first-call constraint. No new P1 issues were identified. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant EP as entrypoint.sh
participant PY as primitivemail_milter.py (import)
participant OTel as OTel SDK
participant Main as main()
participant Milter as PrimitiveMailMilter
participant Collector as OTLP Collector
EP->>PY: OTEL_TRACING_ENABLED=true
PY->>OTel: TracerProvider + BatchSpanProcessor + OTLPSpanExporter
PY->>PY: atexit.register(provider.shutdown)
PY->>PY: define _install_otel_shutdown_signals()
PY->>Main: module loaded
Main->>Main: signal.signal(SIGHUP, reload_config)
Main->>Main: _install_otel_shutdown_signals() [chains SIGTERM/SIGINT]
Milter->>OTel: tracer.trace("milter.process_email") → attach ctx_token
Milter->>OTel: tracer.trace("milter.storage_upload") → child span
Milter->>Milter: HTTPPropagator.inject(span.context, headers) [W3C traceparent]
Milter->>OTel: storage_span.finish() → set_status(ERROR?) + span.end() + detach
Milter->>OTel: tracer.trace("milter.webhook_call") → child span
Milter->>OTel: webhook_span.finish() → set_status(ERROR?) + span.end() + detach
Milter->>OTel: process_email span.finish() → detach root ctx_token
OTel-->>Collector: BatchSpanProcessor flush (background thread)
Note over Main,OTel: On SIGTERM (docker stop)
Main->>OTel: signal handler → provider.shutdown() [force flush]
Main->>Main: re-raise SIG_DFL → sys.exit
Main->>OTel: atexit fires → provider.shutdown() [idempotent]
Reviews (6): Last reviewed commit: "Address Greptile round 5: SpanKind on ou..." | Re-trigger Greptile |
… precedence - Register atexit hook for the OTel TracerProvider so BatchSpanProcessor flushes its in-flight buffer on container stop / process exit. Without it, spans buffered in the last export window are silently dropped. - When both DATADOG_TRACING_ENABLED and OTEL_TRACING_ENABLED are set and ddtrace loads successfully, log a warning that ddtrace took precedence and OTel was not initialised — previously this was silent.
- _OtelSpanShim.set_tag now translates ddtrace's set_tag('error', True)
to span.set_status(StatusCode.ERROR), and set_tag('error.message', ...)
to set_status with the error description. Without this, errored spans
rendered as successful in Tempo/Jaeger/Honeycomb because OTel reads
status, not a magic attribute, to flag failures.
- _OtelTracerShim.trace() no longer copies the service arg onto the span
as service.name. service.name belongs on the Resource (set via
OTEL_SERVICE_NAME env var); setting it per-span is redundant and can
confuse backends that key service identity off the Resource.
…warning - atexit alone doesn't fire on SIGTERM (what 'docker stop' delivers), so the BatchSpanProcessor still dropped its last buffer on graceful container stop. Install a SIGTERM (and SIGINT) handler that flushes the OTel provider, then chains to whatever handler was previously installed — preserves any pymilter/app-level signal handling. - When DATADOG_TRACING_ENABLED=true is set but ddtrace isn't installed, the OTel path quietly takes over (assuming OTEL_TRACING_ENABLED=true and the OTel packages are present). Log a warning so an operator expecting Datadog APM doesn't have to read source to find out why spans landed in the OTel backend.
…ls from main()
- _OtelSpanShim.set_tag now caches the error flag and message instead of
emitting set_status eagerly. OTel locks the status after the first
non-UNSET set_status call, so the ddtrace pattern of set_tag('error',
True) followed by set_tag('error.message', msg) was silently dropping
the description. finish() now emits a single combined status from the
cached components so order doesn't matter.
- The SIGTERM/SIGINT flush chain is now installed from main() after the
SIGHUP handler is wired, instead of at module load. Captures whatever
handler is bound at call time and chains to it. atexit remains the
belt-and-braces backup for the normal-exit path.
…ions
- _OtelTracerShim now infers SpanKind from the operation name:
process_email → SERVER, storage_upload + webhook_call → CLIENT, all
others → INTERNAL. OTel UIs and trace queries that filter by kind
will now see the outbound HTTP spans as client calls instead of
generic INTERNAL.
- Three error paths that called set_tag('error', True) without a paired
set_tag('error.message', ...) now include the description: storage
HTTP non-2xx response, webhook network error (timeout/connection),
webhook unexpected exception. The Status.description is what surfaces
the failure reason in OTel backends; the bare-error path was leaving
it blank.
Summary
OTEL_TRACING_ENABLEDruntime flag andINSTALL_OTELDockerfile build arg for shipping traces to any OTLP/HTTP backend (Tempo, Jaeger, Honeycomb, Grafana Cloud, etc.).Motivation
Tracing in primitivemail has so far been a single path: install ddtrace, set
DATADOG_TRACING_ENABLED=true, ship to a Datadog Agent. Self-hosters who don't run Datadog inherited a dead trace pipeline. Adding vanilla OpenTelemetry alongside (not replacing) ddtrace gives the OSS deployment story a real tracing option without forcing existing Datadog users onto an OTLP relay.Configuration
Build:
--build-arg INSTALL_DDTRACE=true— install ddtrace (existing)--build-arg INSTALL_OTEL=true— installopentelemetry-{api,sdk,exporter-otlp-proto-http}Runtime — pick one (don't set both):
```bash
Datadog (unchanged):
DATADOG_TRACING_ENABLED=true
DD_SERVICE=milter
DD_ENV=...
DD_TRACE_AGENT_URL=http://localhost:8126
OpenTelemetry:
OTEL_TRACING_ENABLED=true
OTEL_SERVICE_NAME=milter
OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4318
OTEL_PROPAGATORS=tracecontext,baggage # default
```
Both SDKs speak W3C tracecontext on the wire, so cross-service trace continuity works regardless of which is selected.
Implementation note
Rather than branching every `tracer.trace()` / `span.set_tag()` / `HTTPPropagator.inject()` call site, the OTel path is wrapped in a thin shim that exposes the same surface ddtrace does. This keeps the diff confined to module-level setup and one `entrypoint.sh` block — no behavior change for the ddtrace path.
The shim manages OTel context attach/detach manually so child spans (storage, webhook) inherit the parent email span — ddtrace's `tracer.trace()` makes the new span active; OTel's `start_span()` does not.
Backwards compatibility
Test plan