Skip to content

Add OpenTelemetry SDK as a parallel option to ddtrace#59

Merged
prim-8 merged 6 commits into
mainfrom
feature/otel-sdk-parallel-option
Apr 28, 2026
Merged

Add OpenTelemetry SDK as a parallel option to ddtrace#59
prim-8 merged 6 commits into
mainfrom
feature/otel-sdk-parallel-option

Conversation

@prim-8
Copy link
Copy Markdown
Contributor

@prim-8 prim-8 commented Apr 28, 2026

Summary

  • Adds a parallel OTEL_TRACING_ENABLED runtime flag and INSTALL_OTEL Dockerfile build arg for shipping traces to any OTLP/HTTP backend (Tempo, Jaeger, Honeycomb, Grafana Cloud, etc.).
  • ddtrace path is unchanged. Deployers who run Datadog keep the native experience; deployers running OSS observability stacks now have a first-class path.
  • The two SDKs are mutually exclusive at runtime — ddtrace wins if both flags are set. Both can coexist as installed packages.

Motivation

Tracing in primitivemail has so far been a single path: install ddtrace, set DATADOG_TRACING_ENABLED=true, ship to a Datadog Agent. Self-hosters who don't run Datadog inherited a dead trace pipeline. Adding vanilla OpenTelemetry alongside (not replacing) ddtrace gives the OSS deployment story a real tracing option without forcing existing Datadog users onto an OTLP relay.

Configuration

Build:

  • --build-arg INSTALL_DDTRACE=true — install ddtrace (existing)
  • --build-arg INSTALL_OTEL=true — install opentelemetry-{api,sdk,exporter-otlp-proto-http}

Runtime — pick one (don't set both):

```bash

Datadog (unchanged):

DATADOG_TRACING_ENABLED=true
DD_SERVICE=milter
DD_ENV=...
DD_TRACE_AGENT_URL=http://localhost:8126

OpenTelemetry:

OTEL_TRACING_ENABLED=true
OTEL_SERVICE_NAME=milter
OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4318
OTEL_PROPAGATORS=tracecontext,baggage # default
```

Both SDKs speak W3C tracecontext on the wire, so cross-service trace continuity works regardless of which is selected.

Implementation note

Rather than branching every `tracer.trace()` / `span.set_tag()` / `HTTPPropagator.inject()` call site, the OTel path is wrapped in a thin shim that exposes the same surface ddtrace does. This keeps the diff confined to module-level setup and one `entrypoint.sh` block — no behavior change for the ddtrace path.

The shim manages OTel context attach/detach manually so child spans (storage, webhook) inherit the parent email span — ddtrace's `tracer.trace()` makes the new span active; OTel's `start_span()` does not.

Backwards compatibility

  • `INSTALL_DDTRACE=false` (default) and `DATADOG_TRACING_ENABLED=false` (default) — no change.
  • `INSTALL_DDTRACE=true` only and `DATADOG_TRACING_ENABLED=true` — identical to today.
  • Both flags set — ddtrace wins, OTel branch skipped, log notes which provider loaded.

Test plan

  • `pytest` — existing tests pass (tracing branches default off; module-level changes are import-gated).
  • Build with `INSTALL_DDTRACE=true INSTALL_OTEL=true`, run with `DATADOG_TRACING_ENABLED=true` — confirm Datadog APM shows the milter spans (no behavior change).
  • Same image, run with `OTEL_TRACING_ENABLED=true` + `OTEL_EXPORTER_OTLP_ENDPOINT=...` pointing at a local OTel collector — confirm traces arrive with same span names (`milter.process_email`, `milter.storage_upload`, `milter.webhook_call`) and proper parent/child structure.
  • Outbound webhook from the OTel run carries a `traceparent` header and the downstream service joins the trace.
  • Log line `APM tracing enabled (provider=...)` reports the correct provider.

Deployer picks one at runtime via DATADOG_TRACING_ENABLED or OTEL_TRACING_ENABLED.
ddtrace stays first-class for users who want native APM; OpenTelemetry SDK
opens any OTLP/HTTP backend (Tempo, Jaeger, Honeycomb, Grafana Cloud) for
users who don't.

OTel exposes the same call surface as ddtrace via a thin shim, so existing
tracer.trace() / span.set_tag() / HTTPPropagator.inject() call sites are
unchanged.
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 28, 2026

Greptile Summary

This PR introduces a parallel OpenTelemetry tracing path alongside the existing ddtrace integration, controlled by a new OTEL_TRACING_ENABLED runtime flag and INSTALL_OTEL build arg. A ddtrace-shaped shim (_OtelTracerShim / _OtelSpanShim) wraps the OTel SDK so that call sites (tracer.trace, span.set_tag, span.finish, HTTPPropagator.inject) remain unchanged. All concerns raised in previous review threads — atexit + SIGTERM span flushing, OTel error status propagation, and misconfiguration logging — are addressed in this revision.

Confidence Score: 5/5

Safe to merge — all previous P1 findings are addressed and no new blocking issues found.

All three P1 issues from previous review threads (BatchSpanProcessor daemon-thread flush on exit, SIGTERM not running atexit, and OTel error status not propagated to backends) are explicitly resolved in this revision. The signal-chaining logic uses a factory closure to correctly capture per-signal previous handlers. The deferred error-status pattern in _OtelSpanShim.finish() correctly handles the OTel set_status lock-after-first-call constraint. No new P1 issues were identified.

No files require special attention.

Important Files Changed

Filename Overview
milter/primitivemail_milter.py Core OTel shim implementation: TracerProvider setup, BatchSpanProcessor with atexit + SIGTERM flush chain, _OtelSpanShim with deferred error status, and comprehensive misconfiguration logging — all well-implemented.
Dockerfile Adds optional INSTALL_OTEL build arg that pip-installs opentelemetry-api/sdk/exporter-otlp-proto-http with a pinned minor-version range; default is false so existing images are unaffected.
milter/entrypoint.sh Adds OTEL_TRACING_ENABLED shell block that sets OTEL_SERVICE_NAME, OTEL_EXPORTER_OTLP_ENDPOINT, and OTEL_PROPAGATORS defaults; comment clarifies ddtrace-wins precedence rule.

Sequence Diagram

sequenceDiagram
    participant EP as entrypoint.sh
    participant PY as primitivemail_milter.py (import)
    participant OTel as OTel SDK
    participant Main as main()
    participant Milter as PrimitiveMailMilter
    participant Collector as OTLP Collector

    EP->>PY: OTEL_TRACING_ENABLED=true
    PY->>OTel: TracerProvider + BatchSpanProcessor + OTLPSpanExporter
    PY->>PY: atexit.register(provider.shutdown)
    PY->>PY: define _install_otel_shutdown_signals()

    PY->>Main: module loaded
    Main->>Main: signal.signal(SIGHUP, reload_config)
    Main->>Main: _install_otel_shutdown_signals() [chains SIGTERM/SIGINT]

    Milter->>OTel: tracer.trace("milter.process_email") → attach ctx_token
    Milter->>OTel: tracer.trace("milter.storage_upload") → child span
    Milter->>Milter: HTTPPropagator.inject(span.context, headers) [W3C traceparent]
    Milter->>OTel: storage_span.finish() → set_status(ERROR?) + span.end() + detach
    Milter->>OTel: tracer.trace("milter.webhook_call") → child span
    Milter->>OTel: webhook_span.finish() → set_status(ERROR?) + span.end() + detach
    Milter->>OTel: process_email span.finish() → detach root ctx_token

    OTel-->>Collector: BatchSpanProcessor flush (background thread)

    Note over Main,OTel: On SIGTERM (docker stop)
    Main->>OTel: signal handler → provider.shutdown() [force flush]
    Main->>Main: re-raise SIG_DFL → sys.exit
    Main->>OTel: atexit fires → provider.shutdown() [idempotent]
Loading

Reviews (6): Last reviewed commit: "Address Greptile round 5: SpanKind on ou..." | Re-trigger Greptile

Comment thread milter/primitivemail_milter.py
Comment thread milter/primitivemail_milter.py
… precedence

- Register atexit hook for the OTel TracerProvider so BatchSpanProcessor
  flushes its in-flight buffer on container stop / process exit. Without
  it, spans buffered in the last export window are silently dropped.
- When both DATADOG_TRACING_ENABLED and OTEL_TRACING_ENABLED are set
  and ddtrace loads successfully, log a warning that ddtrace took
  precedence and OTel was not initialised — previously this was silent.
Comment thread milter/primitivemail_milter.py
- _OtelSpanShim.set_tag now translates ddtrace's set_tag('error', True)
  to span.set_status(StatusCode.ERROR), and set_tag('error.message', ...)
  to set_status with the error description. Without this, errored spans
  rendered as successful in Tempo/Jaeger/Honeycomb because OTel reads
  status, not a magic attribute, to flag failures.
- _OtelTracerShim.trace() no longer copies the service arg onto the span
  as service.name. service.name belongs on the Resource (set via
  OTEL_SERVICE_NAME env var); setting it per-span is redundant and can
  confuse backends that key service identity off the Resource.
Comment thread milter/primitivemail_milter.py
prim-8 added 3 commits April 28, 2026 11:14
…warning

- atexit alone doesn't fire on SIGTERM (what 'docker stop' delivers),
  so the BatchSpanProcessor still dropped its last buffer on graceful
  container stop. Install a SIGTERM (and SIGINT) handler that flushes
  the OTel provider, then chains to whatever handler was previously
  installed — preserves any pymilter/app-level signal handling.
- When DATADOG_TRACING_ENABLED=true is set but ddtrace isn't installed,
  the OTel path quietly takes over (assuming OTEL_TRACING_ENABLED=true
  and the OTel packages are present). Log a warning so an operator
  expecting Datadog APM doesn't have to read source to find out why
  spans landed in the OTel backend.
…ls from main()

- _OtelSpanShim.set_tag now caches the error flag and message instead of
  emitting set_status eagerly. OTel locks the status after the first
  non-UNSET set_status call, so the ddtrace pattern of set_tag('error',
  True) followed by set_tag('error.message', msg) was silently dropping
  the description. finish() now emits a single combined status from the
  cached components so order doesn't matter.
- The SIGTERM/SIGINT flush chain is now installed from main() after the
  SIGHUP handler is wired, instead of at module load. Captures whatever
  handler is bound at call time and chains to it. atexit remains the
  belt-and-braces backup for the normal-exit path.
…ions

- _OtelTracerShim now infers SpanKind from the operation name:
  process_email → SERVER, storage_upload + webhook_call → CLIENT, all
  others → INTERNAL. OTel UIs and trace queries that filter by kind
  will now see the outbound HTTP spans as client calls instead of
  generic INTERNAL.
- Three error paths that called set_tag('error', True) without a paired
  set_tag('error.message', ...) now include the description: storage
  HTTP non-2xx response, webhook network error (timeout/connection),
  webhook unexpected exception. The Status.description is what surfaces
  the failure reason in OTel backends; the bare-error path was leaving
  it blank.
@prim-8 prim-8 merged commit 56deae6 into main Apr 28, 2026
8 checks passed
@prim-8 prim-8 deleted the feature/otel-sdk-parallel-option branch April 28, 2026 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant