feat(tracing): Support sending traces to a generic OTLP trace endpoint#12223
feat(tracing): Support sending traces to a generic OTLP trace endpoint#12223ringerc wants to merge 11 commits into
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughAdds a new OTLP tracing stack (OTLPTracer + OTLPTracerBase) for generic OTLP endpoints, refactors existing tracers to reuse the OTLP base and a shared provider lifecycle, centralizes provider shutdown/reset helpers, and adds extensive OTLP and W3C trace-context tests and minor setup cleanup. Changes
Sequence Diagram(s)sequenceDiagram
participant App as Application
participant OTLP as OTLPTracer
participant Provider as TracerProvider
participant Exporter as OTLP Exporter
participant Collector as OTLP Endpoint
App->>OTLP: __init__(trace_name, trace_type, ...)
OTLP->>OTLP: _validate_otlp_env()
OTLP->>Provider: get/create shared TracerProvider
OTLP->>Provider: start root span (store carrier)
App->>OTLP: add_trace(trace_id, name, inputs, metadata)
OTLP->>OTLP: _convert_to_otlp_dict(inputs/metadata)
OTLP->>Provider: start child span (using injected context)
App->>OTLP: end_trace(trace_id, outputs, logs, error)
OTLP->>OTLP: _convert_to_otlp_dict(outputs/logs)
OTLP->>Provider: set attributes/record_exception/end child span
App->>OTLP: end(inputs, outputs, ...)
OTLP->>Provider: set root attributes/record_exception/end root span
OTLP->>Provider: force_flush()/shutdown (shutdown_* helper)
Provider->>Exporter: export spans
Exporter->>Collector: send (gRPC or HTTP/protobuf)
sequenceDiagram
participant Client as HTTP Client
participant FastAPI as Server
participant Propagator as TraceContext Propagator
participant OTLP as OTLPTracer
participant Exporter as OTLP Exporter
Client->>FastAPI: Request + traceparent header
FastAPI->>Propagator: extract(traceparent)
Propagator->>FastAPI: set active context
FastAPI->>OTLP: instantiate tracer
OTLP->>OTLP: root span created inheriting active context
FastAPI->>OTLP: add_trace(...)
OTLP->>OTLP: child span created with parent linkage
FastAPI->>OTLP: end_trace(...), end(...)
OTLP->>Exporter: spans exported with preserved trace_id/parent_id
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 6 | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/backend/base/langflow/services/tracing/traceloop.py (1)
111-117:⚠️ Potential issue | 🟠 MajorAdd Traceloop-specific metadata key prefixes for association properties.
Metadata spread into span attributes at lines 111-117 and 159-168 lacks the
traceloop.association.properties.*prefix required by Traceloop. The_convert_to_otlp_dict()method only normalizes values and converts keys to strings—it does not add provider-specific prefixes. Custom metadata will be sent as plain span attributes and will not appear as association properties in Traceloop.Wrap metadata keys with the
traceloop.association.properties.prefix before spreading them into attributes:# Instead of: **self._convert_to_otlp_dict(metadata or {}) # Use something like: **{f"traceloop.association.properties.{k}": v for k, v in self._convert_to_otlp_dict(metadata or {}).items()}Applies to lines 111-117 (child spans) and 159-168 (root span).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/backend/base/langflow/services/tracing/traceloop.py` around lines 111 - 117, The metadata being spread into span attributes (the attributes dict built around trace_id/trace_name/trace_type/inputs) is not being prefixed for Traceloop; change the spread of self._convert_to_otlp_dict(metadata or {}) so each key is renamed with the "traceloop.association.properties." prefix before merging (e.g., map the dict returned by _convert_to_otlp_dict to new keys with that prefix), and apply the same change at both places where metadata is merged (the child-span attributes block around trace_id/trace_name/inputs and the root-span attributes block later); keep the conversion logic in _convert_to_otlp_dict but wrap its output keys with the required prefix when building the attributes dict.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/backend/base/langflow/services/tracing/otlp_base.py`:
- Around line 79-83: The metadata conversion currently leaves nested dicts/lists
intact in _convert_langflow_type and _convert_to_otlp_dict, which later get
passed to Span.set_attribute and will fail; update _convert_langflow_type to
detect nested containers (dict or list) and return a JSON-stringified
representation (e.g., json.dumps(value, ensure_ascii=False)) instead of the raw
container, or alternatively convert them into sequences/primitives only so that
_convert_to_otlp_dict always returns only str/bool/int/float or sequences of
primitives before anything is passed to Span.set_attribute; make sure to
import/use json and apply this change in _convert_langflow_type and any code
paths used by _convert_to_otlp_dict.
In `@src/backend/base/langflow/services/tracing/otlp.py`:
- Around line 109-120: The protocol selection currently reads only
OTEL_EXPORTER_OTLP_PROTOCOL and accepts an unsupported "http/json"; change it to
first read OTEL_EXPORTER_OTLP_TRACES_PROTOCOL (falling back to
OTEL_EXPORTER_OTLP_PROTOCOL) into the protocol variable, and restrict accepted
values to "grpc" and "http/protobuf" when importing OTLPSpanExporter; remove any
"http/json" branch, and when protocol is unrecognized, log a warning and default
to "http/protobuf" before importing
opentelemetry.exporter.otlp.proto.http.trace_exporter.
- Around line 123-137: The current manual parsing of OTEL_RESOURCE_ATTRIBUTES
should be removed and replaced by delegating parsing to the OpenTelemetry SDK:
call the SDK's environment-based resource detector (obtain an env_resource using
the SDK's environment parsing helper) and then merge it with your explicit base
resource that sets "service.name" from OTEL_SERVICE_NAME and
"langflow.project_name" (use Resource.merge with the base resource first so base
values take precedence over env values); update the code in otlp.py where
Resource.create and manual OTEL_RESOURCE_ATTRIBUTES parsing occur (use
Resource.merge and the SDK's env resource detector instead of
split(",")/split("=") logic).
- Around line 144-150: The close() method currently calls force_flush() on the
TracerProvider which leaves BatchSpanProcessor worker threads running; change
close() to call tracer_provider.shutdown() instead of force_flush() to ensure
processors and background threads are stopped, and update the end() method to
invoke close() after ending the root span so each OTLPTracer instance
(tracer_provider, BatchSpanProcessor, span_exporter) is properly shut down;
refer to the tracer_provider attribute, close() and end() methods, and the
BatchSpanProcessor/OTLPSpanExporter creation in the OTLP tracer class to locate
where to replace force_flush() with shutdown() and add the close() call in
end().
In `@src/backend/tests/unit/services/tracing/test_tracing_service.py`:
- Around line 652-662: The test flakes because it only patches a single OTEL env
var and leaves other OTEL_* keys that alter OTLPTracer behavior; update these
tests in test_tracing_service.py (the blocks creating OTLPTracer) to run with a
clean OTEL-related environment by using patch.dict("os.environ", {}, clear=True)
and then re-inject only the required OTEL_EXPORTER_OTLP_TRACES_ENDPOINT (or
alternatively explicitly remove/sanitize all keys matching OTEL_* before
constructing OTLPTracer) so the tracer.ready assertion always exercises the
intended code path.
In `@src/backend/tests/unit/services/tracing/test_w3c_trace_context.py`:
- Around line 156-166: The propagation tests patch.dict calls should create a
clean OTEL environment before setting the test-specific OTLP vars; change the
patch.dict(...) usage in test_w3c_trace_context.py (the blocks that currently
call patch.dict(os.environ, { "OTEL_EXPORTER_OTLP_ENDPOINT": ... })) to either
use patch.dict(os.environ, {}, clear=True) followed by another patch.dict that
sets only the required OTEL_* keys, or add clear=True to the existing patch.dict
call and include all OTEL_* keys you need (e.g., OTEL_EXPORTER_OTLP_ENDPOINT,
OTEL_EXPORTER_OTLP_PROTOCOL, OTEL_EXPORTER_OTLP_HEADERS,
OTEL_RESOURCE_ATTRIBUTES) so the CollectingExporter-based tests run
hermetically; apply the same change to the other two blocks noted (around the
ranges mentioned).
---
Outside diff comments:
In `@src/backend/base/langflow/services/tracing/traceloop.py`:
- Around line 111-117: The metadata being spread into span attributes (the
attributes dict built around trace_id/trace_name/trace_type/inputs) is not being
prefixed for Traceloop; change the spread of self._convert_to_otlp_dict(metadata
or {}) so each key is renamed with the "traceloop.association.properties."
prefix before merging (e.g., map the dict returned by _convert_to_otlp_dict to
new keys with that prefix), and apply the same change at both places where
metadata is merged (the child-span attributes block around
trace_id/trace_name/inputs and the root-span attributes block later); keep the
conversion logic in _convert_to_otlp_dict but wrap its output keys with the
required prefix when building the attributes dict.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3d02fec8-bc02-4bd6-8320-53d5d735795b
📒 Files selected for processing (8)
src/backend/base/langflow/services/tracing/arize_phoenix.pysrc/backend/base/langflow/services/tracing/langwatch.pysrc/backend/base/langflow/services/tracing/otlp.pysrc/backend/base/langflow/services/tracing/otlp_base.pysrc/backend/base/langflow/services/tracing/service.pysrc/backend/base/langflow/services/tracing/traceloop.pysrc/backend/tests/unit/services/tracing/test_tracing_service.pysrc/backend/tests/unit/services/tracing/test_w3c_trace_context.py
63c0a5f to
9579a8a
Compare
The Arize/Phoenix tracer and the new generic OpenTelemetry tracer were constructing a new trace engine, batch span processor etc for every graph evaluation, then orphaning them. These would leak and remain running in the background. Instead, initialize the trace engine only once, and shut it down on service exit. For each graph only make a new tracer instance. This allows the OpenTelemetry SDK to efficiently batch traces, handle errors and retry, and otherwise operate as intended. These changes do not update the Langwatch exporter as it already uses a class-level singleton (though this lacks clean shutdown logic). The other otel-based provider, Traceloop, uses an internal singleton within its SDK so it does not need updating. Note that this commit was largely built using Claude Code based on an issue identified in PR review by CodeRabbit here: langflow-ai#12223 (comment) I evaluated the issue and found that it was legitimate.
3768126 to
32b3d28
Compare
The Arize/Phoenix tracer and the new generic OpenTelemetry tracer were constructing a new trace engine, batch span processor etc for every graph evaluation, then orphaning them. These would leak and remain running in the background. Instead, initialize the trace engine only once, and shut it down on service exit. For each graph only make a new tracer instance. This allows the OpenTelemetry SDK to efficiently batch traces, handle errors and retry, and otherwise operate as intended. These changes do not update the Langwatch exporter as it already uses a class-level singleton (though this lacks clean shutdown logic). The other otel-based provider, Traceloop, uses an internal singleton within its SDK so it does not need updating. Note that this commit was largely built using Claude Code based on an issue identified in PR review by CodeRabbit here: langflow-ai#12223 (comment) I evaluated the issue and found that it was legitimate.
8129732 to
8941fc7
Compare
4f96458 to
c63cbf1
Compare
The Arize/Phoenix tracer and the new generic OpenTelemetry tracer were constructing a new trace engine, batch span processor etc for every graph evaluation, then orphaning them. These would leak and remain running in the background. Instead, initialize the trace engine only once, and shut it down on service exit. For each graph only make a new tracer instance. This allows the OpenTelemetry SDK to efficiently batch traces, handle errors and retry, and otherwise operate as intended. These changes do not update the Langwatch exporter as it already uses a class-level singleton (though this lacks clean shutdown logic). The other otel-based provider, Traceloop, uses an internal singleton within its SDK so it does not need updating. Note that this commit was largely built using Claude Code based on an issue identified in PR review by CodeRabbit here: langflow-ai#12223 (comment) I evaluated the issue and found that it was legitimate. Signed-off-by: Craig Ringer <craig.ringer@enterprisedb.com>
The Arize/Phoenix tracer and the new generic OpenTelemetry tracer were constructing a new trace engine, batch span processor etc for every graph evaluation, then orphaning them. These would leak and remain running in the background. Instead, initialize the trace engine only once, and shut it down on service exit. For each graph only make a new tracer instance. This allows the OpenTelemetry SDK to efficiently batch traces, handle errors and retry, and otherwise operate as intended. These changes do not update the Langwatch exporter as it already uses a class-level singleton (though this lacks clean shutdown logic). The other otel-based provider, Traceloop, uses an internal singleton within its SDK so it does not need updating. Note that this commit was largely built using Claude Code based on an issue identified in PR review by CodeRabbit here: langflow-ai#12223 (comment) I evaluated the issue and found that it was legitimate. Signed-off-by: Craig Ringer <craig.ringer@enterprisedb.com>
|
I've rebased this onto Some test failures exist, but they were pre-existing on the |
|
CI job failures appear unrelated; an error for |
|
@erichare @viktoravelino @Jkavia @schuellerf I've prepared a PR to add a generic OpenTelemetry destination for trace events. Is there any chance anyone would be willing to review this and consider it for merge? I've pinged the Discord a couple of times but didn't get a response so I thought I'd try tagging a few people I can see have been involved with tracing in Langflow based on #11689 and/or are active contributors. |
|
I think there's a workaround for this (as I've been able to get no traction at all on this PR): Configure Langflow to use Arize Phoenix or Langwatch but with a fake API key, and an API endpoint env-var that overrides the endpoint to point to a native OTLP trace API endpoint on an otel collector. There are plenty of problems with this of course; there's no sensible way to configure custom mutual TLS, for one thing, and it's a hack. But it's ... something. |
|
After further digging I have been able to determine that there are two partial workarounds for getting Langflow traces, but neither will deliver working trace-context propagation through downstreams due to missing support in Langflow itself. The Phoenix exporter cannot be used as it uses a custom span exporter, but the Arize exporter will work, and the Langflow exporter will work when workaround is applied on the otel-collector side. Arize exporterThe Arize exporter can be used without otel-collector configuration changes if the otel-collector does not require mutual TLS (or TLS is handled via a service mesh like Istio). Configure a fake Arize endpoint to point to an otel-collector gRPC endpoint, like: Omit the The otel collector will ignore the suffix appended by the Arize exporter and accept the gRPC span pushes. Limitations:
Langwatch exporterA similar approach works with Langwatch but the otel-collector configuration must be customized to accept traces on the Langwatch endpoint: receivers:
# ...
otlp/langwatch:
protocols:
http:
endpoint: ${env:MY_POD_IP}:4319
traces_url_path: /api/otel/v1/traces
pipelines:
# ...
receivers:
# ...
- otlp/langwatchThe k8s workload and Service will also need updating to add port 4319. Then the Langflow environment can be updated to add env-vars: Limitations are the same as for Arize. |
|
Currently revising this a little to tidy up integration with the prior trace context PR and add some more testing to guard against regressions in existing tracers. Edit: done. If you want I can rebase the changes into the original commit series to make it easier to review. |
…tracers The existing trace implementations for Arize, Langwatch and Traceloop are all OpenTelemetry based, and share a lot of common code. Mostly they differ in details of their endpoint discovery and authentication tokens. Extract the common functionality into a base class they all share. Signed-off-by: Craig Ringer <craig.ringer@enterprisedb.com>
For langflow-ai#12117 Add a generic OpenTelemetry tracer. This tracer is configured using the standard OpenTelemetry environment variables rather than by extending Langflow configuration explicitly. Key env-vars include OTEL_SERVICE_NAME OTEL_EXPORTER_OTLP_PROTOCOL OTEL_EXPORTER_OTLP_TRACES_ENDPOINT This tracer can be used to send traces to a standard trace tool like Jaeger or Grafana Tempo. It can also be used to route traces through an OpenTelemetry Collector for filtering and processing (e.g. k8sattributesprocessor) before forwarding to any appropriate trace sink. See https://opentelemetry-python.readthedocs.io/en/latest/sdk/environment_variables.html Signed-off-by: Craig Ringer <craig.ringer@enterprisedb.com>
Show that: * Inbound w3c trace context headers are extracted into a trace context * Trace context is propagated through the otel tracing providers * Outbound requests have a trace context injected (if opentelemetry-instrumentation-httpx is available) * Concurrent requests each get the correct inherited trace context Signed-off-by: Craig Ringer <craig.ringer@enterprisedb.com>
The Arize/Phoenix tracer and the new generic OpenTelemetry tracer were constructing a new trace engine, batch span processor etc for every graph evaluation, then orphaning them. These would leak and remain running in the background. Instead, initialize the trace engine only once, and shut it down on service exit. For each graph only make a new tracer instance. This allows the OpenTelemetry SDK to efficiently batch traces, handle errors and retry, and otherwise operate as intended. These changes do not update the Langwatch exporter as it already uses a class-level singleton (though this lacks clean shutdown logic). The other otel-based provider, Traceloop, uses an internal singleton within its SDK so it does not need updating. Note that this commit was largely built using Claude Code based on an issue identified in PR review by CodeRabbit here: langflow-ai#12223 (comment) I evaluated the issue and found that it was legitimate. Signed-off-by: Craig Ringer <craig.ringer@enterprisedb.com>
Address issues identified with the Langwatch tracer implementation's initialization by ensuring that: * Locking guards provider creation, preventing creation of multiple providers * A shutdown hook is added, ensuring that the provider flushes spans and terminates when Langflow is shut down * Tracing teardown is updated to shut down the langflow provider Additionally: * A reset hook is added for test use * A module-level singleton is used to be consistent with the otlp and ArizePhoenix tracers Signed-off-by: Craig Ringer <craig.ringer@enterprisedb.com>
Signed-off-by: Craig Ringer <craig.ringer@enterprisedb.com>
Address issue identified by Rabbit Signed-off-by: Craig Ringer <craig.ringer@enterprisedb.com>
Signed-off-by: Craig Ringer <craig.ringer@enterprisedb.com>
* Fix threading race in test case * Fix unreachable return Signed-off-by: Craig Ringer <craig.ringer@enterprisedb.com>
Move HTTP client instrumentation enable/disable to base class helper methods, reducing duplication across ArizePhoenix and LangWatch tracers. Also fixes a bug in LangWatch where non-existent self.tracer_provider was referenced. Adds regression tests verifying endpoint URIs and headers for each tracer remain unchanged by the refactoring. Documents why the generic OTLP tracer doesn't follow OpenTelemetry GenAI semantic conventions: it operates at workflow orchestration level, tracing component execution rather than individual LLM API calls where model details are available. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ngflow_type The refactoring in 0d5b073 simplified Data handling to always call get_text(), but the original ArizePhoenix code preserved structured dicts/lists. Also update tests to use the new method name. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add support for sending OpenTelemetry traces to a generic OTLP trace destination using the standard OpenTelemetry trace configuration environment variables. Implements #12117
Changes
(Changes are split into logical commits for easier review.)
Logic common to the various otel-based tracing exporters is extracted from the existing providers
traceloop.py,arize_phoenix.pyandlangwatch.pyinto a newotlp_base.pyskeleton provider. This handles attribute transformation, trace-context propagation and other common logic.A new provider
otlp.pyis added and registered with tracingservice.py. The generic OpenTelemetry provider activates if either of theOTEL_EXPORTER_OTLP_TRACES_ENDPOINTorOTEL_EXPORTER_OTLP_ENDPOINTenv-var is set. Configuration follows the conventions set out in OTLP Exporter Configurationand the go otel SDK docs . Some test cover is added to exercise the tracer.
An additional test is added to check that trace context propagation works in the expected manner as requests flow through Langflow.
(Subsequent changes arising from automated review, rebases and merges):
Cache and reuse OpenTelemetry based tracers. The Arize/Phoenix tracer and the new generic OpenTelemetry tracer were constructing a new trace engine, batch span processor etc for every graph evaluation, then orphaning them. These would leak and remain running in the background, as well as failing to properly flush on shutdown. Address by using a module-level singleton to initialize the traceprovider only once, and only construct a new tracer for each invocation. Other providers are unaffected; either they internally use a singleton already, or they don't have a long-lived trace engine to preserve.
Update langwatch trace provider initialization to use the updated singleton pattern. The langwatch tracer already used a singleton, but it didn't have a shutdown hook or any locking to prevent races between concurrent initializations. Update it to work the same way as the other otel-based tracers.
Rebase on top of #12962 which added trace context propagation for the otel-flavoured exporters and de-duplicate functionality.
Update generic OTLP instrumentation to ensure it follows the otel semantic conventions for GenAI/LLM.
Extend test cover to validate that the expected attributes are emitted by each provider.
Usage
OTEL_EXPORTER_OTLP_TRACES_ENDPOINTorOTEL_EXPORTER_OTLP_ENDPOINTenv-var, pointing to your OTLP endpointOTEL_EXPORTER_OTLP_PROTOCOLto the exporter protocol (grpc,http/protobuforhttp/json)OTEL_SERVICE_NAMEto something suitable (langflowwill do if nothing more specific applies in your environment).Optionally define additional SDK env-vars to control server and client certificates, additional headers, timeouts, sampling, etc.
Related issues
Fixes #12117
Context
This patch series was developed with assistance from Claude Code, but all non-test code changes have been manually reviewed and inspected to the best of my (not amazing) ability and modified where appropriate.
The instrumentation is explicitly created in code because the Langflow multi-provider tracing abstraction doesn't fit very well with the python otel SDK's auto-instrumentation and
opentelemetry-instrumenttoo. I also didn't want to impose the requirement of wrapping the langflow backend execution in theopentelemetry-instrumentcommand.I omitted various additional, verbose test cover for things like asserting that all the otel env-vars are respected as they didn't seem useful enough or langflow-specific enough.
Testing
Using the existing tests for the current tracers and some additions made to ensure they properly cover the emitted attributes etc, run the tests against
release-1.10.0and against this PR branchHEAD. Compare. REGRESSION_TEST_RESULTS.mdSummary by CodeRabbit
New Features
Bug Fixes / Reliability
Tests