fix: batch spans before OTLP export to preserve parent-child relationships#290
Conversation
…ships The HoneyHiveSpanProcessor was exporting each span individually via otlp_exporter.export([span]), creating 1-span batches. The backend's resolveParentIdsFromSpans() can only resolve parent-child links within the same batch, so with single-span batches parent matching never succeeded and all events defaulted to parent_id=session_id (flat tree). Changes: - span_processor.py: Buffer spans in _span_buffer during on_end(), flush all buffered spans as a single batch in _flush_span_buffer() which is called by force_flush() and shutdown(). Immediate mode (disable_batch=True) still exports spans individually. - otlp_exporter.py: Group spans by instrumentation scope in _spans_to_otlp_json_payload() so multi-scope batches produce correct OTLP JSON with separate scopeSpans entries. Co-Authored-By: unknown <>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
|
Claude finished @devin-ai-integration[bot]'s task in 2m 19s —— View job
Review CompleteReviewed both changed files. The core approach (batching spans before OTLP export) is correct, but found two must-fix issues before merge:
See inline comments for details and suggested fixes. |
|
📚 Documentation preview built — Download artifact Review instructions & validation statusHow to Review
Validation Status
|
| try: | ||
| with self._buffer_lock: | ||
| if not self._span_buffer: | ||
| return True | ||
| spans_to_flush = list(self._span_buffer) | ||
| self._span_buffer.clear() | ||
|
|
||
| if not self.otlp_exporter: | ||
| self._safe_log("warning", "No OTLP exporter for flush") | ||
| return False | ||
|
|
||
| self._safe_log( | ||
| "info", | ||
| "Flushing %d buffered spans as single OTLP batch", | ||
| len(spans_to_flush), | ||
| ) | ||
|
|
||
| result = self.otlp_exporter.export(spans_to_flush) | ||
|
|
||
| if hasattr(result, "name"): | ||
| self._safe_log( | ||
| "debug", | ||
| "Batch OTLP export result: %s (span_count=%d)", | ||
| result.name, | ||
| len(spans_to_flush), | ||
| ) | ||
| return True |
There was a problem hiding this comment.
Bug: Data loss on export failure & export result ignored
Two issues here:
-
Spans cleared before export succeeds (line 883): The buffer is cleared before
export()is called (line 895). Ifexport()raises an exception or returnsFAILURE, those spans are permanently lost. The buffer should only be cleared after confirming a successful export. -
Export result not checked (line 904): The method always returns
Trueafter calling export, regardless of whether the export actually succeeded (SpanExportResult.SUCCESSvsFAILURE). A failed export should returnFalseor re-buffer the spans.
| try: | |
| with self._buffer_lock: | |
| if not self._span_buffer: | |
| return True | |
| spans_to_flush = list(self._span_buffer) | |
| self._span_buffer.clear() | |
| if not self.otlp_exporter: | |
| self._safe_log("warning", "No OTLP exporter for flush") | |
| return False | |
| self._safe_log( | |
| "info", | |
| "Flushing %d buffered spans as single OTLP batch", | |
| len(spans_to_flush), | |
| ) | |
| result = self.otlp_exporter.export(spans_to_flush) | |
| if hasattr(result, "name"): | |
| self._safe_log( | |
| "debug", | |
| "Batch OTLP export result: %s (span_count=%d)", | |
| result.name, | |
| len(spans_to_flush), | |
| ) | |
| return True | |
| try: | |
| with self._buffer_lock: | |
| if not self._span_buffer: | |
| return True | |
| spans_to_flush = list(self._span_buffer) | |
| self._span_buffer.clear() | |
| if not self.otlp_exporter: | |
| self._safe_log("warning", "No OTLP exporter for flush") | |
| return False | |
| self._safe_log( | |
| "info", | |
| "Flushing %d buffered spans as single OTLP batch", | |
| len(spans_to_flush), | |
| ) | |
| result = self.otlp_exporter.export(spans_to_flush) | |
| if hasattr(result, "name"): | |
| self._safe_log( | |
| "debug", | |
| "Batch OTLP export result: %s (span_count=%d)", | |
| result.name, | |
| len(spans_to_flush), | |
| ) | |
| from opentelemetry.sdk.trace.export import SpanExportResult | |
| if result != SpanExportResult.SUCCESS: | |
| self._safe_log( | |
| "error", | |
| "Batch OTLP export failed, %d spans lost", | |
| len(spans_to_flush), | |
| ) | |
| return False | |
| return True |
Ideally on failure the spans should be re-added to the buffer for retry, but at minimum the result should be checked and reported honestly.
There was a problem hiding this comment.
Addressed in f179131:
- Buffer only cleared after successful export: Moved
self._span_buffer.clear()to afterSpanExportResult.SUCCESScheck. New spans arriving during export are preserved via slice assignment. - Export result now checked:
_flush_span_buffer()returnsFalsewhenexport()doesn't returnSUCCESS, and logs the failure.
| # Group spans by instrumentation scope name | ||
| scope_groups: Dict[str, List[ReadableSpan]] = {} | ||
| for span in spans: | ||
| scope_name = "unknown" | ||
| if ( | ||
| hasattr(span, "instrumentation_scope") | ||
| and span.instrumentation_scope | ||
| and span.instrumentation_scope.name | ||
| ): | ||
| scope_name = span.instrumentation_scope.name | ||
| scope_groups.setdefault(scope_name, []).append(span) |
There was a problem hiding this comment.
Nit: Good improvement — correct OTLP grouping by scope
This is a solid fix. The old code lumped all spans under a single instrumentation scope, which produced incorrect OTLP payloads when spans from different instrumentors (e.g., autogen-core + openinference.instrumentation.openai) were batched together.
One minor suggestion: consider using getattr chaining instead of nested hasattr checks for slightly cleaner code:
scope = getattr(span, "instrumentation_scope", None)
scope_name = getattr(scope, "name", None) or "unknown"
Review SummaryThe core idea is sound — batching spans before OTLP export so the backend can resolve parent-child relationships within a single request is the correct fix for the flat-tree problem. The scope-grouping fix in However, the current implementation has issues that should be addressed before merging: Must Fix
Should Fix
Notes
DocumentationNo public API surface changed, so no docs updates are needed. |
- Add _max_buffer_size (512) threshold that triggers automatic flush - Add periodic flush timer (5s) to prevent unbounded buffer growth - Only clear buffer after confirmed successful export (SpanExportResult.SUCCESS) - Cancel flush timer during shutdown to avoid resource leaks - New spans arriving during export are preserved (not dropped) Addresses code review feedback: - P0: Unbounded buffer growth in long-running processes - Bug: Data loss on export failure (buffer cleared before export confirmed) - Bug: Export result never checked Co-Authored-By: unknown <>
Co-Authored-By: unknown <>
|
📚 Documentation preview built — Download artifact Review instructions & validation statusHow to Review
Validation Status
|
|
📚 Documentation preview built — Download artifact Review instructions & validation statusHow to Review
Validation Status
|
Three tests expected immediate export on _send_via_otlp in batched mode. Updated to verify spans are buffered first, then exported on flush, matching the new batch export implementation. Co-Authored-By: unknown <>
|
📚 Documentation preview built — Download artifact Review instructions & validation statusHow to Review
Validation Status
|
Four tests expected immediate export in batched mode. Updated to verify spans are buffered first, then exported on flush. Co-Authored-By: unknown <>
|
📚 Documentation preview built — Download artifact Review instructions & validation statusHow to Review
Validation Status
|
Co-Authored-By: unknown <>
|
📚 Documentation preview built — Download artifact Review instructions & validation statusHow to Review
Validation Status
|
fix: batch spans before OTLP export to preserve parent-child relationships
Summary
HoneyHiveSpanProcessor._send_via_otlp()was callingself.otlp_exporter.export([span])for every span individually as it completed. This meant each span arrived at the backend as a 1-span batch. The backend'sresolveParentIdsFromSpans()can only resolve parent-child links within the same batch, so with single-span batches the parent lookup never succeeded — all events defaulted toparent_id=session_id, producing a completely flat tree.Changes:
span_processor.py:_send_via_otlp()now buffers spans in a thread-safe_span_bufferinstead of exporting immediately. A periodic flush timer (5s) and max buffer cap (512 spans) prevent unbounded growth._flush_span_buffer()exports all buffered spans as a single batch, only clearing the buffer after a successful export. Called byforce_flush(),shutdown(), the periodic timer, and when the buffer reaches capacity. Thedisable_batch=True(immediate) mode still exports spans one at a time.otlp_exporter.py:_spans_to_otlp_json_payload()now groups spans by instrumentation scope into separatescopeSpansentries, instead of putting all spans under a single scope. This produces correct OTLP JSON when a batch contains spans from multiple instrumentors (e.g.autogen-core+openinference.instrumentation.openai).autogen_integration.py: Removed debug scaffolding (capture_spansimport/usage,verbose=True) from the public example.tests/andtests_v2/): Updated to verify new buffering behavior — spans are not exported immediately onon_end, but are held in_span_bufferand exported on explicit_flush_span_buffer().Root cause discovered during autogen-agentchat v0.7.5 integration validation: 73/81 raw spans had correct
parent_span_id, but 81/82 ingested events hadparent_id=session_id.End-to-end validation result: After the fix, 42/78 ingested events now have proper parent-child nesting (vs 0/82 before). The remaining 36 root-level events are expected — they represent top-level test function spans and spans that crossed flush-timer boundaries.
Review & Testing Checklist for Human
force_flush()should be called more aggressively (e.g. after each top-level trace completes).disable_batch=Truepath: Verify this mode still works as expected (immediate single-span export, no buffering). This path is not exercised in the integration test.shutdown()is called. The 5s timer + 512-span cap mitigate this but don't eliminate it.Recommended test plan:
disable_batch=Trueand confirm spans still export correctlyforce_flush()drains the buffer when called explicitlyNotes
test_tracing_minimal_overhead_integrationCV 379% vs 300% threshold) unrelated to these changes and not required for merge.