Skip to content

Python: [Bug]: executor.process spans intermittently never close, leaving child gen_ai.* spans orphan in App Insights #5577

@rickywck

Description

@rickywck

Description

Summary

When running a multi-executor workflow with HITL pauses, a dropped-type-mismatch edge group, and per-row asyncio.gather work, MAF's executor
instrumentation intermittently fails to close executor.process spans. The spans are created (their span_ids appear as parent_id of child spans
inside the executor), but are never .end()-ed, never exported via SimpleSpanProcessor, and never appear in App Insights / OTLP collectors.

The visible symptom is broken trace trees: child spans (LLM calls, manually-instrumented gen_ai.* spans inside per-row work) appear in App
Insights with parent_id values pointing at executor.process spans that don't exist anywhere in the exported trace.

Environment

  • agent-framework==1.2.2
  • agent-framework-foundry==1.2.2
  • opentelemetry-sdk==1.40.0
  • azure-monitor-opentelemetry (latest as of 2026-04-29)
  • Python 3.13
  • Linux container (mcr.microsoft.com/devcontainers/python:3.13)

Verified NOT a duplicate of #5552

Tested both agent-framework==1.0.1 and agent-framework==1.2.2 — bug reproduces identically. Upstream fix #5552 ("Fix observability spans not
being correctly nested when using streaming"
) does not address this.

Reproduction

Workflow shape that triggers the bug:

  1. 14+ executors built via WorkflowBuilder.
  2. One edge produces dropped type mismatch (intentional misroute that delivery handler drops). In our case: prepare_contract_validation → ContractIntelligence (dispatch on a different edge actually delivers; the type-mismatch edge is the one that drops).
  3. One or more validators that send HITLRequest via ctx.send_message() then return — workflow pauses, resumes after external response.
  4. Per-row work inside some validators uses asyncio.gather over a Semaphore-bounded set of LLM calls.
  5. OTel instrumentation: enable_instrumentation() called from FastAPI lifespan; BatchSpanProcessor to Azure Monitor + SimpleSpanProcessor to
    ConsoleExporter as overlay (via OTEL_DEBUG_CONSOLE-style debug flag).

Run a representative input end-to-end. Inspect stdout for printed executor.process span bodies.

Expected behaviour

Every executor that processes a message during the workflow run should emit an executor.process <name> span via SimpleSpanProcessor (i.e.
printed to stdout) before workflow.run ends.

Actual behaviour

Several executor.process spans never appear in stdout despite their child spans being printed with valid parent_id references back to them.
Affected executors observed (varies between runs of the same input):

  • validate_unbilled_transactions (sync, no gather)
  • route_after_unbilled (sync, no gather)
  • validate_timecards (no LLM calls — pure dict aggregation)
  • validate_expenses (asyncio.gather + HITL)
  • validate_ap_invoices (asyncio.gather + HITL)

Early-pipeline executors (discover_and_prepare_files, extract_documents, persist_and_emit_result, prepare_contract_validation)
consistently DO emit their executor.process spans correctly. The break is specifically downstream of the first dropped-type-mismatch edge group.

Evidence

Excerpt from OTEL_DEBUG_CONSOLE=true stdout (parallelism = 6, single workflow run before HITL pause):

{
  "name": "chat gpt-5.3-chat",
  "context": {"trace_id": "0x185d...", "span_id": "0x89fe...d159"},
  "parent_id": "0x27e049d02c35e8b0",
  "...": "..."
}

The span_id 0x27e049d02c35e8b0 is the executor.process validate_expenses span. It is referenced as parent_id by 10 LLM-call child spans, but
never appears as a span_id in any printed JSON — neither before nor after workflow.run ends. The span is created and live during the
executor's handle() execution (otherwise child spans couldn't reference it as parent), but is never .end()-ed.

Ruled-out alternative explanations:

  • Not BSP queue saturation. Tested with default queue (2048) and tuned queue (8192). Reproduces identically.
  • Not OTel context propagation. All children correctly inherit trace_id from the workflow root and reference correct parent_id values.
  • Not asyncio.gather context loss. Reproduces in route_after_unbilled and validate_unbilled_transactions which don't use gather.
  • Not parallelism-driven. Reproduces at AGENT4_VALIDATOR_CONCURRENCY=6 (lowest tested) just as readily as at 10 or 30.
  • Not Anthropic/OpenAI SDK retries. The orphan spans are MAF's own executor.process spans, not SDK spans.

Request

  1. Confirm or deny that this is a known bug.
  2. If unknown, please investigate the executor.process span lifecycle in MAF's workflow runner — specifically how span closure interacts with: (a)
    HITL ctx.send_message(HITLRequest) paths that yield control to the runner, (b) executors downstream of a dropped type mismatch edge group,
    and (c) sync (non-async) executors triggered by InternalEdgeGroup messages.
  3. If reproducible, fix or document the workaround.

Workaround currently in use

None viable. Application-side wrappers (tracer.start_as_current_span around each Executor.handle()) were considered and rejected — the bug
surfaces in executors with diverse shapes (gather, no-gather, sync, async), so a wrapper strategy can't be comprehensive without effectively
re-implementing enable_instrumentation(). We accept the trace gaps until upstream fix.

Code Sample

Error Messages / Stack Traces

Package Versions

agent-framework==1.2.2, agent-framework-foundry==1.2.2

Python Version

Python 3.13

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions