Python: [Bug]: executor.process spans intermittently never close, leaving child gen_ai.* spans orphan in App Insights

### Description

  ## Summary

  When running a multi-executor workflow with HITL pauses, a dropped-type-mismatch edge group, and per-row `asyncio.gather` work, MAF's executor
  instrumentation intermittently fails to close `executor.process` spans. The spans are created (their span_ids appear as `parent_id` of child spans
   inside the executor), but are never `.end()`-ed, never exported via `SimpleSpanProcessor`, and never appear in App Insights / OTLP collectors.

  The visible symptom is broken trace trees: child spans (LLM calls, manually-instrumented `gen_ai.*` spans inside per-row work) appear in App
  Insights with `parent_id` values pointing at executor.process spans that don't exist anywhere in the exported trace.

  ## Environment

  - `agent-framework==1.2.2`
  - `agent-framework-foundry==1.2.2`
  - `opentelemetry-sdk==1.40.0`
  - `azure-monitor-opentelemetry` (latest as of 2026-04-29)
  - Python 3.13
  - Linux container (mcr.microsoft.com/devcontainers/python:3.13)

  ## Verified NOT a duplicate of #5552

  Tested both `agent-framework==1.0.1` and `agent-framework==1.2.2` — bug reproduces identically. Upstream fix #5552 (*"Fix observability spans not
  being correctly nested when using streaming"*) does not address this.

  ## Reproduction

  Workflow shape that triggers the bug:

  1. 14+ executors built via `WorkflowBuilder`.
  2. One edge produces `dropped type mismatch` (intentional misroute that delivery handler drops). In our case: `prepare_contract_validation →
  ContractIntelligence` (dispatch on a different edge actually delivers; the type-mismatch edge is the one that drops).
  3. One or more validators that send `HITLRequest` via `ctx.send_message()` then return — workflow pauses, resumes after external response.
  4. Per-row work inside some validators uses `asyncio.gather` over a `Semaphore`-bounded set of LLM calls.
  5. OTel instrumentation: `enable_instrumentation()` called from FastAPI lifespan; `BatchSpanProcessor` to Azure Monitor + `SimpleSpanProcessor` to
   ConsoleExporter as overlay (via `OTEL_DEBUG_CONSOLE`-style debug flag).

  Run a representative input end-to-end. Inspect stdout for printed `executor.process` span bodies.

  ## Expected behaviour

  Every executor that processes a message during the workflow run should emit an `executor.process <name>` span via `SimpleSpanProcessor` (i.e.
  printed to stdout) before `workflow.run` ends.

  ## Actual behaviour

  Several `executor.process` spans never appear in stdout despite their child spans being printed with valid `parent_id` references back to them.
  Affected executors observed (varies between runs of the same input):

  - `validate_unbilled_transactions` (sync, no gather)
  - `route_after_unbilled` (sync, no gather)
  - `validate_timecards` (no LLM calls — pure dict aggregation)
  - `validate_expenses` (asyncio.gather + HITL)
  - `validate_ap_invoices` (asyncio.gather + HITL)

  Early-pipeline executors (`discover_and_prepare_files`, `extract_documents`, `persist_and_emit_result`, `prepare_contract_validation`)
  consistently DO emit their `executor.process` spans correctly. The break is specifically downstream of the first dropped-type-mismatch edge group.

  ## Evidence

  Excerpt from `OTEL_DEBUG_CONSOLE=true` stdout (parallelism = 6, single workflow run before HITL pause):

  ```json
  {
    "name": "chat gpt-5.3-chat",
    "context": {"trace_id": "0x185d...", "span_id": "0x89fe...d159"},
    "parent_id": "0x27e049d02c35e8b0",
    "...": "..."
  }
  ```

  The span_id `0x27e049d02c35e8b0` is the `executor.process validate_expenses` span. It is referenced as `parent_id` by 10 LLM-call child spans, but
   **never appears as a span_id in any printed JSON** — neither before nor after `workflow.run` ends. The span is created and live during the
  executor's `handle()` execution (otherwise child spans couldn't reference it as parent), but is never `.end()`-ed.

  Ruled-out alternative explanations:

  - **Not BSP queue saturation.** Tested with default queue (2048) and tuned queue (8192). Reproduces identically.
  - **Not OTel context propagation.** All children correctly inherit `trace_id` from the workflow root and reference correct `parent_id` values.
  - **Not asyncio.gather context loss.** Reproduces in `route_after_unbilled` and `validate_unbilled_transactions` which don't use gather.
  - **Not parallelism-driven.** Reproduces at `AGENT4_VALIDATOR_CONCURRENCY=6` (lowest tested) just as readily as at 10 or 30.
  - **Not Anthropic/OpenAI SDK retries.** The orphan spans are MAF's own `executor.process` spans, not SDK spans.

  ## Request

  1. Confirm or deny that this is a known bug.
  2. If unknown, please investigate the executor.process span lifecycle in MAF's workflow runner — specifically how span closure interacts with: (a)
   HITL `ctx.send_message(HITLRequest)` paths that yield control to the runner, (b) executors downstream of a `dropped type mismatch` edge group,
  and (c) sync (non-async) executors triggered by InternalEdgeGroup messages.
  3. If reproducible, fix or document the workaround.

  ## Workaround currently in use

  None viable. Application-side wrappers (`tracer.start_as_current_span` around each `Executor.handle()`) were considered and rejected — the bug
  surfaces in executors with diverse shapes (gather, no-gather, sync, async), so a wrapper strategy can't be comprehensive without effectively
  re-implementing `enable_instrumentation()`. We accept the trace gaps until upstream fix.

### Code Sample

```markdown

```

### Error Messages / Stack Traces

```markdown

```

### Package Versions

agent-framework==1.2.2, agent-framework-foundry==1.2.2

### Python Version

Python 3.13

### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: [Bug]: executor.process spans intermittently never close, leaving child gen_ai.* spans orphan in App Insights #5577

Description

Summary

Environment

Verified NOT a duplicate of #5552

Reproduction

Expected behaviour

Actual behaviour

Evidence

Request

Workaround currently in use

Code Sample

Error Messages / Stack Traces

Package Versions

Python Version

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Python: [Bug]: executor.process spans intermittently never close, leaving child gen_ai.* spans orphan in App Insights #5577

Description

Description

Summary

Environment

Verified NOT a duplicate of #5552

Reproduction

Expected behaviour

Actual behaviour

Evidence

Request

Workaround currently in use

Code Sample

Error Messages / Stack Traces

Package Versions

Python Version

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions