Skip to content

fix: flush pending spans before HTTP response to prevent loss in hosted sandboxes#46154

Merged
ankitbko merged 2 commits intoagentserver/responsesfrom
fix/agentserver-flush-spans-before-response
Apr 7, 2026
Merged

fix: flush pending spans before HTTP response to prevent loss in hosted sandboxes#46154
ankitbko merged 2 commits intoagentserver/responsesfrom
fix/agentserver-flush-spans-before-response

Conversation

@nagkumar91
Copy link
Copy Markdown
Member

@nagkumar91 nagkumar91 commented Apr 6, 2026

Problem

BatchSpanProcessor exports spans on a background timer (default 5 seconds). In hosted sandbox environments (Azure AI Foundry vNext), the platform may suspend the process immediately after an HTTP response is sent, before the batch timer fires. This causes short-lived spans to be lost.

What gets lost: Per-node invoke_agent spans from LangGraph/LangChain auto-instrumentation (<1ms each) that end just before the response is returned.

What survives without the fix: chat spans (3-8s) and execute_tool spans — these end during graph execution while subsequent LLM calls create enough wall-clock delay for the batch timer to fire.

Before vs After

Before (without force_flush) — 11 spans

request:  invoke_agent naarkalg-langgraph-travel-agent:1       (29s)
  └─ dependency: invoke_agent naarkalg-langgraph-travel-agent  (29s)
       ├─ gen_ai.retriever                                      (0ms)
       ├─ chat gpt-4o                                           (5.1s)
       ├─ chat gpt-4o                                           (2.6s)
       ├─ execute_tool search_flights                           (0ms)
       ├─ execute_tool search_hotels                            (0ms)
       ├─ execute_tool get_destination_weather                  (0ms)
       ├─ execute_tool estimate_trip_cost                       (1ms)
       ├─ chat gpt-4o                                           (9.3s)
       └─ chat gpt-4o                                           (12.3s)

All per-node invoke_agent spans (user_proxy, orchestrator, draft_plan, run_tools, finalize, etc.) are missing — lost in the BatchSpanProcessor buffer when the sandbox suspended.

After (with force_flush) — 19 spans ✅

request:  invoke_agent naarkalg-langgraph-travel-agent:1       (29s)
  └─ dependency: invoke_agent naarkalg-langgraph-travel-agent  (29s)
       ├─ invoke_agent user_proxy                               (0ms)
       ├─ invoke_agent orchestrator                             (0ms)
       ├─ invoke_agent retrieve_context                         (0ms)
       │  └─ gen_ai.retriever                                   (0ms)
       ├─ invoke_agent research_destination                     (0ms)
       ├─ invoke_agent draft_plan                               (5.1s)
       │  └─ chat gpt-4o                                        (5.1s)
       ├─ invoke_agent run_tools                                (11.9s)
       │  ├─ chat gpt-4o                                        (2.6s)
       │  ├─ execute_tool search_flights                        (0ms)
       │  ├─ execute_tool search_hotels                         (0ms)
       │  ├─ execute_tool get_destination_weather                (0ms)
       │  ├─ execute_tool estimate_trip_cost                     (1ms)
       │  └─ chat gpt-4o                                        (9.3s)
       ├─ invoke_agent evaluate_constraints                     (0ms)
       └─ invoke_agent finalize                                 (12.3s)
            └─ chat gpt-4o                                      (12.3s)

Full graph-node hierarchy preserved. For traces with replan loops (budget exceeded), span count grows to 31 with the full replan path visible.

Hosted validation (5 invokes, same agent image)

Prompt Total Spans invoke_agent chat tool
Tokyo 3-day, $3000 19 10 4 4
Paris 5-day, $2500 31 14 8 8
Rome weekend, $1800 31 14 8 8
Seoul 6-day, $5000 19 10 4 4
Bali honeymoon, $8000 19 10 4 4

Paris and Rome trigger a replan loop (budget exceeded → replan → re-run tools), producing 31 spans.

Fix

  • Add flush_spans() to the azure-ai-agentserver-core public API (_tracing.py)
  • Call it in _endpoint_handler.py finally block (covers all non-streaming exit paths)
  • Call it in trace_stream() finally block (covers the streaming path)
  • Gracefully no-ops when OTel SDK is not installed or provider does not support force_flush

Changes

File Change
core/_tracing.py Add flush_spans() function + call from trace_stream
core/__init__.py Export flush_spans
responses/hosting/_endpoint_handler.py Import + call flush_spans() in finally block

Environment

  • Agent: naarkalg-langgraph-travel-agent:1 on hosted-agents-evals-bugbash-wus2
  • Image: hostedagentsevals.azurecr.io/naarkalg-langgraph-travel-agent:20260406112724
  • Tracer: langchain-azure-ai[opentelemetry] from langchain-ai/langchain-azure@main
  • Model: gpt-4o

…ed sandboxes

BatchSpanProcessor exports spans on a background timer (default 5s).
In hosted sandbox environments the platform may suspend the process
immediately after the HTTP response is sent, before the timer fires.
This causes short-lived spans — such as LangGraph per-node invoke_agent
spans created by third-party tracers — to be lost.

Add flush_spans() to the core public API and call it:
- In _endpoint_handler.py's finally block (covers all non-streaming exits)
- In trace_stream's finally block (covers the streaming path)

Locally verified: same agent code produces 30 spans with flush vs 11
without flush, confirming the BatchSpanProcessor timing issue.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions bot added the Hosted Agents sdk/agentserver/* label Apr 6, 2026
AgentServerHost,
create_error_response,
end_span,
flush_spans,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! #46181 is the other PR

@nagkumar91 nagkumar91 changed the base branch from agentserver/responses to agentserver/invoke April 7, 2026 14:17
@nagkumar91 nagkumar91 changed the base branch from agentserver/invoke to agentserver/responses April 7, 2026 14:18
@ankitbko ankitbko merged commit 925b491 into agentserver/responses Apr 7, 2026
2 checks passed
@ankitbko ankitbko deleted the fix/agentserver-flush-spans-before-response branch April 7, 2026 19:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Hosted Agents sdk/agentserver/*

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants