fix(core): add flush_spans() to drain BatchSpanProcessor before sandbox suspend by nagkumar91 · Pull Request #46181 · Azure/azure-sdk-for-python

nagkumar91 · 2026-04-07T14:22:43Z

Problem

BatchSpanProcessor exports spans on a background timer (default 5 seconds). In hosted sandbox environments (Azure AI Foundry vNext), the platform may suspend the process immediately after an HTTP response is sent, before the batch timer fires. This causes short-lived spans to be lost.

Split from #46154 per review feedback — this PR contains only the core package changes. The responses package changes (_endpoint_handler.py) remain in #46154.

Changes (core package only)

File	Change
`core/_tracing.py`	Add `flush_spans()` function + call from `trace_stream` finally block
`core/__init__.py`	Export `flush_spans` in public API

`flush_spans()` behavior

Calls TracerProvider.force_flush(timeout_millis=5000) to synchronously drain the batch buffer
No-op when OTel SDK is not installed or provider does not support force_flush
Catches and logs (debug level) any flush errors to avoid disrupting the response path

Before vs After

Before — 11 spans (per-node spans lost)

invoke_agent root → invoke_agent agent → chat × 4 + tool × 4 + retriever

After — 19+ spans (full node hierarchy) ✅

invoke_agent root → invoke_agent agent
  ├─ invoke_agent user_proxy
  ├─ invoke_agent orchestrator
  ├─ invoke_agent retrieve_context → gen_ai.retriever
  ├─ invoke_agent research_destination
  ├─ invoke_agent draft_plan → chat gpt-4o
  ├─ invoke_agent run_tools → chat + 4× execute_tool + chat
  ├─ invoke_agent evaluate_constraints
  └─ invoke_agent finalize → chat gpt-4o

Verified on hosted sandbox with 5 invokes producing 19-31 spans each.

…ox suspend BatchSpanProcessor exports spans on a background timer (default 5s). In hosted sandbox environments the platform may suspend the process immediately after an HTTP response is sent, before the timer fires. This causes short-lived spans (e.g. LangGraph per-node invoke_agent spans) to be lost. Add flush_spans() to the core public API and call it from trace_stream's finally block so the streaming path also flushes. Verified: same agent produces 19-31 spans with flush vs 11 without. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@app

* Address pvaneck review comments from PR #45925 - Added docstring for TracingHelper.__init__ connection_string param - Added enrichment processor dupe guard (_enrichment_configured flag) - Fixed InMemorySpanExporter import path in test_tracing.py - Fixed @app → @server variable name mismatch in test_tracing.py - Updated invocations CHANGELOG with 2.0.0b1 + kept 1.0.0b1 history - Fixed duplicate InvocationAgentServerHost imports in README - Fixed README titles to match Verify Readmes pattern - Fixed tracing tests to use TestClient and set APPLICATIONINSIGHTS env var - Fixed test pollution: OTel provider reused across test modules - Removed obsolete baggage test Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix CHANGELOGs: add 1.0.0b1 history to core, keep invocations at 1.0.0b1 Core: added historical 1.0.0b1 entry below 2.0.0b1, removed stale leaf_customer_span_id from features. Invocations: reverted to single 1.0.0b1 entry (new package, no prior releases). Updated feature list to reflect InvocationAgentServerHost. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Refactor TracingHelper class → module functions + host method Replaced the TracingHelper class with: - configure_tracing() — standalone function for exporter setup, overridable via AgentServerHost(configure_tracing=my_func) or disabled with configure_tracing=None - request_span() — module-level context manager for span creation - end_span/record_error/trace_stream — module-level lifecycle helpers - AgentServerHost.request_span() — thin method that delegates with pre-populated host identity (agent_id, project_id) Protocol SDKs now use self.request_span() instead of self._tracing.request_span() with None checks. All functions are no-ops when opentelemetry-api is not installed. Removed TracingHelper from core __init__.py exports. Agent identity (name/version/project_id) moved to AgentServerHost. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Update core sample to use new tracing API (self.request_span) Removed TracingHelper import, contextlib.nullcontext pattern, and None checks. Now uses self.request_span() and _tracing.record_error(). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Remove get_logger() — use logging.getLogger directly get_logger() was a one-liner wrapping logging.getLogger('azure.ai.agentserver'). Replaced all usages with direct logging.getLogger() calls and deleted _logger.py. Removed get_logger from core __init__.py exports. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Make OTel a primary dependency, remove _HAS_OTEL guard Moved opentelemetry-api, opentelemetry-sdk, opentelemetry-exporter-otlp, and azure-monitor-opentelemetry-exporter from optional [tracing] extras to primary dependencies. Removed _HAS_OTEL flag, try/except import guard, and all conditional checks — OTel is always available. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Update sdk/agentserver/azure-ai-agentserver-invocations/tests/test_tracing.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update sdk/agentserver/azure-ai-agentserver-invocations/tests/test_tracing.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update sdk/agentserver/azure-ai-agentserver-core/azure/ai/agentserver/core/_tracing.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * removed redundant packages * Address PR review: expose tracing functions publicly, fix imports, update CHANGELOG - Exported end_span, record_error, trace_stream from core __init__.py (no more importing internal _tracing module from other packages) - Updated invocations to use public imports from core - Updated selfhosted sample to use public record_error import - Added None guard in _wrap_streaming_response for otel_span - Fixed test docstring mismatch (tracing_disabled_by_default) - Updated CHANGELOG to reflect TracingHelper → functions change - Fixed get_logger import in githubcopilot adapter Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Spec compliance: platform header, tracer scope, error attrs, baggage, isolation headers, HTTP/2 - x-platform-server header now includes version and python runtime - Instrumentation scope: Azure.AI.AgentServer (core), .Invocations (invocations) - record_error() now sets error.type attribute per OTel semantic conventions - baggage header included in W3C trace context extraction - x-request-id propagated into span attributes - Platform isolation headers (x-agent-user-isolation-key, x-agent-chat-isolation-key) exposed via request.state - HTTP/2 disabled in Hypercorn config (spec requires HTTP/1.1 only) - Fixed get_logger import in githubcopilot adapter Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Spec compliance: baggage keys, SSE keep-alive, structured logging, SIGTERM forwarding - Re-added W3C baggage propagation for invocation_id/session_id - Added SSE_KEEPALIVE_INTERVAL env var and resolve_sse_keepalive_interval() - Added sse_keepalive_stream() as AgentServerHost static method (not in tracing) - Added _InvocationLogFilter for structured log scope with InvocationId/SessionId - Added SIGTERM handler in run() that logs and re-raises for Hypercorn - Separated trace_stream (tracing concern) from sse_keepalive_stream (transport) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix PR review: contextvars for logging, SIGTERM restore, deduplicate log handler - Replaced per-request logger.addFilter/removeFilter with contextvars (_invocation_id_var, _session_id_var) for concurrency-safe structured logging. Filter installed once at module level, reads from contextvars. - SIGTERM handler now restored in finally block after run() exits. - _setup_otlp_log_export only adds LoggingHandler when Azure Monitor handler is not already configured, preventing duplicate log emission. - Fixed Black formatting in test_tracing.py dict literals. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Move azure-monitor-opentelemetry-exporter to optional dep The CI dev-build tool (process_requires) rewrites all azure-* dependency version specs for dev builds, transforming >=1.0.0b21 into >=1.0.0a1,<1.0.0b0 which is unresolvable. The exporter is imported lazily with try/except in _tracing.py so it works correctly when installed separately. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Revert "Move azure-monitor-opentelemetry-exporter to optional dep" This reverts commit 973a32b. * Add azure-monitor-opentelemetry-exporter to CI Artifacts The CI dev-build tool (process_requires) rewrites azure-* dependency version specs. Adding the exporter to Artifacts ensures a compatible dev build is published to the dev feed alongside agentserver packages. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix PR comments: log filter comment, tracing docstring, changelog baggage - Updated _ensure_log_filter comment to say 'first request' not 'module load' - Updated _tracing.py docstring: OTel is required, not optional - Added W3C Baggage and structured logging to invocations CHANGELOG Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Revert "Add azure-monitor-opentelemetry-exporter to CI Artifacts" This reverts commit 15aa0c5. * Fix PR comments: keepalive shield, SIGTERM comment, docstring, thread-safe filter - sse_keepalive_stream: use asyncio.shield to prevent cancelling upstream iterator on timeout. Reuse pending task across timeouts. - SIGTERM handler comment: clarified it logs and re-raises, not forwards. - request_span docstring: removed stale 'no-op when OTel not installed'. - _ensure_log_filter: added threading.Lock for double-checked locking. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix: duplicate imports, sanitize IDs in get/cancel endpoints, copilot adapter logging - Removed duplicate 'import logging' in _invocation.py - Added _sanitize_id for invocation_id and session_id in _traced_invocation_endpoint - Fixed _copilot_adapter.py: consolidated logging import, removed _logging alias Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Spec compliance: baggage propagator, x-request-id baggage, baggage-to-log processor - Replaced TraceContextTextMapPropagator with CompositePropagator (TraceContextTextMapPropagator + W3CBaggagePropagator) to properly extract inbound baggage header into OTel context. - x-request-id now set as both span attribute AND baggage entry for downstream propagation. - Added _BaggageLogRecordProcessor that copies all W3C Baggage entries into every OTel log record's attributes for end-to-end correlation. Registered on both Azure Monitor and OTLP log providers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix import ordering: stdlib → third-party → local per ruff/isort - _base.py: moved 'import sys' to stdlib group - _tracing.py: moved opentelemetry imports to third-party group before constants - _invocation.py: moved contextvars to stdlib group, opentelemetry to third-party, removed duplicate import Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix _BaggageLogRecordProcessor: rename emit → on_emit per OTel SDK API The OTel SDK's LogRecordProcessor interface requires on_emit(), not emit(). This caused AttributeError when the log handler tried to process log records through the processor chain. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add AgentConfig dataclass on app.config, replace Constants New frozen dataclass AgentConfig populated from env vars at init time: app.config.agent_name, .agent_version, .agent_id, .project_id, .project_endpoint, .session_id, .port, .appinsights_connection_string, .otlp_endpoint, .sse_keepalive_interval - Replaced Constants class with private _ENV_* constants in _config.py - AgentServerHost stores self.config = AgentConfig.from_env() - Invocations uses self.config.session_id instead of os.environ.get() - Exported AgentConfig from core __init__.py - Updated all tests to use string literals for env var names Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Set service.name to agent_name, add operation_name to get/cancel spans - service.name span attribute now uses agent_name (falls back to 'azure.ai.agentserver' when agent_name is empty) - _traced_invocation_endpoint now passes operation_name for get_invocation and cancel_invocation spans Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Suppress noisy Azure SDK and OTel exporter INFO logs Set azure.core.pipeline.policies.http_logging_policy and azure.monitor.opentelemetry.exporter loggers to WARNING by default to avoid flooding stderr with HTTP request/response details and exporter transmission status. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Set cloud_RoleName to agent name via OTel Resource service.name The OTel Resource's service.name attribute maps to cloud_RoleName in App Insights. Now set to FOUNDRY_AGENT_NAME (falls back to 'azure.ai.agentserver' when not set). This ensures both spans and logs show the agent name as the cloud role in App Insights. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Use _config._ENV_FOUNDRY_AGENT_NAME instead of hardcoded string Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(core): add flush_spans() to drain BatchSpanProcessor before sandbox suspend (#46181) BatchSpanProcessor exports spans on a background timer (default 5s). In hosted sandbox environments the platform may suspend the process immediately after an HTTP response is sent, before the timer fires. This causes short-lived spans (e.g. LangGraph per-node invoke_agent spans) to be lost. Add flush_spans() to the core public API and call it from trace_stream's finally block so the streaming path also flushes. Verified: same agent produces 19-31 spans with flush vs 11 without. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Update sdk/agentserver/azure-ai-agentserver-core/azure/ai/agentserver/core/_config.py Co-authored-by: Johan Stenberg (MSFT) <johan.stenberg@microsoft.com> * Update sdk/agentserver/azure-ai-agentserver-core/azure/ai/agentserver/core/_config.py Co-authored-by: Johan Stenberg (MSFT) <johan.stenberg@microsoft.com> * fix(agentserver-core): move agent identity attrs to _on_ending in FoundryEnrichmentSpanProcessor (#46186) * fix: move agent identity attrs to _on_ending in FoundryEnrichmentSpanProcessor Move gen_ai.agent.name, gen_ai.agent.version, and gen_ai.agent.id from on_start to _on_ending so underlying frameworks (LangChain, Semantic Kernel, etc.) cannot overwrite them. Uses guarded direct _attributes access as a workaround for opentelemetry-sdk <=1.40.0 spec-compliance gap where set_attribute() is a no-op during _on_ending. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: add shutdown() to _CollectorExporter for SDK compatibility Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Remove redundant shutdown method definition --------- Co-authored-by: Neehar Duvvuri <neeharduvvuri@Neehars-MacBook-Pro.local> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix: update type imports and add missing method return type annotation * removed dataclass * Fix cspell errors: coro→coroutine, reraises→re_raises, sess→session Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix mypy: add type ignore for _BaggageLogRecordProcessor arg-type Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix pylint docstrings: add types, keyword args, rtype, unused arg prefix - request_span (both _base.py and _tracing.py): added :type:, :keyword:, :paramtype:, :rtype:, and missing instrumentation_scope doc - _handle_sigterm: prefixed unused args with _ (_signum, _frame) - sse_keepalive_stream: added :type: and :rtype: - end_span, flush_spans, record_error, trace_stream: added :type: - _BaggageLogRecordProcessor.on_emit: added :param log_data: Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Bump minimum opentelemetry version to 1.33.0 Ensures _on_ending span processor method and stable baggage/log APIs are available. Avoids edge cases with older SDK versions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Bump opentelemetry dependencies to version 1.40.0 * Bump opentelemetry dependencies to version 1.40.0 in dev requirements * Update sdk/agentserver/azure-ai-agentserver-core/azure/ai/agentserver/core/_base.py Co-authored-by: Johan Stenberg (MSFT) <johan.stenberg@microsoft.com> * updated codeowner --------- Co-authored-by: Ankit Sinha <anksinha@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Nagkumar Arkalgud <nagkumar91@users.noreply.github.com> Co-authored-by: Johan Stenberg (MSFT) <johan.stenberg@microsoft.com> Co-authored-by: Neehar Duvvuri <40341266+needuv@users.noreply.github.com> Co-authored-by: Neehar Duvvuri <neeharduvvuri@Neehars-MacBook-Pro.local>

nagkumar91 requested review from JC-386 and lusu-msft as code owners April 7, 2026 14:22

github-actions bot added the Hosted Agents sdk/agentserver/* label Apr 7, 2026

nagkumar91 mentioned this pull request Apr 7, 2026

fix: flush pending spans before HTTP response to prevent loss in hosted sandboxes #46154

Merged

ankitbko merged commit c96b5fd into agentserver/invoke Apr 7, 2026
4 of 5 checks passed

ankitbko deleted the fix/agentserver-core-flush-spans branch April 7, 2026 16:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(core): add flush_spans() to drain BatchSpanProcessor before sandbox suspend#46181

fix(core): add flush_spans() to drain BatchSpanProcessor before sandbox suspend#46181
ankitbko merged 1 commit intoagentserver/invokefrom
fix/agentserver-core-flush-spans

nagkumar91 commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nagkumar91 commented Apr 7, 2026

Problem

Changes (core package only)

flush_spans() behavior

Before vs After

Before — 11 spans (per-node spans lost)

After — 19+ spans (full node hierarchy) ✅

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`flush_spans()` behavior