fix(core): add flush_spans() to drain BatchSpanProcessor before sandbox suspend#46181
Merged
ankitbko merged 1 commit intoagentserver/invokefrom Apr 7, 2026
Merged
Conversation
…ox suspend BatchSpanProcessor exports spans on a background timer (default 5s). In hosted sandbox environments the platform may suspend the process immediately after an HTTP response is sent, before the timer fires. This causes short-lived spans (e.g. LangGraph per-node invoke_agent spans) to be lost. Add flush_spans() to the core public API and call it from trace_stream's finally block so the streaming path also flushes. Verified: same agent produces 19-31 spans with flush vs 11 without. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ankitbko
added a commit
that referenced
this pull request
Apr 8, 2026
* Address pvaneck review comments from PR #45925 - Added docstring for TracingHelper.__init__ connection_string param - Added enrichment processor dupe guard (_enrichment_configured flag) - Fixed InMemorySpanExporter import path in test_tracing.py - Fixed @app → @server variable name mismatch in test_tracing.py - Updated invocations CHANGELOG with 2.0.0b1 + kept 1.0.0b1 history - Fixed duplicate InvocationAgentServerHost imports in README - Fixed README titles to match Verify Readmes pattern - Fixed tracing tests to use TestClient and set APPLICATIONINSIGHTS env var - Fixed test pollution: OTel provider reused across test modules - Removed obsolete baggage test Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix CHANGELOGs: add 1.0.0b1 history to core, keep invocations at 1.0.0b1 Core: added historical 1.0.0b1 entry below 2.0.0b1, removed stale leaf_customer_span_id from features. Invocations: reverted to single 1.0.0b1 entry (new package, no prior releases). Updated feature list to reflect InvocationAgentServerHost. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Refactor TracingHelper class → module functions + host method Replaced the TracingHelper class with: - configure_tracing() — standalone function for exporter setup, overridable via AgentServerHost(configure_tracing=my_func) or disabled with configure_tracing=None - request_span() — module-level context manager for span creation - end_span/record_error/trace_stream — module-level lifecycle helpers - AgentServerHost.request_span() — thin method that delegates with pre-populated host identity (agent_id, project_id) Protocol SDKs now use self.request_span() instead of self._tracing.request_span() with None checks. All functions are no-ops when opentelemetry-api is not installed. Removed TracingHelper from core __init__.py exports. Agent identity (name/version/project_id) moved to AgentServerHost. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Update core sample to use new tracing API (self.request_span) Removed TracingHelper import, contextlib.nullcontext pattern, and None checks. Now uses self.request_span() and _tracing.record_error(). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Remove get_logger() — use logging.getLogger directly get_logger() was a one-liner wrapping logging.getLogger('azure.ai.agentserver'). Replaced all usages with direct logging.getLogger() calls and deleted _logger.py. Removed get_logger from core __init__.py exports. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Make OTel a primary dependency, remove _HAS_OTEL guard Moved opentelemetry-api, opentelemetry-sdk, opentelemetry-exporter-otlp, and azure-monitor-opentelemetry-exporter from optional [tracing] extras to primary dependencies. Removed _HAS_OTEL flag, try/except import guard, and all conditional checks — OTel is always available. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Update sdk/agentserver/azure-ai-agentserver-invocations/tests/test_tracing.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update sdk/agentserver/azure-ai-agentserver-invocations/tests/test_tracing.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update sdk/agentserver/azure-ai-agentserver-core/azure/ai/agentserver/core/_tracing.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * removed redundant packages * Address PR review: expose tracing functions publicly, fix imports, update CHANGELOG - Exported end_span, record_error, trace_stream from core __init__.py (no more importing internal _tracing module from other packages) - Updated invocations to use public imports from core - Updated selfhosted sample to use public record_error import - Added None guard in _wrap_streaming_response for otel_span - Fixed test docstring mismatch (tracing_disabled_by_default) - Updated CHANGELOG to reflect TracingHelper → functions change - Fixed get_logger import in githubcopilot adapter Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Spec compliance: platform header, tracer scope, error attrs, baggage, isolation headers, HTTP/2 - x-platform-server header now includes version and python runtime - Instrumentation scope: Azure.AI.AgentServer (core), .Invocations (invocations) - record_error() now sets error.type attribute per OTel semantic conventions - baggage header included in W3C trace context extraction - x-request-id propagated into span attributes - Platform isolation headers (x-agent-user-isolation-key, x-agent-chat-isolation-key) exposed via request.state - HTTP/2 disabled in Hypercorn config (spec requires HTTP/1.1 only) - Fixed get_logger import in githubcopilot adapter Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Spec compliance: baggage keys, SSE keep-alive, structured logging, SIGTERM forwarding - Re-added W3C baggage propagation for invocation_id/session_id - Added SSE_KEEPALIVE_INTERVAL env var and resolve_sse_keepalive_interval() - Added sse_keepalive_stream() as AgentServerHost static method (not in tracing) - Added _InvocationLogFilter for structured log scope with InvocationId/SessionId - Added SIGTERM handler in run() that logs and re-raises for Hypercorn - Separated trace_stream (tracing concern) from sse_keepalive_stream (transport) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix PR review: contextvars for logging, SIGTERM restore, deduplicate log handler - Replaced per-request logger.addFilter/removeFilter with contextvars (_invocation_id_var, _session_id_var) for concurrency-safe structured logging. Filter installed once at module level, reads from contextvars. - SIGTERM handler now restored in finally block after run() exits. - _setup_otlp_log_export only adds LoggingHandler when Azure Monitor handler is not already configured, preventing duplicate log emission. - Fixed Black formatting in test_tracing.py dict literals. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Move azure-monitor-opentelemetry-exporter to optional dep The CI dev-build tool (process_requires) rewrites all azure-* dependency version specs for dev builds, transforming >=1.0.0b21 into >=1.0.0a1,<1.0.0b0 which is unresolvable. The exporter is imported lazily with try/except in _tracing.py so it works correctly when installed separately. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Revert "Move azure-monitor-opentelemetry-exporter to optional dep" This reverts commit 973a32b. * Add azure-monitor-opentelemetry-exporter to CI Artifacts The CI dev-build tool (process_requires) rewrites azure-* dependency version specs. Adding the exporter to Artifacts ensures a compatible dev build is published to the dev feed alongside agentserver packages. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix PR comments: log filter comment, tracing docstring, changelog baggage - Updated _ensure_log_filter comment to say 'first request' not 'module load' - Updated _tracing.py docstring: OTel is required, not optional - Added W3C Baggage and structured logging to invocations CHANGELOG Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Revert "Add azure-monitor-opentelemetry-exporter to CI Artifacts" This reverts commit 15aa0c5. * Fix PR comments: keepalive shield, SIGTERM comment, docstring, thread-safe filter - sse_keepalive_stream: use asyncio.shield to prevent cancelling upstream iterator on timeout. Reuse pending task across timeouts. - SIGTERM handler comment: clarified it logs and re-raises, not forwards. - request_span docstring: removed stale 'no-op when OTel not installed'. - _ensure_log_filter: added threading.Lock for double-checked locking. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix: duplicate imports, sanitize IDs in get/cancel endpoints, copilot adapter logging - Removed duplicate 'import logging' in _invocation.py - Added _sanitize_id for invocation_id and session_id in _traced_invocation_endpoint - Fixed _copilot_adapter.py: consolidated logging import, removed _logging alias Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Spec compliance: baggage propagator, x-request-id baggage, baggage-to-log processor - Replaced TraceContextTextMapPropagator with CompositePropagator (TraceContextTextMapPropagator + W3CBaggagePropagator) to properly extract inbound baggage header into OTel context. - x-request-id now set as both span attribute AND baggage entry for downstream propagation. - Added _BaggageLogRecordProcessor that copies all W3C Baggage entries into every OTel log record's attributes for end-to-end correlation. Registered on both Azure Monitor and OTLP log providers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix import ordering: stdlib → third-party → local per ruff/isort - _base.py: moved 'import sys' to stdlib group - _tracing.py: moved opentelemetry imports to third-party group before constants - _invocation.py: moved contextvars to stdlib group, opentelemetry to third-party, removed duplicate import Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix _BaggageLogRecordProcessor: rename emit → on_emit per OTel SDK API The OTel SDK's LogRecordProcessor interface requires on_emit(), not emit(). This caused AttributeError when the log handler tried to process log records through the processor chain. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add AgentConfig dataclass on app.config, replace Constants New frozen dataclass AgentConfig populated from env vars at init time: app.config.agent_name, .agent_version, .agent_id, .project_id, .project_endpoint, .session_id, .port, .appinsights_connection_string, .otlp_endpoint, .sse_keepalive_interval - Replaced Constants class with private _ENV_* constants in _config.py - AgentServerHost stores self.config = AgentConfig.from_env() - Invocations uses self.config.session_id instead of os.environ.get() - Exported AgentConfig from core __init__.py - Updated all tests to use string literals for env var names Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Set service.name to agent_name, add operation_name to get/cancel spans - service.name span attribute now uses agent_name (falls back to 'azure.ai.agentserver' when agent_name is empty) - _traced_invocation_endpoint now passes operation_name for get_invocation and cancel_invocation spans Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Suppress noisy Azure SDK and OTel exporter INFO logs Set azure.core.pipeline.policies.http_logging_policy and azure.monitor.opentelemetry.exporter loggers to WARNING by default to avoid flooding stderr with HTTP request/response details and exporter transmission status. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Set cloud_RoleName to agent name via OTel Resource service.name The OTel Resource's service.name attribute maps to cloud_RoleName in App Insights. Now set to FOUNDRY_AGENT_NAME (falls back to 'azure.ai.agentserver' when not set). This ensures both spans and logs show the agent name as the cloud role in App Insights. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Use _config._ENV_FOUNDRY_AGENT_NAME instead of hardcoded string Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix(core): add flush_spans() to drain BatchSpanProcessor before sandbox suspend (#46181) BatchSpanProcessor exports spans on a background timer (default 5s). In hosted sandbox environments the platform may suspend the process immediately after an HTTP response is sent, before the timer fires. This causes short-lived spans (e.g. LangGraph per-node invoke_agent spans) to be lost. Add flush_spans() to the core public API and call it from trace_stream's finally block so the streaming path also flushes. Verified: same agent produces 19-31 spans with flush vs 11 without. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Update sdk/agentserver/azure-ai-agentserver-core/azure/ai/agentserver/core/_config.py Co-authored-by: Johan Stenberg (MSFT) <johan.stenberg@microsoft.com> * Update sdk/agentserver/azure-ai-agentserver-core/azure/ai/agentserver/core/_config.py Co-authored-by: Johan Stenberg (MSFT) <johan.stenberg@microsoft.com> * fix(agentserver-core): move agent identity attrs to _on_ending in FoundryEnrichmentSpanProcessor (#46186) * fix: move agent identity attrs to _on_ending in FoundryEnrichmentSpanProcessor Move gen_ai.agent.name, gen_ai.agent.version, and gen_ai.agent.id from on_start to _on_ending so underlying frameworks (LangChain, Semantic Kernel, etc.) cannot overwrite them. Uses guarded direct _attributes access as a workaround for opentelemetry-sdk <=1.40.0 spec-compliance gap where set_attribute() is a no-op during _on_ending. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * fix: add shutdown() to _CollectorExporter for SDK compatibility Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Apply suggestion from @Copilot Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Remove redundant shutdown method definition --------- Co-authored-by: Neehar Duvvuri <neeharduvvuri@Neehars-MacBook-Pro.local> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix: update type imports and add missing method return type annotation * removed dataclass * Fix cspell errors: coro→coroutine, reraises→re_raises, sess→session Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix mypy: add type ignore for _BaggageLogRecordProcessor arg-type Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix pylint docstrings: add types, keyword args, rtype, unused arg prefix - request_span (both _base.py and _tracing.py): added :type:, :keyword:, :paramtype:, :rtype:, and missing instrumentation_scope doc - _handle_sigterm: prefixed unused args with _ (_signum, _frame) - sse_keepalive_stream: added :type: and :rtype: - end_span, flush_spans, record_error, trace_stream: added :type: - _BaggageLogRecordProcessor.on_emit: added :param log_data: Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Bump minimum opentelemetry version to 1.33.0 Ensures _on_ending span processor method and stable baggage/log APIs are available. Avoids edge cases with older SDK versions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Bump opentelemetry dependencies to version 1.40.0 * Bump opentelemetry dependencies to version 1.40.0 in dev requirements * Update sdk/agentserver/azure-ai-agentserver-core/azure/ai/agentserver/core/_base.py Co-authored-by: Johan Stenberg (MSFT) <johan.stenberg@microsoft.com> * updated codeowner --------- Co-authored-by: Ankit Sinha <anksinha@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Nagkumar Arkalgud <nagkumar91@users.noreply.github.com> Co-authored-by: Johan Stenberg (MSFT) <johan.stenberg@microsoft.com> Co-authored-by: Neehar Duvvuri <40341266+needuv@users.noreply.github.com> Co-authored-by: Neehar Duvvuri <neeharduvvuri@Neehars-MacBook-Pro.local>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
BatchSpanProcessorexports spans on a background timer (default 5 seconds). In hosted sandbox environments (Azure AI Foundry vNext), the platform may suspend the process immediately after an HTTP response is sent, before the batch timer fires. This causes short-lived spans to be lost.Split from #46154 per review feedback — this PR contains only the core package changes. The responses package changes (
_endpoint_handler.py) remain in #46154.Changes (core package only)
core/_tracing.pyflush_spans()function + call fromtrace_streamfinally blockcore/__init__.pyflush_spansin public APIflush_spans()behaviorTracerProvider.force_flush(timeout_millis=5000)to synchronously drain the batch bufferforce_flushBefore vs After
Before — 11 spans (per-node spans lost)
After — 19+ spans (full node hierarchy) ✅
Verified on hosted sandbox with 5 invokes producing 19-31 spans each.