Skip to content

fix(core): add flush_spans() to drain BatchSpanProcessor before sandbox suspend#46181

Merged
ankitbko merged 1 commit intoagentserver/invokefrom
fix/agentserver-core-flush-spans
Apr 7, 2026
Merged

fix(core): add flush_spans() to drain BatchSpanProcessor before sandbox suspend#46181
ankitbko merged 1 commit intoagentserver/invokefrom
fix/agentserver-core-flush-spans

Conversation

@nagkumar91
Copy link
Copy Markdown
Member

Problem

BatchSpanProcessor exports spans on a background timer (default 5 seconds). In hosted sandbox environments (Azure AI Foundry vNext), the platform may suspend the process immediately after an HTTP response is sent, before the batch timer fires. This causes short-lived spans to be lost.

Split from #46154 per review feedback — this PR contains only the core package changes. The responses package changes (_endpoint_handler.py) remain in #46154.

Changes (core package only)

File Change
core/_tracing.py Add flush_spans() function + call from trace_stream finally block
core/__init__.py Export flush_spans in public API

flush_spans() behavior

  • Calls TracerProvider.force_flush(timeout_millis=5000) to synchronously drain the batch buffer
  • No-op when OTel SDK is not installed or provider does not support force_flush
  • Catches and logs (debug level) any flush errors to avoid disrupting the response path

Before vs After

Before — 11 spans (per-node spans lost)

invoke_agent root → invoke_agent agent → chat × 4 + tool × 4 + retriever

After — 19+ spans (full node hierarchy) ✅

invoke_agent root → invoke_agent agent
  ├─ invoke_agent user_proxy
  ├─ invoke_agent orchestrator
  ├─ invoke_agent retrieve_context → gen_ai.retriever
  ├─ invoke_agent research_destination
  ├─ invoke_agent draft_plan → chat gpt-4o
  ├─ invoke_agent run_tools → chat + 4× execute_tool + chat
  ├─ invoke_agent evaluate_constraints
  └─ invoke_agent finalize → chat gpt-4o

Verified on hosted sandbox with 5 invokes producing 19-31 spans each.

…ox suspend

BatchSpanProcessor exports spans on a background timer (default 5s).
In hosted sandbox environments the platform may suspend the process
immediately after an HTTP response is sent, before the timer fires.
This causes short-lived spans (e.g. LangGraph per-node invoke_agent
spans) to be lost.

Add flush_spans() to the core public API and call it from
trace_stream's finally block so the streaming path also flushes.

Verified: same agent produces 19-31 spans with flush vs 11 without.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions bot added the Hosted Agents sdk/agentserver/* label Apr 7, 2026
@ankitbko ankitbko merged commit c96b5fd into agentserver/invoke Apr 7, 2026
4 of 5 checks passed
@ankitbko ankitbko deleted the fix/agentserver-core-flush-spans branch April 7, 2026 16:18
ankitbko added a commit that referenced this pull request Apr 8, 2026
* Address pvaneck review comments from PR #45925

- Added docstring for TracingHelper.__init__ connection_string param
- Added enrichment processor dupe guard (_enrichment_configured flag)
- Fixed InMemorySpanExporter import path in test_tracing.py
- Fixed @app@server variable name mismatch in test_tracing.py
- Updated invocations CHANGELOG with 2.0.0b1 + kept 1.0.0b1 history
- Fixed duplicate InvocationAgentServerHost imports in README
- Fixed README titles to match Verify Readmes pattern
- Fixed tracing tests to use TestClient and set APPLICATIONINSIGHTS env var
- Fixed test pollution: OTel provider reused across test modules
- Removed obsolete baggage test

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix CHANGELOGs: add 1.0.0b1 history to core, keep invocations at 1.0.0b1

Core: added historical 1.0.0b1 entry below 2.0.0b1, removed stale
leaf_customer_span_id from features.

Invocations: reverted to single 1.0.0b1 entry (new package, no prior
releases). Updated feature list to reflect InvocationAgentServerHost.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Refactor TracingHelper class → module functions + host method

Replaced the TracingHelper class with:
- configure_tracing() — standalone function for exporter setup,
  overridable via AgentServerHost(configure_tracing=my_func) or
  disabled with configure_tracing=None
- request_span() — module-level context manager for span creation
- end_span/record_error/trace_stream — module-level lifecycle helpers
- AgentServerHost.request_span() — thin method that delegates with
  pre-populated host identity (agent_id, project_id)

Protocol SDKs now use self.request_span() instead of
self._tracing.request_span() with None checks. All functions are
no-ops when opentelemetry-api is not installed.

Removed TracingHelper from core __init__.py exports.
Agent identity (name/version/project_id) moved to AgentServerHost.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Update core sample to use new tracing API (self.request_span)

Removed TracingHelper import, contextlib.nullcontext pattern, and
None checks. Now uses self.request_span() and _tracing.record_error().

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Remove get_logger() — use logging.getLogger directly

get_logger() was a one-liner wrapping logging.getLogger('azure.ai.agentserver').
Replaced all usages with direct logging.getLogger() calls and deleted _logger.py.
Removed get_logger from core __init__.py exports.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Make OTel a primary dependency, remove _HAS_OTEL guard

Moved opentelemetry-api, opentelemetry-sdk, opentelemetry-exporter-otlp,
and azure-monitor-opentelemetry-exporter from optional [tracing] extras
to primary dependencies. Removed _HAS_OTEL flag, try/except import
guard, and all conditional checks — OTel is always available.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Update sdk/agentserver/azure-ai-agentserver-invocations/tests/test_tracing.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update sdk/agentserver/azure-ai-agentserver-invocations/tests/test_tracing.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update sdk/agentserver/azure-ai-agentserver-core/azure/ai/agentserver/core/_tracing.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* removed redundant packages

* Address PR review: expose tracing functions publicly, fix imports, update CHANGELOG

- Exported end_span, record_error, trace_stream from core __init__.py
  (no more importing internal _tracing module from other packages)
- Updated invocations to use public imports from core
- Updated selfhosted sample to use public record_error import
- Added None guard in _wrap_streaming_response for otel_span
- Fixed test docstring mismatch (tracing_disabled_by_default)
- Updated CHANGELOG to reflect TracingHelper → functions change
- Fixed get_logger import in githubcopilot adapter

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Spec compliance: platform header, tracer scope, error attrs, baggage, isolation headers, HTTP/2

- x-platform-server header now includes version and python runtime
- Instrumentation scope: Azure.AI.AgentServer (core), .Invocations (invocations)
- record_error() now sets error.type attribute per OTel semantic conventions
- baggage header included in W3C trace context extraction
- x-request-id propagated into span attributes
- Platform isolation headers (x-agent-user-isolation-key, x-agent-chat-isolation-key) exposed via request.state
- HTTP/2 disabled in Hypercorn config (spec requires HTTP/1.1 only)
- Fixed get_logger import in githubcopilot adapter

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Spec compliance: baggage keys, SSE keep-alive, structured logging, SIGTERM forwarding

- Re-added W3C baggage propagation for invocation_id/session_id
- Added SSE_KEEPALIVE_INTERVAL env var and resolve_sse_keepalive_interval()
- Added sse_keepalive_stream() as AgentServerHost static method (not in tracing)
- Added _InvocationLogFilter for structured log scope with InvocationId/SessionId
- Added SIGTERM handler in run() that logs and re-raises for Hypercorn
- Separated trace_stream (tracing concern) from sse_keepalive_stream (transport)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix PR review: contextvars for logging, SIGTERM restore, deduplicate log handler

- Replaced per-request logger.addFilter/removeFilter with contextvars
  (_invocation_id_var, _session_id_var) for concurrency-safe structured
  logging. Filter installed once at module level, reads from contextvars.
- SIGTERM handler now restored in finally block after run() exits.
- _setup_otlp_log_export only adds LoggingHandler when Azure Monitor
  handler is not already configured, preventing duplicate log emission.
- Fixed Black formatting in test_tracing.py dict literals.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Move azure-monitor-opentelemetry-exporter to optional dep

The CI dev-build tool (process_requires) rewrites all azure-* dependency
version specs for dev builds, transforming >=1.0.0b21 into >=1.0.0a1,<1.0.0b0
which is unresolvable. The exporter is imported lazily with try/except
in _tracing.py so it works correctly when installed separately.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Revert "Move azure-monitor-opentelemetry-exporter to optional dep"

This reverts commit 973a32b.

* Add azure-monitor-opentelemetry-exporter to CI Artifacts

The CI dev-build tool (process_requires) rewrites azure-* dependency
version specs. Adding the exporter to Artifacts ensures a compatible
dev build is published to the dev feed alongside agentserver packages.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix PR comments: log filter comment, tracing docstring, changelog baggage

- Updated _ensure_log_filter comment to say 'first request' not 'module load'
- Updated _tracing.py docstring: OTel is required, not optional
- Added W3C Baggage and structured logging to invocations CHANGELOG

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Revert "Add azure-monitor-opentelemetry-exporter to CI Artifacts"

This reverts commit 15aa0c5.

* Fix PR comments: keepalive shield, SIGTERM comment, docstring, thread-safe filter

- sse_keepalive_stream: use asyncio.shield to prevent cancelling upstream
  iterator on timeout. Reuse pending task across timeouts.
- SIGTERM handler comment: clarified it logs and re-raises, not forwards.
- request_span docstring: removed stale 'no-op when OTel not installed'.
- _ensure_log_filter: added threading.Lock for double-checked locking.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix: duplicate imports, sanitize IDs in get/cancel endpoints, copilot adapter logging

- Removed duplicate 'import logging' in _invocation.py
- Added _sanitize_id for invocation_id and session_id in _traced_invocation_endpoint
- Fixed _copilot_adapter.py: consolidated logging import, removed _logging alias

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Spec compliance: baggage propagator, x-request-id baggage, baggage-to-log processor

- Replaced TraceContextTextMapPropagator with CompositePropagator
  (TraceContextTextMapPropagator + W3CBaggagePropagator) to properly
  extract inbound baggage header into OTel context.
- x-request-id now set as both span attribute AND baggage entry for
  downstream propagation.
- Added _BaggageLogRecordProcessor that copies all W3C Baggage entries
  into every OTel log record's attributes for end-to-end correlation.
  Registered on both Azure Monitor and OTLP log providers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix import ordering: stdlib → third-party → local per ruff/isort

- _base.py: moved 'import sys' to stdlib group
- _tracing.py: moved opentelemetry imports to third-party group before constants
- _invocation.py: moved contextvars to stdlib group, opentelemetry to third-party, removed duplicate import

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix _BaggageLogRecordProcessor: rename emit → on_emit per OTel SDK API

The OTel SDK's LogRecordProcessor interface requires on_emit(),
not emit(). This caused AttributeError when the log handler tried
to process log records through the processor chain.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Add AgentConfig dataclass on app.config, replace Constants

New frozen dataclass AgentConfig populated from env vars at init time:
  app.config.agent_name, .agent_version, .agent_id, .project_id,
  .project_endpoint, .session_id, .port, .appinsights_connection_string,
  .otlp_endpoint, .sse_keepalive_interval

- Replaced Constants class with private _ENV_* constants in _config.py
- AgentServerHost stores self.config = AgentConfig.from_env()
- Invocations uses self.config.session_id instead of os.environ.get()
- Exported AgentConfig from core __init__.py
- Updated all tests to use string literals for env var names

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Set service.name to agent_name, add operation_name to get/cancel spans

- service.name span attribute now uses agent_name (falls back to
  'azure.ai.agentserver' when agent_name is empty)
- _traced_invocation_endpoint now passes operation_name for
  get_invocation and cancel_invocation spans

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Suppress noisy Azure SDK and OTel exporter INFO logs

Set azure.core.pipeline.policies.http_logging_policy and
azure.monitor.opentelemetry.exporter loggers to WARNING by default
to avoid flooding stderr with HTTP request/response details and
exporter transmission status.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Set cloud_RoleName to agent name via OTel Resource service.name

The OTel Resource's service.name attribute maps to cloud_RoleName in
App Insights. Now set to FOUNDRY_AGENT_NAME (falls back to
'azure.ai.agentserver' when not set). This ensures both spans and
logs show the agent name as the cloud role in App Insights.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Use _config._ENV_FOUNDRY_AGENT_NAME instead of hardcoded string

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix(core): add flush_spans() to drain BatchSpanProcessor before sandbox suspend (#46181)

BatchSpanProcessor exports spans on a background timer (default 5s).
In hosted sandbox environments the platform may suspend the process
immediately after an HTTP response is sent, before the timer fires.
This causes short-lived spans (e.g. LangGraph per-node invoke_agent
spans) to be lost.

Add flush_spans() to the core public API and call it from
trace_stream's finally block so the streaming path also flushes.

Verified: same agent produces 19-31 spans with flush vs 11 without.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Update sdk/agentserver/azure-ai-agentserver-core/azure/ai/agentserver/core/_config.py

Co-authored-by: Johan Stenberg (MSFT) <johan.stenberg@microsoft.com>

* Update sdk/agentserver/azure-ai-agentserver-core/azure/ai/agentserver/core/_config.py

Co-authored-by: Johan Stenberg (MSFT) <johan.stenberg@microsoft.com>

* fix(agentserver-core): move agent identity attrs to _on_ending in FoundryEnrichmentSpanProcessor (#46186)

* fix: move agent identity attrs to _on_ending in FoundryEnrichmentSpanProcessor

Move gen_ai.agent.name, gen_ai.agent.version, and gen_ai.agent.id from
on_start to _on_ending so underlying frameworks (LangChain, Semantic
Kernel, etc.) cannot overwrite them.

Uses guarded direct _attributes access as a workaround for
opentelemetry-sdk <=1.40.0 spec-compliance gap where set_attribute()
is a no-op during _on_ending.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* fix: add shutdown() to _CollectorExporter for SDK compatibility

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Apply suggestion from @Copilot

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Remove redundant shutdown method definition

---------

Co-authored-by: Neehar Duvvuri <neeharduvvuri@Neehars-MacBook-Pro.local>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix: update type imports and add missing method return type annotation

* removed dataclass

* Fix cspell errors: coro→coroutine, reraises→re_raises, sess→session

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix mypy: add type ignore for _BaggageLogRecordProcessor arg-type

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix pylint docstrings: add types, keyword args, rtype, unused arg prefix

- request_span (both _base.py and _tracing.py): added :type:, :keyword:,
  :paramtype:, :rtype:, and missing instrumentation_scope doc
- _handle_sigterm: prefixed unused args with _ (_signum, _frame)
- sse_keepalive_stream: added :type: and :rtype:
- end_span, flush_spans, record_error, trace_stream: added :type:
- _BaggageLogRecordProcessor.on_emit: added :param log_data:

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Bump minimum opentelemetry version to 1.33.0

Ensures _on_ending span processor method and stable baggage/log APIs
are available. Avoids edge cases with older SDK versions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Bump opentelemetry dependencies to version 1.40.0

* Bump opentelemetry dependencies to version 1.40.0 in dev requirements

* Update sdk/agentserver/azure-ai-agentserver-core/azure/ai/agentserver/core/_base.py

Co-authored-by: Johan Stenberg (MSFT) <johan.stenberg@microsoft.com>

* updated codeowner

---------

Co-authored-by: Ankit Sinha <anksinha@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Nagkumar Arkalgud <nagkumar91@users.noreply.github.com>
Co-authored-by: Johan Stenberg (MSFT) <johan.stenberg@microsoft.com>
Co-authored-by: Neehar Duvvuri <40341266+needuv@users.noreply.github.com>
Co-authored-by: Neehar Duvvuri <neeharduvvuri@Neehars-MacBook-Pro.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Hosted Agents sdk/agentserver/*

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants