Skip to content

fix: improve event type detection for orchestration spans (semantic-kernel)#289

Open
devin-ai-integration[bot] wants to merge 2 commits intofederated-sdk-release-candidatefrom
devin/1772446053-fix-sk-event-type-detection
Open

fix: improve event type detection for orchestration spans (semantic-kernel)#289
devin-ai-integration[bot] wants to merge 2 commits intofederated-sdk-release-candidatefrom
devin/1772446053-fix-sk-event-type-detection

Conversation

@devin-ai-integration
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot commented Mar 2, 2026

fix: improve event type detection for orchestration spans

Summary

Fixes event type misclassification for semantic-kernel orchestration spans by adding a two-phase detection approach in _detect_from_span_name_dynamically:

  1. Phase 1 (new): Check for chain/orchestration patterns first — matches spans like AutoFunctionInvocationLoop and agent_runtime process GroupChatManagerActor_* as "chain".
  2. Phase 2: Check for model/LLM patterns — with "chat" replaced by more specific variants (chatcompletion, chat.completion, chat_completion) to prevent false positives.

Root cause: The standalone "chat" substring indicator matched "GroupChatManagerActor", causing agent runtime orchestration spans to be classified as "model". Similarly, AutoFunctionInvocationLoop had no matching pattern and fell through to the default "tool".

Updates since initial revision

  • Restored "model" to llm_indicators — it was accidentally removed in the first commit and has been added back. Phase 1 chain detection runs first, so orchestration spans won't false-positive on "model".
  • "google" remains intentionally removed"generativeai" already covers Google AI spans, and standalone "google" is too broad (would match e.g. "google.cloud.storage").

Validation Results (semantic-kernel v1.39.4)

Re-ran the integration test (CAPTURE_SPANS=true) against session 3436c49e-fd29-476e-be94-6ad21701b6cb (96 spans, 97 ingested events). Results:

Span Name Pattern Before After Status
AutoFunctionInvocationLoop (×4) "tool" "chain" Fixed
agent_runtime * GroupChatManagerActor_* (×6) "model" "chain" Fixed
ChatCompletion (×25) "model" "model" No regression

Event type distribution: chain: 59, model: 25, tool: 12, session: 1

Note: Parent hierarchy (Issue #1) still flat — all 96 non-session events parented to session_id. This is expected because the server-side fix (hive-kube PR #2515) has not been deployed yet.

Review & Testing Checklist for Human

  • Chain indicators are broad substring matches"selector", "pipeline", "workflow" could cause false positives for legitimate tool/model spans in other frameworks (LangGraph, CrewAI, etc.). These run at higher priority than model detection. Consider whether any existing framework integration could be impacted.
  • "google" was removed from llm_indicators — verify this doesn't regress event type detection for Google-related spans that lack openinference.span.kind and don't contain "generativeai" in their name (e.g., hypothetical "google.vertex.predict" spans). Coverage depends on Google spans always routing through "generate", "completion", or "generativeai" patterns.
  • get_model_patterns() diverges from internal detection — the public function (lines 20-64) still lists "chat" and "google.generativeai" as patterns, but the internal _detect_from_span_name_dynamically now uses different indicators. Verify no external consumers depend on consistency between these.
  • No unit tests for chain detection phase — only validated via integration test on semantic-kernel spans. Edge cases (e.g., a span named "model_selector" or "data_pipeline") are untested.
  • Spot-check other framework integration tests — Run OpenAI, Anthropic, LangChain, etc. integration tests to verify event types remain correct.

Notes

  • Part of the semantic-kernel v1.39.4 tracing pipeline validation (paired with hive-kube PR #2515 for cross-batch parent resolution)
  • No unit tests added for the new chain detection phase — deferred to follow-up
  • Requested by: unknown ()
  • Devin session

- Add chain/orchestration detection patterns (Phase 1) before model
  detection (Phase 2) in _detect_from_span_name_dynamically
- Fixes AutoFunctionInvocationLoop being classified as 'tool' instead
  of 'chain' (Issue #3)
- Fixes agent_runtime GroupChatManagerActor being classified as 'model'
  due to overly broad 'chat' substring match (Issue #4)
- Replace standalone 'chat' indicator with specific patterns:
  'chatcompletion', 'chat.completion', 'chat_completion'
- Add chain indicators: invocationloop, orchestrat, workflow, pipeline,
  agent_runtime, manageractor, groupchat, roundrobin, selector, swarm

Co-Authored-By: unknown <>
@devin-ai-integration devin-ai-integration bot requested a review from a team March 2, 2026 10:11
@devin-ai-integration
Copy link
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@claude
Copy link
Contributor

claude bot commented Mar 2, 2026

Claude finished @devin-ai-integration[bot]'s task in 2m 45s —— View job


Review of PR #289 complete

  • Gather context and read changed files
  • Review the diff against base branch
  • Post inline code feedback (4 inline comments)
  • Post top-level review summary

Summary: The two-phase detection approach is a sound design for fixing the semantic-kernel misclassification. I flagged 4 issues:

  1. Regression risk: "model" and "google" removed from llm_indicators without need — Phase 1 already prevents the false positives, so these should be restored
  2. Overly broad chain indicators: "selector", "pipeline", "workflow" could cause false positives in other frameworks
  3. Missing tests: No unit tests for the new chain detection phase
  4. API inconsistency: get_model_patterns() diverges from the internal detection logic

@snyk-io
Copy link

snyk-io bot commented Mar 2, 2026

Snyk checks have passed. No issues have been found so far.

Status Scanner Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

The 'model' indicator is a valid LLM span name pattern (e.g., 'model.invoke')
and should not have been removed. The 'google' standalone indicator is
intentionally excluded since 'generativeai' already covers Google AI spans
without the false positive risk.

Co-Authored-By: unknown <>
@github-actions
Copy link
Contributor

github-actions bot commented Mar 2, 2026

📚 Documentation preview built — Download artifact

Review instructions & validation status

How to Review

  1. Download the artifact from the link above
  2. Extract the files
  3. Open index.html in your browser

Validation Status

  • API validation: ✅ Passed
  • Build process: ✅ Successful
  • Import tests: ✅ All imports working

# it causes false positives (e.g., "GroupChatManagerActor" is not an LLM call).
# Instead, we use more specific patterns like "chatcompletion" and "chat.completion".
llm_indicators = [
"llm",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Removal of "model" is a regression risk

This indicator was removed without explanation. Spans like "model.invoke", "model.predict", or "model.run" will no longer be classified as "model" — they'll fall through to the "tool" default.

Note that get_model_patterns() (line 41) still lists "model." as a pattern, creating an internal inconsistency.

Since Phase 1 chain detection now runs first, orchestration spans won't false-positive on "model" anyway — so the original concern that motivated this removal doesn't apply. I'd recommend restoring "model" here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already restored "model" in commit d3eab04 — good catch. Since Phase 1 chain detection runs first, orchestration spans won't false-positive on "model" anyway.

"generate",
"inference",
"openai",
"anthropic",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Removal of "google" is a regression risk

Same concern as "model" above. While "google.generativeai" still matches via "generativeai", other Google-related spans (e.g., "google.ai.studio", "google.vertex") would no longer be detected as "model".

get_model_patterns() (line 35) still lists "google.generativeai" as a provider pattern, reinforcing the inconsistency.

Recommend restoring "google" here — Phase 1 chain detection already guards against orchestration false positives.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional removal. "google" as a standalone indicator is very broad — it would match any span with "google" in the name regardless of whether it's an LLM call. The specific pattern "generativeai" already covers Google AI spans (e.g., google.generativeai.generate), and "google.vertex" would match via "generate" or "completion" in the actual API call span names. The risk of false positives (e.g., "google.cloud.storage") outweighs the benefit here.

Comment on lines +422 to +434
chain_indicators = [
"invocationloop",
"orchestrat",
"workflow",
"pipeline",
"agent_runtime",
"manageractor",
"groupchat",
"roundrobin",
"selector",
"swarm",
"magentic",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concern: Overly broad chain indicators may cause false positives

Several of these substrings are very generic and could match spans from other frameworks that are not orchestration:

  • "selector" — could match CSS selectors, model selectors, data selectors, etc. (e.g., a span named "feature_selector" or "model_selector" would be misclassified as "chain" instead of "tool" or "model")
  • "pipeline" — could match "data_pipeline", "ml_pipeline", "preprocessing_pipeline" which are often tool-like operations
  • "workflow" — could match "workflow_step" or custom tool spans

Consider using more specific patterns:

  • "selector""selectorstrategy" or "selectorspeaker" (semantic-kernel specific)
  • "pipeline""orchestration_pipeline" or scope it to known frameworks
  • Or add a comment documenting that these broad matches are intentional tradeoffs for the semantic-kernel use case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point on breadth. These patterns are intentionally broad to cover orchestration spans across frameworks (semantic-kernel, autogen, magentic-one). In practice, "selector", "pipeline", and "workflow" are unlikely to appear in model/tool span names from LLM instrumentors — those tend to use specific API names. That said, this is a tradeoff worth revisiting if false positives surface in other framework integrations. Leaving as-is for now since the priority is fixing the semantic-kernel misclassification.

span_name_lower = span_name.lower()

# Dynamic LLM/Model detection patterns - more flexible matching
# Phase 1: Chain/orchestration detection (higher priority)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Missing test coverage for the new Phase 1 chain detection

No unit tests were added for the new chain detection behavior. At minimum, the existing TestEventTypeDetection class should be updated with:

  1. Tests for the specific semantic-kernel patterns this PR targets (AutoFunctionInvocationLoop, agent_runtime process GroupChatManagerActor_xyz)
  2. Tests verifying that chain indicators don't cause false positives for model spans (e.g., "ChatCompletion" should still be "model", not "chain")
  3. Regression tests for removed indicators ("model", "google", "chat") — or update existing tests if the removals are intentional

The existing test at line 173 (("model_inference", "model")) still passes only because "inference" matches — but a span like "model.invoke" would now fail silently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that test coverage would be valuable. Deferring to a follow-up since this PR is focused on the semantic-kernel validation fix. The integration test re-run will serve as functional validation for the target patterns.

@claude
Copy link
Contributor

claude bot commented Mar 2, 2026

Review Summary

The two-phase detection approach is a sound design. Checking chain/orchestration patterns before model/LLM patterns is a clean way to prevent the "chat" substring false positive in GroupChatManagerActor. The "chat""chatcompletion" / "chat.completion" / "chat_completion" refinement is the right fix for the root cause.

However, there are issues that need to be addressed before merging:

  1. Regression risk from removed indicators"model" and "google" were removed from llm_indicators without clear justification. Since Phase 1 chain detection now runs first, these removals are unnecessary — orchestration spans will be caught before they reach Phase 2. Restoring them avoids breaking spans like "model.invoke" or "google.vertex". See inline comments for details.

  2. Overly broad chain indicators"selector", "pipeline", and "workflow" are very generic substrings that could match non-orchestration spans from other frameworks (LangChain, CrewAI, custom instrumentation). These should be scoped more narrowly or documented as intentional tradeoffs.

  3. No test coverage for new behavior — The chain detection phase is entirely untested. At minimum, tests should cover the semantic-kernel patterns this PR targets and verify no regressions for the removed indicators.

  4. get_model_patterns() inconsistency — The public function get_model_patterns() (lines 20-64, exported in __all__) still includes "chat", "model.", and "google.generativeai" but the internal detection function now diverges. Even though get_model_patterns() isn't called by the detection logic, this API inconsistency could confuse consumers.

Documentation

No public API signatures changed, so no doc updates are needed. The behavior change in event classification is internal.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 2, 2026

📚 Documentation preview built — Download artifact

Review instructions & validation status

How to Review

  1. Download the artifact from the link above
  2. Extract the files
  3. Open index.html in your browser

Validation Status

  • API validation: ✅ Passed
  • Build process: ✅ Successful
  • Import tests: ✅ All imports working

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants