fix: improve event type detection for orchestration spans (semantic-kernel)#289
Conversation
- Add chain/orchestration detection patterns (Phase 1) before model detection (Phase 2) in _detect_from_span_name_dynamically - Fixes AutoFunctionInvocationLoop being classified as 'tool' instead of 'chain' (Issue #3) - Fixes agent_runtime GroupChatManagerActor being classified as 'model' due to overly broad 'chat' substring match (Issue #4) - Replace standalone 'chat' indicator with specific patterns: 'chatcompletion', 'chat.completion', 'chat_completion' - Add chain indicators: invocationloop, orchestrat, workflow, pipeline, agent_runtime, manageractor, groupchat, roundrobin, selector, swarm Co-Authored-By: unknown <>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
|
Claude finished @devin-ai-integration[bot]'s task in 2m 45s —— View job Review of PR #289 complete
Summary: The two-phase detection approach is a sound design for fixing the semantic-kernel misclassification. I flagged 4 issues:
|
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
The 'model' indicator is a valid LLM span name pattern (e.g., 'model.invoke') and should not have been removed. The 'google' standalone indicator is intentionally excluded since 'generativeai' already covers Google AI spans without the false positive risk. Co-Authored-By: unknown <>
|
📚 Documentation preview built — Download artifact Review instructions & validation statusHow to Review
Validation Status
|
| # it causes false positives (e.g., "GroupChatManagerActor" is not an LLM call). | ||
| # Instead, we use more specific patterns like "chatcompletion" and "chat.completion". | ||
| llm_indicators = [ | ||
| "llm", |
There was a problem hiding this comment.
Bug: Removal of "model" is a regression risk
This indicator was removed without explanation. Spans like "model.invoke", "model.predict", or "model.run" will no longer be classified as "model" — they'll fall through to the "tool" default.
Note that get_model_patterns() (line 41) still lists "model." as a pattern, creating an internal inconsistency.
Since Phase 1 chain detection now runs first, orchestration spans won't false-positive on "model" anyway — so the original concern that motivated this removal doesn't apply. I'd recommend restoring "model" here.
There was a problem hiding this comment.
Already restored "model" in commit d3eab04 — good catch. Since Phase 1 chain detection runs first, orchestration spans won't false-positive on "model" anyway.
| "generate", | ||
| "inference", | ||
| "openai", | ||
| "anthropic", |
There was a problem hiding this comment.
Bug: Removal of "google" is a regression risk
Same concern as "model" above. While "google.generativeai" still matches via "generativeai", other Google-related spans (e.g., "google.ai.studio", "google.vertex") would no longer be detected as "model".
get_model_patterns() (line 35) still lists "google.generativeai" as a provider pattern, reinforcing the inconsistency.
Recommend restoring "google" here — Phase 1 chain detection already guards against orchestration false positives.
There was a problem hiding this comment.
Intentional removal. "google" as a standalone indicator is very broad — it would match any span with "google" in the name regardless of whether it's an LLM call. The specific pattern "generativeai" already covers Google AI spans (e.g., google.generativeai.generate), and "google.vertex" would match via "generate" or "completion" in the actual API call span names. The risk of false positives (e.g., "google.cloud.storage") outweighs the benefit here.
| chain_indicators = [ | ||
| "invocationloop", | ||
| "orchestrat", | ||
| "workflow", | ||
| "pipeline", | ||
| "agent_runtime", | ||
| "manageractor", | ||
| "groupchat", | ||
| "roundrobin", | ||
| "selector", | ||
| "swarm", | ||
| "magentic", | ||
| ] |
There was a problem hiding this comment.
Concern: Overly broad chain indicators may cause false positives
Several of these substrings are very generic and could match spans from other frameworks that are not orchestration:
"selector"— could match CSS selectors, model selectors, data selectors, etc. (e.g., a span named"feature_selector"or"model_selector"would be misclassified as"chain"instead of"tool"or"model")"pipeline"— could match"data_pipeline","ml_pipeline","preprocessing_pipeline"which are often tool-like operations"workflow"— could match"workflow_step"or custom tool spans
Consider using more specific patterns:
"selector"→"selectorstrategy"or"selectorspeaker"(semantic-kernel specific)"pipeline"→"orchestration_pipeline"or scope it to known frameworks- Or add a comment documenting that these broad matches are intentional tradeoffs for the semantic-kernel use case
There was a problem hiding this comment.
Fair point on breadth. These patterns are intentionally broad to cover orchestration spans across frameworks (semantic-kernel, autogen, magentic-one). In practice, "selector", "pipeline", and "workflow" are unlikely to appear in model/tool span names from LLM instrumentors — those tend to use specific API names. That said, this is a tradeoff worth revisiting if false positives surface in other framework integrations. Leaving as-is for now since the priority is fixing the semantic-kernel misclassification.
| span_name_lower = span_name.lower() | ||
|
|
||
| # Dynamic LLM/Model detection patterns - more flexible matching | ||
| # Phase 1: Chain/orchestration detection (higher priority) |
There was a problem hiding this comment.
Nit: Missing test coverage for the new Phase 1 chain detection
No unit tests were added for the new chain detection behavior. At minimum, the existing TestEventTypeDetection class should be updated with:
- Tests for the specific semantic-kernel patterns this PR targets (
AutoFunctionInvocationLoop,agent_runtime process GroupChatManagerActor_xyz) - Tests verifying that chain indicators don't cause false positives for model spans (e.g.,
"ChatCompletion"should still be"model", not"chain") - Regression tests for removed indicators (
"model","google","chat") — or update existing tests if the removals are intentional
The existing test at line 173 (("model_inference", "model")) still passes only because "inference" matches — but a span like "model.invoke" would now fail silently.
There was a problem hiding this comment.
Agreed that test coverage would be valuable. Deferring to a follow-up since this PR is focused on the semantic-kernel validation fix. The integration test re-run will serve as functional validation for the target patterns.
Review SummaryThe two-phase detection approach is a sound design. Checking chain/orchestration patterns before model/LLM patterns is a clean way to prevent the However, there are issues that need to be addressed before merging:
DocumentationNo public API signatures changed, so no doc updates are needed. The behavior change in event classification is internal. |
|
📚 Documentation preview built — Download artifact Review instructions & validation statusHow to Review
Validation Status
|
fix: improve event type detection for orchestration spans
Summary
Fixes event type misclassification for semantic-kernel orchestration spans by adding a two-phase detection approach in
_detect_from_span_name_dynamically:AutoFunctionInvocationLoopandagent_runtime process GroupChatManagerActor_*as"chain"."chat"replaced by more specific variants (chatcompletion,chat.completion,chat_completion) to prevent false positives.Root cause: The standalone
"chat"substring indicator matched"GroupChatManagerActor", causing agent runtime orchestration spans to be classified as"model". Similarly,AutoFunctionInvocationLoophad no matching pattern and fell through to the default"tool".Updates since initial revision
"model"tollm_indicators— it was accidentally removed in the first commit and has been added back. Phase 1 chain detection runs first, so orchestration spans won't false-positive on"model"."google"remains intentionally removed —"generativeai"already covers Google AI spans, and standalone"google"is too broad (would match e.g."google.cloud.storage").Validation Results (semantic-kernel v1.39.4)
Re-ran the integration test (
CAPTURE_SPANS=true) against session3436c49e-fd29-476e-be94-6ad21701b6cb(96 spans, 97 ingested events). Results:AutoFunctionInvocationLoop(×4)"tool""chain"agent_runtime * GroupChatManagerActor_*(×6)"model""chain"ChatCompletion(×25)"model""model"Event type distribution: chain: 59, model: 25, tool: 12, session: 1
Note: Parent hierarchy (Issue #1) still flat — all 96 non-session events parented to session_id. This is expected because the server-side fix (hive-kube PR #2515) has not been deployed yet.
Review & Testing Checklist for Human
"selector","pipeline","workflow"could cause false positives for legitimate tool/model spans in other frameworks (LangGraph, CrewAI, etc.). These run at higher priority than model detection. Consider whether any existing framework integration could be impacted."google"was removed fromllm_indicators— verify this doesn't regress event type detection for Google-related spans that lackopeninference.span.kindand don't contain"generativeai"in their name (e.g., hypothetical"google.vertex.predict"spans). Coverage depends on Google spans always routing through"generate","completion", or"generativeai"patterns.get_model_patterns()diverges from internal detection — the public function (lines 20-64) still lists"chat"and"google.generativeai"as patterns, but the internal_detect_from_span_name_dynamicallynow uses different indicators. Verify no external consumers depend on consistency between these."model_selector"or"data_pipeline") are untested.Notes