fix: improve event type detection for orchestration spans (semantic-kernel) by devin-ai-integration[bot] · Pull Request #289 · honeyhiveai/python-sdk

devin-ai-integration · 2026-03-02T10:11:08Z

fix: improve event type detection for orchestration spans

Summary

Fixes event type misclassification for semantic-kernel orchestration spans by adding a two-phase detection approach in _detect_from_span_name_dynamically:

Phase 1 (new): Check for chain/orchestration patterns first — matches spans like AutoFunctionInvocationLoop and agent_runtime process GroupChatManagerActor_* as "chain".
Phase 2: Check for model/LLM patterns — with "chat" replaced by more specific variants (chatcompletion, chat.completion, chat_completion) to prevent false positives.

Root cause: The standalone "chat" substring indicator matched "GroupChatManagerActor", causing agent runtime orchestration spans to be classified as "model". Similarly, AutoFunctionInvocationLoop had no matching pattern and fell through to the default "tool".

Updates since initial revision

Restored "model" to llm_indicators — it was accidentally removed in the first commit and has been added back. Phase 1 chain detection runs first, so orchestration spans won't false-positive on "model".
"google" remains intentionally removed — "generativeai" already covers Google AI spans, and standalone "google" is too broad (would match e.g. "google.cloud.storage").

Validation Results (semantic-kernel v1.39.4)

Re-ran the integration test (CAPTURE_SPANS=true) against session 3436c49e-fd29-476e-be94-6ad21701b6cb (96 spans, 97 ingested events). Results:

Span Name Pattern	Before	After	Status
`AutoFunctionInvocationLoop` (×4)	`"tool"`	`"chain"`	Fixed
`agent_runtime * GroupChatManagerActor_*` (×6)	`"model"`	`"chain"`	Fixed
`ChatCompletion` (×25)	`"model"`	`"model"`	No regression

Event type distribution: chain: 59, model: 25, tool: 12, session: 1

Note: Parent hierarchy (Issue #1) still flat — all 96 non-session events parented to session_id. This is expected because the server-side fix (hive-kube PR #2515) has not been deployed yet.

Review & Testing Checklist for Human

Chain indicators are broad substring matches — "selector", "pipeline", "workflow" could cause false positives for legitimate tool/model spans in other frameworks (LangGraph, CrewAI, etc.). These run at higher priority than model detection. Consider whether any existing framework integration could be impacted.
"google" was removed from llm_indicators — verify this doesn't regress event type detection for Google-related spans that lack openinference.span.kind and don't contain "generativeai" in their name (e.g., hypothetical "google.vertex.predict" spans). Coverage depends on Google spans always routing through "generate", "completion", or "generativeai" patterns.
get_model_patterns() diverges from internal detection — the public function (lines 20-64) still lists "chat" and "google.generativeai" as patterns, but the internal _detect_from_span_name_dynamically now uses different indicators. Verify no external consumers depend on consistency between these.
No unit tests for chain detection phase — only validated via integration test on semantic-kernel spans. Edge cases (e.g., a span named "model_selector" or "data_pipeline") are untested.
Spot-check other framework integration tests — Run OpenAI, Anthropic, LangChain, etc. integration tests to verify event types remain correct.

Notes

Part of the semantic-kernel v1.39.4 tracing pipeline validation (paired with hive-kube PR #2515 for cross-batch parent resolution)
No unit tests added for the new chain detection phase — deferred to follow-up
Requested by: unknown ()
Devin session

- Add chain/orchestration detection patterns (Phase 1) before model detection (Phase 2) in _detect_from_span_name_dynamically - Fixes AutoFunctionInvocationLoop being classified as 'tool' instead of 'chain' (Issue #3) - Fixes agent_runtime GroupChatManagerActor being classified as 'model' due to overly broad 'chat' substring match (Issue #4) - Replace standalone 'chat' indicator with specific patterns: 'chatcompletion', 'chat.completion', 'chat_completion' - Add chain indicators: invocationloop, orchestrat, workflow, pipeline, agent_runtime, manageractor, groupchat, roundrobin, selector, swarm Co-Authored-By: unknown <>

devin-ai-integration · 2026-03-02T10:11:11Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

claude · 2026-03-02T10:11:23Z

Claude finished @devin-ai-integration[bot]'s task in 2m 45s —— View job

Review of PR #289 complete

Gather context and read changed files
Review the diff against base branch
Post inline code feedback (4 inline comments)
Post top-level review summary

Summary: The two-phase detection approach is a sound design for fixing the semantic-kernel misclassification. I flagged 4 issues:

Regression risk: "model" and "google" removed from llm_indicators without need — Phase 1 already prevents the false positives, so these should be restored
Overly broad chain indicators: "selector", "pipeline", "workflow" could cause false positives in other frameworks
Missing tests: No unit tests for the new chain detection phase
API inconsistency: get_model_patterns() diverges from the internal detection logic

snyk-io · 2026-03-02T10:11:25Z

✅ Snyk checks have passed. No issues have been found so far.

Status	Scanner	Critical	High	Medium	Low	Total (0)
✅	Open Source Security	0	0	0	0	0 issues
✅	Licenses	0	0	0	0	0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

The 'model' indicator is a valid LLM span name pattern (e.g., 'model.invoke') and should not have been removed. The 'google' standalone indicator is intentionally excluded since 'generativeai' already covers Google AI spans without the false positive risk. Co-Authored-By: unknown <>

github-actions · 2026-03-02T10:13:23Z

📚 Documentation preview built — Download artifact

Review instructions & validation status

How to Review

Download the artifact from the link above
Extract the files
Open index.html in your browser

Validation Status

API validation: ✅ Passed
Build process: ✅ Successful
Import tests: ✅ All imports working

claude · 2026-03-02T10:13:27Z

src/honeyhive/tracer/utils/event_type.py

+    # it causes false positives (e.g., "GroupChatManagerActor" is not an LLM call).
+    # Instead, we use more specific patterns like "chatcompletion" and "chat.completion".
    llm_indicators = [
        "llm",


Bug: Removal of "model" is a regression risk

This indicator was removed without explanation. Spans like "model.invoke", "model.predict", or "model.run" will no longer be classified as "model" — they'll fall through to the "tool" default.

Note that get_model_patterns() (line 41) still lists "model." as a pattern, creating an internal inconsistency.

Since Phase 1 chain detection now runs first, orchestration spans won't false-positive on "model" anyway — so the original concern that motivated this removal doesn't apply. I'd recommend restoring "model" here.

Already restored "model" in commit d3eab04 — good catch. Since Phase 1 chain detection runs first, orchestration spans won't false-positive on "model" anyway.

claude · 2026-03-02T10:13:33Z

src/honeyhive/tracer/utils/event_type.py

        "generate",
        "inference",
        "openai",
        "anthropic",


Bug: Removal of "google" is a regression risk

Same concern as "model" above. While "google.generativeai" still matches via "generativeai", other Google-related spans (e.g., "google.ai.studio", "google.vertex") would no longer be detected as "model".

get_model_patterns() (line 35) still lists "google.generativeai" as a provider pattern, reinforcing the inconsistency.

Recommend restoring "google" here — Phase 1 chain detection already guards against orchestration false positives.

Intentional removal. "google" as a standalone indicator is very broad — it would match any span with "google" in the name regardless of whether it's an LLM call. The specific pattern "generativeai" already covers Google AI spans (e.g., google.generativeai.generate), and "google.vertex" would match via "generate" or "completion" in the actual API call span names. The risk of false positives (e.g., "google.cloud.storage") outweighs the benefit here.

claude · 2026-03-02T10:13:42Z

src/honeyhive/tracer/utils/event_type.py

+    chain_indicators = [
+        "invocationloop",
+        "orchestrat",
+        "workflow",
+        "pipeline",
+        "agent_runtime",
+        "manageractor",
+        "groupchat",
+        "roundrobin",
+        "selector",
+        "swarm",
+        "magentic",
+    ]


Concern: Overly broad chain indicators may cause false positives

Several of these substrings are very generic and could match spans from other frameworks that are not orchestration:

"selector" — could match CSS selectors, model selectors, data selectors, etc. (e.g., a span named "feature_selector" or "model_selector" would be misclassified as "chain" instead of "tool" or "model")

"pipeline" — could match "data_pipeline", "ml_pipeline", "preprocessing_pipeline" which are often tool-like operations

"workflow" — could match "workflow_step" or custom tool spans

Consider using more specific patterns:

"selector" → "selectorstrategy" or "selectorspeaker" (semantic-kernel specific)

"pipeline" → "orchestration_pipeline" or scope it to known frameworks

Or add a comment documenting that these broad matches are intentional tradeoffs for the semantic-kernel use case

Fair point on breadth. These patterns are intentionally broad to cover orchestration spans across frameworks (semantic-kernel, autogen, magentic-one). In practice, "selector", "pipeline", and "workflow" are unlikely to appear in model/tool span names from LLM instrumentors — those tend to use specific API names. That said, this is a tradeoff worth revisiting if false positives surface in other framework integrations. Leaving as-is for now since the priority is fixing the semantic-kernel misclassification.

claude · 2026-03-02T10:13:51Z

src/honeyhive/tracer/utils/event_type.py

    span_name_lower = span_name.lower()

-    # Dynamic LLM/Model detection patterns - more flexible matching
+    # Phase 1: Chain/orchestration detection (higher priority)


Nit: Missing test coverage for the new Phase 1 chain detection

No unit tests were added for the new chain detection behavior. At minimum, the existing TestEventTypeDetection class should be updated with:

Tests for the specific semantic-kernel patterns this PR targets (AutoFunctionInvocationLoop, agent_runtime process GroupChatManagerActor_xyz)

Tests verifying that chain indicators don't cause false positives for model spans (e.g., "ChatCompletion" should still be "model", not "chain")

Regression tests for removed indicators ("model", "google", "chat") — or update existing tests if the removals are intentional

The existing test at line 173 (("model_inference", "model")) still passes only because "inference" matches — but a span like "model.invoke" would now fail silently.

Agreed that test coverage would be valuable. Deferring to a follow-up since this PR is focused on the semantic-kernel validation fix. The integration test re-run will serve as functional validation for the target patterns.

claude · 2026-03-02T10:14:05Z

Review Summary

The two-phase detection approach is a sound design. Checking chain/orchestration patterns before model/LLM patterns is a clean way to prevent the "chat" substring false positive in GroupChatManagerActor. The "chat" → "chatcompletion" / "chat.completion" / "chat_completion" refinement is the right fix for the root cause.

However, there are issues that need to be addressed before merging:

Regression risk from removed indicators — "model" and "google" were removed from llm_indicators without clear justification. Since Phase 1 chain detection now runs first, these removals are unnecessary — orchestration spans will be caught before they reach Phase 2. Restoring them avoids breaking spans like "model.invoke" or "google.vertex". See inline comments for details.
Overly broad chain indicators — "selector", "pipeline", and "workflow" are very generic substrings that could match non-orchestration spans from other frameworks (LangChain, CrewAI, custom instrumentation). These should be scoped more narrowly or documented as intentional tradeoffs.
No test coverage for new behavior — The chain detection phase is entirely untested. At minimum, tests should cover the semantic-kernel patterns this PR targets and verify no regressions for the removed indicators.
get_model_patterns() inconsistency — The public function get_model_patterns() (lines 20-64, exported in __all__) still includes "chat", "model.", and "google.generativeai" but the internal detection function now diverges. Even though get_model_patterns() isn't called by the detection logic, this API inconsistency could confuse consumers.

Documentation

No public API signatures changed, so no doc updates are needed. The behavior change in event classification is internal.

github-actions · 2026-03-02T10:15:20Z

📚 Documentation preview built — Download artifact

Review instructions & validation status

How to Review

Download the artifact from the link above
Extract the files
Open index.html in your browser

Validation Status

API validation: ✅ Passed
Build process: ✅ Successful
Import tests: ✅ All imports working

devin-ai-integration bot requested a review from a team March 2, 2026 10:11

claude bot reviewed Mar 2, 2026

View reviewed changes

Conversation

devin-ai-integration bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

fix: improve event type detection for orchestration spans

Summary

Updates since initial revision

Validation Results (semantic-kernel v1.39.4)

Review & Testing Checklist for Human

Notes

Uh oh!

devin-ai-integration bot commented Mar 2, 2026

🤖 Devin AI Engineer

Uh oh!

claude bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review of PR #289 complete

Uh oh!

snyk-io bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Snyk checks have passed. No issues have been found so far.

Uh oh!

github-actions bot commented Mar 2, 2026

Uh oh!

claude bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Mar 2, 2026

Review Summary

Documentation

Uh oh!

github-actions bot commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

devin-ai-integration bot commented Mar 2, 2026 •

edited

Loading

claude bot commented Mar 2, 2026 •

edited

Loading

snyk-io bot commented Mar 2, 2026 •

edited

Loading