Clean up cli output, fix classification bug#118
Merged
krisztianfekete merged 1 commit intomainfrom Apr 14, 2026
Merged
Conversation
bc546dd to
4d7853c
Compare
There was a problem hiding this comment.
Pull request overview
Updates the CLI’s performance-metrics presentation to be more informative for small per-trace sample sizes, while fixing a span-classification bug that was inflating LLM counts and masking tool calls for Strands-style traces.
Changes:
- Per-trace metrics now report min/median/max (+count) and surface models, call counts, tool names, and cache token usage (while keeping legacy percentile keys for JSON compatibility).
- Overall (cross-trace) metrics add aggregate counts/model list and keep p50/p95/p99 where percentiles are meaningful.
- Fixes
is_llm_span()/classify_span()so tool spans aren’t misclassified as LLM spans.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_trace_metrics.py | Adds coverage for summary-stats calculation, performance-metric extraction, and table/JSON output expectations. |
| tests/test_extraction.py | Updates tests to ensure gen_ai.input.messages alone no longer marks a span as LLM. |
| tests/test_api.py | Updates API test fixtures to include new additive performance-metrics fields and summary stats. |
| src/agentevals/trace_metrics.py | Introduces _calc_summary_stats() and enriches extracted metrics with counts/models/tool names and cache tokens. |
| src/agentevals/runner.py | Aggregates overall performance metrics across traces (counts, models, cache tokens, per-trace latency percentiles). |
| src/agentevals/output.py | Refreshes CLI table output to show model/call counts and min/median/max latencies, and improves token/cache display. |
| src/agentevals/extraction.py | Tightens LLM span detection to require gen_ai.request.model and classifies tool spans before LLM spans. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR cleans up the performance metrics output in the CLI so it actually tells you something useful instead of dumping statistically meaningless percentiles.
Earlier when you evaluate a trace with 3 LLM calls, showing p50/p95/p99 is nonsense (p95 is just the biggest number). The output was also missing obvious stuff like how many LLM/tool calls happened, which model was used, and cache token info that was already being computed but never shown.
What changed:
Plus a bug fix: while validating the output against real Strands traces, found that is_llm_span() was matching on
gen_ai.input.messages alone, which Strands attaches to event loop cycles and tool spans too. This inflated LLM counts (12 instead of 5) and hid tool calls entirely. Fixed by tightening the check to require gen_ai.request.model and reordering classify_span to check tool before LLM.