Skip to content

Clean up cli output, fix classification bug#118

Merged
krisztianfekete merged 1 commit intomainfrom
feature/enhance-perf-metrics-in-cli
Apr 14, 2026
Merged

Clean up cli output, fix classification bug#118
krisztianfekete merged 1 commit intomainfrom
feature/enhance-perf-metrics-in-cli

Conversation

@krisztianfekete
Copy link
Copy Markdown
Contributor

This PR cleans up the performance metrics output in the CLI so it actually tells you something useful instead of dumping statistically meaningless percentiles.

Earlier when you evaluate a trace with 3 LLM calls, showing p50/p95/p99 is nonsense (p95 is just the biggest number). The output was also missing obvious stuff like how many LLM/tool calls happened, which model was used, and cache token info that was already being computed but never shown.

What changed:

  • Per-trace metrics now show min/median/max instead of p50/p95/p99 (honest stats for small N), plus model name, call counts, and cache tokens when present
  • The overall section across traces keeps p50/p95/p99 (where percentiles actually make sense with many traces) and adds aggregate counts and model list
  • Removed the confusing "Per LLM Call" token percentile line that nobody could act on
  • All existing JSON keys are preserved for backwards compat, new fields are additive

Plus a bug fix: while validating the output against real Strands traces, found that is_llm_span() was matching on
gen_ai.input.messages alone, which Strands attaches to event loop cycles and tool spans too. This inflated LLM counts (12 instead of 5) and hid tool calls entirely. Fixed by tightening the check to require gen_ai.request.model and reordering classify_span to check tool before LLM.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the CLI’s performance-metrics presentation to be more informative for small per-trace sample sizes, while fixing a span-classification bug that was inflating LLM counts and masking tool calls for Strands-style traces.

Changes:

  • Per-trace metrics now report min/median/max (+count) and surface models, call counts, tool names, and cache token usage (while keeping legacy percentile keys for JSON compatibility).
  • Overall (cross-trace) metrics add aggregate counts/model list and keep p50/p95/p99 where percentiles are meaningful.
  • Fixes is_llm_span() / classify_span() so tool spans aren’t misclassified as LLM spans.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/test_trace_metrics.py Adds coverage for summary-stats calculation, performance-metric extraction, and table/JSON output expectations.
tests/test_extraction.py Updates tests to ensure gen_ai.input.messages alone no longer marks a span as LLM.
tests/test_api.py Updates API test fixtures to include new additive performance-metrics fields and summary stats.
src/agentevals/trace_metrics.py Introduces _calc_summary_stats() and enriches extracted metrics with counts/models/tool names and cache tokens.
src/agentevals/runner.py Aggregates overall performance metrics across traces (counts, models, cache tokens, per-trace latency percentiles).
src/agentevals/output.py Refreshes CLI table output to show model/call counts and min/median/max latencies, and improves token/cache display.
src/agentevals/extraction.py Tightens LLM span detection to require gen_ai.request.model and classifies tool spans before LLM spans.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@krisztianfekete krisztianfekete merged commit 04eb6ee into main Apr 14, 2026
8 checks passed
@krisztianfekete krisztianfekete deleted the feature/enhance-perf-metrics-in-cli branch April 14, 2026 20:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants