Clean up cli output, fix classification bug by krisztianfekete · Pull Request #118 · agentevals-dev/agentevals

krisztianfekete · 2026-04-14T20:39:30Z

This PR cleans up the performance metrics output in the CLI so it actually tells you something useful instead of dumping statistically meaningless percentiles.

Earlier when you evaluate a trace with 3 LLM calls, showing p50/p95/p99 is nonsense (p95 is just the biggest number). The output was also missing obvious stuff like how many LLM/tool calls happened, which model was used, and cache token info that was already being computed but never shown.

What changed:

Per-trace metrics now show min/median/max instead of p50/p95/p99 (honest stats for small N), plus model name, call counts, and cache tokens when present
The overall section across traces keeps p50/p95/p99 (where percentiles actually make sense with many traces) and adds aggregate counts and model list
Removed the confusing "Per LLM Call" token percentile line that nobody could act on
All existing JSON keys are preserved for backwards compat, new fields are additive

Plus a bug fix: while validating the output against real Strands traces, found that is_llm_span() was matching on
gen_ai.input.messages alone, which Strands attaches to event loop cycles and tool spans too. This inflated LLM counts (12 instead of 5) and hid tool calls entirely. Fixed by tightening the check to require gen_ai.request.model and reordering classify_span to check tool before LLM.

Copilot

Pull request overview

Updates the CLI’s performance-metrics presentation to be more informative for small per-trace sample sizes, while fixing a span-classification bug that was inflating LLM counts and masking tool calls for Strands-style traces.

Changes:

Per-trace metrics now report min/median/max (+count) and surface models, call counts, tool names, and cache token usage (while keeping legacy percentile keys for JSON compatibility).
Overall (cross-trace) metrics add aggregate counts/model list and keep p50/p95/p99 where percentiles are meaningful.
Fixes is_llm_span() / classify_span() so tool spans aren’t misclassified as LLM spans.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
tests/test_trace_metrics.py	Adds coverage for summary-stats calculation, performance-metric extraction, and table/JSON output expectations.
tests/test_extraction.py	Updates tests to ensure `gen_ai.input.messages` alone no longer marks a span as LLM.
tests/test_api.py	Updates API test fixtures to include new additive performance-metrics fields and summary stats.
src/agentevals/trace_metrics.py	Introduces `_calc_summary_stats()` and enriches extracted metrics with counts/models/tool names and cache tokens.
src/agentevals/runner.py	Aggregates overall performance metrics across traces (counts, models, cache tokens, per-trace latency percentiles).
src/agentevals/output.py	Refreshes CLI table output to show model/call counts and min/median/max latencies, and improves token/cache display.
src/agentevals/extraction.py	Tightens LLM span detection to require `gen_ai.request.model` and classifies tool spans before LLM spans.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

clean up cli output, fix classification bug

4d7853c

krisztianfekete force-pushed the feature/enhance-perf-metrics-in-cli branch from bc546dd to 4d7853c Compare April 14, 2026 20:40

krisztianfekete requested a review from Copilot April 14, 2026 20:40

Copilot started reviewing on behalf of krisztianfekete April 14, 2026 20:41 View session

Copilot AI reviewed Apr 14, 2026

View reviewed changes

krisztianfekete merged commit 04eb6ee into main Apr 14, 2026
8 checks passed

krisztianfekete deleted the feature/enhance-perf-metrics-in-cli branch April 14, 2026 20:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up cli output, fix classification bug#118

Clean up cli output, fix classification bug#118
krisztianfekete merged 1 commit intomainfrom
feature/enhance-perf-metrics-in-cli

krisztianfekete commented Apr 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

krisztianfekete commented Apr 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants