add per eval time metrics to CLI by krisztianfekete · Pull Request #75 · agentevals-dev/agentevals

krisztianfekete · 2026-03-30T16:45:24Z

This PR adds meta-eval timing metric, so users can have an educated guess on how much time certain evals take.

Example outputs:

Invocations: 1
        Metric                       Score  Status      Per-Invocation  Time    Error
------  -------------------------  -------  --------  ----------------  ------  -------
[PASS]  tool_trajectory_avg_score   1       PASSED              1       3ms
[PASS]  bertscore                   0.8635  PASSED              0.8635  2.5s
[PASS]  response_similarity         0.8438  PASSED              0.8438  4.7s

  Performance Metrics:
    Overall Latency: p50=4164ms, p95=4164ms, p99=4164ms
    LLM Latency:     p50=2072ms, p95=2344ms, p99=2344ms
    Tool Latency:    p50=57ms, p95=57ms, p99=57ms
    Tokens: 3906 total (3776 prompt + 130 output)
    Per LLM Call:    p50=1953, p95=2073, p99=2073

Overall Performance:
  Total Tokens: 3906 (3776 prompt + 130 output)
  Avg per Trace: 3776 prompt, 130 output

{
  "traces": [
    {
      "trace_id": "3e289017fe03ffd7c4145316d2eb3d0d",
      "num_invocations": 1,
      "conversion_warnings": [],
      "metrics": [
        {
          "metric_name": "tool_trajectory_avg_score",
          "score": 1.0,
          "eval_status": "PASSED",
          "per_invocation_scores": [
            1.0
          ],
          "duration_ms": 4.154182999627665,
          "error": null,
          "details": {
            "comparisons": [
              {
                "invocation_id": "581f1448d659341f",
                "expected": [
                  {
                    "name": "helm_list_releases",
                    "args": {}
                  }
                ],
                "actual": [
                  {
                    "name": "helm_list_releases",
                    "args": {}
                  }
                ],
                "matched": true
              }
            ]
          }
        },
        {
          "metric_name": "bertscore",
          "score": 0.8635173602486865,
          "eval_status": "PASSED",
          "per_invocation_scores": [
            0.8635173602486865
          ],
          "duration_ms": 2438.765358994715,
          "error": null
        },
        {
          "metric_name": "response_similarity",
          "score": 0.8437956204379562,
          "eval_status": "PASSED",
          "per_invocation_scores": [
            0.8437956204379562
          ],
          "duration_ms": 2637.799336996977,
          "error": null,
          "details": {
            "openai_eval_id": "eval_69caa7f6c7048191813e3d60ca41ee5a",
            "openai_run_id": "evalrun_69caa7f71d9c8191a13270d96a1b433c",
            "evaluation_metric": "fuzzy_match",
            "result_counts": {
              "passed": 1,
              "failed": 0,
              "total": 1
            },
            "per_testing_criteria": [
              {
                "name": "response_similarity-f01e8335-34a6-4a8c-aac2-f93e8cca0b72",
                "passed": 1,
                "failed": 0
              }
            ]
          }
        }
      ],
      "performance_metrics": {
        "latency": {
          "overall": {
            "p50": 4163.849,
            "p95": 4163.849,
            "p99": 4163.849
          },
          "llm_calls": {
            "p50": 2071.5425,
            "p95": 2343.756,
            "p99": 2343.756
          },
          "tool_executions": {
            "p50": 57.091,
            "p95": 57.091,
            "p99": 57.091
          }
        },
        "tokens": {
          "total_prompt": 3776,
          "total_output": 130,
          "total": 3906,
          "per_llm_call": {
            "p50": 1953.0,
            "p95": 2073,
            "p99": 2073
          }
        }
      }
    }
  ],
  "errors": [],
  "performance_metrics": {
    "tokens": {
      "total_prompt": 3776,
      "total_output": 130,
      "total": 3906,
      "avg_per_trace": {
        "prompt": 3776.0,
        "output": 130.0
      }
    },
    "trace_count": 1
  }
}

Copilot

Pull request overview

Adds per-metric evaluation timing to the CLI output by capturing wall-clock duration for each metric evaluation and surfacing it in table/summary/JSON formats.

Changes:

Add duration_ms to MetricResult and record it for built-in + custom metric evaluation runs.
Add duration formatting and a new Time column / suffix in CLI output (table + summary + JSON).
Add/extend unit tests covering duration capture and output formatting.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
`tests/test_runner.py`	Asserts that metric results include non-negative `duration_ms`.
`tests/test_output.py`	New tests for duration formatting and output inclusion across `table/json/summary`.
`src/agentevals/runner.py`	Adds `duration_ms` to `MetricResult` and measures per-metric execution time via `time.monotonic()`.
`src/agentevals/output.py`	Formats durations and displays them in CLI outputs (table column + summary suffix + JSON field).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

add per eval time metrics to CLI

e54e194

krisztianfekete force-pushed the feature/add-meta-eval-time-metric branch from 57490b1 to e54e194 Compare March 30, 2026 16:50

krisztianfekete requested a review from Copilot March 30, 2026 16:50

Copilot started reviewing on behalf of krisztianfekete March 30, 2026 16:50 View session

Copilot AI reviewed Mar 30, 2026

View reviewed changes

Comment thread src/agentevals/output.py Outdated

fix review comments

e4852d3

krisztianfekete merged commit 106d078 into main Mar 30, 2026
4 checks passed

krisztianfekete deleted the feature/add-meta-eval-time-metric branch March 30, 2026 16:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add per eval time metrics to CLI#75

add per eval time metrics to CLI#75
krisztianfekete merged 2 commits intomainfrom
feature/add-meta-eval-time-metric

krisztianfekete commented Mar 30, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

krisztianfekete commented Mar 30, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants