Skip to content

add per eval time metrics to CLI#75

Merged
krisztianfekete merged 2 commits intomainfrom
feature/add-meta-eval-time-metric
Mar 30, 2026
Merged

add per eval time metrics to CLI#75
krisztianfekete merged 2 commits intomainfrom
feature/add-meta-eval-time-metric

Conversation

@krisztianfekete
Copy link
Copy Markdown
Contributor

This PR adds meta-eval timing metric, so users can have an educated guess on how much time certain evals take.

Example outputs:

Invocations: 1
        Metric                       Score  Status      Per-Invocation  Time    Error
------  -------------------------  -------  --------  ----------------  ------  -------
[PASS]  tool_trajectory_avg_score   1       PASSED              1       3ms
[PASS]  bertscore                   0.8635  PASSED              0.8635  2.5s
[PASS]  response_similarity         0.8438  PASSED              0.8438  4.7s

  Performance Metrics:
    Overall Latency: p50=4164ms, p95=4164ms, p99=4164ms
    LLM Latency:     p50=2072ms, p95=2344ms, p99=2344ms
    Tool Latency:    p50=57ms, p95=57ms, p99=57ms
    Tokens: 3906 total (3776 prompt + 130 output)
    Per LLM Call:    p50=1953, p95=2073, p99=2073

Overall Performance:
  Total Tokens: 3906 (3776 prompt + 130 output)
  Avg per Trace: 3776 prompt, 130 output
{
  "traces": [
    {
      "trace_id": "3e289017fe03ffd7c4145316d2eb3d0d",
      "num_invocations": 1,
      "conversion_warnings": [],
      "metrics": [
        {
          "metric_name": "tool_trajectory_avg_score",
          "score": 1.0,
          "eval_status": "PASSED",
          "per_invocation_scores": [
            1.0
          ],
          "duration_ms": 4.154182999627665,
          "error": null,
          "details": {
            "comparisons": [
              {
                "invocation_id": "581f1448d659341f",
                "expected": [
                  {
                    "name": "helm_list_releases",
                    "args": {}
                  }
                ],
                "actual": [
                  {
                    "name": "helm_list_releases",
                    "args": {}
                  }
                ],
                "matched": true
              }
            ]
          }
        },
        {
          "metric_name": "bertscore",
          "score": 0.8635173602486865,
          "eval_status": "PASSED",
          "per_invocation_scores": [
            0.8635173602486865
          ],
          "duration_ms": 2438.765358994715,
          "error": null
        },
        {
          "metric_name": "response_similarity",
          "score": 0.8437956204379562,
          "eval_status": "PASSED",
          "per_invocation_scores": [
            0.8437956204379562
          ],
          "duration_ms": 2637.799336996977,
          "error": null,
          "details": {
            "openai_eval_id": "eval_69caa7f6c7048191813e3d60ca41ee5a",
            "openai_run_id": "evalrun_69caa7f71d9c8191a13270d96a1b433c",
            "evaluation_metric": "fuzzy_match",
            "result_counts": {
              "passed": 1,
              "failed": 0,
              "total": 1
            },
            "per_testing_criteria": [
              {
                "name": "response_similarity-f01e8335-34a6-4a8c-aac2-f93e8cca0b72",
                "passed": 1,
                "failed": 0
              }
            ]
          }
        }
      ],
      "performance_metrics": {
        "latency": {
          "overall": {
            "p50": 4163.849,
            "p95": 4163.849,
            "p99": 4163.849
          },
          "llm_calls": {
            "p50": 2071.5425,
            "p95": 2343.756,
            "p99": 2343.756
          },
          "tool_executions": {
            "p50": 57.091,
            "p95": 57.091,
            "p99": 57.091
          }
        },
        "tokens": {
          "total_prompt": 3776,
          "total_output": 130,
          "total": 3906,
          "per_llm_call": {
            "p50": 1953.0,
            "p95": 2073,
            "p99": 2073
          }
        }
      }
    }
  ],
  "errors": [],
  "performance_metrics": {
    "tokens": {
      "total_prompt": 3776,
      "total_output": 130,
      "total": 3906,
      "avg_per_trace": {
        "prompt": 3776.0,
        "output": 130.0
      }
    },
    "trace_count": 1
  }
}

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds per-metric evaluation timing to the CLI output by capturing wall-clock duration for each metric evaluation and surfacing it in table/summary/JSON formats.

Changes:

  • Add duration_ms to MetricResult and record it for built-in + custom metric evaluation runs.
  • Add duration formatting and a new Time column / suffix in CLI output (table + summary + JSON).
  • Add/extend unit tests covering duration capture and output formatting.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
tests/test_runner.py Asserts that metric results include non-negative duration_ms.
tests/test_output.py New tests for duration formatting and output inclusion across table/json/summary.
src/agentevals/runner.py Adds duration_ms to MetricResult and measures per-metric execution time via time.monotonic().
src/agentevals/output.py Formats durations and displays them in CLI outputs (table column + summary suffix + JSON field).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/agentevals/output.py Outdated
@krisztianfekete krisztianfekete merged commit 106d078 into main Mar 30, 2026
4 checks passed
@krisztianfekete krisztianfekete deleted the feature/add-meta-eval-time-metric branch March 30, 2026 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants