Skip to content

Testbench: Framework-Independent Agent Evaluation Platform #26

@fmallmann

Description

@fmallmann

Goal: Transform the Testbench from a RAGAS-coupled evaluation tool into a framework-agnostic, fully observable, Kubernetes-native agent evaluation platform.

Description:
The Testbench currently has deep coupling to RAGAS: flat JSONL data format tied to RAGAS's LocalJSONLBackend, a MetricsRegistry hardcoded to ragas.metrics.BaseMetric, and CLI arguments designed around RAGAS conventions. This story replaces all RAGAS-specific abstractions with framework-agnostic alternatives and delivers the infrastructure and documentation needed for broad adoption across the Agentic Layer platform.


Key Deliverables

1. Hierarchical JSON Data Model

Replace flat RAGAS JSONL with a three-level hierarchy validated against formal JSON schemas at each pipeline phase:

Experiment → Scenarios → Steps

Three schema phases:

  • experiment.schema.json — User input (test definitions with metrics config)
  • executed_experiment.schema.json — After agent execution (adds IDs, trace_id, turns)
  • evaluated_experiment.schema.json — After metric evaluation (adds evaluations with nested { metric, result })

Pydantic model hierarchy:

Step(input, reference, custom_values, metrics)
  └→ ExecutedStep(+id, turns)
       └→ EvaluatedStep(+evaluations: list[Evaluation])

Scenario(name, steps, evaluations)
  └→ ExecutedScenario(+id, trace_id)
       └→ EvaluatedScenario(+evaluations)

Experiment(llm_as_a_judge_model, default_threshold, scenarios)
  └→ ExecutedExperiment(+id)
       └→ EvaluatedExperiment

Key model details:

  • ToolCall — Unified model with name and args fields, shared by both Reference and Turn tool_calls
  • Turn — Conversation turn with type enum: "human" | "agent" | "tool"
  • Metric — Config object with metric_name, threshold, parameters
  • Evaluation — Wraps metric: Metric + result: Result (nested structure, not flat)
  • Step-level metric configs use metrics[]; evaluation results (post-evaluate) use evaluations[]
  • Scenario-level evaluations use evaluations[] at both config and result stages

Concept doc: concepts/data_structur_concept.md

2. GenericMetricsRegistry with FrameworkAdapters

Replace RAGAS-specific MetricsRegistry with a framework-agnostic adapter pattern:

GenericMetricsRegistry
    ├── RagasFrameworkAdapter (default)
    └── [future] DeepEvalAdapter, custom adapters...

Core protocol:

@dataclass
class MetricResult:
    score: float
    reason: str | None = None

class MetricCallable(Protocol):
    async def __call__(self, sample: ExecutedStep, **metric_args: Any) -> MetricResult: ...
  • MetricResult dataclass (not Result, to avoid collision with schema.models.Result)
  • MetricCallable takes ExecutedStep (not Step) since metrics need access to turns
  • FrameworkAdapter ABC with discover_metrics(), create_callable(), framework_name
  • Lazy-loading of adapters via registry

Concept doc: concepts/generic-metrics-registry-concept.md

3. RagasFrameworkAdapter as Default Adapter

  • Translates ExecutedStep into RAGAS EvaluationDataset samples
  • Supports both single-turn and multi-turn input formats
  • Routes LLM access through AI Gateway (LiteLLM) via OPENAI_API_BASE

4. OTLP Metrics Publishing & Grafana Dashboards

Publish per-step evaluation scores to observability backend via OpenTelemetry with rich hierarchical labels:

Target OTLP labels:

  • name, workflow_name, execution_id, execution_number
  • experiment_id, scenario_id, scenario_name
  • step_id, step_index, trace_id
  • threshold, result, user_input_truncated

Two Grafana dashboards:

  1. Trends Dashboard — Monitor quality trends over time, spot regressions after deployments
  2. Execution Details Dashboard — Investigate specific execution failures with trace linking to Tempo

Concept doc: concepts/grafana_visualization_concept.md


Acceptance Criteria

Data Model

  • JSON schemas (common.schema.json, experiment.schema.json, executed_experiment.schema.json, evaluated_experiment.schema.json) defined and consistent with Pydantic models
  • ToolCall model uses args (not arguments) for both Reference and Turn tool_calls
  • Turn.type enum uses "human" | "agent" | "tool" (not "ai")
  • Step-level metric configs use metrics[] field; evaluation results use evaluations[] with nested { metric, result } structure
  • Content-based deterministic ID generation for experiments, scenarios, and steps
  • 4-phase pipeline (setup.pyrun.pyevaluate.pypublish.py) reads/writes correct schema at each phase

Metrics Framework

  • GenericMetricsRegistry with FrameworkAdapter ABC
  • MetricCallable protocol with ExecutedStep input and MetricResult output
  • RagasFrameworkAdapter implementing adapter interface
  • Metric discovery, callable creation, and LLM injection working through adapter

Observability

  • OTLP metrics published with all hierarchical labels (experiment_id, scenario_id, scenario_name, step_index, threshold, result)
  • Grafana Trends Dashboard template
  • Grafana Execution Details Dashboard template
  • Trace linking from Grafana to Tempo via scenario trace_id

Testing & Quality

  • Unit tests for all pipeline phases with mocked external dependencies
  • E2E test validating complete 4-phase pipeline
  • mypy, bandit, ruff checks passing
  • Concept docs aligned with implementation

Deployment

  • TestWorkflowTemplate CRDs for each phase
  • Combined evaluation workflow for Testkube
  • Docker image build and local run support
  • Tilt local development environment with all operators

Implementation Status

Updated based on current codebase state (Feb 2026)

Deliverable Status Notes
Hierarchical JSON data model ✅ Done Pydantic models + schemas implemented
GenericMetricsRegistry ✅ Done Adapter pattern with RAGAS adapter
4-phase pipeline ✅ Done setup, run, evaluate, publish
HTML visualization ✅ Done Optional visualize.py phase
Unit tests ✅ Done 120 passing, 1 skipped
Quality checks ✅ Done mypy, bandit, ruff clean
Concept docs aligned ✅ Done Updated to match code
Enhanced OTLP labels 🔲 Pending publish.py missing experiment_id, scenario_id, scenario_name, step_index, threshold, result labels
Grafana dashboard templates 🔲 Pending Concept defined, templates not yet created
Testkube workflow templates ✅ Done CRDs in deploy/base/templates/

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions