Testbench: Framework-Independent Agent Evaluation Platform

**Goal**: Transform the Testbench from a RAGAS-coupled evaluation tool into a framework-agnostic, fully observable, Kubernetes-native agent evaluation platform.

**Description**:
The Testbench currently has deep coupling to RAGAS: flat JSONL data format tied to RAGAS's LocalJSONLBackend, a MetricsRegistry hardcoded to ragas.metrics.BaseMetric, and CLI arguments designed around RAGAS conventions. This story replaces all RAGAS-specific abstractions with framework-agnostic alternatives and delivers the infrastructure and documentation needed for broad adoption across the Agentic Layer platform.

---

## Key Deliverables

### 1. Hierarchical JSON Data Model

Replace flat RAGAS JSONL with a three-level hierarchy validated against formal JSON schemas at each pipeline phase:

```
Experiment → Scenarios → Steps
```

**Three schema phases:**
- `experiment.schema.json` — User input (test definitions with metrics config)
- `executed_experiment.schema.json` — After agent execution (adds IDs, trace_id, turns)
- `evaluated_experiment.schema.json` — After metric evaluation (adds evaluations with nested `{ metric, result }`)

**Pydantic model hierarchy:**
```
Step(input, reference, custom_values, metrics)
  └→ ExecutedStep(+id, turns)
       └→ EvaluatedStep(+evaluations: list[Evaluation])

Scenario(name, steps, evaluations)
  └→ ExecutedScenario(+id, trace_id)
       └→ EvaluatedScenario(+evaluations)

Experiment(llm_as_a_judge_model, default_threshold, scenarios)
  └→ ExecutedExperiment(+id)
       └→ EvaluatedExperiment
```

**Key model details:**
- `ToolCall` — Unified model with `name` and `args` fields, shared by both Reference and Turn tool_calls
- `Turn` — Conversation turn with `type` enum: `"human" | "agent" | "tool"`
- `Metric` — Config object with `metric_name`, `threshold`, `parameters`
- `Evaluation` — Wraps `metric: Metric` + `result: Result` (nested structure, not flat)
- Step-level metric configs use `metrics[]`; evaluation results (post-evaluate) use `evaluations[]`
- Scenario-level evaluations use `evaluations[]` at both config and result stages

**Concept doc:** [`concepts/data_structur_concept.md`](concepts/data_structur_concept.md)

### 2. GenericMetricsRegistry with FrameworkAdapters

Replace RAGAS-specific `MetricsRegistry` with a framework-agnostic adapter pattern:

```
GenericMetricsRegistry
    ├── RagasFrameworkAdapter (default)
    └── [future] DeepEvalAdapter, custom adapters...
```

**Core protocol:**
```python
@dataclass
class MetricResult:
    score: float
    reason: str | None = None

class MetricCallable(Protocol):
    async def __call__(self, sample: ExecutedStep, **metric_args: Any) -> MetricResult: ...
```

- `MetricResult` dataclass (not `Result`, to avoid collision with `schema.models.Result`)
- `MetricCallable` takes `ExecutedStep` (not `Step`) since metrics need access to `turns`
- `FrameworkAdapter` ABC with `discover_metrics()`, `create_callable()`, `framework_name`
- Lazy-loading of adapters via registry

**Concept doc:** [`concepts/generic-metrics-registry-concept.md`](concepts/generic-metrics-registry-concept.md)

### 3. RagasFrameworkAdapter as Default Adapter

- Translates `ExecutedStep` into RAGAS `EvaluationDataset` samples
- Supports both single-turn and multi-turn input formats
- Routes LLM access through AI Gateway (LiteLLM) via `OPENAI_API_BASE`

### 4. OTLP Metrics Publishing & Grafana Dashboards

Publish per-step evaluation scores to observability backend via OpenTelemetry with rich hierarchical labels:

**Target OTLP labels:**
- `name`, `workflow_name`, `execution_id`, `execution_number`
- `experiment_id`, `scenario_id`, `scenario_name`
- `step_id`, `step_index`, `trace_id`
- `threshold`, `result`, `user_input_truncated`

**Two Grafana dashboards:**
1. **Trends Dashboard** — Monitor quality trends over time, spot regressions after deployments
2. **Execution Details Dashboard** — Investigate specific execution failures with trace linking to Tempo

**Concept doc:** [`concepts/grafana_visualization_concept.md`](concepts/grafana_visualization_concept.md)

---

## Acceptance Criteria

### Data Model
- [ ] JSON schemas (`common.schema.json`, `experiment.schema.json`, `executed_experiment.schema.json`, `evaluated_experiment.schema.json`) defined and consistent with Pydantic models
- [ ] `ToolCall` model uses `args` (not `arguments`) for both Reference and Turn tool_calls
- [ ] `Turn.type` enum uses `"human" | "agent" | "tool"` (not `"ai"`)
- [ ] Step-level metric configs use `metrics[]` field; evaluation results use `evaluations[]` with nested `{ metric, result }` structure
- [ ] Content-based deterministic ID generation for experiments, scenarios, and steps
- [ ] 4-phase pipeline (`setup.py` → `run.py` → `evaluate.py` → `publish.py`) reads/writes correct schema at each phase

### Metrics Framework
- [ ] `GenericMetricsRegistry` with `FrameworkAdapter` ABC
- [ ] `MetricCallable` protocol with `ExecutedStep` input and `MetricResult` output
- [ ] `RagasFrameworkAdapter` implementing adapter interface
- [ ] Metric discovery, callable creation, and LLM injection working through adapter

### Observability
- [ ] OTLP metrics published with all hierarchical labels (experiment_id, scenario_id, scenario_name, step_index, threshold, result)
- [ ] Grafana Trends Dashboard template
- [ ] Grafana Execution Details Dashboard template
- [ ] Trace linking from Grafana to Tempo via scenario `trace_id`

### Testing & Quality
- [ ] Unit tests for all pipeline phases with mocked external dependencies
- [ ] E2E test validating complete 4-phase pipeline
- [ ] mypy, bandit, ruff checks passing
- [ ] Concept docs aligned with implementation

### Deployment
- [ ] TestWorkflowTemplate CRDs for each phase
- [ ] Combined evaluation workflow for Testkube
- [ ] Docker image build and local run support
- [ ] Tilt local development environment with all operators

---

## Implementation Status

> Updated based on current codebase state (Feb 2026)

| Deliverable | Status | Notes |
|---|---|---|
| Hierarchical JSON data model | ✅ Done | Pydantic models + schemas implemented |
| GenericMetricsRegistry | ✅ Done | Adapter pattern with RAGAS adapter |
| 4-phase pipeline | ✅ Done | setup, run, evaluate, publish |
| HTML visualization | ✅ Done | Optional `visualize.py` phase |
| Unit tests | ✅ Done | 120 passing, 1 skipped |
| Quality checks | ✅ Done | mypy, bandit, ruff clean |
| Concept docs aligned | ✅ Done | Updated to match code |
| Enhanced OTLP labels | 🔲 Pending | `publish.py` missing `experiment_id`, `scenario_id`, `scenario_name`, `step_index`, `threshold`, `result` labels |
| Grafana dashboard templates | 🔲 Pending | Concept defined, templates not yet created |
| Testkube workflow templates | ✅ Done | CRDs in `deploy/base/templates/` |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testbench: Framework-Independent Agent Evaluation Platform #26

Key Deliverables

1. Hierarchical JSON Data Model

2. GenericMetricsRegistry with FrameworkAdapters

3. RagasFrameworkAdapter as Default Adapter

4. OTLP Metrics Publishing & Grafana Dashboards

Acceptance Criteria

Data Model

Metrics Framework

Observability

Testing & Quality

Deployment

Implementation Status

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Deliverable	Status	Notes
Hierarchical JSON data model	✅ Done	Pydantic models + schemas implemented
GenericMetricsRegistry	✅ Done	Adapter pattern with RAGAS adapter
4-phase pipeline	✅ Done	setup, run, evaluate, publish
HTML visualization	✅ Done	Optional `visualize.py` phase
Unit tests	✅ Done	120 passing, 1 skipped
Quality checks	✅ Done	mypy, bandit, ruff clean
Concept docs aligned	✅ Done	Updated to match code
Enhanced OTLP labels	🔲 Pending	`publish.py` missing `experiment_id`, `scenario_id`, `scenario_name`, `step_index`, `threshold`, `result` labels
Grafana dashboard templates	🔲 Pending	Concept defined, templates not yet created
Testkube workflow templates	✅ Done	CRDs in `deploy/base/templates/`

Testbench: Framework-Independent Agent Evaluation Platform #26

Description

Key Deliverables

1. Hierarchical JSON Data Model

2. GenericMetricsRegistry with FrameworkAdapters

3. RagasFrameworkAdapter as Default Adapter

4. OTLP Metrics Publishing & Grafana Dashboards

Acceptance Criteria

Data Model

Metrics Framework

Observability

Testing & Quality

Deployment

Implementation Status

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions