-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Goal: Transform the Testbench from a RAGAS-coupled evaluation tool into a framework-agnostic, fully observable, Kubernetes-native agent evaluation platform.
Description:
The Testbench currently has deep coupling to RAGAS: flat JSONL data format tied to RAGAS's LocalJSONLBackend, a MetricsRegistry hardcoded to ragas.metrics.BaseMetric, and CLI arguments designed around RAGAS conventions. This story replaces all RAGAS-specific abstractions with framework-agnostic alternatives and delivers the infrastructure and documentation needed for broad adoption across the Agentic Layer platform.
Key Deliverables
1. Hierarchical JSON Data Model
Replace flat RAGAS JSONL with a three-level hierarchy validated against formal JSON schemas at each pipeline phase:
Experiment → Scenarios → Steps
Three schema phases:
experiment.schema.json— User input (test definitions with metrics config)executed_experiment.schema.json— After agent execution (adds IDs, trace_id, turns)evaluated_experiment.schema.json— After metric evaluation (adds evaluations with nested{ metric, result })
Pydantic model hierarchy:
Step(input, reference, custom_values, metrics)
└→ ExecutedStep(+id, turns)
└→ EvaluatedStep(+evaluations: list[Evaluation])
Scenario(name, steps, evaluations)
└→ ExecutedScenario(+id, trace_id)
└→ EvaluatedScenario(+evaluations)
Experiment(llm_as_a_judge_model, default_threshold, scenarios)
└→ ExecutedExperiment(+id)
└→ EvaluatedExperiment
Key model details:
ToolCall— Unified model withnameandargsfields, shared by both Reference and Turn tool_callsTurn— Conversation turn withtypeenum:"human" | "agent" | "tool"Metric— Config object withmetric_name,threshold,parametersEvaluation— Wrapsmetric: Metric+result: Result(nested structure, not flat)- Step-level metric configs use
metrics[]; evaluation results (post-evaluate) useevaluations[] - Scenario-level evaluations use
evaluations[]at both config and result stages
Concept doc: concepts/data_structur_concept.md
2. GenericMetricsRegistry with FrameworkAdapters
Replace RAGAS-specific MetricsRegistry with a framework-agnostic adapter pattern:
GenericMetricsRegistry
├── RagasFrameworkAdapter (default)
└── [future] DeepEvalAdapter, custom adapters...
Core protocol:
@dataclass
class MetricResult:
score: float
reason: str | None = None
class MetricCallable(Protocol):
async def __call__(self, sample: ExecutedStep, **metric_args: Any) -> MetricResult: ...MetricResultdataclass (notResult, to avoid collision withschema.models.Result)MetricCallabletakesExecutedStep(notStep) since metrics need access toturnsFrameworkAdapterABC withdiscover_metrics(),create_callable(),framework_name- Lazy-loading of adapters via registry
Concept doc: concepts/generic-metrics-registry-concept.md
3. RagasFrameworkAdapter as Default Adapter
- Translates
ExecutedStepinto RAGASEvaluationDatasetsamples - Supports both single-turn and multi-turn input formats
- Routes LLM access through AI Gateway (LiteLLM) via
OPENAI_API_BASE
4. OTLP Metrics Publishing & Grafana Dashboards
Publish per-step evaluation scores to observability backend via OpenTelemetry with rich hierarchical labels:
Target OTLP labels:
name,workflow_name,execution_id,execution_numberexperiment_id,scenario_id,scenario_namestep_id,step_index,trace_idthreshold,result,user_input_truncated
Two Grafana dashboards:
- Trends Dashboard — Monitor quality trends over time, spot regressions after deployments
- Execution Details Dashboard — Investigate specific execution failures with trace linking to Tempo
Concept doc: concepts/grafana_visualization_concept.md
Acceptance Criteria
Data Model
- JSON schemas (
common.schema.json,experiment.schema.json,executed_experiment.schema.json,evaluated_experiment.schema.json) defined and consistent with Pydantic models -
ToolCallmodel usesargs(notarguments) for both Reference and Turn tool_calls -
Turn.typeenum uses"human" | "agent" | "tool"(not"ai") - Step-level metric configs use
metrics[]field; evaluation results useevaluations[]with nested{ metric, result }structure - Content-based deterministic ID generation for experiments, scenarios, and steps
- 4-phase pipeline (
setup.py→run.py→evaluate.py→publish.py) reads/writes correct schema at each phase
Metrics Framework
-
GenericMetricsRegistrywithFrameworkAdapterABC -
MetricCallableprotocol withExecutedStepinput andMetricResultoutput -
RagasFrameworkAdapterimplementing adapter interface - Metric discovery, callable creation, and LLM injection working through adapter
Observability
- OTLP metrics published with all hierarchical labels (experiment_id, scenario_id, scenario_name, step_index, threshold, result)
- Grafana Trends Dashboard template
- Grafana Execution Details Dashboard template
- Trace linking from Grafana to Tempo via scenario
trace_id
Testing & Quality
- Unit tests for all pipeline phases with mocked external dependencies
- E2E test validating complete 4-phase pipeline
- mypy, bandit, ruff checks passing
- Concept docs aligned with implementation
Deployment
- TestWorkflowTemplate CRDs for each phase
- Combined evaluation workflow for Testkube
- Docker image build and local run support
- Tilt local development environment with all operators
Implementation Status
Updated based on current codebase state (Feb 2026)
| Deliverable | Status | Notes |
|---|---|---|
| Hierarchical JSON data model | ✅ Done | Pydantic models + schemas implemented |
| GenericMetricsRegistry | ✅ Done | Adapter pattern with RAGAS adapter |
| 4-phase pipeline | ✅ Done | setup, run, evaluate, publish |
| HTML visualization | ✅ Done | Optional visualize.py phase |
| Unit tests | ✅ Done | 120 passing, 1 skipped |
| Quality checks | ✅ Done | mypy, bandit, ruff clean |
| Concept docs aligned | ✅ Done | Updated to match code |
| Enhanced OTLP labels | 🔲 Pending | publish.py missing experiment_id, scenario_id, scenario_name, step_index, threshold, result labels |
| Grafana dashboard templates | 🔲 Pending | Concept defined, templates not yet created |
| Testkube workflow templates | ✅ Done | CRDs in deploy/base/templates/ |
Metadata
Metadata
Assignees
Labels
Type
Projects
Status