-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Goal: Implement a Kubernetes operator that watches Experiment Custom Resources and reconciles them into fully functional Testkube TestWorkflows executing the 4-phase evaluation pipeline (setup → run → evaluate → publish).
Description:
Today, running the testbench evaluation pipeline in Kubernetes requires manually creating a ConfigMap with the experiment definition, a TestWorkflow chaining the phase templates, and optionally a TestTrigger for automatic execution on agent changes (see deploy/local/). This story introduces a testbench-operator — analogous to the agent-runtime-operator — that automates this entire process. Users define an Experiment CR with their evaluation configuration, and the operator generates all required Kubernetes resources.
Related: #21
Key Deliverables
1. Experiment CRD Specification
Define an Experiment Custom Resource that captures all evaluation parameters. The CRD supports two data source modes: inline scenarios (defined directly in the spec) and external datasets (loaded from S3 or other sources via the setup phase).
apiVersion: testbench.agentic-layer.ai/v1alpha1
kind: Experiment
metadata:
name: weather-agent-evaluation
namespace: testkube
spec:
# Reference to the agent under test
agentRef:
name: weather-agent
namespace: sample-agents
# External dataset source (triggers setup phase)
dataset:
s3:
bucket: evaluation-datasets
key: weather-agent/dataset.csv
# Alternative: url: "http://data-server:8000/dataset.csv"
# LLM configuration for evaluation
llmAsAJudgeModel: "gemini-2.5-flash-lite"
defaultThreshold: 0.9
# Inline scenario definitions (alternative to dataset, skips setup phase)
scenarios:
- name: "Weather Query - New York"
steps:
- input: "What is the weather in New York?"
reference:
toolCalls:
- name: get_weather
args:
city: "New York"
metrics:
- metricName: AgentGoalAccuracyWithoutReference
- metricName: ToolCallAccuracy
- metricName: TopicAdherence
threshold: 0.8
parameters:
mode: precision
# Automatic trigger configuration
trigger:
enabled: true
event: modified # Trigger on agent deployment changes
concurrencyPolicy: allowKey design decisions:
agentRefreferences anAgentCR (resolved to its A2A endpoint by the operator)datasetandscenariosare mutually exclusive — ifdatasetis set, the setup phase runs; ifscenariosis set inline, the operator generates theexperiment.jsonConfigMap directlytriggercontrols automatic TestTrigger creation, watching the referenced agent's Deployment
2. Operator Implementation (Go + Operator SDK)
Built with Go and Operator SDK, consistent with agent-runtime-operator.
Reconciliation logic — on each Experiment CR change, the operator ensures:
| Resource | Purpose | Condition |
|---|---|---|
ConfigMap ({name}-experiment) |
Stores serialized experiment.json |
Always (from inline scenarios or as empty placeholder for setup phase) |
TestWorkflow ({name}-workflow) |
Chains phase templates: setup → run → evaluate → publish → visualize | Always |
TestTrigger ({name}-trigger) |
Watches agent Deployment for changes, triggers workflow | Only when spec.trigger.enabled: true |
Phase template selection:
- If
spec.datasetis set → includesetup-templateas first phase (downloads external dataset) - If
spec.scenariosis set → skip setup, inject experiment.json via ConfigMap - Always include:
run-template,evaluate-template,publish-template,visualize-template
Config parameter mapping:
spec.agentRef→ resolved to agent URL →run-template.config.agentUrlspec.dataset.s3→setup-template.config.bucket+setup-template.config.keyspec.llmAsAJudgeModel→ embedded in experiment.json- OTEL endpoint → inherited from cluster ConfigMap (
otel-config)
3. Status Reporting
The Experiment CR status subresource reports reconciliation and workflow state:
status:
conditions:
- type: Ready
status: "True"
reason: Reconciled
message: "All resources created successfully"
- type: WorkflowReady
status: "True"
reason: Created
generatedResources:
configMap: weather-agent-evaluation-experiment
testWorkflow: weather-agent-evaluation-workflow
testTrigger: weather-agent-evaluation-trigger
lastExecution:
id: "exec-abc123"
status: "passed"
timestamp: "2026-02-23T10:00:00Z"4. Garbage Collection & Ownership
- All generated resources use
ownerReferencespointing to theExperimentCR - Deleting an
ExperimentCR cascades deletion to ConfigMap, TestWorkflow, and TestTrigger - Updates to the
ExperimentCR trigger re-reconciliation (update existing resources)
5. Helm Chart Integration
- CRD installed via Helm chart (following agent-runtime-operator pattern)
- Operator Deployment with RBAC (ServiceAccount, ClusterRole, ClusterRoleBinding)
- Webhook configuration for CRD validation (optional, stretch goal)
- Values configurable: image, namespace, resource limits, testworkflow template names
Acceptance Criteria
CRD & Validation
-
ExperimentCRD schema defined with OpenAPI validation -
datasetandscenariosare mutually exclusive (validation webhook or CEL rule) -
agentRefmust reference an existing Agent CR in the specified namespace -
defaultThresholdvalidated to 0-1 range - Metric
thresholdvalidated to 0-1 range
Reconciliation
- Creating an
ExperimentCR generates ConfigMap + TestWorkflow + TestTrigger - Updating an
ExperimentCR updates generated resources accordingly - Deleting an
ExperimentCR cascades deletion via ownerReferences - TestWorkflow correctly chains phase templates based on data source mode
- Agent URL resolved from
agentRef(Agent CR status or service DNS) - Experiment JSON correctly serialized into ConfigMap from inline scenarios
Trigger
- TestTrigger created when
spec.trigger.enabled: true - TestTrigger watches correct Deployment (derived from agent reference)
- TestTrigger not created when
spec.trigger.enabled: falseor omitted
Status
- Status subresource updated with reconciliation conditions
- Generated resource names tracked in status
- Last execution status populated (stretch goal)
Testing & Quality
- Unit tests for reconciliation logic (envtest or mocked client)
- E2E test: create Experiment CR → verify generated resources
- golangci-lint passing
- CRD validation tests
Deployment
- Helm chart with CRD, operator Deployment, RBAC
- Integration into Tilt local development environment
- Docker image build via Makefile
Implementation Status
| Deliverable | Status | Notes |
|---|---|---|
| Experiment CRD specification | 🔲 Pending | |
| Operator scaffolding (Operator SDK) | 🔲 Pending | |
| Reconciler: ConfigMap generation | 🔲 Pending | |
| Reconciler: TestWorkflow generation | 🔲 Pending | |
| Reconciler: TestTrigger generation | 🔲 Pending | |
| Status reporting | 🔲 Pending | |
| Garbage collection (ownerReferences) | 🔲 Pending | |
| Validation webhook | 🔲 Pending | |
| Helm chart | 🔲 Pending | |
| Unit tests | 🔲 Pending | |
| E2E tests | 🔲 Pending | |
| Tilt integration | 🔲 Pending |
References
- Issue Testbench Operator #21 — Original operator request
- agent-runtime-operator — Reference operator implementation (Go + Operator SDK)
deploy/local/example-workflow.yaml— Current manual TestWorkflow definitiondeploy/local/experiment.yaml— Current manual experiment ConfigMapdeploy/local/example-workflow-trigger.yaml— Current manual TestTriggerchart/templates/— Existing TestWorkflowTemplate Helm templates
Sub-Tasks
- Operator: Scaffold Go project with Operator SDK and define Experiment CRD types #28 — Scaffold Go project with Operator SDK and define Experiment CRD types
- Operator: Reconciler implementation, status reporting, and tests #29 — Reconciler implementation, status reporting, and tests
- Operator: Helm chart, Tilt integration, and showcase agent validation #30 — Helm chart, Tilt integration, and showcase agent validation
- Operator: Documentation #31 — Documentation
Supersedes
Testbench Operator #21(closed, fully covered by this issue)
Metadata
Metadata
Assignees
Labels
Type
Projects
Status