Testbench Operator: Reconcile Experiment CRD to TestWorkflows

**Goal**: Implement a Kubernetes operator that watches `Experiment` Custom Resources and reconciles them into fully functional Testkube TestWorkflows executing the 4-phase evaluation pipeline (setup → run → evaluate → publish).

**Description**:
Today, running the testbench evaluation pipeline in Kubernetes requires manually creating a ConfigMap with the experiment definition, a TestWorkflow chaining the phase templates, and optionally a TestTrigger for automatic execution on agent changes (see `deploy/local/`). This story introduces a testbench-operator — analogous to the [agent-runtime-operator](https://github.com/agentic-layer/agent-runtime-operator) — that automates this entire process. Users define an `Experiment` CR with their evaluation configuration, and the operator generates all required Kubernetes resources.

**Related**: #21

---

## Key Deliverables

### 1. Experiment CRD Specification

Define an `Experiment` Custom Resource that captures all evaluation parameters. The CRD supports two data source modes: **inline scenarios** (defined directly in the spec) and **external datasets** (loaded from S3 or other sources via the setup phase).

```yaml
apiVersion: testbench.agentic-layer.ai/v1alpha1
kind: Experiment
metadata:
  name: weather-agent-evaluation
  namespace: testkube
spec:
  # Reference to the agent under test
  agentRef:
    name: weather-agent
    namespace: sample-agents

  # External dataset source (triggers setup phase)
  dataset:
    s3:
      bucket: evaluation-datasets
      key: weather-agent/dataset.csv
    # Alternative: url: "http://data-server:8000/dataset.csv"

  # LLM configuration for evaluation
  llmAsAJudgeModel: "gemini-2.5-flash-lite"
  defaultThreshold: 0.9

  # Inline scenario definitions (alternative to dataset, skips setup phase)
  scenarios:
    - name: "Weather Query - New York"
      steps:
        - input: "What is the weather in New York?"
          reference:
            toolCalls:
              - name: get_weather
                args:
                  city: "New York"
          metrics:
            - metricName: AgentGoalAccuracyWithoutReference
            - metricName: ToolCallAccuracy
            - metricName: TopicAdherence
              threshold: 0.8
              parameters:
                mode: precision

  # Automatic trigger configuration
  trigger:
    enabled: true
    event: modified       # Trigger on agent deployment changes
    concurrencyPolicy: allow
```

**Key design decisions**:
- `agentRef` references an `Agent` CR (resolved to its A2A endpoint by the operator)
- `dataset` and `scenarios` are mutually exclusive — if `dataset` is set, the setup phase runs; if `scenarios` is set inline, the operator generates the `experiment.json` ConfigMap directly
- `trigger` controls automatic TestTrigger creation, watching the referenced agent's Deployment

### 2. Operator Implementation (Go + Operator SDK)

Built with **Go and Operator SDK**, consistent with [agent-runtime-operator](https://github.com/agentic-layer/agent-runtime-operator).

**Reconciliation logic** — on each `Experiment` CR change, the operator ensures:

| Resource | Purpose | Condition |
|---|---|---|
| **ConfigMap** (`{name}-experiment`) | Stores serialized `experiment.json` | Always (from inline scenarios or as empty placeholder for setup phase) |
| **TestWorkflow** (`{name}-workflow`) | Chains phase templates: setup → run → evaluate → publish → visualize | Always |
| **TestTrigger** (`{name}-trigger`) | Watches agent Deployment for changes, triggers workflow | Only when `spec.trigger.enabled: true` |

**Phase template selection**:
- If `spec.dataset` is set → include `setup-template` as first phase (downloads external dataset)
- If `spec.scenarios` is set → skip setup, inject experiment.json via ConfigMap
- Always include: `run-template`, `evaluate-template`, `publish-template`, `visualize-template`

**Config parameter mapping**:
- `spec.agentRef` → resolved to agent URL → `run-template.config.agentUrl`
- `spec.dataset.s3` → `setup-template.config.bucket` + `setup-template.config.key`
- `spec.llmAsAJudgeModel` → embedded in experiment.json
- OTEL endpoint → inherited from cluster ConfigMap (`otel-config`)

### 3. Status Reporting

The `Experiment` CR status subresource reports reconciliation and workflow state:

```yaml
status:
  conditions:
    - type: Ready
      status: "True"
      reason: Reconciled
      message: "All resources created successfully"
    - type: WorkflowReady
      status: "True"
      reason: Created
  generatedResources:
    configMap: weather-agent-evaluation-experiment
    testWorkflow: weather-agent-evaluation-workflow
    testTrigger: weather-agent-evaluation-trigger
  lastExecution:
    id: "exec-abc123"
    status: "passed"
    timestamp: "2026-02-23T10:00:00Z"
```

### 4. Garbage Collection & Ownership

- All generated resources use `ownerReferences` pointing to the `Experiment` CR
- Deleting an `Experiment` CR cascades deletion to ConfigMap, TestWorkflow, and TestTrigger
- Updates to the `Experiment` CR trigger re-reconciliation (update existing resources)

### 5. Helm Chart Integration

- CRD installed via Helm chart (following agent-runtime-operator pattern)
- Operator Deployment with RBAC (ServiceAccount, ClusterRole, ClusterRoleBinding)
- Webhook configuration for CRD validation (optional, stretch goal)
- Values configurable: image, namespace, resource limits, testworkflow template names

---

## Acceptance Criteria

### CRD & Validation
- [ ] `Experiment` CRD schema defined with OpenAPI validation
- [ ] `dataset` and `scenarios` are mutually exclusive (validation webhook or CEL rule)
- [ ] `agentRef` must reference an existing Agent CR in the specified namespace
- [ ] `defaultThreshold` validated to 0-1 range
- [ ] Metric `threshold` validated to 0-1 range

### Reconciliation
- [ ] Creating an `Experiment` CR generates ConfigMap + TestWorkflow + TestTrigger
- [ ] Updating an `Experiment` CR updates generated resources accordingly
- [ ] Deleting an `Experiment` CR cascades deletion via ownerReferences
- [ ] TestWorkflow correctly chains phase templates based on data source mode
- [ ] Agent URL resolved from `agentRef` (Agent CR status or service DNS)
- [ ] Experiment JSON correctly serialized into ConfigMap from inline scenarios

### Trigger
- [ ] TestTrigger created when `spec.trigger.enabled: true`
- [ ] TestTrigger watches correct Deployment (derived from agent reference)
- [ ] TestTrigger not created when `spec.trigger.enabled: false` or omitted

### Status
- [ ] Status subresource updated with reconciliation conditions
- [ ] Generated resource names tracked in status
- [ ] Last execution status populated (stretch goal)

### Testing & Quality
- [ ] Unit tests for reconciliation logic (envtest or mocked client)
- [ ] E2E test: create Experiment CR → verify generated resources
- [ ] golangci-lint passing
- [ ] CRD validation tests

### Deployment
- [ ] Helm chart with CRD, operator Deployment, RBAC
- [ ] Integration into Tilt local development environment
- [ ] Docker image build via Makefile

---

## Implementation Status

| Deliverable | Status | Notes |
|---|---|---|
| Experiment CRD specification | 🔲 Pending | |
| Operator scaffolding (Operator SDK) | 🔲 Pending | |
| Reconciler: ConfigMap generation | 🔲 Pending | |
| Reconciler: TestWorkflow generation | 🔲 Pending | |
| Reconciler: TestTrigger generation | 🔲 Pending | |
| Status reporting | 🔲 Pending | |
| Garbage collection (ownerReferences) | 🔲 Pending | |
| Validation webhook | 🔲 Pending | |
| Helm chart | 🔲 Pending | |
| Unit tests | 🔲 Pending | |
| E2E tests | 🔲 Pending | |
| Tilt integration | 🔲 Pending | |

---

## References

- Issue #21 — Original operator request
- [agent-runtime-operator](https://github.com/agentic-layer/agent-runtime-operator) — Reference operator implementation (Go + Operator SDK)
- `deploy/local/example-workflow.yaml` — Current manual TestWorkflow definition
- `deploy/local/experiment.yaml` — Current manual experiment ConfigMap
- `deploy/local/example-workflow-trigger.yaml` — Current manual TestTrigger
- `chart/templates/` — Existing TestWorkflowTemplate Helm templates

---

## Sub-Tasks

- [ ] #28 — Scaffold Go project with Operator SDK and define Experiment CRD types
- [ ] #29 — Reconciler implementation, status reporting, and tests
- [ ] #30 — Helm chart, Tilt integration, and showcase agent validation
- [ ] #31 — Documentation

## Supersedes

- ~#21~ (closed, fully covered by this issue)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testbench Operator: Reconcile Experiment CRD to TestWorkflows #27

Key Deliverables

1. Experiment CRD Specification

2. Operator Implementation (Go + Operator SDK)

3. Status Reporting

4. Garbage Collection & Ownership

5. Helm Chart Integration

Acceptance Criteria

CRD & Validation

Reconciliation

Trigger

Status

Testing & Quality

Deployment

Implementation Status

References

Sub-Tasks

Supersedes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Resource	Purpose	Condition
ConfigMap (`{name}-experiment`)	Stores serialized `experiment.json`	Always (from inline scenarios or as empty placeholder for setup phase)
TestWorkflow (`{name}-workflow`)	Chains phase templates: setup → run → evaluate → publish → visualize	Always
TestTrigger (`{name}-trigger`)	Watches agent Deployment for changes, triggers workflow	Only when `spec.trigger.enabled: true`

Deliverable	Status	Notes
Experiment CRD specification	🔲 Pending
Operator scaffolding (Operator SDK)	🔲 Pending
Reconciler: ConfigMap generation	🔲 Pending
Reconciler: TestWorkflow generation	🔲 Pending
Reconciler: TestTrigger generation	🔲 Pending
Status reporting	🔲 Pending
Garbage collection (ownerReferences)	🔲 Pending
Validation webhook	🔲 Pending
Helm chart	🔲 Pending
Unit tests	🔲 Pending
E2E tests	🔲 Pending
Tilt integration	🔲 Pending

Testbench Operator: Reconcile Experiment CRD to TestWorkflows #27

Description

Key Deliverables

1. Experiment CRD Specification

2. Operator Implementation (Go + Operator SDK)

3. Status Reporting

4. Garbage Collection & Ownership

5. Helm Chart Integration

Acceptance Criteria

CRD & Validation

Reconciliation

Trigger

Status

Testing & Quality

Deployment

Implementation Status

References

Sub-Tasks

Supersedes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions