Skip to content

Testbench Operator: Reconcile Experiment CRD to TestWorkflows #27

@fmallmann

Description

@fmallmann

Goal: Implement a Kubernetes operator that watches Experiment Custom Resources and reconciles them into fully functional Testkube TestWorkflows executing the 4-phase evaluation pipeline (setup → run → evaluate → publish).

Description:
Today, running the testbench evaluation pipeline in Kubernetes requires manually creating a ConfigMap with the experiment definition, a TestWorkflow chaining the phase templates, and optionally a TestTrigger for automatic execution on agent changes (see deploy/local/). This story introduces a testbench-operator — analogous to the agent-runtime-operator — that automates this entire process. Users define an Experiment CR with their evaluation configuration, and the operator generates all required Kubernetes resources.

Related: #21


Key Deliverables

1. Experiment CRD Specification

Define an Experiment Custom Resource that captures all evaluation parameters. The CRD supports two data source modes: inline scenarios (defined directly in the spec) and external datasets (loaded from S3 or other sources via the setup phase).

apiVersion: testbench.agentic-layer.ai/v1alpha1
kind: Experiment
metadata:
  name: weather-agent-evaluation
  namespace: testkube
spec:
  # Reference to the agent under test
  agentRef:
    name: weather-agent
    namespace: sample-agents

  # External dataset source (triggers setup phase)
  dataset:
    s3:
      bucket: evaluation-datasets
      key: weather-agent/dataset.csv
    # Alternative: url: "http://data-server:8000/dataset.csv"

  # LLM configuration for evaluation
  llmAsAJudgeModel: "gemini-2.5-flash-lite"
  defaultThreshold: 0.9

  # Inline scenario definitions (alternative to dataset, skips setup phase)
  scenarios:
    - name: "Weather Query - New York"
      steps:
        - input: "What is the weather in New York?"
          reference:
            toolCalls:
              - name: get_weather
                args:
                  city: "New York"
          metrics:
            - metricName: AgentGoalAccuracyWithoutReference
            - metricName: ToolCallAccuracy
            - metricName: TopicAdherence
              threshold: 0.8
              parameters:
                mode: precision

  # Automatic trigger configuration
  trigger:
    enabled: true
    event: modified       # Trigger on agent deployment changes
    concurrencyPolicy: allow

Key design decisions:

  • agentRef references an Agent CR (resolved to its A2A endpoint by the operator)
  • dataset and scenarios are mutually exclusive — if dataset is set, the setup phase runs; if scenarios is set inline, the operator generates the experiment.json ConfigMap directly
  • trigger controls automatic TestTrigger creation, watching the referenced agent's Deployment

2. Operator Implementation (Go + Operator SDK)

Built with Go and Operator SDK, consistent with agent-runtime-operator.

Reconciliation logic — on each Experiment CR change, the operator ensures:

Resource Purpose Condition
ConfigMap ({name}-experiment) Stores serialized experiment.json Always (from inline scenarios or as empty placeholder for setup phase)
TestWorkflow ({name}-workflow) Chains phase templates: setup → run → evaluate → publish → visualize Always
TestTrigger ({name}-trigger) Watches agent Deployment for changes, triggers workflow Only when spec.trigger.enabled: true

Phase template selection:

  • If spec.dataset is set → include setup-template as first phase (downloads external dataset)
  • If spec.scenarios is set → skip setup, inject experiment.json via ConfigMap
  • Always include: run-template, evaluate-template, publish-template, visualize-template

Config parameter mapping:

  • spec.agentRef → resolved to agent URL → run-template.config.agentUrl
  • spec.dataset.s3setup-template.config.bucket + setup-template.config.key
  • spec.llmAsAJudgeModel → embedded in experiment.json
  • OTEL endpoint → inherited from cluster ConfigMap (otel-config)

3. Status Reporting

The Experiment CR status subresource reports reconciliation and workflow state:

status:
  conditions:
    - type: Ready
      status: "True"
      reason: Reconciled
      message: "All resources created successfully"
    - type: WorkflowReady
      status: "True"
      reason: Created
  generatedResources:
    configMap: weather-agent-evaluation-experiment
    testWorkflow: weather-agent-evaluation-workflow
    testTrigger: weather-agent-evaluation-trigger
  lastExecution:
    id: "exec-abc123"
    status: "passed"
    timestamp: "2026-02-23T10:00:00Z"

4. Garbage Collection & Ownership

  • All generated resources use ownerReferences pointing to the Experiment CR
  • Deleting an Experiment CR cascades deletion to ConfigMap, TestWorkflow, and TestTrigger
  • Updates to the Experiment CR trigger re-reconciliation (update existing resources)

5. Helm Chart Integration

  • CRD installed via Helm chart (following agent-runtime-operator pattern)
  • Operator Deployment with RBAC (ServiceAccount, ClusterRole, ClusterRoleBinding)
  • Webhook configuration for CRD validation (optional, stretch goal)
  • Values configurable: image, namespace, resource limits, testworkflow template names

Acceptance Criteria

CRD & Validation

  • Experiment CRD schema defined with OpenAPI validation
  • dataset and scenarios are mutually exclusive (validation webhook or CEL rule)
  • agentRef must reference an existing Agent CR in the specified namespace
  • defaultThreshold validated to 0-1 range
  • Metric threshold validated to 0-1 range

Reconciliation

  • Creating an Experiment CR generates ConfigMap + TestWorkflow + TestTrigger
  • Updating an Experiment CR updates generated resources accordingly
  • Deleting an Experiment CR cascades deletion via ownerReferences
  • TestWorkflow correctly chains phase templates based on data source mode
  • Agent URL resolved from agentRef (Agent CR status or service DNS)
  • Experiment JSON correctly serialized into ConfigMap from inline scenarios

Trigger

  • TestTrigger created when spec.trigger.enabled: true
  • TestTrigger watches correct Deployment (derived from agent reference)
  • TestTrigger not created when spec.trigger.enabled: false or omitted

Status

  • Status subresource updated with reconciliation conditions
  • Generated resource names tracked in status
  • Last execution status populated (stretch goal)

Testing & Quality

  • Unit tests for reconciliation logic (envtest or mocked client)
  • E2E test: create Experiment CR → verify generated resources
  • golangci-lint passing
  • CRD validation tests

Deployment

  • Helm chart with CRD, operator Deployment, RBAC
  • Integration into Tilt local development environment
  • Docker image build via Makefile

Implementation Status

Deliverable Status Notes
Experiment CRD specification 🔲 Pending
Operator scaffolding (Operator SDK) 🔲 Pending
Reconciler: ConfigMap generation 🔲 Pending
Reconciler: TestWorkflow generation 🔲 Pending
Reconciler: TestTrigger generation 🔲 Pending
Status reporting 🔲 Pending
Garbage collection (ownerReferences) 🔲 Pending
Validation webhook 🔲 Pending
Helm chart 🔲 Pending
Unit tests 🔲 Pending
E2E tests 🔲 Pending
Tilt integration 🔲 Pending

References

  • Issue Testbench Operator #21 — Original operator request
  • agent-runtime-operator — Reference operator implementation (Go + Operator SDK)
  • deploy/local/example-workflow.yaml — Current manual TestWorkflow definition
  • deploy/local/experiment.yaml — Current manual experiment ConfigMap
  • deploy/local/example-workflow-trigger.yaml — Current manual TestTrigger
  • chart/templates/ — Existing TestWorkflowTemplate Helm templates

Sub-Tasks

Supersedes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions