Skip to content

Add Test-Driven as the second autoloop strategy (red/green/refactor playbook) #184

@mrjf

Description

@mrjf

Summary

Ship a second strategy — Test-Driven — alongside AlphaEvolve. Where AlphaEvolve is for optimization (evolve a self-contained artifact against a scalar fitness), Test-Driven is for specification — pin the desired behaviour in a failing test, then drive implementation until it passes. Between the two strategies, autoloop covers most of its useful problem space.

Why

Two concrete pain points in tsessebe today:

  1. PRs like [Autoloop] Add timedelta_range (pandas feature port) #174 land with red CI because the agent writes tests against methods that don't exist, or writes code that the sandbox can't type-check. Under a TDD strategy, the iteration starts by pinning behaviour in a failing test, and acceptance ends with CI green — the two halves compose directly.
  2. The pandas-port program is half-TDD already. The agent is supposed to write tests for each ported feature. Making TDD explicit turns that from prose into structure: red → green → refactor, with a test harness tracked in state.

Bug-fixing becomes a first-class autoloop use case: a future tsb-bugfix program picks a bug from an issue label, writes the failing repro, makes it green. The strategy handles the rest.

Scope

Mirrors #181's shape so the two strategies feel like siblings:

.autoloop/strategies/
├── alphaevolve/                       # already shipped
└── test-driven/                       # NEW
    ├── strategy.md
    ├── CUSTOMIZE.md
    └── prompts/
        ├── write-test.md
        ├── make-green.md
        └── refactor.md

No changes to .github/workflows/autoloop.md are needed — the "Strategy discovery" prompt section added in #181 already handles any strategy whose playbook is pointed at from program.md's ## Evolution Strategy section.

Content to ship — .autoloop/strategies/test-driven/

strategies/test-driven/strategy.md

# Test-Driven Strategy — <CUSTOMIZE: program-name>

This file is the **runtime playbook** for this program. The autoloop agent reads it at the start of every iteration and follows it literally. It supersedes the generic "Analyze and Propose" / "Accept or Reject" steps in the default autoloop iteration loop — all other steps (state read, branch management, state file updates, CI gating) still apply.

## Problem framing

<CUSTOMIZE: 2–4 sentences describing what the program specifies. What is the target artifact? What does "correct" mean — are we implementing an API against a reference, fixing bugs against a repro, adding behaviour against a spec document? Name the source of truth the agent checks whenever ambiguity arises (e.g. "pandas' `Series.sort_values` semantics are authoritative; when tsb behaviour diverges, pandas wins unless the divergence is documented").>

## Per-iteration loop

### Step 1. Load state

1. Read `program.md` — Goal, Target, Evaluation.
2. Read the program's state file from the repo-memory folder (`{program-name}.md`). Locate the `## ✅ Test Harness` subsection. If it does not exist, create it using the schema in [Test Harness schema](#test-harness-schema).
3. Read <CUSTOMIZE: the source-of-truth references the agent consults — e.g. pandas docs for the current target function, the issue whose bug we're fixing, a spec document in docs/>.

### Step 2. Pick target

Pick **one** unit of work — a single behaviour to pin or fix. Size it so that the entire red → green → refactor cycle fits in one iteration:

- <CUSTOMIZE: concrete guidance for how to size work in this program. E.g. "one method signature" for an API-porting program, "one failing repro" for a bug-fixing program, "one spec bullet" for a spec-driven program.>

Deterministic overrides (apply *before* free choice):

- If the Test Harness has any entry with status `failing` that is **not** marked `blocked`, pick that one. A failing test is an obligation — you don't add new tests while old ones are still red.
- If the most recent 3 iterations were all `error` (validity pre-check failed, test didn't even compile), force a `rethink-test` iteration — the problem is the test, not the implementation. See Step 4's rethink branch.

Record the chosen target in the iteration's reasoning.

### Step 3. Red — write the failing test

Use `strategy/prompts/write-test.md` as framing.

Before writing the test, state (in visible reasoning):

1. What behaviour you are pinning. One sentence, specific.
2. The source-of-truth reference (pandas doc, spec bullet, issue reproducer).
3. The minimum set of assertions that captures "this is correct" without over-specifying implementation details.
4. Edge cases the test must include (empty inputs, NaN, dtype boundaries — whatever's applicable).

Then write the test file (or append to an existing one). Before continuing: **run the test and confirm it fails with a useful error message**. If it passes already, you picked wrong — either the target is already implemented (pick a different one) or the test is too weak (rewrite).

Record the new test in the Test Harness with status `failing` and the iteration number.

### Step 4. Green — implement until the test passes

Use `strategy/prompts/make-green.md` as framing.

Before writing any implementation code, state:

1. Parent state of the target file(s) — one-line summary of what exists now.
2. The **minimum** change needed to make the failing test pass. Resist scope creep; the test defines the requirement, nothing else.
3. Which invariants of the existing tests must continue to hold (list them).

Then write the implementation. Run the full test suite (not just the new test): **every existing test must still pass, and the new one must now pass too.**

If the test still fails after implementation:
- **Attempt ≤ 3**: re-analyze what's missing and try again (stay in Step 4).
- **Attempt ≥ 4**: consider that the test itself may be wrong — re-enter the `rethink-test` branch. Read the source of truth again, weaken/rewrite the test to match the *real* spec, then restart Step 4. Document the change in the Test Harness entry as a `test-revised` note.
- **After 5 total attempts in the same iteration**: stop. Mark the target `blocked` in the Test Harness with a `blocked_reason`. Set `paused: true` on the state file with `pause_reason: "td-stuck: <target>"`. End the iteration.

### Step 5. Refactor (optional, gated on green)

Only if the test suite is fully green, consider a refactor. Use `strategy/prompts/refactor.md` as framing.

Pick a refactor only if you can name a concrete clarity/complexity improvement. Cosmetic changes are not refactors — they are diffs in search of a justification. If nothing is worth refactoring, skip this step. Record the choice in reasoning either way.

After any refactor, the full test suite must still be green. If it isn't, revert the refactor and continue without it.

### Step 6. Evaluate

Run the evaluation command from `program.md`. For most TDD programs this is simply "the full test suite passes" — a boolean, not a scalar. Emit `{"metric": <count>, "passing": N, "failing": 0}` where `metric` is `passing` (higher is better).

Some TDD programs have a secondary metric (bundle size, coverage percentage). In that case `metric` can be the secondary metric, with the hard constraint that `failing == 0` — no reduction in coverage counts as progress if tests are red.

### Step 7. Update the Test Harness

Append the iteration's actions to `## ✅ Test Harness`:

- New test → add entry with status `passing` (it was just made green).
- Existing failing test became green → flip its status.
- A test became blocked → set status `blocked`, fill `blocked_reason`.

Enforce size discipline: keep at most <CUSTOMIZE: harness_size_cap, default 100> test entries visible; older entries can collapse into compressed range summaries (`### Tests 40–80 — ✅ passing (N batch additions for X feature): brief summary`).

### Step 8. Fold through to the default loop

Continue with the normal autoloop Step 5 (Accept or Reject → commit / discard, update state file's Machine State, Iteration History, Lessons Learned, etc.) as defined in the workflow. The only additional requirements from Test-Driven are:

- The Iteration History entry must include `phase` (red / green / refactor / rethink-test), `target`, `new_tests` count, `existing_tests_status` (all-green / regression-introduced-and-fixed).
- Lessons Learned additions should be phrased as *transferable heuristics* about the problem space (e.g. "Pandas' NaN-handling for `sort_values` treats NaN as always-last-or-first regardless of `ascending`; tsb implementations must branch on `naPosition` independently of the sort direction") — not iteration reports.

## Test Harness schema

The harness lives in the state file `{program-name}.md` on the `memory/autoloop` branch as a subsection:

```markdown
## ✅ Test Harness

> 🤖 *Managed by the Test-Driven strategy. One entry per pinned behaviour. Newest first.*

### <test-name or describe-block>  ·  gen <N>

- **Status**: ✅ passing / ❌ failing / 🚧 blocked
- **Target**: <file:line or target entity being specified>
- **Spec source**: <URL or reference — pandas doc / issue / spec bullet>
- **Added iteration**: <N>
- **Made green iteration**: <M> (if applicable)
- **Blocked reason**: <one-line, if blocked>
- **Notes**: <one sentence on what the test pins, not how>

---
```

Identifiers:
- Test names should match the test file's names verbatim so they're greppable.
- Status `blocked` means attempts have been exhausted and a human must intervene. The test is still present in the test file but may be `skip`ped (with a TODO comment pointing at the blocked entry here).

## Invariants the agent must not violate

- **Never loosen an existing test to make a new one pass.** If an existing test fails because of your implementation change, the change is wrong — fix the implementation, not the test.
- **Never skip a failing test** to get CI green (except via `blocked` with a recorded reason, and only after the 5-attempt budget).
- **Never delete a test.** Revise the assertions if the spec has genuinely changed (and document the change), but don't remove coverage.
- **Tests are the acceptance criterion.** CI green on the pushed commit is the accept signal (composes with #176's CI-gated acceptance). If CI has tests passing but the Test Harness shows any `failing`, the state file is out of sync — reconcile before ending the iteration.
- **Tests pin behaviour, not implementation.** Don't write tests that check private state, internal method names, or structural details a future refactor would legitimately change.

strategies/test-driven/prompts/write-test.md

# Write-test prompt — <CUSTOMIZE: program-name>

Framing for the **red** phase of an iteration. Read before writing the failing test.

---

You are writing a test that **pins desired behaviour**. The point is to capture a specification the implementation must satisfy — not to exhaustively probe the current implementation's structure.

## Domain knowledge

<CUSTOMIZE: 5–15 bullets with the high-leverage facts for *this* domain. Examples:
- What the source-of-truth reference is and where to find it (pandas docs URL, spec document path, issue number).
- The edge cases this problem domain is known to have (NaN in sort order, Unicode collation, timezone transitions, integer overflow boundaries).
- Conventions for this test suite (naming, imports, fixture helpers).
These should be things an expert in *this* problem would remind a junior engineer on day one.>

## How to write a good failing test

1. **One behaviour per test.** If you catch yourself writing `&&` in a single assertion, split the test.
2. **Name the test after the behaviour, not the method.** "`sortValues orders NaN last when naPosition is last`" is useful; "`test sort 3`" is not.
3. **Prefer property assertions over fixtures** for cross-cutting invariants (lengths, permutations, idempotence, ordering). Use concrete fixtures for the canonical happy path and the known-tricky cases.
4. **Test the behaviour, not the implementation.** Checking that `result.length === input.length` is a behaviour. Checking that `result._internalSortAlgorithm === "quicksort"` is implementation.
5. **Assert on both "what's there" and "what's not there"** where both matter. E.g. for a sort, both "elements are in order" AND "no elements lost or duplicated."

## Red-phase checklist

Before moving on:

- The new test file compiles (run `bun test --dry-run <file>` or equivalent).
- Running the test produces **one clear failure message** — not a parse error, not a compile error, not a stack trace from uninitialized state. The failure should read to a human as "this behaviour is missing" or "this behaviour is wrong in this specific way."
- The failure message names the expected vs. actual value in a form a future contributor can act on.
- The test would *still pass* under a reasonable future refactor of the implementation (no implementation-detail coupling).

## What the reasoning output must contain

Before writing the test:

- Target: what behaviour are you pinning?
- Spec source: where does the desired behaviour come from? URL / reference.
- Edge cases included: list the specific cases this test covers.
- Edge cases intentionally excluded: list what this test *doesn't* cover and why (separate tests, out of scope, etc.).

After writing the test:

- A "Red summary" line: 10–20 words, suitable for the Test Harness and Iteration History.
- The concrete failure message observed when the test runs.

strategies/test-driven/prompts/make-green.md

# Make-green prompt — <CUSTOMIZE: program-name>

Framing for the **green** phase of an iteration. Read before writing the implementation.

---

You are making a failing test pass with the **minimum** change. Scope creep is the enemy — the test defines the requirement, nothing else.

## Domain knowledge

<CUSTOMIZE: same facts as write-test.md. Keep them in sync when one is updated.>

## How to make a test pass without scope creep

1. **Re-read the failing test.** Don't skim. The exact assertions tell you exactly what must change.
2. **Identify the smallest code change** that would make the failing test pass without breaking any existing test. Name it concretely.
3. **Write only that change.** If a helper would make the code cleaner, note it for a later refactor iteration — don't add it now.
4. **Never modify existing tests to make the new test pass.** If the change you're considering breaks an existing test, something is wrong with your change, not with the old test.
5. **Run the full test suite, not just the new test.** Regressions in unrelated tests must be fixed before the iteration is accepted.

## Anti-patterns to avoid

- **Overfitting to the test.** Don't hard-code the test's expected value in the implementation. If the test expects `42` for input `6`, your implementation must compute `6 * 7` or equivalent, not `return 42`.
- **Speculative generality.** "While I'm here, let me also handle <edge case the test doesn't cover>." No — that edge case gets its own test.
- **Parallel implementations.** If the existing implementation has a branch your new behaviour doesn't fit into, think carefully before adding a `if (<new case>)` branch. Sometimes the right answer is to unify the code paths; sometimes it's to keep them separate. Don't default to "add another branch."

## Green-phase checklist

Before moving on:

- The new test passes.
- Every existing test still passes.
- The code change is as small as possible while satisfying the above two.
- You can explain in one sentence *why* the change makes the test pass — not just *that* it does.

## What the reasoning output must contain

Before writing the implementation:

- Target file(s) and what exists there now (one-line summary).
- The minimum change you're about to make.
- Existing-test invariants that must continue to hold.

After writing the implementation:

- A "Green summary" line.
- Confirmation that the full suite is green (attach the relevant pass count).
- Any follow-up refactor candidates you noted but deliberately did *not* implement in this iteration.

strategies/test-driven/prompts/refactor.md

# Refactor prompt — <CUSTOMIZE: program-name>

Framing for the optional **refactor** phase. Read only after the green phase is complete and the full test suite is passing.

---

You are deciding whether — and how — to refactor. The bar is: can you name a concrete clarity or complexity improvement that a future reader of this code will thank you for? If not, skip the refactor.

## When to refactor

- Duplicated logic that just appeared — the change in this iteration introduced a near-copy of existing code. Deduplicate now, while the context is fresh.
- A function that grew past its natural boundaries — if the target file now has a function with more than one responsibility that used to have one, extract.
- Naming that drifted — the variable or function name no longer reflects what the code does after the change. Rename.

## When NOT to refactor

- The code looks fine. Skip.
- You want to rearrange for aesthetic preference. Skip.
- You want to add abstractions you might need later. Strongly skip — the test suite is your safety net; speculative abstraction destroys that safety net's value.
- The existing code is old and has nothing to do with this iteration. Not your job, not this iteration.

## Refactor-phase checklist

Before committing the refactor:

- Every test that was green at the end of the green phase is still green.
- You can summarize the refactor in one line ("Extracted `compareWithNaN` from inline into a private helper because it's now used in two places").
- The diff is smaller than the green phase's diff. If it's not, you're probably doing something that wants its own iteration (or its own issue).

## What the reasoning output must contain

If you chose to refactor:

- The one-line improvement.
- Before/after of the specific thing you changed (not the whole file).
- Confirmation all tests still pass.

If you chose not to refactor:

- One sentence on why. This is a valid outcome — "no refactor needed" should be reportable with the same weight as a refactor.

strategies/test-driven/CUSTOMIZE.md

# Test-Driven — Customization Guide (read by the program-creator agent)

This file is **not** copied into programs. It tells the creator agent how to turn the generic Test-Driven template into a problem-specific strategy in a new program directory.

## When to pick this strategy

Pick Test-Driven when **all** of the following hold:

- The goal is defined by **behaviour that can be captured in tests** — an API to implement, a bug to fix, a spec to satisfy.
- Progress is additive: each iteration pins one more piece of behaviour and makes it work, without regressing the pieces already pinned.
- There is a **source of truth** for correctness — a reference implementation (e.g. pandas), a spec document, a failing repro on an issue.
- Validity is cheap (the test suite runs in seconds to a few minutes).

Do **not** pick Test-Driven for: optimizing a scalar metric (use AlphaEvolve), exploratory research without a clear spec, or pure refactoring tasks (tests should already exist before a refactor; TDD is for when they don't).

## What to copy

Copy these files from this template into the new program directory at `.autoloop/programs/<program-name>/strategy/`:

- `strategy.md``strategy/test-driven.md`
- `prompts/write-test.md``strategy/prompts/write-test.md`
- `prompts/make-green.md``strategy/prompts/make-green.md`
- `prompts/refactor.md``strategy/prompts/refactor.md`

The `## ✅ Test Harness` subsection is not a filesystem concept — it lives in the program's state file on `memory/autoloop`.

## What to customize

Every `<CUSTOMIZE: …>` marker must be resolved before enabling the program.

In `strategy/test-driven.md`:

- **Problem framing** — 2–4 sentences on the target artifact, source of truth for correctness, what makes a test "good" in this domain.
- **Step 2 target-sizing guidance** — the concrete unit of work per iteration (one method, one bug, one spec bullet).
- **Harness size cap** — default 100, bump for large porting efforts.

In `strategy/prompts/write-test.md`, `make-green.md`, `refactor.md`:

- **Domain knowledge block** — the facts an expert would put on a whiteboard. Source-of-truth URLs, edge-case conventions, test-suite conventions. Keep these three in sync: copy the domain block to all three files.

## What goes in program.md

Replace the program's `## Evolution Strategy` section (yes, "Evolution Strategy" is a misnomer for TDD but keeps the section consistent across strategies — consider renaming to `## Strategy` in a future pass) with a pointer block:

```markdown
## Evolution Strategy

This program uses the **Test-Driven** strategy. On every iteration, read `strategy/test-driven.md` and follow it literally — it supersedes the generic analyze/accept/reject steps in the default autoloop loop.

Support files:
- `strategy/test-driven.md` — the runtime playbook (red/green/refactor phases, rethink-test rule).
- `strategy/prompts/write-test.md` — framing for the red phase.
- `strategy/prompts/make-green.md` — framing for the green phase.
- `strategy/prompts/refactor.md` — framing for the optional refactor phase.

Test Harness lives in the state file on the `memory/autoloop` branch under the `## ✅ Test Harness` subsection (see the playbook for the schema).
```

## What NOT to put in the program directory

- Do not duplicate the Test Harness into a file in the program dir — it lives in the state file.
- Do not copy `CUSTOMIZE.md` — creator-time only.
- Do not invent new phases beyond red/green/refactor/rethink-test. If a program needs more structure, it probably wants a different strategy.

Authoring a TD-based program

  1. Create .autoloop/programs/<program-name>/ with program.md + any code/ scaffolding.
  2. Copy .autoloop/strategies/test-driven/{strategy.md,prompts/*} into .autoloop/programs/<program-name>/strategy/ (renaming strategy.mdtest-driven.md).
  3. Resolve every <CUSTOMIZE: …> marker.
  4. Replace program.md's ## Evolution Strategy section with the pointer block above.

Composes with

Acceptance

  • .autoloop/strategies/test-driven/ exists with strategy.md, CUSTOMIZE.md, prompts/write-test.md, prompts/make-green.md, prompts/refactor.md.
  • At least one proof-of-concept TD-based program exists (suggestion: a tsb-pandas-tests program that ports pandas' own test suite for one function at a time — target src/core/series.ts, evaluation = test suite passes, increment = count of pandas tests now mirrored and passing).
  • The Strategy discovery prompt from Add strategy system to autoloop — ship AlphaEvolve as the first specialized iteration playbook #181 picks up the TD strategy without any further wiring (sanity check: a program with strategy/test-driven.md runs under TD; a program with no strategy/ dir runs under the default loop).

Out of scope

  • A CLI for picking strategies at program-creation time.
  • Cross-strategy harnesses (e.g., an AlphaEvolve program that also tracks a test harness). Each program picks one strategy.
  • Mutation testing / fuzz-like TD variants. Future strategies.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions