Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 111 additions & 0 deletions .autoloop/strategies/test-driven/CUSTOMIZE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
# Adopting the Test-Driven strategy for a new program

This file is a **creator-time guide** — it is read by the maintainer (or a "create program" agent) **once**, when authoring a new program that wants to use Test-Driven. It is **not** copied into the program's `strategy/` directory and is **not** read by the iteration agent at runtime.

If you are an iteration agent and have somehow ended up here: stop, go back to `strategy/test-driven.md` in the program directory, and follow that.

## When to pick Test-Driven

Test-Driven is the right strategy when **all** of the following are true:

- The program is about **specifying behaviour**, not optimizing a metric. The question is "is this correct?", not "is this faster?".
- "Correct" can be expressed as **executable assertions** — unit tests, property tests, integration tests, repros — that run as part of CI.
- Iterations **accumulate**: each iteration pins one more behaviour (or fixes one more bug), and the work product grows monotonically. You're not searching for a single best artifact; you're building up a body of pinned behaviour.
- There exists a **source of truth** the agent can consult when ambiguity arises (a reference implementation, a spec document, an issue with a reproducer, etc.).

If the program is "make this faster" or "minimize this scalar", **do not use Test-Driven**. Use AlphaEvolve (`.autoloop/strategies/alphaevolve/`).

If the program is genuinely "do whatever the agent thinks is best", neither strategy fits — use the default loop.

### Canonical use cases

- **API porting** (e.g., the pandas → tsb migration): each iteration pins one method's behaviour from the reference and implements it. The Test Harness becomes the coverage map.
- **Bug fixing** (e.g., a future `tsb-bugfix` program): each iteration picks a bug from a label, writes the failing repro as a test, makes it green.
- **Spec-driven development**: each iteration pins one bullet from a spec document as a test, then implements it.

## Steps to adopt

1. Create `.autoloop/programs/<program-name>/` with the usual layout: a `program.md`, and any source-of-truth references the program needs (a `docs/` directory, a pinned spec, etc.).
2. Copy the strategy template into the program:

```bash
mkdir -p .autoloop/programs/<program-name>/strategy/prompts
cp .autoloop/strategies/test-driven/strategy.md \
.autoloop/programs/<program-name>/strategy/test-driven.md
cp .autoloop/strategies/test-driven/prompts/write-test.md \
.autoloop/programs/<program-name>/strategy/prompts/write-test.md
cp .autoloop/strategies/test-driven/prompts/make-green.md \
.autoloop/programs/<program-name>/strategy/prompts/make-green.md
cp .autoloop/strategies/test-driven/prompts/refactor.md \
.autoloop/programs/<program-name>/strategy/prompts/refactor.md
```

3. Resolve every `<CUSTOMIZE: …>` marker in `strategy/test-driven.md` and the three prompt files. See the marker-by-marker guidance below.
4. Add the `## Evolution Strategy` pointer block to `program.md` (template below).
5. Sanity-check: `grep -R "<CUSTOMIZE" .autoloop/programs/<program-name>/strategy/` should return **nothing**.

## The pointer block for `program.md`

Replace (or add) `program.md`'s `## Evolution Strategy` section with exactly this:

```markdown
## Evolution Strategy

This program uses the **Test-Driven** strategy. On every iteration, read `strategy/test-driven.md` and follow it literally — it supersedes the generic analyze/accept/reject steps in the default autoloop loop.

Support files:
- `strategy/test-driven.md` — the runtime playbook (red → green → refactor loop, Test Harness rules).
- `strategy/prompts/write-test.md` — framing for the **red** phase: what makes a good failing test for this problem.
- `strategy/prompts/make-green.md` — framing for the **green** phase: minimum-change discipline.
- `strategy/prompts/refactor.md` — framing for the optional **refactor** phase, gated on a green suite.

Test Harness state lives in the state file on the `memory/autoloop` branch under the `## ✅ Test Harness` subsection (see the playbook for the schema).
```

## Marker-by-marker guidance

### `strategy.md` markers

- **`# Test-Driven Strategy — <CUSTOMIZE: program-name>`** — the program name as it appears in the file path.
- **`## Problem framing`** — 2–4 sentences. State the artifact under test, what "correct" means, and the source of truth. The agent reads this every iteration; make it dense and unambiguous about which reference wins on conflict.
- **Step 1 source-of-truth list** — name the *specific* references the agent should consult. Be concrete: not "pandas docs", but "pandas docs for the function currently in scope, plus the corresponding numpy doc when behaviour is delegated to numpy". Vague references mean the agent will skip them.
- **Step 2 sizing guidance** — the most important customization. Make it impossible to pick work too big to finish in one iteration. Examples:
- API porting: "one method signature, with all overloads listed in the reference doc, but no more than one method per iteration."
- Bug fixing: "one bug per iteration, identified by issue number; if the bug decomposes into sub-bugs, file new issues for the sub-bugs and pick one."
- Spec-driven: "one MUST-bullet from the spec; SHOULD-bullets get separate iterations once all MUSTs are green."
- **`harness_size_cap`** — default 100 is fine for most programs. Lower it if your tests are large and the state file balloons; raise it only if you have a reason older entries should stay individually visible.

### `prompts/write-test.md` markers

This prompt frames the **red** phase. Customize:

- **Test framework setup** — the exact command to run a single test, the file naming convention, the import paths. The agent reads this every iteration; don't make it guess.
- **Domain knowledge** — anything about the source of truth that's easy to get wrong. (E.g. "pandas treats NaN as always-last-or-first regardless of `ascending`; the test must include both `ascending=True` and `ascending=False` with NaN to pin this." or "the issue's repro is in Python — translate carefully, NaN ≠ undefined.")
- **Anti-patterns** — failure modes you've seen the agent hit before in this program (over-specifying implementation details, asserting on internal data structures, snapshot tests of huge outputs).

### `prompts/make-green.md` markers

This prompt frames the **green** phase. Customize:

- **Minimum-change examples** — 2–3 worked examples of "this is the minimum change to make this test pass". Counter-examples are also helpful: "the agent was tempted to also handle <X>; that gets its own test."
- **Don't-modify list** — the files / tests / fixtures that the green phase must never touch in pursuit of a passing test (typically: existing tests, the source of truth, the state file).
- **Domain knowledge** — same facts as `write-test.md`. Keep the two files in sync when one is updated; the agent reads both every iteration.

### `prompts/refactor.md` markers

This prompt frames the optional **refactor** phase. Customize:

- **Refactor vocabulary** — 5–10 concrete refactoring moves that make sense for this program (extract helper, collapse duplicate dispatch, replace `switch` with table). Acts as a menu the agent samples from.
- **What's not a refactor** — explicit list of cosmetic-only changes the agent must reject as "not a refactor" (renaming for taste, formatting, comment polish).
- **Stop conditions** — when the agent should *not* attempt a refactor this iteration (suite is green but only just; the previous iteration was also a refactor; the file was rewritten substantially this iteration).

## A tiny worked example

Suppose you are creating `tsb-bugfix` to chew through bugs filed against tsb.

- Source of truth per iteration: the issue body and any reproducer in it.
- Sizing: one issue per iteration. If an issue has multiple unrelated repros, the agent files sub-issues and picks one.
- Test-writing convention: each repro becomes a `tests/regressions/issue-<NNN>.test.ts` file, named after the issue number, with the issue link in a top-of-file comment.
- Acceptance: the new regression test passes; the full suite is still green; the issue can be closed by referencing the merged commit.

That's the kind of fill-in to aim for — the agent should never have to guess the convention, the source of truth, or the size of the work.
53 changes: 53 additions & 0 deletions .autoloop/strategies/test-driven/prompts/make-green.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Make-green prompt — <CUSTOMIZE: program-name>

Framing for the **green** phase of an iteration. Read before writing the implementation.

---

You are making a failing test pass with the **minimum** change. Scope creep is the enemy — the test defines the requirement, nothing else.

## Domain knowledge

<CUSTOMIZE: same facts as write-test.md. Keep them in sync when one is updated.>

- <CUSTOMIZE: e.g. "pandas treats NaN as always-last-or-first regardless of `ascending` — the test must cover both directions.">
- <CUSTOMIZE: e.g. "Reproducers in issues may use Python idioms that don't translate one-to-one to TypeScript — translate the *behaviour*, not the syntax.">
- <CUSTOMIZE: e.g. "Empty-input behaviour is rarely documented in the reference; check the reference's source if the doc is silent before guessing.">

## How to make a test pass without scope creep

1. **Re-read the failing test.** Don't skim. The exact assertions tell you exactly what must change.
2. **Identify the smallest code change** that would make the failing test pass without breaking any existing test. Name it concretely.
3. **Write only that change.** If a helper would make the code cleaner, note it for a later refactor iteration — don't add it now.
4. **Never modify existing tests to make the new test pass.** If the change you're considering breaks an existing test, something is wrong with your change, not with the old test.
5. **Run the full test suite, not just the new test.** Regressions in unrelated tests must be fixed before the iteration is accepted.

## Files you must not touch in the green phase

<CUSTOMIZE: replace with the program's don't-modify list — typically the existing tests, the source-of-truth references, the state file. Be exhaustive; the agent will read this every iteration.>

- `tests/**` other than the test file added in Step 3 of this iteration.
- <CUSTOMIZE: e.g. "the reference doc snapshots in `.autoloop/programs/<program-name>/reference/`">
- The state file on the `memory/autoloop` branch — that gets updated in Step 7, not the green phase.

## Anti-patterns to avoid

- ❌ **Overfitting to the test.** Don't hard-code the test's expected value in the implementation. If the test expects `42` for input `6`, your implementation must compute `6 * 7` or equivalent, not `return 42`.
- ❌ **Speculative generality.** "While I'm here, let me also handle <edge case the test doesn't cover>." No — that edge case gets its own test.
- ❌ **Parallel implementations.** If the existing implementation has a branch your new behaviour doesn't fit into, think carefully before adding a `if (<new case>) { ... } else { <old code> }` next to it. Often the right answer is to fold the new case into the existing dispatch, not to grow a parallel one.
- ❌ **Weakening tests to make the change smaller.** If the test is right and the implementation is hard, the implementation is what's wrong. The test moves only via an explicit `rethink-test` iteration.
- ❌ **Skipping the full-suite run.** "The new test passes" is not the bar. "The new test passes *and nothing else broke*" is the bar.

## What the reasoning output must contain

Before writing the implementation:

- **Parent state**: one-line summary of what the target file does today.
- **Minimum change**: a concrete description of the smallest diff that would make the failing test pass.
- **Invariants to preserve**: the named tests / behaviours that must keep working.

After writing the implementation:

- A "Green summary" line: 10–20 words, suitable for the Iteration History.
- Confirmation that the full test suite is green, with the test count (`N passing, 0 failing`).
- Any new lesson worth promoting to the state file's Lessons Learned (phrased as a transferable heuristic, not an iteration report).
64 changes: 64 additions & 0 deletions .autoloop/strategies/test-driven/prompts/refactor.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Refactor prompt — <CUSTOMIZE: program-name>

Framing for the optional **refactor** phase of an iteration. Read only after the suite is fully green.

---

Refactoring is a *gated* step. You earn the right to refactor by getting to green with no regressions; you don't earn the right to refactor every iteration. If nothing is worth refactoring, skip this step and say so explicitly in the iteration's reasoning.

A refactor in Test-Driven has one rule that overrides everything else: **the test suite must remain green, with the same set of tests, before and after the refactor**. If the diff requires changing a test to stay green, it isn't a refactor — it's a behaviour change, and behaviour changes go through Step 3 (red), not Step 5.

## What counts as a refactor (vocabulary)

These are the moves available for this problem (<CUSTOMIZE: rewrite this list with 5–10 problem-specific refactors>):

- <CUSTOMIZE: e.g. "Extract a duplicated dtype-dispatch block into a single helper called from both call sites.">
- <CUSTOMIZE: e.g. "Replace a chained `if/else if` over dtype with a lookup table keyed on dtype.">
- <CUSTOMIZE: e.g. "Collapse two near-identical loop bodies (forward and reverse) into a single loop parameterised by direction.">
- <CUSTOMIZE: e.g. "Inline a one-call-site helper whose name no longer adds clarity.">
- <CUSTOMIZE: e.g. "Replace a hand-rolled `for` loop with the existing `iterRows` helper that already handles the edge case correctly.">
- <CUSTOMIZE: e.g. "Hoist a repeated property access out of a hot path into a local — only when there's a measured reason, otherwise it's noise.">

## What is *not* a refactor

These changes look like refactors but are not. The agent must reject them in this phase:

- ❌ **Renaming for taste.** `userID` → `userId` with no other change is a diff in search of a justification. Skip.
- ❌ **Reformatting.** Biome runs in CI; manual whitespace edits are noise.
- ❌ **Comment polish.** Improving a comment is fine, but it doesn't justify a refactor iteration on its own.
- ❌ **Reordering functions in a file** without changing call relationships.
- ❌ **Adding speculative abstractions.** "What if we needed three implementations of this someday?" → no, you don't, until you do.
- ❌ **Anything that changes behaviour.** If a test's assertion would change, this is a red-phase iteration, not a refactor.

## When to skip the refactor step entirely

Refactoring this iteration is the wrong call when:

- The previous iteration was also a refactor (give the codebase a beat).
- The file you'd touch was substantially rewritten *this* iteration in the green phase (it hasn't earned a refactor yet — let it sit through one or two more behavioural iterations first).
- The refactor would touch files outside the green-phase target, expanding the iteration's blast radius.
- You can't name a concrete clarity or complexity improvement in one sentence.

If any of these is true, write "skipping refactor: <reason>" in the reasoning and proceed to Step 6.

## Reasoning template

Before writing any refactor diff, fill in (in your visible reasoning):

1. **Move**: which refactor from the vocabulary above (or a novel one — describe it).
2. **Files touched**: the exact list. Refactor diffs that grow beyond the green-phase target need a strong reason.
3. **Improvement claimed**: one concrete sentence. "Removes the only remaining duplicated dispatch block, so a future dtype only needs to be added in one place." A vague "cleaner" or "more idiomatic" is not enough.
4. **Suite-stability claim**: which tests run, and the prediction that all of them stay green with no test edits.

After applying the refactor:

- Run the full test suite. **Same set of tests, same results — all green.**
- If anything went red, **revert the refactor** and continue without it. Don't try to "fix" a refactor by editing tests; that's the line that separates refactor from behaviour change.
- A "Refactor summary" line: 10–20 words, suitable for the Iteration History.

## Anti-patterns to avoid

- ❌ **Refactor + behaviour change in one diff.** If you change behaviour during a refactor, you can no longer tell which part broke a test. Split into separate iterations.
- ❌ **Editing a test to keep a refactor green.** Hard stop. Revert.
- ❌ **Sweeping cosmetic passes** dressed up as refactors. They are noise; CI lint catches the things that matter.
- ❌ **Refactors that touch unrelated modules.** A refactor whose blast radius exceeds the green-phase target needs its own justification — usually it should be its own iteration with no green-phase work.
Loading
Loading