diff --git a/.autoloop/strategies/test-driven/CUSTOMIZE.md b/.autoloop/strategies/test-driven/CUSTOMIZE.md new file mode 100644 index 00000000..2013fe0a --- /dev/null +++ b/.autoloop/strategies/test-driven/CUSTOMIZE.md @@ -0,0 +1,111 @@ +# Adopting the Test-Driven strategy for a new program + +This file is a **creator-time guide** — it is read by the maintainer (or a "create program" agent) **once**, when authoring a new program that wants to use Test-Driven. It is **not** copied into the program's `strategy/` directory and is **not** read by the iteration agent at runtime. + +If you are an iteration agent and have somehow ended up here: stop, go back to `strategy/test-driven.md` in the program directory, and follow that. + +## When to pick Test-Driven + +Test-Driven is the right strategy when **all** of the following are true: + +- The program is about **specifying behaviour**, not optimizing a metric. The question is "is this correct?", not "is this faster?". +- "Correct" can be expressed as **executable assertions** — unit tests, property tests, integration tests, repros — that run as part of CI. +- Iterations **accumulate**: each iteration pins one more behaviour (or fixes one more bug), and the work product grows monotonically. You're not searching for a single best artifact; you're building up a body of pinned behaviour. +- There exists a **source of truth** the agent can consult when ambiguity arises (a reference implementation, a spec document, an issue with a reproducer, etc.). + +If the program is "make this faster" or "minimize this scalar", **do not use Test-Driven**. Use AlphaEvolve (`.autoloop/strategies/alphaevolve/`). + +If the program is genuinely "do whatever the agent thinks is best", neither strategy fits — use the default loop. + +### Canonical use cases + +- **API porting** (e.g., the pandas → tsb migration): each iteration pins one method's behaviour from the reference and implements it. The Test Harness becomes the coverage map. +- **Bug fixing** (e.g., a future `tsb-bugfix` program): each iteration picks a bug from a label, writes the failing repro as a test, makes it green. +- **Spec-driven development**: each iteration pins one bullet from a spec document as a test, then implements it. + +## Steps to adopt + +1. Create `.autoloop/programs//` with the usual layout: a `program.md`, and any source-of-truth references the program needs (a `docs/` directory, a pinned spec, etc.). +2. Copy the strategy template into the program: + + ```bash + mkdir -p .autoloop/programs//strategy/prompts + cp .autoloop/strategies/test-driven/strategy.md \ + .autoloop/programs//strategy/test-driven.md + cp .autoloop/strategies/test-driven/prompts/write-test.md \ + .autoloop/programs//strategy/prompts/write-test.md + cp .autoloop/strategies/test-driven/prompts/make-green.md \ + .autoloop/programs//strategy/prompts/make-green.md + cp .autoloop/strategies/test-driven/prompts/refactor.md \ + .autoloop/programs//strategy/prompts/refactor.md + ``` + +3. Resolve every `` marker in `strategy/test-driven.md` and the three prompt files. See the marker-by-marker guidance below. +4. Add the `## Evolution Strategy` pointer block to `program.md` (template below). +5. Sanity-check: `grep -R "/strategy/` should return **nothing**. + +## The pointer block for `program.md` + +Replace (or add) `program.md`'s `## Evolution Strategy` section with exactly this: + +```markdown +## Evolution Strategy + +This program uses the **Test-Driven** strategy. On every iteration, read `strategy/test-driven.md` and follow it literally — it supersedes the generic analyze/accept/reject steps in the default autoloop loop. + +Support files: +- `strategy/test-driven.md` — the runtime playbook (red → green → refactor loop, Test Harness rules). +- `strategy/prompts/write-test.md` — framing for the **red** phase: what makes a good failing test for this problem. +- `strategy/prompts/make-green.md` — framing for the **green** phase: minimum-change discipline. +- `strategy/prompts/refactor.md` — framing for the optional **refactor** phase, gated on a green suite. + +Test Harness state lives in the state file on the `memory/autoloop` branch under the `## ✅ Test Harness` subsection (see the playbook for the schema). +``` + +## Marker-by-marker guidance + +### `strategy.md` markers + +- **`# Test-Driven Strategy — `** — the program name as it appears in the file path. +- **`## Problem framing`** — 2–4 sentences. State the artifact under test, what "correct" means, and the source of truth. The agent reads this every iteration; make it dense and unambiguous about which reference wins on conflict. +- **Step 1 source-of-truth list** — name the *specific* references the agent should consult. Be concrete: not "pandas docs", but "pandas docs for the function currently in scope, plus the corresponding numpy doc when behaviour is delegated to numpy". Vague references mean the agent will skip them. +- **Step 2 sizing guidance** — the most important customization. Make it impossible to pick work too big to finish in one iteration. Examples: + - API porting: "one method signature, with all overloads listed in the reference doc, but no more than one method per iteration." + - Bug fixing: "one bug per iteration, identified by issue number; if the bug decomposes into sub-bugs, file new issues for the sub-bugs and pick one." + - Spec-driven: "one MUST-bullet from the spec; SHOULD-bullets get separate iterations once all MUSTs are green." +- **`harness_size_cap`** — default 100 is fine for most programs. Lower it if your tests are large and the state file balloons; raise it only if you have a reason older entries should stay individually visible. + +### `prompts/write-test.md` markers + +This prompt frames the **red** phase. Customize: + +- **Test framework setup** — the exact command to run a single test, the file naming convention, the import paths. The agent reads this every iteration; don't make it guess. +- **Domain knowledge** — anything about the source of truth that's easy to get wrong. (E.g. "pandas treats NaN as always-last-or-first regardless of `ascending`; the test must include both `ascending=True` and `ascending=False` with NaN to pin this." or "the issue's repro is in Python — translate carefully, NaN ≠ undefined.") +- **Anti-patterns** — failure modes you've seen the agent hit before in this program (over-specifying implementation details, asserting on internal data structures, snapshot tests of huge outputs). + +### `prompts/make-green.md` markers + +This prompt frames the **green** phase. Customize: + +- **Minimum-change examples** — 2–3 worked examples of "this is the minimum change to make this test pass". Counter-examples are also helpful: "the agent was tempted to also handle ; that gets its own test." +- **Don't-modify list** — the files / tests / fixtures that the green phase must never touch in pursuit of a passing test (typically: existing tests, the source of truth, the state file). +- **Domain knowledge** — same facts as `write-test.md`. Keep the two files in sync when one is updated; the agent reads both every iteration. + +### `prompts/refactor.md` markers + +This prompt frames the optional **refactor** phase. Customize: + +- **Refactor vocabulary** — 5–10 concrete refactoring moves that make sense for this program (extract helper, collapse duplicate dispatch, replace `switch` with table). Acts as a menu the agent samples from. +- **What's not a refactor** — explicit list of cosmetic-only changes the agent must reject as "not a refactor" (renaming for taste, formatting, comment polish). +- **Stop conditions** — when the agent should *not* attempt a refactor this iteration (suite is green but only just; the previous iteration was also a refactor; the file was rewritten substantially this iteration). + +## A tiny worked example + +Suppose you are creating `tsb-bugfix` to chew through bugs filed against tsb. + +- Source of truth per iteration: the issue body and any reproducer in it. +- Sizing: one issue per iteration. If an issue has multiple unrelated repros, the agent files sub-issues and picks one. +- Test-writing convention: each repro becomes a `tests/regressions/issue-.test.ts` file, named after the issue number, with the issue link in a top-of-file comment. +- Acceptance: the new regression test passes; the full suite is still green; the issue can be closed by referencing the merged commit. + +That's the kind of fill-in to aim for — the agent should never have to guess the convention, the source of truth, or the size of the work. diff --git a/.autoloop/strategies/test-driven/prompts/make-green.md b/.autoloop/strategies/test-driven/prompts/make-green.md new file mode 100644 index 00000000..7bf9ab8d --- /dev/null +++ b/.autoloop/strategies/test-driven/prompts/make-green.md @@ -0,0 +1,53 @@ +# Make-green prompt — + +Framing for the **green** phase of an iteration. Read before writing the implementation. + +--- + +You are making a failing test pass with the **minimum** change. Scope creep is the enemy — the test defines the requirement, nothing else. + +## Domain knowledge + + + +- +- +- + +## How to make a test pass without scope creep + +1. **Re-read the failing test.** Don't skim. The exact assertions tell you exactly what must change. +2. **Identify the smallest code change** that would make the failing test pass without breaking any existing test. Name it concretely. +3. **Write only that change.** If a helper would make the code cleaner, note it for a later refactor iteration — don't add it now. +4. **Never modify existing tests to make the new test pass.** If the change you're considering breaks an existing test, something is wrong with your change, not with the old test. +5. **Run the full test suite, not just the new test.** Regressions in unrelated tests must be fixed before the iteration is accepted. + +## Files you must not touch in the green phase + + + +- `tests/**` other than the test file added in Step 3 of this iteration. +- /reference/`"> +- The state file on the `memory/autoloop` branch — that gets updated in Step 7, not the green phase. + +## Anti-patterns to avoid + +- ❌ **Overfitting to the test.** Don't hard-code the test's expected value in the implementation. If the test expects `42` for input `6`, your implementation must compute `6 * 7` or equivalent, not `return 42`. +- ❌ **Speculative generality.** "While I'm here, let me also handle ." No — that edge case gets its own test. +- ❌ **Parallel implementations.** If the existing implementation has a branch your new behaviour doesn't fit into, think carefully before adding a `if () { ... } else { }` next to it. Often the right answer is to fold the new case into the existing dispatch, not to grow a parallel one. +- ❌ **Weakening tests to make the change smaller.** If the test is right and the implementation is hard, the implementation is what's wrong. The test moves only via an explicit `rethink-test` iteration. +- ❌ **Skipping the full-suite run.** "The new test passes" is not the bar. "The new test passes *and nothing else broke*" is the bar. + +## What the reasoning output must contain + +Before writing the implementation: + +- **Parent state**: one-line summary of what the target file does today. +- **Minimum change**: a concrete description of the smallest diff that would make the failing test pass. +- **Invariants to preserve**: the named tests / behaviours that must keep working. + +After writing the implementation: + +- A "Green summary" line: 10–20 words, suitable for the Iteration History. +- Confirmation that the full test suite is green, with the test count (`N passing, 0 failing`). +- Any new lesson worth promoting to the state file's Lessons Learned (phrased as a transferable heuristic, not an iteration report). diff --git a/.autoloop/strategies/test-driven/prompts/refactor.md b/.autoloop/strategies/test-driven/prompts/refactor.md new file mode 100644 index 00000000..da1933fd --- /dev/null +++ b/.autoloop/strategies/test-driven/prompts/refactor.md @@ -0,0 +1,64 @@ +# Refactor prompt — + +Framing for the optional **refactor** phase of an iteration. Read only after the suite is fully green. + +--- + +Refactoring is a *gated* step. You earn the right to refactor by getting to green with no regressions; you don't earn the right to refactor every iteration. If nothing is worth refactoring, skip this step and say so explicitly in the iteration's reasoning. + +A refactor in Test-Driven has one rule that overrides everything else: **the test suite must remain green, with the same set of tests, before and after the refactor**. If the diff requires changing a test to stay green, it isn't a refactor — it's a behaviour change, and behaviour changes go through Step 3 (red), not Step 5. + +## What counts as a refactor (vocabulary) + +These are the moves available for this problem (): + +- +- +- +- +- +- + +## What is *not* a refactor + +These changes look like refactors but are not. The agent must reject them in this phase: + +- ❌ **Renaming for taste.** `userID` → `userId` with no other change is a diff in search of a justification. Skip. +- ❌ **Reformatting.** Biome runs in CI; manual whitespace edits are noise. +- ❌ **Comment polish.** Improving a comment is fine, but it doesn't justify a refactor iteration on its own. +- ❌ **Reordering functions in a file** without changing call relationships. +- ❌ **Adding speculative abstractions.** "What if we needed three implementations of this someday?" → no, you don't, until you do. +- ❌ **Anything that changes behaviour.** If a test's assertion would change, this is a red-phase iteration, not a refactor. + +## When to skip the refactor step entirely + +Refactoring this iteration is the wrong call when: + +- The previous iteration was also a refactor (give the codebase a beat). +- The file you'd touch was substantially rewritten *this* iteration in the green phase (it hasn't earned a refactor yet — let it sit through one or two more behavioural iterations first). +- The refactor would touch files outside the green-phase target, expanding the iteration's blast radius. +- You can't name a concrete clarity or complexity improvement in one sentence. + +If any of these is true, write "skipping refactor: " in the reasoning and proceed to Step 6. + +## Reasoning template + +Before writing any refactor diff, fill in (in your visible reasoning): + +1. **Move**: which refactor from the vocabulary above (or a novel one — describe it). +2. **Files touched**: the exact list. Refactor diffs that grow beyond the green-phase target need a strong reason. +3. **Improvement claimed**: one concrete sentence. "Removes the only remaining duplicated dispatch block, so a future dtype only needs to be added in one place." A vague "cleaner" or "more idiomatic" is not enough. +4. **Suite-stability claim**: which tests run, and the prediction that all of them stay green with no test edits. + +After applying the refactor: + +- Run the full test suite. **Same set of tests, same results — all green.** +- If anything went red, **revert the refactor** and continue without it. Don't try to "fix" a refactor by editing tests; that's the line that separates refactor from behaviour change. +- A "Refactor summary" line: 10–20 words, suitable for the Iteration History. + +## Anti-patterns to avoid + +- ❌ **Refactor + behaviour change in one diff.** If you change behaviour during a refactor, you can no longer tell which part broke a test. Split into separate iterations. +- ❌ **Editing a test to keep a refactor green.** Hard stop. Revert. +- ❌ **Sweeping cosmetic passes** dressed up as refactors. They are noise; CI lint catches the things that matter. +- ❌ **Refactors that touch unrelated modules.** A refactor whose blast radius exceeds the green-phase target needs its own justification — usually it should be its own iteration with no green-phase work. diff --git a/.autoloop/strategies/test-driven/prompts/write-test.md b/.autoloop/strategies/test-driven/prompts/write-test.md new file mode 100644 index 00000000..75313515 --- /dev/null +++ b/.autoloop/strategies/test-driven/prompts/write-test.md @@ -0,0 +1,69 @@ +# Write-test prompt — + +Framing for the **red** phase of an iteration. Read before writing the failing test. + +--- + +You are pinning one behaviour as an executable assertion. The test you write will outlive this iteration and is the contract every future implementation must satisfy. Treat it as a spec, not as scratch work. + +## Domain knowledge + +Things you, the agent, should keep in mind about this specific problem space (): + +- +- +- +- + +## Test framework setup + + + +``` + +``` + +Run a single test with: . + +## What makes a good failing test for this problem + +- **One behaviour per test.** If you find yourself writing more than one assertion that exercises a different code path, split into multiple `it(...)` blocks. +- **Sourced from the reference, not from intuition.** The expected values in the test must be traceable to the source of truth — quote the reference (URL, line number, example output) in a comment above the test if it isn't obvious. +- **Cover the named edge cases.** The playbook listed the edge cases you decided this test must include — none of them are optional. If you cut one, justify it in the Test Harness `Notes` field. +- **Doesn't couple to implementation details.** Assert on the observable result, not on the data structure used internally. A test that breaks when the implementation switches from `Map` to `Object` is testing the wrong thing. +- **Fails for the right reason on first run.** Run the test before writing any implementation. The failure must read as "this behaviour is missing" or "this behaviour is wrong in *this specific way*", not "Cannot read property X of undefined" or "module not found". A confusing first-failure message will mislead future you. +- **Would still pass under a reasonable refactor.** If renaming an internal helper would break the test, the test is too coupled. Refactor the test before going to green. + +## Validity checklist + +Before declaring the test done and moving to Step 4 (green), confirm: + +- The test is in the right file (per the framework setup above). +- The test imports from the public API surface (`tsb`, not deep `src/...` paths) unless the program explicitly says otherwise. +- Running the test produces **one clear failure message** — not a parse error, not a compile error, not a stack trace from uninitialized state. The failure should read to a human as "this behaviour is missing" or "this behaviour is wrong in this specific way." +- The failure message names the expected vs. actual value in a form a future contributor can act on. +- The test would *still pass* under a reasonable future refactor of the implementation (no implementation-detail coupling). + +## What the reasoning output must contain + +Before writing the test: + +- **Target**: what behaviour are you pinning? +- **Spec source**: where does the desired behaviour come from? URL / reference. +- **Edge cases included**: list the specific cases this test covers. +- **Edge cases intentionally excluded**: list what this test *doesn't* cover and why (separate tests, out of scope, etc.). + +After writing the test: + +- A "Red summary" line: 10–20 words, suitable for the Test Harness and Iteration History. +- The concrete failure message observed when the test runs. + +## Anti-patterns to avoid + +- ❌ **Testing the implementation, not the behaviour.** "Calls `_sortInternal` with these args" is not a behavioural test. +- ❌ **Snapshot tests of huge outputs.** A snapshot of a 10k-row table makes regression triage impossible. Snapshot a *summary* (length, dtype, first/last N rows) instead. +- ❌ **Asserting on error message strings verbatim.** Error wording changes; the *type* of error and the *fact* that it was thrown rarely do. +- ❌ **Tests that depend on each other.** Each `it(...)` must run in isolation. No shared mutable fixtures. +- ❌ **Skipping the run-and-confirm-it-fails step.** A test that has never been seen to fail is not yet a test. diff --git a/.autoloop/strategies/test-driven/strategy.md b/.autoloop/strategies/test-driven/strategy.md new file mode 100644 index 00000000..0aa09b25 --- /dev/null +++ b/.autoloop/strategies/test-driven/strategy.md @@ -0,0 +1,133 @@ +# Test-Driven Strategy — + +This file is the **runtime playbook** for this program. The autoloop agent reads it at the start of every iteration and follows it literally. It supersedes the generic "Analyze and Propose" / "Accept or Reject" steps in the default autoloop iteration loop — all other steps (state read, branch management, state file updates, CI gating) still apply. + +## Problem framing + + + +## Per-iteration loop + +### Step 1. Load state + +1. Read `program.md` — Goal, Target, Evaluation. +2. Read the program's state file from the repo-memory folder (`{program-name}.md`). Locate the `## ✅ Test Harness` subsection. If it does not exist, create it using the schema in [Test Harness schema](#test-harness-schema). +3. Read . +4. Read both prompt templates in `strategy/prompts/`. They frame how you reason about writing tests and making them pass for this specific problem. + +### Step 2. Pick target + +Pick **one** unit of work — a single behaviour to pin or fix. Size it so that the entire red → green → refactor cycle fits in one iteration: + +- + +Deterministic overrides (apply *before* free choice): + +- If the Test Harness has any entry with status `failing` that is **not** marked `blocked`, pick that one. A failing test is an obligation — you don't add new tests while old ones are still red. +- If the most recent 3 iterations were all `error` (validity pre-check failed, test didn't even compile), force a `rethink-test` iteration — the problem is the test, not the implementation. See Step 4's rethink branch. + +Record the chosen target in the iteration's reasoning. + +### Step 3. Red — write the failing test + +Use `strategy/prompts/write-test.md` as framing. + +Before writing the test, state (in visible reasoning): + +1. What behaviour you are pinning. One sentence, specific. +2. The source-of-truth reference (pandas doc, spec bullet, issue reproducer). +3. The minimum set of assertions that captures "this is correct" without over-specifying implementation details. +4. Edge cases the test must include (empty inputs, NaN, dtype boundaries — whatever's applicable). + +Then write the test file (or append to an existing one). Before continuing: **run the test and confirm it fails with a useful error message**. If it passes already, you picked wrong — either the target is already implemented (pick a different one) or the test is too weak (rewrite). + +Record the new test in the Test Harness with status `failing` and the iteration number. + +### Step 4. Green — implement until the test passes + +Use `strategy/prompts/make-green.md` as framing. + +Before writing any implementation code, state: + +1. Parent state of the target file(s) — one-line summary of what exists now. +2. The **minimum** change needed to make the failing test pass. Resist scope creep; the test defines the requirement, nothing else. +3. Which invariants of the existing tests must continue to hold (list them). + +Then write the implementation. Run the full test suite (not just the new test): **every existing test must still pass, and the new one must now pass too.** + +If the test still fails after implementation: +- **Attempt ≤ 3**: re-analyze what's missing and try again (stay in Step 4). +- **Attempt ≥ 4**: consider that the test itself may be wrong — re-enter the `rethink-test` branch. Read the source of truth again, weaken/rewrite the test to match the *real* spec, then restart Step 4. Document the change in the Test Harness entry as a `test-revised` note. +- **After 5 total attempts in the same iteration**: stop. Mark the target `blocked` in the Test Harness with a `blocked_reason`. Set `paused: true` on the state file with `pause_reason: "td-stuck: "`. End the iteration. + +### Step 5. Refactor (optional, gated on green) + +Only if the test suite is fully green, consider a refactor. Use `strategy/prompts/refactor.md` as framing. + +Pick a refactor only if you can name a concrete clarity/complexity improvement. Cosmetic changes are not refactors — they are diffs in search of a justification. If nothing is worth refactoring, skip this step. Record the choice in reasoning either way. + +After any refactor, the full test suite must still be green. If it isn't, revert the refactor and continue without it. + +### Step 6. Evaluate + +Run the evaluation command from `program.md`. For most TDD programs this is simply "the full test suite passes" — a boolean, not a scalar. Emit `{"metric": , "passing": N, "failing": 0}` where `metric` is `passing` (higher is better). + +Some TDD programs have a secondary metric (bundle size, coverage percentage). In that case `metric` can be the secondary metric, with the hard constraint that `failing == 0` — no reduction in coverage counts as progress if tests are red. + +### Step 7. Update the Test Harness + +Append the iteration's actions to `## ✅ Test Harness`: + +- New test → add entry with status `passing` (it was just made green). +- Existing failing test became green → flip its status. +- A test became blocked → set status `blocked`, fill `blocked_reason`. + +Enforce size discipline: keep at most test entries visible; older entries can collapse into compressed range summaries (`### Tests 40–80 — ✅ passing (N batch additions for X feature): brief summary`). + +### Step 8. Fold through to the default loop + +Continue with the normal autoloop Step 5 (Accept or Reject → commit / discard, update state file's Machine State, Iteration History, Lessons Learned, etc.) as defined in the workflow. The only additional requirements from Test-Driven are: + +- The Iteration History entry must include `phase` (red / green / refactor / rethink-test), `target`, `new_tests` count, `existing_tests_status` (all-green / regression-introduced-and-fixed). +- Lessons Learned additions should be phrased as *transferable heuristics* about the problem space (e.g. "Pandas' NaN-handling for `sort_values` treats NaN as always-last-or-first regardless of `ascending`; tsb implementations must branch on `naPosition` independently of the sort direction") — not iteration reports. + +## Test Harness schema + +The harness lives in the state file `{program-name}.md` on the `memory/autoloop` branch as a subsection. Use this exact layout so maintainers can read and edit it: + +```markdown +## ✅ Test Harness + +> 🤖 *Managed by the Test-Driven strategy. One entry per pinned behaviour. Newest first.* + +### · status · iter + +- **Target**: +- **Spec source**: +- **Test file**: :: +- **Phase added**: red / green / refactor / rethink-test +- **Edge cases covered**: +- **Notes**: +- **Blocked reason**: + +--- +``` + +Identifiers: +- `` is `t{NNN}` zero-padded, monotonically increasing across the program's lifetime. +- `` is the iteration number from the Machine State table. +- Status transitions: `failing → passing` on green; `passing → failing` only if a regression is introduced (and must be fixed before the iteration is accepted); `* → blocked` only via the 5-attempt cap in Step 4. + +When compressing older entries under the `harness_size_cap`, **never** delete an individual entry's metadata in isolation — collapse a contiguous range into a single summary header (`### Tests t040–t080 — ✅ passing (N batch additions for X feature): brief summary`) and remove the per-entry bodies in that range. The summary keeps the count and theme so future iterations can see what's already covered. + +## Acceptance checklist + +An iteration is acceptable iff **all** of the following hold: + +- The new (or previously-failing) test now passes. +- Every previously-passing test still passes — no regressions. +- The Test Harness entry for the target has been updated to reflect the new status. +- The Iteration History entry records `phase`, `target`, `new_tests`, and `existing_tests_status`. +- If the iteration was a `refactor`, the diff was justified by a named clarity/complexity improvement and the suite is still green. + +If any of these fail, the iteration is rejected and the working tree is reset, exactly as in the default loop.