Skip to content

Gate iteration acceptance on CI green, with a bounded fix-and-retry loop #37

@mrjf

Description

@mrjf

Summary

Today an iteration is marked ✅ Accepted as soon as the sandbox-computed metric improves. CI runs after the push and is not on the acceptance path — which means a broken commit can land on the long-running branch and stack under subsequent iterations, with no revert or retry.

This matters because the agent sandbox cannot reliably install common toolchains — bun, tsc, cargo, go, pytest, etc. — due to firewall restrictions on asset hosts like releaseassets.githubusercontent.com. Sandbox self-evaluation is therefore structurally unreliable. Real validation has to come from CI on the pushed HEAD commit.

Concrete failure this prevents

A long-running branch has an open draft PR with dozens of TypeScript compile errors because the agent wrote tests against a method that doesn't exist on the target class. The sandbox couldn't run tsc (toolchain install blocked), so the iteration's self-evaluation rubber-stamped the change as Accepted. Nothing reverts or retries; the red branch stays red and piles subsequent iterations on top. This is the default failure mode whenever the sandbox can't run the project's type-check/test suite — which is most of them in practice.

Proposed flow (replaces current Step 5)

Split the current "Step 5: Accept or Reject" into three sub-steps with an explicit CI gate between push and accept:

Step 5a: Push and wait for CI

After committing, push to autoloop/{program-name} and wait for the CI on the new HEAD:

PR=${EXISTING_PR:-$(gh pr list --head autoloop/{program-name} --json number -q '.[0].number')}
gh pr checks "$PR" --watch --interval 30 || true
status=$(gh pr checks "$PR" --json conclusion,state -q '.[] | (.conclusion // .state // "")' \
  | awk '
      BEGIN { r = "success" }
      /^(FAILURE|CANCELLED|TIMED_OUT|ACTION_REQUIRED|STARTUP_FAILURE|STALE)$/ { r = "failure" }
      /^(PENDING|QUEUED|IN_PROGRESS|WAITING|REQUESTED)$/ { if (r == "success") r = "pending" }
      END { print r }')

Three outcomes: success, failure, pending. pending should rarely happen if --watch is used but the awk fallback is defensive.

Step 5b: Fix loop (up to 5 attempts per iteration)

If status == "failure", fix and retry — do not revert, do not accept:

  1. Fetch the failing check-run logs for the pushed SHA via gh run view --log or the Checks API.
  2. Extract a structured failure summary:
    • Failing job names and their first error lines.
    • A failure signature — a stable, normalized fingerprint of the failures (e.g., sorted failing-test names + the top error code, like TS2339:fromArrays:tests/stats/eval_query.test.ts). Used by the no-progress guard.
  3. No-progress guard: if this attempt's failure signature matches the previous attempt's signature, stop. The agent is stuck in a repeat-loop. Set paused: true on the state file with pause_reason: "stuck in CI fix loop: <signature>", comment on the program issue with the signature and the three most recent attempts, and end the iteration.
  4. Attempt the fix: feed the structured failure summary back to the agent as the next task ("CI failed on . Here are the failures: <…>. Fix them and push again"). Agent commits the fix and pushes.
  5. Loop back to Step 5a with the new HEAD.
  6. Budget: 5 fix attempts per iteration. If the 5th attempt still leaves CI red, paused: true with pause_reason: "ci-fix-exhausted: <signature>".
  7. Wall-clock cap: 60 min per iteration including CI waits. If exceeded mid-fix, set paused: true with pause_reason: "ci-timeout" and leave the current state in place.

Step 5c: Accept

Only when status == "success":

  • Mark iteration accepted. Update state file's Machine State, push/update PR, comment on program issue with metric delta and fix-attempt count if > 0.

Why no revert?

The naive alternative is "revert on red, retry next iteration." Fix-and-retry is strictly better:

  • Reverting throws away real work. The agent's change is usually 80% right; re-deriving the fix from scratch next iteration is wasteful.
  • Reverting creates commit-history churn — the branch ends up with commit-revert-commit triples that are hard to audit.
  • Fix-and-retry produces a single clean commit on accept. Multiple fix attempts within an iteration are local to that iteration; if it succeeds, only the final commit is on the branch.
  • Edge case of fundamentally wrong direction: caught by the 5-attempt budget plus the no-progress guard. Program auto-pauses with a loud, structured pause_reason. Humans (or a PR-health-keeper workflow) can reset.

New Machine State values to document

Add to the pause-reason vocabulary:

  • ci-fix-exhausted: <signature> — 5 fix attempts didn't fix CI.
  • stuck in CI fix loop: <signature> — no-progress guard tripped (same failure twice in a row).
  • ci-timeout — 60-min wall-clock cap hit.

Add to the recent_statuses vocabulary:

  • ci-fix-exhausted — alongside accepted, rejected, error.

Coordination with PR-health-keeper workflows

If a repo ships a companion PR-health-keeper workflow (e.g., an "Evergreen" workflow that fixes failing CI on open PRs), it should be able to pick up paused Autoloop PRs using the same rules as human-authored PRs. The handoff is via the pause_reason field. Absent such a workflow, the loud pause + structured reason gives a human enough signal to intervene.

Related

  • Depends on sibling issue Extract the scheduler from the inline heredoc into a committed script #34 (scheduler extraction) for the fix-loop helper code (failure-signature extraction is ~30 lines of Python).
  • Related root cause of sandbox unreliability: gh-aw's firewall blocks releaseassets.githubusercontent.com, preventing tools like bun from installing. Even if that's fixed upstream, a CI gate is the correct acceptance criterion regardless.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions