You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Today an iteration is marked ✅ Accepted as soon as the sandbox-computed metric improves. CI runs after the push and is not on the acceptance path — which means a broken commit can land on the long-running branch and stack under subsequent iterations, with no revert or retry.
This matters because the agent sandbox cannot reliably install common toolchains — bun, tsc, cargo, go, pytest, etc. — due to firewall restrictions on asset hosts like releaseassets.githubusercontent.com. Sandbox self-evaluation is therefore structurally unreliable. Real validation has to come from CI on the pushed HEAD commit.
Concrete failure this prevents
A long-running branch has an open draft PR with dozens of TypeScript compile errors because the agent wrote tests against a method that doesn't exist on the target class. The sandbox couldn't run tsc (toolchain install blocked), so the iteration's self-evaluation rubber-stamped the change as Accepted. Nothing reverts or retries; the red branch stays red and piles subsequent iterations on top. This is the default failure mode whenever the sandbox can't run the project's type-check/test suite — which is most of them in practice.
Proposed flow (replaces current Step 5)
Split the current "Step 5: Accept or Reject" into three sub-steps with an explicit CI gate between push and accept:
Step 5a: Push and wait for CI
After committing, push to autoloop/{program-name} and wait for the CI on the new HEAD:
PR=${EXISTING_PR:-$(gh pr list --head autoloop/{program-name} --json number -q '.[0].number')}
gh pr checks "$PR" --watch --interval 30 ||true
status=$(gh pr checks "$PR" --json conclusion,state -q '.[] | (.conclusion // .state // "")' \| awk ' BEGIN { r = "success" } /^(FAILURE|CANCELLED|TIMED_OUT|ACTION_REQUIRED|STARTUP_FAILURE|STALE)$/ { r = "failure" } /^(PENDING|QUEUED|IN_PROGRESS|WAITING|REQUESTED)$/ { if (r == "success") r = "pending" } END { print r }')
Three outcomes: success, failure, pending. pending should rarely happen if --watch is used but the awk fallback is defensive.
Step 5b: Fix loop (up to 5 attempts per iteration)
If status == "failure", fix and retry — do not revert, do not accept:
Fetch the failing check-run logs for the pushed SHA via gh run view --log or the Checks API.
Extract a structured failure summary:
Failing job names and their first error lines.
A failure signature — a stable, normalized fingerprint of the failures (e.g., sorted failing-test names + the top error code, like TS2339:fromArrays:tests/stats/eval_query.test.ts). Used by the no-progress guard.
No-progress guard: if this attempt's failure signature matches the previous attempt's signature, stop. The agent is stuck in a repeat-loop. Set paused: true on the state file with pause_reason: "stuck in CI fix loop: <signature>", comment on the program issue with the signature and the three most recent attempts, and end the iteration.
Attempt the fix: feed the structured failure summary back to the agent as the next task ("CI failed on . Here are the failures: <…>. Fix them and push again"). Agent commits the fix and pushes.
Loop back to Step 5a with the new HEAD.
Budget: 5 fix attempts per iteration. If the 5th attempt still leaves CI red, paused: true with pause_reason: "ci-fix-exhausted: <signature>".
Wall-clock cap: 60 min per iteration including CI waits. If exceeded mid-fix, set paused: true with pause_reason: "ci-timeout" and leave the current state in place.
Step 5c: Accept
Only when status == "success":
Mark iteration accepted. Update state file's Machine State, push/update PR, comment on program issue with metric delta and fix-attempt count if > 0.
Why no revert?
The naive alternative is "revert on red, retry next iteration." Fix-and-retry is strictly better:
Reverting throws away real work. The agent's change is usually 80% right; re-deriving the fix from scratch next iteration is wasteful.
Reverting creates commit-history churn — the branch ends up with commit-revert-commit triples that are hard to audit.
Fix-and-retry produces a single clean commit on accept. Multiple fix attempts within an iteration are local to that iteration; if it succeeds, only the final commit is on the branch.
Edge case of fundamentally wrong direction: caught by the 5-attempt budget plus the no-progress guard. Program auto-pauses with a loud, structured pause_reason. Humans (or a PR-health-keeper workflow) can reset.
If a repo ships a companion PR-health-keeper workflow (e.g., an "Evergreen" workflow that fixes failing CI on open PRs), it should be able to pick up paused Autoloop PRs using the same rules as human-authored PRs. The handoff is via the pause_reason field. Absent such a workflow, the loud pause + structured reason gives a human enough signal to intervene.
Related root cause of sandbox unreliability: gh-aw's firewall blocks releaseassets.githubusercontent.com, preventing tools like bun from installing. Even if that's fixed upstream, a CI gate is the correct acceptance criterion regardless.
Summary
Today an iteration is marked ✅ Accepted as soon as the sandbox-computed metric improves. CI runs after the push and is not on the acceptance path — which means a broken commit can land on the long-running branch and stack under subsequent iterations, with no revert or retry.
This matters because the agent sandbox cannot reliably install common toolchains —
bun,tsc,cargo,go,pytest, etc. — due to firewall restrictions on asset hosts likereleaseassets.githubusercontent.com. Sandbox self-evaluation is therefore structurally unreliable. Real validation has to come from CI on the pushed HEAD commit.Concrete failure this prevents
A long-running branch has an open draft PR with dozens of TypeScript compile errors because the agent wrote tests against a method that doesn't exist on the target class. The sandbox couldn't run
tsc(toolchain install blocked), so the iteration's self-evaluation rubber-stamped the change as Accepted. Nothing reverts or retries; the red branch stays red and piles subsequent iterations on top. This is the default failure mode whenever the sandbox can't run the project's type-check/test suite — which is most of them in practice.Proposed flow (replaces current Step 5)
Split the current "Step 5: Accept or Reject" into three sub-steps with an explicit CI gate between push and accept:
Step 5a: Push and wait for CI
After committing, push to
autoloop/{program-name}and wait for the CI on the new HEAD:Three outcomes:
success,failure,pending.pendingshould rarely happen if--watchis used but the awk fallback is defensive.Step 5b: Fix loop (up to 5 attempts per iteration)
If
status == "failure", fix and retry — do not revert, do not accept:gh run view --logor the Checks API.TS2339:fromArrays:tests/stats/eval_query.test.ts). Used by the no-progress guard.paused: trueon the state file withpause_reason: "stuck in CI fix loop: <signature>", comment on the program issue with the signature and the three most recent attempts, and end the iteration.paused: truewithpause_reason: "ci-fix-exhausted: <signature>".paused: truewithpause_reason: "ci-timeout"and leave the current state in place.Step 5c: Accept
Only when
status == "success":Why no revert?
The naive alternative is "revert on red, retry next iteration." Fix-and-retry is strictly better:
pause_reason. Humans (or a PR-health-keeper workflow) can reset.New Machine State values to document
Add to the pause-reason vocabulary:
ci-fix-exhausted: <signature>— 5 fix attempts didn't fix CI.stuck in CI fix loop: <signature>— no-progress guard tripped (same failure twice in a row).ci-timeout— 60-min wall-clock cap hit.Add to the
recent_statusesvocabulary:ci-fix-exhausted— alongsideaccepted,rejected,error.Coordination with PR-health-keeper workflows
If a repo ships a companion PR-health-keeper workflow (e.g., an "Evergreen" workflow that fixes failing CI on open PRs), it should be able to pick up paused Autoloop PRs using the same rules as human-authored PRs. The handoff is via the
pause_reasonfield. Absent such a workflow, the loud pause + structured reason gives a human enough signal to intervene.Related
releaseassets.githubusercontent.com, preventing tools likebunfrom installing. Even if that's fixed upstream, a CI gate is the correct acceptance criterion regardless.