You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Today an autoloop iteration is marked ✅ Accepted as soon as the agent's in-sandbox self-evaluation returns a non-zero metric. CI runs after the push and is not in the acceptance path. When CI fails (as in #174 — 24 TS2339 errors because the agent wrote tests against DataFrame.fromArrays(...), which doesn't exist), the autoloop just keeps going:
The broken commit stays on the long-running branch.
The next iteration stacks new work on top of a red branch.
Compounding factor: the agent sandbox can't reliably run bun/tsc because releaseassets.githubusercontent.com is firewall-blocked, so the sandbox self-evaluation is unreliable by construction. Real validation must come from CI.
Proposed behavior — fix-and-retry loop until green
Change the iteration acceptance criterion from "agent self-evaluated OK" to "CI is green on the pushed commit". If CI is red, the agent keeps iterating on the fix in-place until it's green. No revert.
Per-iteration flow
1. Agent proposes change, commits, and pushes to autoloop/<program>.
2. Wait for CI on the new HEAD commit:
gh pr checks <pr> --watch --interval 30
(or poll `/repos/.../commits/:sha/check-runs` and `/check-suites`).
3. If CI is green:
→ Mark iteration accepted. Comment on the program issue. Update state. Done.
4. If CI is red:
→ Fetch the failing check-run logs and a structured failure summary
(compile errors / failing test names / first N lines of stack).
→ Feed that back to the agent as the next task:
"CI failed on <sha>. Here are the failures: <...>. Fix them and push again."
→ Agent commits the fix and pushes again.
→ Go to step 2.
5. Keep looping until CI is green OR the fix-retry budget is exhausted.
Bounds (to avoid runaway cost / infinite loops)
Per-iteration fix budget: up to 5 fix attempts. Each fix attempt has its own bounded wall-clock and token budget.
Per-iteration wall clock: hard cap at 60 min total (including CI waits). If the cap is hit mid-fix, record the current state, leave the PR in a failing state, and surface it loudly — don't silently abandon.
No-progress guard: if two consecutive fix attempts produce the same failing check signature, stop looping — the agent is clearly stuck. Record paused with pause_reason: "stuck in CI fix loop: <signature>" and surface for human review.
When the budget is exhausted
Prefer loud failure over silent corruption:
Set paused: true on the program state file with the pause reason.
Do not revert. Leaving the broken commit in place lets a human (or Evergreen — see below) see what's wrong and continue from there. Reverting would lose the partial progress the agent made.
Evergreen extension
Currently Evergreen ("PR Health Keeper") is scoped to merge-conflict and CI-failure fixing on human-authored PRs. Extend it to also pick up autoloop PRs whose long-running branch is red. Two cases where Evergreen would help:
Autoloop hit its fix-retry budget and paused. Evergreen picks up the PR, attempts its own fix with fresh context. If green → un-pause the program.
The fix loop is in-flight but stalled (wall-clock cap hit). Evergreen resumes on the next tick.
Keep Evergreen's 5-attempts-per-SHA rule so it also gives up eventually rather than burning cycles.
Implementation sketch in .github/workflows/autoloop.md
The agent prompt already has a Step 5 (Accept/Reject). Replace the current accept logic with:
### Step 5a: Push and wait for CI
After committing, push to `autoloop/{program-name}`. Then wait for CI on the new HEAD
commit:
\`\`\`bash
gh pr checks $EXISTING_PR --watch --interval 30 --fail-fast || true
status=$(gh pr checks $EXISTING_PR --json conclusion -q '.[].conclusion' \\
| awk 'BEGIN{r="success"} /FAILURE|CANCELLED/{r="failure"} END{print r}')
\`\`\`### Step 5b: Fix loop (up to 5 attempts)
If `$status == "failure"`:
- Fetch failing check-run logs: `gh run view --log` for each failing job.
- Extract the first 50 lines of error output and the set of failing test names.
- Treat this as the new task description and implement a fix.
- Commit, push, go back to Step 5a.
- If you are on the 5th attempt and still failing, stop. Set `paused=true`
on the state file with \`pause_reason: "ci-fix-exhausted: <signature>"\`,
comment on the program issue, and end the iteration.
### Step 5c: Accept
Only when `$status == "success"` do you mark the iteration accepted: update the state file,
post the accepted comment on the program issue, and finish.
The missing_tool safe-output is irrelevant here — this runs in the agent's bash tool, not via an MCP tool.
Why not just revert?
The original proposal was "revert if red, let the next iteration try again". The user explicitly prefers fix-and-retry because:
Reverting throws away real work. The agent's proposed change is usually 80% right; it's better to fix the 20% than re-derive it.
Reverting creates churn in the commit history — harder to audit what actually got tried.
Fix-and-retry produces a single clean commit on accept rather than a commit + revert + new commit.
Edge case: if the agent's change is fundamentally wrong (e.g., wrong architectural direction), the fix loop will exhaust its budget, the program will pause, and a human can manually reset the branch. That's a deliberate loud-failure path, not silent corruption.
Acceptance
An iteration cannot be accepted while CI is red.
When CI is red, the agent fetches the failure details and pushes a fix (up to 5 attempts).
If all attempts fail, the program auto-pauses with a clear pause_reason and a comment on the program issue — not a revert.
No autoloop PR sits with persistent red CI unnoticed.
Evergreen picks up paused autoloop PRs (same rules as human-authored PRs: max 5 attempts per SHA).
Related firewall issue making sandbox self-evaluation unreliable: the agent can't install bun because releaseassets.githubusercontent.com is blocked (see comment on Build tsb: pandas → TypeScript migration #1, iteration 233). That's why CI-based validation is the right anchor — the sandbox can't be trusted to catch these errors on its own.
Problem
Today an autoloop iteration is marked ✅ Accepted as soon as the agent's in-sandbox self-evaluation returns a non-zero metric. CI runs after the push and is not in the acceptance path. When CI fails (as in #174 — 24 TS2339 errors because the agent wrote tests against
DataFrame.fromArrays(...), which doesn't exist), the autoloop just keeps going:Compounding factor: the agent sandbox can't reliably run
bun/tscbecausereleaseassets.githubusercontent.comis firewall-blocked, so the sandbox self-evaluation is unreliable by construction. Real validation must come from CI.Proposed behavior — fix-and-retry loop until green
Change the iteration acceptance criterion from "agent self-evaluated OK" to "CI is green on the pushed commit". If CI is red, the agent keeps iterating on the fix in-place until it's green. No revert.
Per-iteration flow
Bounds (to avoid runaway cost / infinite loops)
pausedwithpause_reason: "stuck in CI fix loop: <signature>"and surface for human review.When the budget is exhausted
Prefer loud failure over silent corruption:
paused: trueon the program state file with the pause reason.Evergreen extension
Currently Evergreen ("PR Health Keeper") is scoped to merge-conflict and CI-failure fixing on human-authored PRs. Extend it to also pick up autoloop PRs whose long-running branch is red. Two cases where Evergreen would help:
Keep Evergreen's 5-attempts-per-SHA rule so it also gives up eventually rather than burning cycles.
Implementation sketch in
.github/workflows/autoloop.mdThe agent prompt already has a Step 5 (Accept/Reject). Replace the current accept logic with:
The
missing_toolsafe-output is irrelevant here — this runs in the agent's bash tool, not via an MCP tool.Why not just revert?
The original proposal was "revert if red, let the next iteration try again". The user explicitly prefers fix-and-retry because:
Edge case: if the agent's change is fundamentally wrong (e.g., wrong architectural direction), the fix loop will exhaust its budget, the program will pause, and a human can manually reset the branch. That's a deliberate loud-failure path, not silent corruption.
Acceptance
pause_reasonand a comment on the program issue — not a revert.Context
tests/stats/eval_query.test.ts).releaseassets.githubusercontent.comis blocked (see comment on Build tsb: pandas → TypeScript migration #1, iteration 233). That's why CI-based validation is the right anchor — the sandbox can't be trusted to catch these errors on its own.