You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Every iteration comment on an OpenEvolve program (e.g. tsb-perf-evolve's program issue #189, and its draft PR #190) is stuck showing ⏳ Pending CI with Metric: pending CI (sandbox has no bun). The comments never get updated, even when CI finishes and is green.
Concrete example from #189 iteration 1 (2026-04-23 04:03 UTC):
🤖 Iteration 1 — ⏳ Pending CI (link to run 24815695413)
- Operator: Exploration (first run — seeding Island 1 diversity)
- Island: 1 — parallel-typed-arrays · comparison
- Change: Replace boxed {v,i} pair allocation with indirect Uint32Array index sort + NaN pre-partition
- Metric: pending CI (sandbox has no bun)
- Commit: 24bbe85
Meanwhile the actual CI for the PR (#190)'s latest commit b230a018 shows:
Test & Lint SUCCESS
Validate Python Examples SUCCESS
Build SUCCESS
The iteration is correct — tests pass, types check, build succeeds. The comment is 90 minutes stale. There is no path that ever updates it.
Root cause — two separate problems
Problem 1: iteration comment posts before CI, never reconciles
The agent's per-iteration flow is currently:
Push commit.
Attempt in-sandbox self-evaluation.
Sandbox has no bun (firewall blocks releaseassets.githubusercontent.com), so fitness comes back null.
Post iteration comment to the program issue + PR with status ⏳ Pending CI and metric placeholder.
End the workflow run.
CI runs after the workflow ends. Nothing loops back to read CI's outcome and rewrite the iteration comment. The "pending" label is frozen.
This was meant to be handled by the CI-gated acceptance loop from #176 ("Gate autoloop iteration acceptance on CI green with a bounded fix-retry loop"). That change landed on 2026-04-21. But the OpenEvolve strategy playbook (.autoloop/strategies/openevolve/strategy.md) overrides Step 5 of the generic loop with its own "post iteration comment" sequence, and that override does not inherit the CI wait. OpenEvolve's Step 8 says "fold through to the default loop" — but by the time we get there the comment has already been posted with pending status.
Problem 2: fitness isn't measured in CI either
Even if we fix Problem 1 and the workflow waits for CI before posting, CI currently only runs Test & Lint / Validate Python Examples / Build. It does not run the benchmark from .autoloop/programs/tsb-perf-evolve/code/benchmark.ts + benchmark.py. So "CI green" tells us correctness held — not what the fitness ratio is.
For OpenEvolve programs the fitness is the whole point. A comment that says "correctness: ✅" but leaves fitness: pending measurement is not useful. The population subsection in the state file has the same problem — candidate entries go in with fitness: null, and the MAP-Elites eviction rule can't decide who to keep when fitnesses are unknown.
The benchmark requires bun + python3 + pandas on a real runner, not the agent sandbox. The sandbox can't run it, and CI isn't configured to either.
Fix
Two coordinated changes.
Fix 1 — make iteration comments reconcile with CI
Pick one of these approaches:
Option A (preferred) — block the workflow until CI finishes.
Make the OpenEvolve playbook's Step 5 ("post iteration comment") explicitly call the CI-wait helper before posting. The agent's flow becomes:
Push commit.
Resolve the PR number for the pushed branch (existing_pr from autoloop.json, or gh pr list --head as fallback).
gh pr checks <pr> --watch --interval 30 --fail-fast — block until every required check terminates.
Parse the conclusion (SUCCESS / FAILURE / CANCELLED / …).
Only now post the iteration comment with real status (✅ Accepted / ❌ Rejected / ⚠️ Error) and real metrics.
Downside: the autoloop workflow run now takes as long as CI. Fine for a 6h-schedule program; less fine for every 30m. The playbook already has a 60-min wall-clock cap, so we're bounded.
Sketch of the prompt addition in .autoloop/strategies/openevolve/strategy.md between Step 6 (Evaluate) and Step 7 (Update the population):
### Step 6.5. Wait for CI
The in-sandbox evaluation from Step 6 cannot verify correctness or measure
fitness in realistic conditions (no `bun`, no `pandas` reliably installed).
Real validation is CI on the pushed commit. Before recording the candidate in
the population or posting any iteration comment, wait for CI:
\`\`\`bash
PR=$(jq -r '.existing_pr // empty' /tmp/gh-aw/autoloop.json)
if [ -z "$PR" ]; then
PR=$(gh pr list --head autoloop/{program-name} --json number -q '.[0].number')
fi
gh pr checks "$PR" --watch --interval 30 --fail-fast || true
status=$(gh pr checks "$PR" --json conclusion,state -q '.[] | (.conclusion // .state // "")' \
| awk 'BEGIN { r="success" }
/^(FAILURE|CANCELLED|TIMED_OUT|ACTION_REQUIRED|STARTUP_FAILURE|STALE)$/ { r="failure" }
/^(PENDING|QUEUED|IN_PROGRESS|WAITING|REQUESTED)$/ { if (r=="success") r="pending" }
END { print r }')
\`\`\`
Branch on `$status`:
-`success` → record the candidate with fitness from the CI benchmark artifact (see Fix 2 below), proceed to Step 7.
-`failure` → enter the fix-retry loop from the default workflow; do NOT post "accepted" comments. On exhausted budget, mark the candidate `status: error` in the population with `fitness: null` and `pause_reason: "ci-fix-exhausted: <signature>"`.
-`pending` (ran out of wait budget) → don't post a speculative comment. Record the candidate with `fitness: null`, `status: pending-ci`, and leave a single reconciliation-pending comment the next run will overwrite.
Option B — post speculatively, reconcile on workflow_run.
Add a new workflow .github/workflows/autoloop-reconcile.yml triggered on workflow_run: { workflows: [CI], types: [completed] }. It finds the iteration comment for the head SHA (grep for the SHA in the bot's recent comments on the associated PR + program issue), reads the CI conclusion, and edits the comment in place.
More moving parts; doesn't stall the main workflow. Useful if Option A's wait pushes the 30-min-schedule programs over their wall-clock cap.
Prefer A. Fall back to B only if wait budgets become a real problem.
Fix 2 — run the benchmark in CI and expose fitness
Add a job to .github/workflows/ci.yml (or a new benchmark.yml that runs on PRs touching autoloop/*-evolve branches):
Requires evaluate.sh (or the ## Evaluation script) to be invocable standalone. The current program.md has a bash block that works if extracted into .autoloop/programs/<name>/evaluate.sh — do that as part of this issue so both the autoloop agent and CI use the exact same command.
Then the autoloop agent reads the fitness from the check-run output (or the uploaded artifact) when reconciling in Step 6.5. Matching API:
fitness=$(gh api "repos/$GITHUB_REPOSITORY/commits/$SHA/check-runs" \ --jq '.check_runs[] | select(.name == "OpenEvolve benchmark") | .output.title' \| sed -n 's/^fitness=//p')
Ordering and dependencies
Fix 2 is independent and can land first — it just adds fitness measurement capability; nothing consumes it yet.
Fix 1 depends on Fix 2 being live for the "record fitness in the population" part to work end-to-end. Without Fix 2, Fix 1 would wait for CI and then still have no fitness to record. That's still better than today (at least the correctness status is real and timely) — so Fix 1 can ship partially before Fix 2 is done.
Both depend on the firewall allowlist fix (releaseassets.githubusercontent.com in network.allowed:) being in place so bun is installable — but that's a CI runner concern, and CI runners are unrestricted. The sandbox firewall only affects in-agent runs; CI sees the open internet. So this is a non-issue for CI.
Cleanup of already-posted stale comments
Existing iteration comments on #189 and PR #190 (e.g. iteration 1 at 04:03 UTC) are permanently stale. Options:
Leave them. Historical record; they accurately reflect what the agent knew at the time.
Edit them retroactively once Fix 1 lands, prepending an [Updated: CI result <conclusion>] marker to the title. Cheap one-off script using the GitHub API.
Prefer leaving them. The bug is the ongoing one — new iterations get real results from Fix 1 onward.
Acceptance
The next iteration comment on an OpenEvolve program issue shows ✅ Accepted (or ❌ Rejected, etc.) with a numeric fitness value — no more pending CI (sandbox has no bun).
The state file's ## 🧬 Population section has fitness: <number> on every candidate, no more null.
CI on autoloop-evolve PRs runs an OpenEvolve benchmark check alongside the existing Test/Lint/Build/Validate checks.
Symptom
Every iteration comment on an OpenEvolve program (e.g.
tsb-perf-evolve's program issue #189, and its draft PR #190) is stuck showing⏳ Pending CIwithMetric: pending CI (sandbox has no bun). The comments never get updated, even when CI finishes and is green.Concrete example from #189 iteration 1 (2026-04-23 04:03 UTC):
Meanwhile the actual CI for the PR (#190)'s latest commit
b230a018shows:The iteration is correct — tests pass, types check, build succeeds. The comment is 90 minutes stale. There is no path that ever updates it.
Root cause — two separate problems
Problem 1: iteration comment posts before CI, never reconciles
The agent's per-iteration flow is currently:
bun(firewall blocksreleaseassets.githubusercontent.com), so fitness comes back null.⏳ Pending CIand metric placeholder.CI runs after the workflow ends. Nothing loops back to read CI's outcome and rewrite the iteration comment. The "pending" label is frozen.
This was meant to be handled by the CI-gated acceptance loop from #176 ("Gate autoloop iteration acceptance on CI green with a bounded fix-retry loop"). That change landed on 2026-04-21. But the OpenEvolve strategy playbook (
.autoloop/strategies/openevolve/strategy.md) overrides Step 5 of the generic loop with its own "post iteration comment" sequence, and that override does not inherit the CI wait. OpenEvolve's Step 8 says "fold through to the default loop" — but by the time we get there the comment has already been posted with pending status.Problem 2: fitness isn't measured in CI either
Even if we fix Problem 1 and the workflow waits for CI before posting, CI currently only runs
Test & Lint/Validate Python Examples/Build. It does not run the benchmark from.autoloop/programs/tsb-perf-evolve/code/benchmark.ts+benchmark.py. So "CI green" tells us correctness held — not what the fitness ratio is.For OpenEvolve programs the fitness is the whole point. A comment that says "correctness: ✅" but leaves
fitness: pending measurementis not useful. The population subsection in the state file has the same problem — candidate entries go in withfitness: null, and the MAP-Elites eviction rule can't decide who to keep when fitnesses are unknown.The benchmark requires
bun+python3+pandason a real runner, not the agent sandbox. The sandbox can't run it, and CI isn't configured to either.Fix
Two coordinated changes.
Fix 1 — make iteration comments reconcile with CI
Pick one of these approaches:
Option A (preferred) — block the workflow until CI finishes.
Make the OpenEvolve playbook's Step 5 ("post iteration comment") explicitly call the CI-wait helper before posting. The agent's flow becomes:
existing_prfromautoloop.json, orgh pr list --headas fallback).gh pr checks <pr> --watch --interval 30 --fail-fast— block until every required check terminates.SUCCESS/FAILURE/CANCELLED/ …).FAILURE(up to 5 attempts).✅ Accepted/❌ Rejected/⚠️ Error) and real metrics.Downside: the autoloop workflow run now takes as long as CI. Fine for a 6h-schedule program; less fine for
every 30m. The playbook already has a 60-min wall-clock cap, so we're bounded.Sketch of the prompt addition in
.autoloop/strategies/openevolve/strategy.mdbetween Step 6 (Evaluate) and Step 7 (Update the population):Option B — post speculatively, reconcile on
workflow_run.Add a new workflow
.github/workflows/autoloop-reconcile.ymltriggered onworkflow_run: { workflows: [CI], types: [completed] }. It finds the iteration comment for the head SHA (grep for the SHA in the bot's recent comments on the associated PR + program issue), reads the CI conclusion, and edits the comment in place.More moving parts; doesn't stall the main workflow. Useful if Option A's wait pushes the 30-min-schedule programs over their wall-clock cap.
Prefer A. Fall back to B only if wait budgets become a real problem.
Fix 2 — run the benchmark in CI and expose fitness
Add a job to
.github/workflows/ci.yml(or a newbenchmark.ymlthat runs on PRs touchingautoloop/*-evolvebranches):Requires
evaluate.sh(or the## Evaluationscript) to be invocable standalone. The current program.md has abashblock that works if extracted into.autoloop/programs/<name>/evaluate.sh— do that as part of this issue so both the autoloop agent and CI use the exact same command.Then the autoloop agent reads the fitness from the check-run output (or the uploaded artifact) when reconciling in Step 6.5. Matching API:
Ordering and dependencies
releaseassets.githubusercontent.cominnetwork.allowed:) being in place sobunis installable — but that's a CI runner concern, and CI runners are unrestricted. The sandbox firewall only affects in-agent runs; CI sees the open internet. So this is a non-issue for CI.Cleanup of already-posted stale comments
Existing iteration comments on #189 and PR #190 (e.g. iteration 1 at 04:03 UTC) are permanently stale. Options:
[Updated: CI result <conclusion>]marker to the title. Cheap one-off script using the GitHub API.Prefer leaving them. The bug is the ongoing one — new iterations get real results from Fix 1 onward.
Acceptance
✅ Accepted(or❌ Rejected, etc.) with a numeric fitness value — no morepending CI (sandbox has no bun).## 🧬 Populationsection hasfitness: <number>on every candidate, no more null.OpenEvolve benchmarkcheck alongside the existing Test/Lint/Build/Validate checks.Related
## Observabilitysection that fitness would need CI measurement — this issue is the formal fix.releaseassets.githubusercontent.com— parallel concern; not a blocker for CI-side measurement (CI runners are unrestricted).