OpenEvolve iteration comments stuck on 'Pending CI'; fitness never populated — wait for CI + run benchmark in CI

## Symptom

Every iteration comment on an OpenEvolve program (e.g. `tsb-perf-evolve`'s program issue #189, and its draft PR #190) is stuck showing `⏳ Pending CI` with `Metric: pending CI (sandbox has no bun)`. The comments never get updated, even when CI finishes and is green.

Concrete example from #189 iteration 1 (2026-04-23 04:03 UTC):

```
🤖 Iteration 1 — ⏳ Pending CI (link to run 24815695413)
- Operator: Exploration (first run — seeding Island 1 diversity)
- Island: 1 — parallel-typed-arrays · comparison
- Change: Replace boxed {v,i} pair allocation with indirect Uint32Array index sort + NaN pre-partition
- Metric: pending CI (sandbox has no bun)
- Commit: 24bbe85
```

Meanwhile the actual CI for the PR (#190)'s latest commit `b230a018` shows:

```
Test & Lint               SUCCESS
Validate Python Examples  SUCCESS
Build                     SUCCESS
```

The iteration is correct — tests pass, types check, build succeeds. The comment is 90 minutes stale. There is no path that ever updates it.

## Root cause — two separate problems

### Problem 1: iteration comment posts before CI, never reconciles

The agent's per-iteration flow is currently:

1. Push commit.
2. Attempt in-sandbox self-evaluation.
3. Sandbox has no `bun` (firewall blocks `releaseassets.githubusercontent.com`), so fitness comes back null.
4. Post iteration comment to the program issue + PR with status `⏳ Pending CI` and metric placeholder.
5. **End the workflow run.**

CI runs *after* the workflow ends. Nothing loops back to read CI's outcome and rewrite the iteration comment. The "pending" label is frozen.

This was meant to be handled by the CI-gated acceptance loop from #176 ("Gate autoloop iteration acceptance on CI green with a bounded fix-retry loop"). That change landed on 2026-04-21. But the OpenEvolve strategy playbook (`.autoloop/strategies/openevolve/strategy.md`) overrides Step 5 of the generic loop with its own "post iteration comment" sequence, and that override **does not inherit the CI wait**. OpenEvolve's Step 8 says "fold through to the default loop" — but by the time we get there the comment has already been posted with pending status.

### Problem 2: fitness isn't measured in CI either

Even if we fix Problem 1 and the workflow waits for CI before posting, CI currently only runs `Test & Lint` / `Validate Python Examples` / `Build`. It does **not** run the benchmark from `.autoloop/programs/tsb-perf-evolve/code/benchmark.ts` + `benchmark.py`. So "CI green" tells us correctness held — not what the fitness ratio is.

For OpenEvolve programs the fitness *is* the whole point. A comment that says "correctness: ✅" but leaves `fitness: pending measurement` is not useful. The population subsection in the state file has the same problem — candidate entries go in with `fitness: null`, and the MAP-Elites eviction rule can't decide who to keep when fitnesses are unknown.

The benchmark requires `bun` + `python3` + `pandas` on a real runner, not the agent sandbox. The sandbox can't run it, and CI isn't configured to either.

## Fix

Two coordinated changes.

### Fix 1 — make iteration comments reconcile with CI

Pick one of these approaches:

**Option A (preferred) — block the workflow until CI finishes.**

Make the OpenEvolve playbook's Step 5 ("post iteration comment") explicitly call the CI-wait helper before posting. The agent's flow becomes:

1. Push commit.
2. Resolve the PR number for the pushed branch (`existing_pr` from `autoloop.json`, or `gh pr list --head` as fallback).
3. `gh pr checks <pr> --watch --interval 30 --fail-fast` — block until every required check terminates.
4. Parse the conclusion (`SUCCESS` / `FAILURE` / `CANCELLED` / …).
5. Enter the fix-retry loop from #176 if `FAILURE` (up to 5 attempts).
6. **Only now** post the iteration comment with real status (`✅ Accepted` / `❌ Rejected` / `⚠️ Error`) and real metrics.

Downside: the autoloop workflow run now takes as long as CI. Fine for a 6h-schedule program; less fine for `every 30m`. The playbook already has a 60-min wall-clock cap, so we're bounded.

Sketch of the prompt addition in `.autoloop/strategies/openevolve/strategy.md` between Step 6 (Evaluate) and Step 7 (Update the population):

```markdown
### Step 6.5. Wait for CI

The in-sandbox evaluation from Step 6 cannot verify correctness or measure
fitness in realistic conditions (no `bun`, no `pandas` reliably installed).
Real validation is CI on the pushed commit. Before recording the candidate in
the population or posting any iteration comment, wait for CI:

\`\`\`bash
PR=$(jq -r '.existing_pr // empty' /tmp/gh-aw/autoloop.json)
if [ -z "$PR" ]; then
  PR=$(gh pr list --head autoloop/{program-name} --json number -q '.[0].number')
fi
gh pr checks "$PR" --watch --interval 30 --fail-fast || true
status=$(gh pr checks "$PR" --json conclusion,state -q '.[] | (.conclusion // .state // "")' \
  | awk 'BEGIN { r="success" }
         /^(FAILURE|CANCELLED|TIMED_OUT|ACTION_REQUIRED|STARTUP_FAILURE|STALE)$/ { r="failure" }
         /^(PENDING|QUEUED|IN_PROGRESS|WAITING|REQUESTED)$/ { if (r=="success") r="pending" }
         END { print r }')
\`\`\`

Branch on `$status`:

- `success` → record the candidate with fitness from the CI benchmark artifact (see Fix 2 below), proceed to Step 7.
- `failure` → enter the fix-retry loop from the default workflow; do NOT post "accepted" comments. On exhausted budget, mark the candidate `status: error` in the population with `fitness: null` and `pause_reason: "ci-fix-exhausted: <signature>"`.
- `pending` (ran out of wait budget) → don't post a speculative comment. Record the candidate with `fitness: null`, `status: pending-ci`, and leave a single reconciliation-pending comment the next run will overwrite.
```

**Option B — post speculatively, reconcile on `workflow_run`.**

Add a new workflow `.github/workflows/autoloop-reconcile.yml` triggered on `workflow_run: { workflows: [CI], types: [completed] }`. It finds the iteration comment for the head SHA (grep for the SHA in the bot's recent comments on the associated PR + program issue), reads the CI conclusion, and edits the comment in place.

More moving parts; doesn't stall the main workflow. Useful if Option A's wait pushes the 30-min-schedule programs over their wall-clock cap.

Prefer A. Fall back to B only if wait budgets become a real problem.

### Fix 2 — run the benchmark in CI and expose fitness

Add a job to `.github/workflows/ci.yml` (or a new `benchmark.yml` that runs on PRs touching `autoloop/*-evolve` branches):

```yaml
  benchmark:
    if: startsWith(github.head_ref, 'autoloop/') && contains(github.head_ref, '-evolve')
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: oven-sh/setup-bun@v2
      - uses: actions/setup-python@v5
        with: { python-version: "3.12" }
      - run: pip install pandas numpy

      - name: Run OpenEvolve benchmark
        id: bench
        run: |
          # Resolve the program directory from the branch name:
          #   autoloop/<program-name>  →  .autoloop/programs/<program-name>/
          PROGRAM="${GITHUB_HEAD_REF#autoloop/}"
          PROGRAM_DIR=".autoloop/programs/$PROGRAM"
          [ -f "$PROGRAM_DIR/code/benchmark.ts" ] || {
            echo "No benchmark.ts for program $PROGRAM — skipping."
            exit 0
          }

          # Run the evaluation command from program.md — for OpenEvolve
          # programs this emits {"fitness": <ratio>, "tsb_mean_ms": …, "pandas_mean_ms": …}
          # on stdout.
          result_json=$(bash "$PROGRAM_DIR/evaluate.sh" 2>/tmp/bench-stderr || true)
          echo "$result_json" > /tmp/bench-result.json
          cat /tmp/bench-result.json

          fitness=$(jq -r '.fitness // "null"' /tmp/bench-result.json)
          echo "fitness=$fitness" >> "$GITHUB_OUTPUT"
          echo "result_json=$(jq -c . /tmp/bench-result.json)" >> "$GITHUB_OUTPUT"

      - name: Upload benchmark result
        uses: actions/upload-artifact@v4
        with:
          name: benchmark-result
          path: /tmp/bench-result.json

      - name: Attach fitness as check-run annotation
        uses: actions/github-script@v7
        with:
          script: |
            const fitness = "${{ steps.bench.outputs.fitness }}";
            const result = ${{ steps.bench.outputs.result_json }};
            await github.rest.checks.create({
              ...context.repo,
              name: "OpenEvolve benchmark",
              head_sha: context.sha,
              status: "completed",
              conclusion: fitness === "null" ? "neutral" : "success",
              output: {
                title: `fitness=${fitness}`,
                summary: "```json\n" + JSON.stringify(result, null, 2) + "\n```",
              },
            });
```

Requires `evaluate.sh` (or the `## Evaluation` script) to be invocable standalone. The current program.md has a `bash` block that works if extracted into `.autoloop/programs/<name>/evaluate.sh` — do that as part of this issue so both the autoloop agent and CI use the exact same command.

Then the autoloop agent reads the fitness from the check-run output (or the uploaded artifact) when reconciling in Step 6.5. Matching API:

```bash
fitness=$(gh api "repos/$GITHUB_REPOSITORY/commits/$SHA/check-runs" \
  --jq '.check_runs[] | select(.name == "OpenEvolve benchmark") | .output.title' \
  | sed -n 's/^fitness=//p')
```

## Ordering and dependencies

- Fix 2 is independent and can land first — it just adds fitness measurement capability; nothing consumes it yet.
- Fix 1 depends on Fix 2 being live for the "record fitness in the population" part to work end-to-end. Without Fix 2, Fix 1 would wait for CI and then still have no fitness to record. That's still better than today (at least the correctness status is real and timely) — so Fix 1 can ship partially before Fix 2 is done.
- Both depend on the firewall allowlist fix (`releaseassets.githubusercontent.com` in `network.allowed:`) being in place so `bun` is installable — but that's a CI runner concern, and CI runners are unrestricted. The sandbox firewall only affects in-agent runs; CI sees the open internet. So this is a non-issue for CI.

## Cleanup of already-posted stale comments

Existing iteration comments on #189 and PR #190 (e.g. iteration 1 at 04:03 UTC) are permanently stale. Options:

- **Leave them.** Historical record; they accurately reflect what the agent knew at the time.
- **Edit them retroactively** once Fix 1 lands, prepending an `[Updated: CI result <conclusion>]` marker to the title. Cheap one-off script using the GitHub API.

Prefer leaving them. The bug is the ongoing one — new iterations get real results from Fix 1 onward.

## Acceptance

- The next iteration comment on an OpenEvolve program issue shows `✅ Accepted` (or `❌ Rejected`, etc.) with a numeric fitness value — no more `pending CI (sandbox has no bun)`.
- The state file's `## 🧬 Population` section has `fitness: <number>` on every candidate, no more null.
- CI on autoloop-evolve PRs runs an `OpenEvolve benchmark` check alongside the existing Test/Lint/Build/Validate checks.
- Fix-retry loop engages when CI is red (this part is already in #176; verify it still works under the new Step 6.5 ordering).

## Related

- #176 — CI-gated acceptance + fix-retry loop (merged 2026-04-21). Fix 1 here extends its coverage to OpenEvolve programs that currently bypass it.
- #183 — tsb-perf-evolve implementation gaps. Bug noted in the `## Observability` section that fitness would need CI measurement — this issue is the formal fix.
- #192 — duplicate-PR bug (fixed). Unrelated but touched adjacent scheduler code; coordinate if both are in flight.
- #194 — rename AlphaEvolve → OpenEvolve. Land before or together with this issue so the check-run name and paths are consistent.
- firewall allowlist for `releaseassets.githubusercontent.com` — parallel concern; not a blocker for CI-side measurement (CI runners are unrestricted).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenEvolve iteration comments stuck on 'Pending CI'; fitness never populated — wait for CI + run benchmark in CI #196

Symptom

Root cause — two separate problems

Problem 1: iteration comment posts before CI, never reconciles

Problem 2: fitness isn't measured in CI either

Fix

Fix 1 — make iteration comments reconcile with CI

Fix 2 — run the benchmark in CI and expose fitness

Ordering and dependencies

Cleanup of already-posted stale comments

Acceptance

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

OpenEvolve iteration comments stuck on 'Pending CI'; fitness never populated — wait for CI + run benchmark in CI #196

Description

Symptom

Root cause — two separate problems

Problem 1: iteration comment posts before CI, never reconciles

Problem 2: fitness isn't measured in CI either

Fix

Fix 1 — make iteration comments reconcile with CI

Fix 2 — run the benchmark in CI and expose fitness

Ordering and dependencies

Cleanup of already-posted stale comments

Acceptance

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions