Skip to content

OpenEvolve iteration comments stuck on 'Pending CI'; fitness never populated — wait for CI + run benchmark in CI #196

@mrjf

Description

@mrjf

Symptom

Every iteration comment on an OpenEvolve program (e.g. tsb-perf-evolve's program issue #189, and its draft PR #190) is stuck showing ⏳ Pending CI with Metric: pending CI (sandbox has no bun). The comments never get updated, even when CI finishes and is green.

Concrete example from #189 iteration 1 (2026-04-23 04:03 UTC):

🤖 Iteration 1 — ⏳ Pending CI (link to run 24815695413)
- Operator: Exploration (first run — seeding Island 1 diversity)
- Island: 1 — parallel-typed-arrays · comparison
- Change: Replace boxed {v,i} pair allocation with indirect Uint32Array index sort + NaN pre-partition
- Metric: pending CI (sandbox has no bun)
- Commit: 24bbe85

Meanwhile the actual CI for the PR (#190)'s latest commit b230a018 shows:

Test & Lint               SUCCESS
Validate Python Examples  SUCCESS
Build                     SUCCESS

The iteration is correct — tests pass, types check, build succeeds. The comment is 90 minutes stale. There is no path that ever updates it.

Root cause — two separate problems

Problem 1: iteration comment posts before CI, never reconciles

The agent's per-iteration flow is currently:

  1. Push commit.
  2. Attempt in-sandbox self-evaluation.
  3. Sandbox has no bun (firewall blocks releaseassets.githubusercontent.com), so fitness comes back null.
  4. Post iteration comment to the program issue + PR with status ⏳ Pending CI and metric placeholder.
  5. End the workflow run.

CI runs after the workflow ends. Nothing loops back to read CI's outcome and rewrite the iteration comment. The "pending" label is frozen.

This was meant to be handled by the CI-gated acceptance loop from #176 ("Gate autoloop iteration acceptance on CI green with a bounded fix-retry loop"). That change landed on 2026-04-21. But the OpenEvolve strategy playbook (.autoloop/strategies/openevolve/strategy.md) overrides Step 5 of the generic loop with its own "post iteration comment" sequence, and that override does not inherit the CI wait. OpenEvolve's Step 8 says "fold through to the default loop" — but by the time we get there the comment has already been posted with pending status.

Problem 2: fitness isn't measured in CI either

Even if we fix Problem 1 and the workflow waits for CI before posting, CI currently only runs Test & Lint / Validate Python Examples / Build. It does not run the benchmark from .autoloop/programs/tsb-perf-evolve/code/benchmark.ts + benchmark.py. So "CI green" tells us correctness held — not what the fitness ratio is.

For OpenEvolve programs the fitness is the whole point. A comment that says "correctness: ✅" but leaves fitness: pending measurement is not useful. The population subsection in the state file has the same problem — candidate entries go in with fitness: null, and the MAP-Elites eviction rule can't decide who to keep when fitnesses are unknown.

The benchmark requires bun + python3 + pandas on a real runner, not the agent sandbox. The sandbox can't run it, and CI isn't configured to either.

Fix

Two coordinated changes.

Fix 1 — make iteration comments reconcile with CI

Pick one of these approaches:

Option A (preferred) — block the workflow until CI finishes.

Make the OpenEvolve playbook's Step 5 ("post iteration comment") explicitly call the CI-wait helper before posting. The agent's flow becomes:

  1. Push commit.
  2. Resolve the PR number for the pushed branch (existing_pr from autoloop.json, or gh pr list --head as fallback).
  3. gh pr checks <pr> --watch --interval 30 --fail-fast — block until every required check terminates.
  4. Parse the conclusion (SUCCESS / FAILURE / CANCELLED / …).
  5. Enter the fix-retry loop from Gate autoloop iteration acceptance on CI green with a bounded fix-retry loop #176 if FAILURE (up to 5 attempts).
  6. Only now post the iteration comment with real status (✅ Accepted / ❌ Rejected / ⚠️ Error) and real metrics.

Downside: the autoloop workflow run now takes as long as CI. Fine for a 6h-schedule program; less fine for every 30m. The playbook already has a 60-min wall-clock cap, so we're bounded.

Sketch of the prompt addition in .autoloop/strategies/openevolve/strategy.md between Step 6 (Evaluate) and Step 7 (Update the population):

### Step 6.5. Wait for CI

The in-sandbox evaluation from Step 6 cannot verify correctness or measure
fitness in realistic conditions (no `bun`, no `pandas` reliably installed).
Real validation is CI on the pushed commit. Before recording the candidate in
the population or posting any iteration comment, wait for CI:

\`\`\`bash
PR=$(jq -r '.existing_pr // empty' /tmp/gh-aw/autoloop.json)
if [ -z "$PR" ]; then
  PR=$(gh pr list --head autoloop/{program-name} --json number -q '.[0].number')
fi
gh pr checks "$PR" --watch --interval 30 --fail-fast || true
status=$(gh pr checks "$PR" --json conclusion,state -q '.[] | (.conclusion // .state // "")' \
  | awk 'BEGIN { r="success" }
         /^(FAILURE|CANCELLED|TIMED_OUT|ACTION_REQUIRED|STARTUP_FAILURE|STALE)$/ { r="failure" }
         /^(PENDING|QUEUED|IN_PROGRESS|WAITING|REQUESTED)$/ { if (r=="success") r="pending" }
         END { print r }')
\`\`\`

Branch on `$status`:

- `success` → record the candidate with fitness from the CI benchmark artifact (see Fix 2 below), proceed to Step 7.
- `failure` → enter the fix-retry loop from the default workflow; do NOT post "accepted" comments. On exhausted budget, mark the candidate `status: error` in the population with `fitness: null` and `pause_reason: "ci-fix-exhausted: <signature>"`.
- `pending` (ran out of wait budget) → don't post a speculative comment. Record the candidate with `fitness: null`, `status: pending-ci`, and leave a single reconciliation-pending comment the next run will overwrite.

Option B — post speculatively, reconcile on workflow_run.

Add a new workflow .github/workflows/autoloop-reconcile.yml triggered on workflow_run: { workflows: [CI], types: [completed] }. It finds the iteration comment for the head SHA (grep for the SHA in the bot's recent comments on the associated PR + program issue), reads the CI conclusion, and edits the comment in place.

More moving parts; doesn't stall the main workflow. Useful if Option A's wait pushes the 30-min-schedule programs over their wall-clock cap.

Prefer A. Fall back to B only if wait budgets become a real problem.

Fix 2 — run the benchmark in CI and expose fitness

Add a job to .github/workflows/ci.yml (or a new benchmark.yml that runs on PRs touching autoloop/*-evolve branches):

  benchmark:
    if: startsWith(github.head_ref, 'autoloop/') && contains(github.head_ref, '-evolve')
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: oven-sh/setup-bun@v2
      - uses: actions/setup-python@v5
        with: { python-version: "3.12" }
      - run: pip install pandas numpy

      - name: Run OpenEvolve benchmark
        id: bench
        run: |
          # Resolve the program directory from the branch name:
          #   autoloop/<program-name>  →  .autoloop/programs/<program-name>/
          PROGRAM="${GITHUB_HEAD_REF#autoloop/}"
          PROGRAM_DIR=".autoloop/programs/$PROGRAM"
          [ -f "$PROGRAM_DIR/code/benchmark.ts" ] || {
            echo "No benchmark.ts for program $PROGRAM — skipping."
            exit 0
          }

          # Run the evaluation command from program.md — for OpenEvolve
          # programs this emits {"fitness": <ratio>, "tsb_mean_ms": …, "pandas_mean_ms": …}
          # on stdout.
          result_json=$(bash "$PROGRAM_DIR/evaluate.sh" 2>/tmp/bench-stderr || true)
          echo "$result_json" > /tmp/bench-result.json
          cat /tmp/bench-result.json

          fitness=$(jq -r '.fitness // "null"' /tmp/bench-result.json)
          echo "fitness=$fitness" >> "$GITHUB_OUTPUT"
          echo "result_json=$(jq -c . /tmp/bench-result.json)" >> "$GITHUB_OUTPUT"

      - name: Upload benchmark result
        uses: actions/upload-artifact@v4
        with:
          name: benchmark-result
          path: /tmp/bench-result.json

      - name: Attach fitness as check-run annotation
        uses: actions/github-script@v7
        with:
          script: |
            const fitness = "${{ steps.bench.outputs.fitness }}";
            const result = ${{ steps.bench.outputs.result_json }};
            await github.rest.checks.create({
              ...context.repo,
              name: "OpenEvolve benchmark",
              head_sha: context.sha,
              status: "completed",
              conclusion: fitness === "null" ? "neutral" : "success",
              output: {
                title: `fitness=${fitness}`,
                summary: "```json\n" + JSON.stringify(result, null, 2) + "\n```",
              },
            });

Requires evaluate.sh (or the ## Evaluation script) to be invocable standalone. The current program.md has a bash block that works if extracted into .autoloop/programs/<name>/evaluate.sh — do that as part of this issue so both the autoloop agent and CI use the exact same command.

Then the autoloop agent reads the fitness from the check-run output (or the uploaded artifact) when reconciling in Step 6.5. Matching API:

fitness=$(gh api "repos/$GITHUB_REPOSITORY/commits/$SHA/check-runs" \
  --jq '.check_runs[] | select(.name == "OpenEvolve benchmark") | .output.title' \
  | sed -n 's/^fitness=//p')

Ordering and dependencies

  • Fix 2 is independent and can land first — it just adds fitness measurement capability; nothing consumes it yet.
  • Fix 1 depends on Fix 2 being live for the "record fitness in the population" part to work end-to-end. Without Fix 2, Fix 1 would wait for CI and then still have no fitness to record. That's still better than today (at least the correctness status is real and timely) — so Fix 1 can ship partially before Fix 2 is done.
  • Both depend on the firewall allowlist fix (releaseassets.githubusercontent.com in network.allowed:) being in place so bun is installable — but that's a CI runner concern, and CI runners are unrestricted. The sandbox firewall only affects in-agent runs; CI sees the open internet. So this is a non-issue for CI.

Cleanup of already-posted stale comments

Existing iteration comments on #189 and PR #190 (e.g. iteration 1 at 04:03 UTC) are permanently stale. Options:

  • Leave them. Historical record; they accurately reflect what the agent knew at the time.
  • Edit them retroactively once Fix 1 lands, prepending an [Updated: CI result <conclusion>] marker to the title. Cheap one-off script using the GitHub API.

Prefer leaving them. The bug is the ongoing one — new iterations get real results from Fix 1 onward.

Acceptance

  • The next iteration comment on an OpenEvolve program issue shows ✅ Accepted (or ❌ Rejected, etc.) with a numeric fitness value — no more pending CI (sandbox has no bun).
  • The state file's ## 🧬 Population section has fitness: <number> on every candidate, no more null.
  • CI on autoloop-evolve PRs runs an OpenEvolve benchmark check alongside the existing Test/Lint/Build/Validate checks.
  • Fix-retry loop engages when CI is red (this part is already in Gate autoloop iteration acceptance on CI green with a bounded fix-retry loop #176; verify it still works under the new Step 6.5 ordering).

Related

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions