Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 41 additions & 0 deletions .autoloop/programs/tsb-perf-evolve/evaluate.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
#!/usr/bin/env bash
# Evaluator for the tsb-perf-evolve OpenEvolve program.
#
# Both the autoloop agent (Step 6 of the OpenEvolve playbook) and CI (the
# `benchmark` job in .github/workflows/ci.yml) invoke this script so they
# produce comparable fitness numbers from identical commands.
#
# Output: a single JSON line on stdout with one of these shapes
# {"fitness": <number>, "tsb_mean_ms": <number>, "pandas_mean_ms": <number>}
# {"fitness": null, "rejected_reason": "<string>"}
#
# Exit code is always 0 — failures are encoded in the JSON so callers can
# parse the result uniformly. Diagnostics go to stderr.

set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd "$SCRIPT_DIR/../../.." && pwd)"

cd "$REPO_ROOT"

# 1. Validity — existing tests for sortValues must still pass.
if ! bun test tests/core/series.sortValues.test.ts >/tmp/perf-evolve-tests.log 2>&1; then
echo '{"fitness": null, "rejected_reason": "tests failed"}'
exit 0
fi

# 2. Benchmark — tsb side.
tsb_ms=$(bun run "$SCRIPT_DIR/code/benchmark.ts" \
| python3 -c "import json,sys; print(json.load(sys.stdin)['mean_ms'])")

# 3. Benchmark — pandas side. Skip gracefully if pandas isn't available.
if ! python3 -c 'import pandas' 2>/dev/null; then
pip3 install pandas --quiet 2>/dev/null || true
fi
pd_ms=$(python3 "$SCRIPT_DIR/code/benchmark.py" \
| python3 -c "import json,sys; print(json.load(sys.stdin)['mean_ms'])")

# 4. Fitness = ratio. Lower is better.
ratio=$(python3 -c "print(${tsb_ms} / ${pd_ms})")
echo "{\"fitness\": ${ratio}, \"tsb_mean_ms\": ${tsb_ms}, \"pandas_mean_ms\": ${pd_ms}}"
39 changes: 19 additions & 20 deletions .autoloop/programs/tsb-perf-evolve/program.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,26 +55,25 @@ Population state lives in the state file on the `memory/autoloop` branch under t
## Evaluation

```bash
set -euo pipefail

# 1. Validity — existing tests for sortValues must still pass.
bun test tests/core/series.sortValues.test.ts >/tmp/perf-evolve-tests.log 2>&1 || {
echo '{"fitness": null, "rejected_reason": "tests failed"}'
exit 0
}

# 2. Benchmark — tsb side.
tsb_ms=$(bun run .autoloop/programs/tsb-perf-evolve/code/benchmark.ts | python3 -c "import json,sys; print(json.load(sys.stdin)['mean_ms'])")

# 3. Benchmark — pandas side. Skip gracefully if pandas isn't available.
if ! python3 -c 'import pandas' 2>/dev/null; then
pip3 install pandas --quiet 2>/dev/null || true
fi
pd_ms=$(python3 .autoloop/programs/tsb-perf-evolve/code/benchmark.py | python3 -c "import json,sys; print(json.load(sys.stdin)['mean_ms'])")

# 4. Fitness = ratio. Lower is better.
ratio=$(python3 -c "print(${tsb_ms} / ${pd_ms})")
echo "{\"fitness\": ${ratio}, \"tsb_mean_ms\": ${tsb_ms}, \"pandas_mean_ms\": ${pd_ms}}"
bash .autoloop/programs/tsb-perf-evolve/evaluate.sh
```

The actual evaluator lives in `evaluate.sh` next to this file so the autoloop
agent (Step 6 of the OpenEvolve playbook) and CI (the `benchmark` job in
`.github/workflows/ci.yml`) invoke the **exact same** command and produce
comparable fitness numbers. See that script for details.

It runs the validity tests, then the tsb and pandas benchmarks, and prints a
single JSON line on stdout:

```json
{"fitness": <number>, "tsb_mean_ms": <number>, "pandas_mean_ms": <number>}
```

or, if validity failed:

```json
{"fitness": null, "rejected_reason": "tests failed"}
```

The metric is `fitness` (= `tsb_mean_ms / pandas_mean_ms`). **Lower is better.** A value below `1.0` means tsb is now faster than pandas on this workload.
48 changes: 47 additions & 1 deletion .autoloop/strategies/openevolve/strategy.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,51 @@ Edit only the files listed in `program.md`'s Target section. The diff style for

Run the evaluation command from `program.md`. Parse the metric.

The in-sandbox evaluation is a *cheap pre-filter only* — the agent sandbox often cannot install `bun`, run `python3 -c 'import pandas'`, or otherwise reproduce realistic conditions (the `releaseassets.githubusercontent.com` firewall block is the common culprit). A null/missing metric here is **not** grounds for rejecting the candidate; that decision is deferred to Step 6.5.

### Step 6.5. Wait for CI

Before recording the candidate in the population (Step 7) or posting *any* iteration comment on the program issue / PR, wait for CI on the pushed commit. CI is the authoritative source of both correctness (Test & Lint / Build / Validate Python Examples) and fitness (the `OpenEvolve benchmark` check, which runs `bash .autoloop/programs/{program-name}/evaluate.sh` on a real runner with `bun` + `python3` + `pandas` installed).

This step extends — and ties into — the generic `Step 5a → 5b → 5c` flow described in the autoloop workflow. OpenEvolve's only added requirement is that you must reach Step 5c (or the budget-exhausted handler) **before** writing the iteration comment, never after a speculative push.

```bash
# Resolve the PR — prefer the pre-step lookup, fall back to gh.
PR=$(jq -r '.existing_pr // empty' /tmp/gh-aw/autoloop.json 2>/dev/null || true)
if [ -z "$PR" ]; then
PR=$(gh pr list --head autoloop/{program-name} --json number -q '.[0].number')
fi

# Block until every required check terminates (or the wall-clock cap fires).
gh pr checks "$PR" --watch --interval 30 --fail-fast || true

# Determine an aggregate status. Same awk classifier as Step 5a in the
# generic autoloop playbook — keep them in sync.
status=$(gh pr checks "$PR" --json conclusion,state \
-q '.[] | (.conclusion // .state // "")' \
| awk '
BEGIN { r = "success" }
/^(FAILURE|CANCELLED|TIMED_OUT|ACTION_REQUIRED|STARTUP_FAILURE|STALE)$/ { r = "failure" }
/^(PENDING|QUEUED|IN_PROGRESS|WAITING|REQUESTED)$/ { if (r == "success") r = "pending" }
END { print r }')

# Read the fitness from the OpenEvolve benchmark check-run (created by the
# `benchmark` job in .github/workflows/ci.yml). Title format: `fitness=<num>`
# or `fitness=null`. SHA = the HEAD of the PR after the latest push/fix.
SHA=$(gh pr view "$PR" --json headRefOid -q '.headRefOid')
fitness=$(gh api "repos/${GITHUB_REPOSITORY}/commits/${SHA}/check-runs" \
--jq '.check_runs[] | select(.name == "OpenEvolve benchmark") | .output.title' \
| sed -n 's/^fitness=//p' | head -n1)
```

Branch on `$status`:

- **`success`** → record the candidate in the population with `fitness: <number>` from the check-run (or `fitness: null` only if the `OpenEvolve benchmark` check explicitly reported it that way — e.g., correctness held but the benchmark itself errored). Proceed to Step 7. The iteration comment is `✅ Accepted` with the real numeric fitness.
- **`failure`** → enter the fix-retry loop from the generic autoloop Step 5b (up to 5 attempts, no-progress guard, 60-min wall-clock cap). Do **not** post an "accepted" comment. On a successful fix, loop back through the `gh pr checks --watch` block above on the new HEAD. On exhausted budget, mark the candidate `status: error` in the population with `fitness: null` and `pause_reason: "ci-fix-exhausted: <signature>"`, and post a `❌ Rejected` (or `⚠️ Error`) iteration comment that links to the failing run.
- **`pending`** (the wall-clock cap fired before CI concluded) → don't post a speculative `⏳ Pending CI` comment. Record the candidate in the population with `fitness: null` and `status: pending-ci`, and leave a single reconciliation-pending comment on the PR/issue that the next iteration's Step 6.5 is allowed to overwrite when it reads the now-concluded status for this same SHA.

In all three branches, the iteration comment posted to the program issue and PR must reflect *terminal* state — never `⏳ Pending CI` as a permanent label. Comments live forever; the pending placeholder is what produced the bug this step exists to fix.

### Step 7. Update the population

Regardless of whether the iteration is accepted or rejected at the branch level, the candidate has been tried and should be recorded in the population — the population is a memory of what's been explored, not just what's been kept.
Expand All @@ -88,7 +133,8 @@ Append a new entry to the `## 🧬 Population` subsection in the state file usin

Continue with the normal autoloop Step 5 (Accept or Reject → commit / discard, update state file's Machine State, Iteration History, Lessons Learned, etc.) as defined in the workflow. The only additional requirements from OpenEvolve are:

- The Iteration History entry must include `operator`, `parent_id(s)`, `island`, and `fitness` fields (in addition to the normal status/change/metric/notes).
- The Iteration History entry must include `operator`, `parent_id(s)`, `island`, and `fitness` fields (in addition to the normal status/change/metric/notes). The `fitness` value comes from the `OpenEvolve benchmark` check-run resolved in Step 6.5 — never from the in-sandbox Step 6 estimate.
- The iteration comment posted to the program issue and PR must use the terminal status from Step 6.5 (`✅ Accepted` / `❌ Rejected` / `⚠️ Error` / `⏸ Pending-CI` only when the wall-clock cap genuinely fired). Never post `⏳ Pending CI` as a final state — that placeholder is what Step 6.5 exists to eliminate.
- Lessons Learned additions should be phrased as *transferable heuristics* about the problem space, not as reports of what this iteration did. (E.g. "Hex layouts dominate grid layouts above n=20" — not "Iteration 17 tried a hex layout.")

## Feature dimensions
Expand Down
109 changes: 109 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ on:

permissions:
contents: read
checks: write

jobs:
test:
Expand Down Expand Up @@ -76,3 +77,111 @@ jobs:

- name: Validate Python playground examples
run: python scripts/validate-python-examples.py playground/

benchmark:
# Run the OpenEvolve benchmark for autoloop *-evolve PRs so the autoloop
# agent can read a real fitness number from CI (see .autoloop/strategies/
# openevolve/strategy.md, Step 6.5). The sandbox the agent runs in cannot
# install bun reliably and so cannot measure fitness itself.
name: OpenEvolve benchmark
if: |
(github.event_name == 'pull_request' && startsWith(github.head_ref, 'autoloop/') && contains(github.head_ref, '-evolve'))
|| (github.event_name == 'push' && startsWith(github.ref_name, 'autoloop/') && contains(github.ref_name, '-evolve'))
runs-on: ubuntu-latest
permissions:
contents: read
checks: write
steps:
- uses: actions/checkout@v4

- name: Setup Bun
uses: oven-sh/setup-bun@v2
with:
bun-version: latest

- name: Install dependencies
run: bun install

- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: "3.12"

- name: Install Python dependencies
run: pip install pandas numpy

- name: Resolve program directory
id: program
run: |
# Resolve the program directory from the branch name:
# autoloop/<program-name> → .autoloop/programs/<program-name>/
BRANCH="${GITHUB_HEAD_REF:-${GITHUB_REF_NAME}}"
PROGRAM="${BRANCH#autoloop/}"
PROGRAM_DIR=".autoloop/programs/${PROGRAM}"
echo "program=${PROGRAM}" >> "$GITHUB_OUTPUT"
echo "program_dir=${PROGRAM_DIR}" >> "$GITHUB_OUTPUT"
if [ -x "${PROGRAM_DIR}/evaluate.sh" ]; then
echo "has_evaluator=true" >> "$GITHUB_OUTPUT"
else
echo "No evaluate.sh for program '${PROGRAM}' — skipping benchmark." >&2
echo "has_evaluator=false" >> "$GITHUB_OUTPUT"
fi

- name: Run OpenEvolve benchmark
id: bench
if: steps.program.outputs.has_evaluator == 'true'
run: |
PROGRAM_DIR="${{ steps.program.outputs.program_dir }}"
# evaluate.sh is contracted to always exit 0 and encode failures in
# the JSON, but we tolerate non-zero exits anyway and fall back to a
# null fitness so the check-run still gets created.
set +e
bash "${PROGRAM_DIR}/evaluate.sh" >/tmp/bench-result.json 2>/tmp/bench-stderr
rc=$?
set -e
if [ ! -s /tmp/bench-result.json ]; then
echo "{\"fitness\": null, \"rejected_reason\": \"evaluator produced no output (exit ${rc})\"}" \
> /tmp/bench-result.json
fi
cat /tmp/bench-result.json
fitness=$(jq -r '.fitness // "null"' /tmp/bench-result.json)
echo "fitness=${fitness}" >> "$GITHUB_OUTPUT"
# Compact JSON for the check-run output below.
echo "result_json=$(jq -c . /tmp/bench-result.json)" >> "$GITHUB_OUTPUT"

- name: Upload benchmark result
if: steps.program.outputs.has_evaluator == 'true'
uses: actions/upload-artifact@v4
with:
name: benchmark-result
path: /tmp/bench-result.json

- name: Attach fitness as check-run
if: steps.program.outputs.has_evaluator == 'true'
uses: actions/github-script@v7
env:
FITNESS: ${{ steps.bench.outputs.fitness }}
RESULT_JSON: ${{ steps.bench.outputs.result_json }}
with:
script: |
const fitness = process.env.FITNESS;
let result;
try {
result = JSON.parse(process.env.RESULT_JSON);
} catch {
result = { raw: process.env.RESULT_JSON };
}
const sha = context.payload.pull_request
? context.payload.pull_request.head.sha
: context.sha;
await github.rest.checks.create({
...context.repo,
name: "OpenEvolve benchmark",
head_sha: sha,
status: "completed",
conclusion: fitness === "null" ? "neutral" : "success",
output: {
title: `fitness=${fitness}`,
summary: "```json\n" + JSON.stringify(result, null, 2) + "\n```",
},
});
Loading