tsb-perf-evolve: close the gaps for end-to-end AlphaEvolve loop on Series.sortValues

## Summary

With #182 merged, the `tsb-perf-evolve` program is fully scaffolded:

- `.autoloop/programs/tsb-perf-evolve/program.md` — Goal, Target, Evaluation, validity invariants.
- `.autoloop/programs/tsb-perf-evolve/strategy/alphaevolve.md` — customized playbook: 5 islands (comparison sort, indirect typed-array, packed-keys DSU, non-comparison/radix, hybrid), feature dimensions `(storage × algorithm)`, rut-breaking overrides.
- `.autoloop/programs/tsb-perf-evolve/strategy/prompts/{mutation,crossover}.md` — problem-specific framing.
- `.autoloop/programs/tsb-perf-evolve/code/{benchmark.ts,benchmark.py,config.yaml}` — the evaluator.

But the scaffold doesn't run end-to-end yet. This issue enumerates the concrete gaps that must close before the program can enter normal autoloop scheduling, and walks through the first few expected iterations so maintainers know what to watch for.

## Prerequisite gaps

### 1. The validity-oracle test file doesn't exist

The Evaluation step runs `bun test tests/core/series.sortValues.test.ts` as the validity check. That file is **not** in the repo — `tests/core/` currently holds only `natsort.test.ts` and `searchsorted.test.ts`. Every candidate will fail validity on the first command.

**Action — write `tests/core/series.sortValues.test.ts` before enabling the program.** It must cover the invariants that `strategy/alphaevolve.md` lists in its validity pre-check:

- Numeric with NaN — both `naPosition: "first"` and `naPosition: "last"`.
- Ascending and descending for numeric, string, and mixed-dtype Series.
- Empty Series returns an empty Series (same dtype, same name).
- Index alignment: for every element of the output, the index at that position must be the original index of the input row that the value came from. Not "the sorted index array," the *originating* row's index.
- Public signature unchanged: `sortValues(ascending = true, naPosition: "first" | "last" = "last"): Series<T>` — use a `expectTypeOf` / `assertType` check if the framework supports it.

The tests should be written against the *current* behaviour of `src/core/series.ts:714`. Use pandas' `Series.sort_values` as a semantic reference where behaviour is ambiguous (NaN ordering, stability, etc.) — the program's whole premise is that tsb should match pandas semantics.

Suggested layout:

```ts
// tests/core/series.sortValues.test.ts
import { describe, expect, it } from "bun:test";
import { Series } from "../../src/index.ts";

describe("Series.sortValues — numeric with NaN", () => {
  it("ascending, naPosition='last' (default)", () => { … });
  it("ascending, naPosition='first'", () => { … });
  it("descending, naPosition='last'", () => { … });
  it("descending, naPosition='first'", () => { … });
  it("preserves original indices in output", () => { … });
});

describe("Series.sortValues — string", () => { … });
describe("Series.sortValues — mixed dtype", () => { … });
describe("Series.sortValues — empty Series", () => { … });
describe("Series.sortValues — property checks", () => {
  it("output length equals input length", () => { … });
  it("output is a permutation of the input", () => { … });
  it("applying sortValues twice is idempotent up to ties", () => { … });
});
```

Aim for 20–30 test cases. The more thorough this file is, the harder it is for a candidate that improves speed at the cost of correctness to slip through.

### 2. State machinery assumes "higher is better"; this program is "lower is better"

The autoloop state file's `best_metric` is updated with max-wins semantics by default. `tsb-perf-evolve`'s fitness is `tsb_mean_ms / pandas_mean_ms` — **lower is better**, with `< 1.0` meaning tsb beats pandas.

**Action — add a frontmatter field to `program.md`** that the scheduler/agent reads to flip the comparison direction:

```yaml
---
schedule: every 6h
metric_direction: lower  # default is `higher`; this program ratchets downward
---
```

And update `.autoloop/programs/<name>/program.md`'s convention (if not already documented) so other programs can opt in. The scheduler and the agent prompt both need to honour this:

- Scheduler: when computing "is this iteration an improvement?" use `new < best` if `metric_direction: lower`, else `new > best`.
- Agent prompt (state-file update step): when appending to Iteration History and computing the delta, sign it accordingly.

If this isn't yet supported, file as a blocker in `.github/workflows/scripts/autoloop_scheduler.py` + the state-file-update section of `.github/workflows/autoloop.md`.

### 3. Bun must be installable in the autoloop sandbox

`benchmark.ts` runs under `bun run`. The agent sandbox still can't install bun reliably because `releaseassets.githubusercontent.com` is firewall-blocked — the exact problem that blocked iteration 233 of `build-tsb`. See [comment on issue #1](https://github.com/githubnext/tsessebe/issues/1#issuecomment-4291636503).

**Action — fix the firewall allowlist before enabling this program.** Add `releaseassets.githubusercontent.com` to `.github/workflows/autoloop.md`'s `network.allowed:`:

```yaml
network:
  allowed:
    - defaults
    - node
    - python
    - rust
    - java
    - dotnet
    - releaseassets.githubusercontent.com
```

Without this, every evaluation falls back to the `tsc`-only path and `bun run` fails silently, making every candidate look rejected. This program is *pointless* without a real bun runtime.

### 4. Pandas must be available on the runner

`benchmark.py` imports pandas. The Evaluation script's fallback `pip3 install pandas --quiet 2>/dev/null || true` swallows install failures — meaning a sandbox without network access will produce a benchmark run that fails to start, and `pd_ms=` will end up empty. The evaluator should fail *loudly* in that case, not silently emit a broken JSON line.

**Action — make `benchmark.py` assert pandas is importable, and update the Evaluation script** to `exit 1` if pandas can't be installed:

```bash
if ! python3 -c 'import pandas' 2>/dev/null; then
  pip3 install pandas --quiet || {
    echo '{"fitness": null, "rejected_reason": "pandas install failed"}'
    exit 0
  }
fi
```

This plays well with the strategy's "status: error with fitness=null" path — candidates aren't silently accepted with bogus numbers.

### 5. Scheduler must actually pick this program

Currently the scheduler can starve non-file-based programs (see #162) and has a tiebreak that always picks the same program when state is absent. The tsb-perf-evolve program lands on `schedule: every 6h`, alongside `perf-comparison` (every 30m) and `build-tsb-pandas-typescript-migration` (every 30m). If #162 isn't merged yet, the new program will never run.

**Action — confirm #162's fix is merged first**, or manually run the program via `workflow_dispatch` with the `program: tsb-perf-evolve` input for the first few iterations to seed state.

## First-run bootstrap (Iteration 1)

The population is empty on first run. `strategy/alphaevolve.md`'s Step 2 deterministic override says: "*If the population is empty or has one member → force Exploration.*" But before any exploration, we need a **baseline** — otherwise fitness values are meaningless with nothing to compare against.

**Bootstrap protocol for the very first iteration:**

1. Do **not** modify `src/core/series.ts`. The current implementation is the baseline.
2. Run the evaluator with the existing code to get `baseline_fitness = tsb_mean_ms / pandas_mean_ms`. Record it.
3. Create `## 🧬 Population` in the state file and append **Candidate c001** as:

   ```markdown
   ### Candidate c001  ·  island 0  ·  fitness <baseline_fitness>  ·  gen 1

   - **Operator**: (baseline — not produced by an operator)
   - **Parent(s)**: []
   - **Feature cell**: boxed-pairs · comparison
   - **Approach**: Current `Array.prototype.sort` over `{v, i}` pairs with NaN-aware comparator.
   - **Status**: ✅ accepted (baseline)
   - **Notes**: Baseline established; tsb=<tsb_ms>ms / pandas=<pd_ms>ms / ratio=<baseline_fitness>.

   Code:

   \`\`\`typescript
   // Current body of sortValues, copied verbatim from src/core/series.ts
   \`\`\`

   ---
   ```

4. Set `best_metric = baseline_fitness` in the Machine State table.
5. Set Iteration Count = 1, last_run = now, recent_statuses = `["accepted"]`.
6. Do **not** commit any code change. This iteration is a state-file-only commit (plus the new test file — see gap #1).

The strategy playbook handles iterations 2+ normally.

## Expected first ~10 iterations (sanity check)

Use this as a correctness check for whoever watches the first runs:

| Iter | Op override fires? | Expected operator | Expected target island | What the agent does |
|---|---|---|---|---|
| 1 | — | bootstrap | 0 (comparison) | Seed baseline (above). |
| 2 | pop size == 1 → force Exploration | Exploration | 1 (indirect typed-array) | Rewrite with `Float64Array` values + `Uint32Array` index, sort indices by `values[i] - values[j]`, gather. |
| 3 | — | sample (likely Exploitation or Exploration) | whichever island won so far | If island 1 now has the best fitness, refine it; else try island 2. |
| 4 | — | sample | likely island 2 (packed DSU) | Try BigInt-packed `(value, index)` keys, single sort, split. |
| 5 | — | sample | likely island 3 (radix) | Dtype-dispatched radix sort for float64. |
| 6–7 | — | exploitation dominant | best island | Tune the winner. |
| 8 | — | possibly Crossover | two islands | Combine, e.g., radix's speed for large n with indirect's simplicity for small n → hybrid. |
| 9 | — | sample | — | — |
| 10 | — | maybe Migration | cross-island | Port the fastest algorithm's invariant trick (e.g., "values in a typed array avoid boxing") into another island's candidate. |

If by iteration 10 the population doesn't span at least 3 of the 5 islands, something's wrong — most likely the agent is ignoring the operator override and just re-doing exploitation. Investigate by reading the Iteration History entries, which must record `operator`, `island`, `parent_id(s)`, `fitness`.

## Success criteria

- **Short-term (≤ 20 iterations):** population spans ≥ 3 islands; at least one candidate is faster than the baseline (fitness < `baseline_fitness`); no correctness-breaking candidates accepted (all accepts passed the existing test suite).
- **Medium-term (≤ 50 iterations):** best fitness ≤ 1.0 on at least one dataset configuration — tsb is genuinely faster than pandas for this function on one workload.
- **Long-term:** the lessons-learned section of the state file contains *transferable* heuristics ("typed-array index sort beats object-pair sort above n≈10k") rather than iteration logs. These heuristics become the seeds for similar programs on other hot functions.

## Observability — where to watch

1. **State file** on `memory/autoloop` branch: `tsb-perf-evolve.md`.
   - `## ⚙️ Machine State` — current best, iteration count, paused?
   - `## 🧬 Population` — candidates tried, with fitness and feature cells.
   - `## 📚 Lessons Learned` — transferable heuristics.
2. **Program issue** (autoloop-program labelled, auto-created per #165's single-issue-per-program model) — status comment + per-iteration comments.
3. **Draft PR** on `autoloop/tsb-perf-evolve` — accumulated accepted commits.
4. **CI on the draft PR** — gate per #176. Red CI on an iteration means the candidate broke correctness and is being fixed in-loop.

## When to expand beyond `sortValues`

Once `tsb-perf-evolve`'s loop is stable and producing wins, the natural next step is to apply the same pattern to other hot functions. Two options:

1. **One program per function** (`tsb-perf-evolve-sortvalues`, `tsb-perf-evolve-merge`, …). Each has its own population, islands, lessons. Strong isolation; lots of scheduler load.
2. **One rotating program** (`tsb-perf-evolve`) that cycles through a list of functions. Lessons cross-pollinate; but per-function state is cramped.

Defer that decision until this first instance shows results.

## Acceptance

- `tests/core/series.sortValues.test.ts` exists and covers the invariants listed above.
- `program.md` has `metric_direction: lower` and the scheduler honours it.
- Firewall allowlist includes `releaseassets.githubusercontent.com`; bun install succeeds in the sandbox.
- `benchmark.py` / Evaluation script fails loudly on missing pandas rather than emitting `null` fitness.
- The first iteration seeds Candidate c001 as baseline (state-file-only commit) and the second iteration begins real evolution.
- After 20 iterations: population spans ≥ 3 islands, at least one candidate strictly improves on baseline.

## Out of scope for this issue

- Porting the strategy to other hot functions (covered by a follow-up after first results).
- Automatic detection of "which function is hottest" (for now, we pick `sortValues` manually).
- Cross-function lesson sharing via a shared lessons-learned bank.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tsb-perf-evolve: close the gaps for end-to-end AlphaEvolve loop on Series.sortValues #183

Summary

Prerequisite gaps

1. The validity-oracle test file doesn't exist

2. State machinery assumes "higher is better"; this program is "lower is better"

3. Bun must be installable in the autoloop sandbox

4. Pandas must be available on the runner

5. Scheduler must actually pick this program

First-run bootstrap (Iteration 1)

Expected first ~10 iterations (sanity check)

Success criteria

Observability — where to watch

When to expand beyond `sortValues`

Acceptance

Out of scope for this issue

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Iter	Op override fires?	Expected operator	Expected target island	What the agent does
1	—	bootstrap	0 (comparison)	Seed baseline (above).
2	pop size == 1 → force Exploration	Exploration	1 (indirect typed-array)	Rewrite with `Float64Array` values + `Uint32Array` index, sort indices by `values[i] - values[j]`, gather.
3	—	sample (likely Exploitation or Exploration)	whichever island won so far	If island 1 now has the best fitness, refine it; else try island 2.
4	—	sample	likely island 2 (packed DSU)	Try BigInt-packed `(value, index)` keys, single sort, split.
5	—	sample	likely island 3 (radix)	Dtype-dispatched radix sort for float64.
6–7	—	exploitation dominant	best island	Tune the winner.
8	—	possibly Crossover	two islands	Combine, e.g., radix's speed for large n with indirect's simplicity for small n → hybrid.
9	—	sample	—	—
10	—	maybe Migration	cross-island	Port the fastest algorithm's invariant trick (e.g., "values in a typed array avoid boxing") into another island's candidate.

tsb-perf-evolve: close the gaps for end-to-end AlphaEvolve loop on Series.sortValues #183

Description

Summary

Prerequisite gaps

1. The validity-oracle test file doesn't exist

2. State machinery assumes "higher is better"; this program is "lower is better"

3. Bun must be installable in the autoloop sandbox

4. Pandas must be available on the runner

5. Scheduler must actually pick this program

First-run bootstrap (Iteration 1)

Expected first ~10 iterations (sanity check)

Success criteria

Observability — where to watch

When to expand beyond sortValues

Acceptance

Out of scope for this issue

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

When to expand beyond `sortValues`