You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
.autoloop/programs/tsb-perf-evolve/code/{benchmark.ts,benchmark.py,config.yaml} — the evaluator.
But the scaffold doesn't run end-to-end yet. This issue enumerates the concrete gaps that must close before the program can enter normal autoloop scheduling, and walks through the first few expected iterations so maintainers know what to watch for.
Prerequisite gaps
1. The validity-oracle test file doesn't exist
The Evaluation step runs bun test tests/core/series.sortValues.test.ts as the validity check. That file is not in the repo — tests/core/ currently holds only natsort.test.ts and searchsorted.test.ts. Every candidate will fail validity on the first command.
Action — write tests/core/series.sortValues.test.ts before enabling the program. It must cover the invariants that strategy/alphaevolve.md lists in its validity pre-check:
Numeric with NaN — both naPosition: "first" and naPosition: "last".
Ascending and descending for numeric, string, and mixed-dtype Series.
Empty Series returns an empty Series (same dtype, same name).
Index alignment: for every element of the output, the index at that position must be the original index of the input row that the value came from. Not "the sorted index array," the originating row's index.
Public signature unchanged: sortValues(ascending = true, naPosition: "first" | "last" = "last"): Series<T> — use a expectTypeOf / assertType check if the framework supports it.
The tests should be written against the current behaviour of src/core/series.ts:714. Use pandas' Series.sort_values as a semantic reference where behaviour is ambiguous (NaN ordering, stability, etc.) — the program's whole premise is that tsb should match pandas semantics.
Suggested layout:
// tests/core/series.sortValues.test.tsimport{describe,expect,it}from"bun:test";import{Series}from"../../src/index.ts";describe("Series.sortValues — numeric with NaN",()=>{it("ascending, naPosition='last' (default)",()=>{…});it("ascending, naPosition='first'",()=>{…});it("descending, naPosition='last'",()=>{…});it("descending, naPosition='first'",()=>{…});it("preserves original indices in output",()=>{…});});describe("Series.sortValues — string",()=>{…});describe("Series.sortValues — mixed dtype",()=>{…});describe("Series.sortValues — empty Series",()=>{…});describe("Series.sortValues — property checks",()=>{it("output length equals input length",()=>{…});it("output is a permutation of the input",()=>{…});it("applying sortValues twice is idempotent up to ties",()=>{…});});
Aim for 20–30 test cases. The more thorough this file is, the harder it is for a candidate that improves speed at the cost of correctness to slip through.
2. State machinery assumes "higher is better"; this program is "lower is better"
The autoloop state file's best_metric is updated with max-wins semantics by default. tsb-perf-evolve's fitness is tsb_mean_ms / pandas_mean_ms — lower is better, with < 1.0 meaning tsb beats pandas.
Action — add a frontmatter field to program.md that the scheduler/agent reads to flip the comparison direction:
---
schedule: every 6hmetric_direction: lower # default is `higher`; this program ratchets downward
---
And update .autoloop/programs/<name>/program.md's convention (if not already documented) so other programs can opt in. The scheduler and the agent prompt both need to honour this:
Scheduler: when computing "is this iteration an improvement?" use new < best if metric_direction: lower, else new > best.
Agent prompt (state-file update step): when appending to Iteration History and computing the delta, sign it accordingly.
If this isn't yet supported, file as a blocker in .github/workflows/scripts/autoloop_scheduler.py + the state-file-update section of .github/workflows/autoloop.md.
3. Bun must be installable in the autoloop sandbox
benchmark.ts runs under bun run. The agent sandbox still can't install bun reliably because releaseassets.githubusercontent.com is firewall-blocked — the exact problem that blocked iteration 233 of build-tsb. See comment on issue #1.
Action — fix the firewall allowlist before enabling this program. Add releaseassets.githubusercontent.com to .github/workflows/autoloop.md's network.allowed::
Without this, every evaluation falls back to the tsc-only path and bun run fails silently, making every candidate look rejected. This program is pointless without a real bun runtime.
4. Pandas must be available on the runner
benchmark.py imports pandas. The Evaluation script's fallback pip3 install pandas --quiet 2>/dev/null || true swallows install failures — meaning a sandbox without network access will produce a benchmark run that fails to start, and pd_ms= will end up empty. The evaluator should fail loudly in that case, not silently emit a broken JSON line.
Action — make benchmark.py assert pandas is importable, and update the Evaluation script to exit 1 if pandas can't be installed:
This plays well with the strategy's "status: error with fitness=null" path — candidates aren't silently accepted with bogus numbers.
5. Scheduler must actually pick this program
Currently the scheduler can starve non-file-based programs (see #162) and has a tiebreak that always picks the same program when state is absent. The tsb-perf-evolve program lands on schedule: every 6h, alongside perf-comparison (every 30m) and build-tsb-pandas-typescript-migration (every 30m). If #162 isn't merged yet, the new program will never run.
Action — confirm #162's fix is merged first, or manually run the program via workflow_dispatch with the program: tsb-perf-evolve input for the first few iterations to seed state.
First-run bootstrap (Iteration 1)
The population is empty on first run. strategy/alphaevolve.md's Step 2 deterministic override says: "If the population is empty or has one member → force Exploration." But before any exploration, we need a baseline — otherwise fitness values are meaningless with nothing to compare against.
Bootstrap protocol for the very first iteration:
Do not modify src/core/series.ts. The current implementation is the baseline.
Run the evaluator with the existing code to get baseline_fitness = tsb_mean_ms / pandas_mean_ms. Record it.
Create ## 🧬 Population in the state file and append Candidate c001 as:
### Candidate c001 · island 0 · fitness <baseline_fitness> · gen 1-**Operator**: (baseline — not produced by an operator)
-**Parent(s)**: []-**Feature cell**: boxed-pairs · comparison
-**Approach**: Current `Array.prototype.sort` over `{v, i}` pairs with NaN-aware comparator.
-**Status**: ✅ accepted (baseline)
-**Notes**: Baseline established; tsb=<tsb_ms>ms / pandas=<pd_ms>ms / ratio=<baseline_fitness>.
Code:
\`\`\`typescript
// Current body of sortValues, copied verbatim from src/core/series.ts
\`\`\`---
Set best_metric = baseline_fitness in the Machine State table.
The strategy playbook handles iterations 2+ normally.
Expected first ~10 iterations (sanity check)
Use this as a correctness check for whoever watches the first runs:
Iter
Op override fires?
Expected operator
Expected target island
What the agent does
1
—
bootstrap
0 (comparison)
Seed baseline (above).
2
pop size == 1 → force Exploration
Exploration
1 (indirect typed-array)
Rewrite with Float64Array values + Uint32Array index, sort indices by values[i] - values[j], gather.
3
—
sample (likely Exploitation or Exploration)
whichever island won so far
If island 1 now has the best fitness, refine it; else try island 2.
4
—
sample
likely island 2 (packed DSU)
Try BigInt-packed (value, index) keys, single sort, split.
5
—
sample
likely island 3 (radix)
Dtype-dispatched radix sort for float64.
6–7
—
exploitation dominant
best island
Tune the winner.
8
—
possibly Crossover
two islands
Combine, e.g., radix's speed for large n with indirect's simplicity for small n → hybrid.
9
—
sample
—
—
10
—
maybe Migration
cross-island
Port the fastest algorithm's invariant trick (e.g., "values in a typed array avoid boxing") into another island's candidate.
If by iteration 10 the population doesn't span at least 3 of the 5 islands, something's wrong — most likely the agent is ignoring the operator override and just re-doing exploitation. Investigate by reading the Iteration History entries, which must record operator, island, parent_id(s), fitness.
Success criteria
Short-term (≤ 20 iterations): population spans ≥ 3 islands; at least one candidate is faster than the baseline (fitness < baseline_fitness); no correctness-breaking candidates accepted (all accepts passed the existing test suite).
Medium-term (≤ 50 iterations): best fitness ≤ 1.0 on at least one dataset configuration — tsb is genuinely faster than pandas for this function on one workload.
Long-term: the lessons-learned section of the state file contains transferable heuristics ("typed-array index sort beats object-pair sort above n≈10k") rather than iteration logs. These heuristics become the seeds for similar programs on other hot functions.
Observability — where to watch
State file on memory/autoloop branch: tsb-perf-evolve.md.
## ⚙️ Machine State — current best, iteration count, paused?
## 🧬 Population — candidates tried, with fitness and feature cells.
Once tsb-perf-evolve's loop is stable and producing wins, the natural next step is to apply the same pattern to other hot functions. Two options:
One program per function (tsb-perf-evolve-sortvalues, tsb-perf-evolve-merge, …). Each has its own population, islands, lessons. Strong isolation; lots of scheduler load.
One rotating program (tsb-perf-evolve) that cycles through a list of functions. Lessons cross-pollinate; but per-function state is cramped.
Defer that decision until this first instance shows results.
Acceptance
tests/core/series.sortValues.test.ts exists and covers the invariants listed above.
program.md has metric_direction: lower and the scheduler honours it.
Firewall allowlist includes releaseassets.githubusercontent.com; bun install succeeds in the sandbox.
benchmark.py / Evaluation script fails loudly on missing pandas rather than emitting null fitness.
The first iteration seeds Candidate c001 as baseline (state-file-only commit) and the second iteration begins real evolution.
After 20 iterations: population spans ≥ 3 islands, at least one candidate strictly improves on baseline.
Out of scope for this issue
Porting the strategy to other hot functions (covered by a follow-up after first results).
Automatic detection of "which function is hottest" (for now, we pick sortValues manually).
Cross-function lesson sharing via a shared lessons-learned bank.
Summary
With #182 merged, the
tsb-perf-evolveprogram is fully scaffolded:.autoloop/programs/tsb-perf-evolve/program.md— Goal, Target, Evaluation, validity invariants..autoloop/programs/tsb-perf-evolve/strategy/alphaevolve.md— customized playbook: 5 islands (comparison sort, indirect typed-array, packed-keys DSU, non-comparison/radix, hybrid), feature dimensions(storage × algorithm), rut-breaking overrides..autoloop/programs/tsb-perf-evolve/strategy/prompts/{mutation,crossover}.md— problem-specific framing..autoloop/programs/tsb-perf-evolve/code/{benchmark.ts,benchmark.py,config.yaml}— the evaluator.But the scaffold doesn't run end-to-end yet. This issue enumerates the concrete gaps that must close before the program can enter normal autoloop scheduling, and walks through the first few expected iterations so maintainers know what to watch for.
Prerequisite gaps
1. The validity-oracle test file doesn't exist
The Evaluation step runs
bun test tests/core/series.sortValues.test.tsas the validity check. That file is not in the repo —tests/core/currently holds onlynatsort.test.tsandsearchsorted.test.ts. Every candidate will fail validity on the first command.Action — write
tests/core/series.sortValues.test.tsbefore enabling the program. It must cover the invariants thatstrategy/alphaevolve.mdlists in its validity pre-check:naPosition: "first"andnaPosition: "last".sortValues(ascending = true, naPosition: "first" | "last" = "last"): Series<T>— use aexpectTypeOf/assertTypecheck if the framework supports it.The tests should be written against the current behaviour of
src/core/series.ts:714. Use pandas'Series.sort_valuesas a semantic reference where behaviour is ambiguous (NaN ordering, stability, etc.) — the program's whole premise is that tsb should match pandas semantics.Suggested layout:
Aim for 20–30 test cases. The more thorough this file is, the harder it is for a candidate that improves speed at the cost of correctness to slip through.
2. State machinery assumes "higher is better"; this program is "lower is better"
The autoloop state file's
best_metricis updated with max-wins semantics by default.tsb-perf-evolve's fitness istsb_mean_ms / pandas_mean_ms— lower is better, with< 1.0meaning tsb beats pandas.Action — add a frontmatter field to
program.mdthat the scheduler/agent reads to flip the comparison direction:And update
.autoloop/programs/<name>/program.md's convention (if not already documented) so other programs can opt in. The scheduler and the agent prompt both need to honour this:new < bestifmetric_direction: lower, elsenew > best.If this isn't yet supported, file as a blocker in
.github/workflows/scripts/autoloop_scheduler.py+ the state-file-update section of.github/workflows/autoloop.md.3. Bun must be installable in the autoloop sandbox
benchmark.tsruns underbun run. The agent sandbox still can't install bun reliably becausereleaseassets.githubusercontent.comis firewall-blocked — the exact problem that blocked iteration 233 ofbuild-tsb. See comment on issue #1.Action — fix the firewall allowlist before enabling this program. Add
releaseassets.githubusercontent.comto.github/workflows/autoloop.md'snetwork.allowed::Without this, every evaluation falls back to the
tsc-only path andbun runfails silently, making every candidate look rejected. This program is pointless without a real bun runtime.4. Pandas must be available on the runner
benchmark.pyimports pandas. The Evaluation script's fallbackpip3 install pandas --quiet 2>/dev/null || trueswallows install failures — meaning a sandbox without network access will produce a benchmark run that fails to start, andpd_ms=will end up empty. The evaluator should fail loudly in that case, not silently emit a broken JSON line.Action — make
benchmark.pyassert pandas is importable, and update the Evaluation script toexit 1if pandas can't be installed:This plays well with the strategy's "status: error with fitness=null" path — candidates aren't silently accepted with bogus numbers.
5. Scheduler must actually pick this program
Currently the scheduler can starve non-file-based programs (see #162) and has a tiebreak that always picks the same program when state is absent. The tsb-perf-evolve program lands on
schedule: every 6h, alongsideperf-comparison(every 30m) andbuild-tsb-pandas-typescript-migration(every 30m). If #162 isn't merged yet, the new program will never run.Action — confirm #162's fix is merged first, or manually run the program via
workflow_dispatchwith theprogram: tsb-perf-evolveinput for the first few iterations to seed state.First-run bootstrap (Iteration 1)
The population is empty on first run.
strategy/alphaevolve.md's Step 2 deterministic override says: "If the population is empty or has one member → force Exploration." But before any exploration, we need a baseline — otherwise fitness values are meaningless with nothing to compare against.Bootstrap protocol for the very first iteration:
Do not modify
src/core/series.ts. The current implementation is the baseline.Run the evaluator with the existing code to get
baseline_fitness = tsb_mean_ms / pandas_mean_ms. Record it.Create
## 🧬 Populationin the state file and append Candidate c001 as:Set
best_metric = baseline_fitnessin the Machine State table.Set Iteration Count = 1, last_run = now, recent_statuses =
["accepted"].Do not commit any code change. This iteration is a state-file-only commit (plus the new test file — see gap Build tsb: pandas → TypeScript migration #1).
The strategy playbook handles iterations 2+ normally.
Expected first ~10 iterations (sanity check)
Use this as a correctness check for whoever watches the first runs:
Float64Arrayvalues +Uint32Arrayindex, sort indices byvalues[i] - values[j], gather.(value, index)keys, single sort, split.If by iteration 10 the population doesn't span at least 3 of the 5 islands, something's wrong — most likely the agent is ignoring the operator override and just re-doing exploitation. Investigate by reading the Iteration History entries, which must record
operator,island,parent_id(s),fitness.Success criteria
baseline_fitness); no correctness-breaking candidates accepted (all accepts passed the existing test suite).Observability — where to watch
memory/autoloopbranch:tsb-perf-evolve.md.## ⚙️ Machine State— current best, iteration count, paused?## 🧬 Population— candidates tried, with fitness and feature cells.## 📚 Lessons Learned— transferable heuristics.autoloop/tsb-perf-evolve— accumulated accepted commits.When to expand beyond
sortValuesOnce
tsb-perf-evolve's loop is stable and producing wins, the natural next step is to apply the same pattern to other hot functions. Two options:tsb-perf-evolve-sortvalues,tsb-perf-evolve-merge, …). Each has its own population, islands, lessons. Strong isolation; lots of scheduler load.tsb-perf-evolve) that cycles through a list of functions. Lessons cross-pollinate; but per-function state is cramped.Defer that decision until this first instance shows results.
Acceptance
tests/core/series.sortValues.test.tsexists and covers the invariants listed above.program.mdhasmetric_direction: lowerand the scheduler honours it.releaseassets.githubusercontent.com; bun install succeeds in the sandbox.benchmark.py/ Evaluation script fails loudly on missing pandas rather than emittingnullfitness.Out of scope for this issue
sortValuesmanually).