From ea53ad6e65be1340a1620ee7b535dbd7062ed35d Mon Sep 17 00:00:00 2001 From: Tim Neutkens Date: Mon, 13 Apr 2026 13:34:53 +0200 Subject: [PATCH 1/2] bench: add A/B branch comparison workflow to BENCHMARKING.md --- bench/BENCHMARKING.md | 59 +++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 57 insertions(+), 2 deletions(-) diff --git a/bench/BENCHMARKING.md b/bench/BENCHMARKING.md index ea8195457384..813a4428b27b 100644 --- a/bench/BENCHMARKING.md +++ b/bench/BENCHMARKING.md @@ -138,7 +138,62 @@ for (const b of base) { NODE baseline-run candidate-run ``` -## 8. Noise control rules +## 8. A/B branch comparison + +When comparing two branches (e.g. canary vs a PR), follow this workflow to get reliable numbers. + +**Start with a focused route, not the full suite.** The full suite takes ~3 minutes per run. Pick the route where your change has the largest proportional impact — typically the lightest route (`/`) for per-request overhead changes, or a specific streaming route for render pipeline changes. + +**Increase request counts for fast routes.** The default `--serial-requests=120` is too noisy for sub-2ms routes. Use at least 500 serial and 5000 load requests: + +```bash +pnpm bench:render-pipeline \ + --scenario=e2e \ + --stream-mode=node \ + --build=false \ + --routes=/ \ + --serial-requests=500 \ + --load-requests=5000 \ + --load-concurrency=80 \ + --json-out=bench/render-pipeline/artifacts//results.json \ + --artifact-dir=bench/render-pipeline/artifacts/ +``` + +**Run at least 3 times per side.** A single run can swing 10–15% on light routes due to JIT warmup variance and system noise. Three runs let you average out outliers and spot whether a delta is real. + +**Compare absolute req/s, not just deltas.** Percentage deltas from a single pair of runs can be misleading. Line up the raw numbers side by side across all runs to see the full picture. + +**Watch for system state drift.** Running all baseline runs first, then all candidate runs, means the later runs may be affected by thermal throttling or background processes. If results look suspicious, interleave runs (baseline, candidate, baseline, candidate) to control for this. + +Example workflow: + +```bash +# 1. Checkout baseline, build, run 3 times +git checkout canary +pnpm --filter=next build +for i in 1 2 3; do + pnpm bench:render-pipeline --scenario=e2e --stream-mode=node --build=false \ + --routes=/ --serial-requests=500 --load-requests=5000 --load-concurrency=80 \ + --json-out=bench/render-pipeline/artifacts/baseline-$i/results.json \ + --artifact-dir=bench/render-pipeline/artifacts/baseline-$i +done + +# 2. Checkout candidate, build, run 3 times +git checkout +pnpm --filter=next build +for i in 1 2 3; do + pnpm bench:render-pipeline --scenario=e2e --stream-mode=node --build=false \ + --routes=/ --serial-requests=500 --load-requests=5000 --load-concurrency=80 \ + --json-out=bench/render-pipeline/artifacts/candidate-$i/results.json \ + --artifact-dir=bench/render-pipeline/artifacts/candidate-$i +done + +# 3. Compare averages across runs +``` + +**Only run the full route suite once you've confirmed a signal on focused routes.** Use the full suite as a final check that the change doesn't regress other routes, not as the primary measurement. + +## 9. Noise control rules Use these rules to keep measurements trustworthy: @@ -149,7 +204,7 @@ Use these rules to keep measurements trustworthy: - Prefer relative deltas across multiple runs over one-off absolute numbers. - When comparing e2e vs minimal-server scenarios, remember that e2e includes the full router-server overhead. -## 9. Suggested iteration loop +## 10. Suggested iteration loop 1. Change one thing. 2. Build (`pnpm --filter=next build`). From d02a6e3b856b3f7545136d566bed690f038ec772 Mon Sep 17 00:00:00 2001 From: Tim Neutkens Date: Mon, 13 Apr 2026 13:41:31 +0200 Subject: [PATCH 2/2] Update BENCHMARKING.md --- bench/BENCHMARKING.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/bench/BENCHMARKING.md b/bench/BENCHMARKING.md index 813a4428b27b..c4605d0a6390 100644 --- a/bench/BENCHMARKING.md +++ b/bench/BENCHMARKING.md @@ -161,7 +161,7 @@ pnpm bench:render-pipeline \ **Run at least 3 times per side.** A single run can swing 10–15% on light routes due to JIT warmup variance and system noise. Three runs let you average out outliers and spot whether a delta is real. -**Compare absolute req/s, not just deltas.** Percentage deltas from a single pair of runs can be misleading. Line up the raw numbers side by side across all runs to see the full picture. +**Compare absolute req/s, not deltas.** Percentage deltas from a single pair of runs can be misleading. Line up the raw numbers side by side across all runs to see the full picture. **Watch for system state drift.** Running all baseline runs first, then all candidate runs, means the later runs may be affected by thermal throttling or background processes. If results look suspicious, interleave runs (baseline, candidate, baseline, candidate) to control for this.