Bench vs p3 nightly publish by jotabulacios · Pull Request #590 · yetanotherco/lambda_vm

jotabulacios · 2026-05-18T19:24:03Z

New bench_vs_plonky3/ proves a shared Fibonacci AIR on both Lambda and Plonky3. Plonky3's MMCS is configured scalar (no vector lanes) so Merkle compression is one Keccak-f per call on both sides. bench_vs_plonky3/run.sh runs the bench, reports median + CV, and writes TSV + metrics to --report-dir.
Workflow runs the bench after the SP1 + ethrex steps and appends a "Lambda VM vs Plonky3" section to the same Slack post.

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

…ch_vs

…tion to bench_vs

--steps flag for nightly 1M/2M/4M/8M benchmarks

CLI flag

bench_vs/run.sh by switching to sed

… sweeps)

github-actions · 2026-05-19T19:11:57Z

Codex Code Review

Findings

Medium - Security / supply chain: bench_vs_plonky3/Cargo.toml:15 depends on Plonky3 git main, and the workflow explicitly runs cargo update to latest upstream before building on a self-hosted runner at .github/workflows/bench-vs-nightly.yml:65. That executes mutable third-party code in CI. Pin a reviewed rev for normal CI, or run this “latest upstream” job in an isolated runner with no durable local state/secrets.
Medium - Cryptographic correctness: lambda_fibonacci_pair.rs:273 builds boundary constraints from pub_inputs.initial_values.len() instead of self.num_sequences. A proof with fewer public inputs silently verifies a weaker statement with unconstrained sequences. Add an assert_eq!(pub_inputs.initial_values.len(), self.num_sequences) or make the public input shape fixed.
Low - Benchmark correctness: prove_bench.rs:360 tracks each tracing span with a single active_since. If Plonky3/rayon enters the same span concurrently, the first exit clears the timer and later exits are ignored, undercounting breakdown spans. Use an active-enter count and accumulate only when it drops to zero. The same pattern is duplicated in the ignored test.

I ran bash -n on the changed shell scripts and git diff --check; both passed. I could not run cargo check because rustup tried to write under read-only /home/runner/.rustup.

claude · 2026-05-19T19:13:51Z

+            echo -e "  ${RED}[${prover}] FAILED on log-rows=${log_rows} run ${run_i}${NC}"
+            cat "$out_file"


Medium — diagnostic output swallowed by command substitution

run_prover is called as lambda_median=$(run_prover lambda "$lr"). Everything written to stdout inside the function is captured, not printed. When the prover fails, these two lines write to stdout, so the error message and log file contents are silently discarded. The caller then exits with no explanation.

Redirect to stderr:

Suggested change

echo -e " ${RED}[${prover}] FAILED on log-rows=${log_rows} run ${run_i}${NC}"

cat "$out_file"

echo -e " ${RED}[${prover}] FAILED on log-rows=${log_rows} run ${run_i}${NC}" >&2

cat "$out_file" >&2

Same fix is needed for the second cat "$out_file" at line 370.

claude · 2026-05-19T19:14:01Z

+            && entry.active_since.is_none()
+        {
+            entry.active_since = Some(Instant::now());
+        }
+    }
+
+    fn on_exit(&self, id: &tracing::span::Id, _ctx: tracing_subscriber::layer::Context<'_, S>) {
+        if let Some(entry) = self.spans.lock().unwrap().get_mut(&id.into_u64())
+            && let Some(start) = entry.active_since.take()
+        {
+            entry.accumulated += start.elapsed();
+        }
+    }
+
+    fn on_close(&self, id: tracing::span::Id, _ctx: tracing_subscriber::layer::Context<'_, S>) {
+        if let Some(entry) = self.spans.lock().unwrap().remove(&id.into_u64()) {
+            let mut total = entry.accumulated;
+            if let Some(start) = entry.active_since {


Medium — P3TimingLayer systematically undercounts time under Rayon parallelism

When Rayon re-enters a span concurrently on N threads, on_enter is called N times in parallel. Only the first call sees active_since.is_none() and records a start time; the other N−1 threads skip. On on_exit, only the first thread finds active_since set and accumulates a duration; the remaining N−1 exits are no-ops. For parallel phases (FFT, Merkle), the reported span time is 1/N of the actual wall-clock contribution across all threads, making the breakdown numbers misleading.

The comment describes the intent ("only start timing on the first enter after each exit") but that intent was designed for a single-threaded re-entrant span, not for Rayon's concurrent multi-thread entry. For accurate wall-clock measurement in parallel code, consider using on_new_span/on_close instead, which fire exactly once per span lifetime regardless of how many threads enter it.

claude · 2026-05-19T19:14:11Z

+    let fri_params = FriParameters {
+        log_blowup: blowup.trailing_zeros() as usize,
+        log_final_poly_len: 0,
+        max_log_arity: 1,


Medium — hardcoded max_log_arity = 1 makes Plonky3 artificially slow in FRI

max_log_arity = 1 forces binary FRI folding (radix-2). With the nightly config (--log-rows 21, blowup=2), the LDE domain is 2^22, requiring 22 binary fold rounds. Typical Plonky3 production deployments use max_log_arity = 3 or 4 (arity 8/16), reducing fold rounds to ~7–8 and cutting FRI commit time significantly.

The README documents the scalar MMCS choice but not the folding arity choice. The published Lambda/P3 timing ratio is influenced by this sub-optimal P3 configuration that P3 would never use in production. Please either raise this value to match realistic P3 usage, or add a clear note in the README documenting why radix-2 was chosen and its impact on the reported ratio.

claude · 2026-05-19T19:14:18Z

+    let maxrss = unsafe { usage.assume_init().ru_maxrss };
+    #[cfg(target_os = "macos")]
+    {
+        Some((maxrss as u64).div_ceil(1024))


Low — maxrss cast from i64 to u64 without sign check

On Linux, ru_maxrss is i64. If it ever returns a negative value (kernel oddity, 32-bit platform, or future ABI change), the as u64 cast wraps silently to a huge number, producing a wildly incorrect RSS reading.

Suggested change

let maxrss = unsafe { usage.assume_init().ru_maxrss };

#[cfg(target_os = "macos")]

{

Some((maxrss as u64).div_ceil(1024))

let maxrss = unsafe { usage.assume_init().ru_maxrss };

if maxrss < 0 {

return None;

}

#[cfg(target_os = "macos")]

return Some((maxrss as u64).div_ceil(1024));

#[cfg(not(target_os = "macos"))]

return Some(maxrss as u64);

claude · 2026-05-19T19:14:24Z

+        struct SpanState {
+            name: String,
+            active_since: Option<std::time::Instant>,
+            accumulated: std::time::Duration,
+        }
+
+        struct P3TimingLayer {
+            spans: Mutex<HashMap<u64, SpanState>>,
+            results: SpanResults,
+        }
+
+        impl<
+            S: tracing::Subscriber + for<'lookup> tracing_subscriber::registry::LookupSpan<'lookup>,
+        > tracing_subscriber::Layer<S> for P3TimingLayer
+        {
+            fn on_new_span(
+                &self,
+                attrs: &tracing::span::Attributes<'_>,
+                id: &tracing::span::Id,
+                _ctx: tracing_subscriber::layer::Context<'_, S>,
+            ) {
+                let name = attrs.metadata().name().to_string();
+                self.spans.lock().unwrap().insert(
+                    id.into_u64(),
+                    SpanState {
+                        name,
+                        active_since: None,
+                        accumulated: std::time::Duration::ZERO,
+                    },
+                );
+            }
+
+            // Rayon can re-enter a span across threads, so only start timing on
+            // the first enter after each exit; accumulate every interval.
+            fn on_enter(
+                &self,
+                id: &tracing::span::Id,
+                _ctx: tracing_subscriber::layer::Context<'_, S>,
+            ) {
+                if let Some(entry) = self.spans.lock().unwrap().get_mut(&id.into_u64())
+                    && entry.active_since.is_none()
+                {
+                    entry.active_since = Some(std::time::Instant::now());
+                }
+            }
+
+            fn on_exit(
+                &self,
+                id: &tracing::span::Id,
+                _ctx: tracing_subscriber::layer::Context<'_, S>,
+            ) {
+                if let Some(entry) = self.spans.lock().unwrap().get_mut(&id.into_u64())
+                    && let Some(start) = entry.active_since.take()


Low — P3TimingLayer is duplicated verbatim from prove_bench.rs

SpanState and P3TimingLayer (plus the full tracing_subscriber::Layer impl) are copy-pasted from src/bin/prove_bench.rs:330–394, with only minor std::time::Instant qualification differences. These two copies will diverge. The implementation should live here in lib.rs (or a dedicated module) and be re-exported for use in the binary.

claude · 2026-05-19T19:14:45Z

Review Summary

This PR adds a new bench_vs_plonky3 crate with a Fibonacci STARK benchmark comparing Lambda VM against Plonky3, plus CI automation.

Medium

Diagnostic output swallowed by command substitution (run.sh:357–358): run_prover is called as lambda_median=$(run_prover ...). Failure messages and log output written to stdout are captured and discarded. Redirect to stderr. (See inline comment.)
P3TimingLayer undercounts time under Rayon parallelism (prove_bench.rs:362–379): When Rayon enters a span concurrently on N threads, only the first thread's interval is recorded; the other N−1 are silently dropped. CPU-parallel phases (FFT, Merkle) will show ~1/N of their real wall-clock contribution in the breakdown. (See inline comment.)
max_log_arity = 1 is non-representative of Plonky3 production config (plonky3_config.rs:66): Forcing radix-2 FRI folding at 2^22 domain requires 22 fold rounds; typical P3 production configs use arity 8–16 (~7–8 rounds). The headline Lambda/P3 ratio is influenced by this sub-optimal P3 setting. Please document the reasoning or adjust the arity. (See inline comment.)

Low

maxrss as u64 cast without sign check (prove_bench.rs:413–416): On Linux ru_maxrss is i64; a negative return value wraps silently to a huge number. Add a guard. (See inline comment.)
P3TimingLayer is duplicated between prove_bench.rs:330–394 and lib.rs:215–267; the two copies will diverge. Consolidate in lib.rs and re-export. (See inline comment.)

jotabulacios · 2026-05-22T17:54:03Z

/claude /codex

github-actions · 2026-05-22T17:56:12Z

Codex Code Review

Findings

High Security: The nightly workflow builds and runs unpinned code from Plonky3 main on a self-hosted runner. The new manifest uses git dependencies without a pinned rev, and the workflow explicitly refreshes them before building. Cargo build scripts or dependency code from a compromised/upstream-changed repo would execute on the persistent self-hosted runner. Pin the Plonky3 revision for CI, or run this job in an isolated disposable environment without sensitive filesystem/network access.
bench_vs_plonky3/Cargo.toml, .github/workflows/bench-vs-nightly.yml
Medium Bug: The benchmark intentionally updates Plonky3 to latest main, but the generated artifact only records the Lambda repo SHA and git_tree=dirty; it does not record the resolved Plonky3 commit. That makes nightly results non-reproducible and hard to audit, especially if performance changes or upstream breaks. Record the resolved git revision from Cargo.lock/cargo metadata, or archive the post-update lockfile in bench_vs_artifacts.
.github/workflows/bench-vs-nightly.yml, bench_vs_plonky3/run.sh

I attempted cargo check -p bench-vs-plonky3 --bin prove_bench --benches, but it could not run because rustup tried to create files under the read-only /home/runner/.rustup.

claude · 2026-05-22T17:57:56Z

+    let proof = p3_prove(&config, &air, trace, &pis);
+
+    group.bench_with_input(BenchmarkId::new("fibonacci", TRACE_LABEL), &ROWS, |b, _| {
+        b.iter(|| {


Medium — Benchmark measurement skew

matched_params_config() is called inside b.iter, so config construction (several hash/compressor objects + DFT + FRI params + PCS) is measured as part of every verify iteration. The Lambda verify bench constructs only a lightweight DefaultTranscript per iteration, making the two numbers not directly comparable.

Move config construction outside b.iter:

Suggested change

b.iter(|| {

group.bench_with_input(BenchmarkId::new("fibonacci", TRACE_LABEL), &ROWS, |b, _| {

let config = plonky3_config::matched_params_config();

b.iter(|| {

p3_verify(&config, &air, &proof, &pis).unwrap();

});

});

claude · 2026-05-22T17:58:04Z

+            m = s / n
+            v = (ss / n) - (m * m)
+            if (v < 0) v = 0
+            sd = sqrt(v)


Low — Population variance instead of sample variance

v = (ss / n) - (m * m) is the biased population variance (denominator n). The unbiased sample variance uses denominator n-1. For the default --runs 10, this underestimates the standard deviation by a factor of sqrt(n/(n-1)) ≈ 1.054, reporting CV ~5% lower than the statistically correct value.

For n this small, the difference is visible. Consider:

Suggested change

sd = sqrt(v)

v = (ss - n * m * m) / (n - 1)

(guard against n < 2 similarly to the existing n == 0 check)

claude · 2026-05-22T17:58:13Z

+p3-field = { git = "https://github.com/Plonky3/Plonky3.git" }
+p3-goldilocks = { git = "https://github.com/Plonky3/Plonky3.git" }
+p3-matrix = { git = "https://github.com/Plonky3/Plonky3.git" }
+p3-commit = { git = "https://github.com/Plonky3/Plonky3.git" }


Info — Unpinned git deps, intentional but worth documenting

All p3-* deps resolve to whatever HEAD on Plonky3's main is at cargo update time. The Cargo.lock pins a specific commit (de83ef4), and the nightly workflow explicitly runs cargo update -p p3-* before the bench step, so this is by design.

The only risk is that cargo build in a dev checkout will use the locked commit, which drifts from the nightly. That's generally fine for a benchmark crate, but it means "works locally" and "nightly" can diverge silently. Adding a comment near the deps block explaining that cargo update is expected before nightly runs would help future contributors.

claude · 2026-05-22T17:58:30Z

Review: Bench vs Plonky3

Overall this is a well-structured, well-documented benchmarking crate. The fairness choices (scalar MMCS, matching AIR shape, same blowup/queries) are clearly reasoned and the README is thorough. A few issues worth fixing:

Medium

Verify benchmark skew (benches/stark_comparison.rs:154): matched_params_config() is called inside b.iter, so config construction is measured on every verify iteration. The Lambda verify bench only constructs a lightweight DefaultTranscript per iteration. Move the config outside the iter closure — see inline comment.

Low

Population vs sample variance (run.sh cv_pct_file): Uses biased population variance (÷ n) rather than sample variance (÷ n−1). For the default 10 runs this reports CV ~5% lower than the statistically correct value — inline comment has a one-line fix.

Info

Unpinned git deps (Cargo.toml): Intentional for nightly tracking, but a brief comment near [dependencies] would help future contributors understand that cargo update -p p3-* is expected before nightly runs.
P3TimingLayer in parallel proofs: Rayon tasks can create multiple spans with the same name concurrently. Each closes independently, so --breakdown output shows repeated span names rather than one aggregated line. Fine for a debug tool, but worth knowing when reading breakdown output.

No security or correctness issues found in the AIR, prover, or verifier wiring.

MauroToscano and others added 30 commits February 25, 2026 19:00

bench vs others

f7f842b

Standarize guest

472ddc3

Fix decimals

88171d0

Benchmark vs sp1

ef7d5bb

Add lambda program

e11346a

Remove stray blank line from .gitignore

687d8d8

Apply suggestion from @gabrielbosio

0069659

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Add instruments

dabe54c

Merge branch 'main' of github.com:yetanotherco/lambda_vm

19e569e

Merge branch 'main' of github.com:yetanotherco/lambda_

3955884

Merge branch 'main' into bench_vs

15984c1

Merge branch 'bench_vs' of github.com:yetanotherco/lambda_vm into ben…

c2dca55

…ch_vs

Merge branch 'main' into bench_vs

8d043eb

Merge branch 'main' into bench_vs

4858ad8

Merge

5605c6b

Fix syscall number

af6aab3

Merge branch 'main' into bench_vs

a1169d9

Add runtime private inputs, nightly CI workflow, and 500M step projec…

687af82

…tion to bench_vs

Merge branch 'main' into bench_vs

5ea0c40

save work

0315a80

Fix bench_vs 5x projection inflation and add

3f3c7ac

--steps flag for nightly 1M/2M/4M/8M benchmarks

Merge branch 'main' into bench_vs

1288f27

Switch bench_vs projection to measured cycles and add --cycles

e9e9fbd

CLI flag

Fix grep pipefail in

1e1459e

bench_vs/run.sh by switching to sed

Merge branch 'main' into bench_vs

8d5c152

Add bench_vs_plonky3

2e65d5a

lint

1fd1a58

Replace summary.md with metrics.txt in bench_vs_plonky3 and add README

c9f9df9

remove line

ba4c7cd

add README

f7fe4d3

jotabulacios added 14 commits April 20, 2026 16:30

rm old files

d3d41b0

Merge branch 'main' into bench_vs_p3

60bf9a4

adress comments

3010b38

Merge branch 'main' into bench_vs_p3

734ebdc

Merge branch 'main' into bench_vs_p3

4bcd877

Migrate FibonacciPair AIR constraints

4b19d25

Add full metrics and phase breakdown to bench_vs_plonky3

7ee4cbf

Add Lambda-vs-Plonky3 sections to the nightly Slack post

6b57545

Trim Lambda-vs-Plonky3 nightly to headline-only (drop size and column…

82db83b

… sweeps)

solve conflicts

5f723f5

fix bench

c84a12c

Refresh Plonky3 to latest main

e9d5abb

cleanup

18f1f0c

Merge branch 'main' into bench_vs_p3_nightly_publish

14a42df

jotabulacios marked this pull request as ready for review May 19, 2026 19:08

claude Bot reviewed May 19, 2026

View reviewed changes

jotabulacios added 2 commits May 20, 2026 11:50

address comments

bb7d8ba

add missing file

cd41dd7

claude Bot reviewed May 22, 2026

View reviewed changes

		echo -e " ${RED}[${prover}] FAILED on log-rows=${log_rows} run ${run_i}${NC}"
		cat "$out_file"

Conversation

jotabulacios commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 19, 2026

Codex Code Review

Uh oh!

claude Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented May 19, 2026

Review Summary

Medium

Low

Uh oh!

jotabulacios commented May 22, 2026

Uh oh!

github-actions Bot commented May 22, 2026

Codex Code Review

Uh oh!

claude Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented May 22, 2026

Review: Bench vs Plonky3

Medium

Low

Info

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jotabulacios commented May 18, 2026 •

edited

Loading