Skip to content

Bench vs p3 nightly publish#590

Open
jotabulacios wants to merge 50 commits into
mainfrom
bench_vs_p3_nightly_publish
Open

Bench vs p3 nightly publish#590
jotabulacios wants to merge 50 commits into
mainfrom
bench_vs_p3_nightly_publish

Conversation

@jotabulacios
Copy link
Copy Markdown
Collaborator

@jotabulacios jotabulacios commented May 18, 2026

New bench_vs_plonky3/ proves a shared Fibonacci AIR on both Lambda and Plonky3. Plonky3's MMCS is configured scalar (no vector lanes) so Merkle compression is one Keccak-f per call on both sides. bench_vs_plonky3/run.sh runs the bench, reports median + CV, and writes TSV + metrics to --report-dir.
Workflow runs the bench after the SP1 + ethrex steps and appends a "Lambda VM vs Plonky3" section to the same Slack post.

@jotabulacios jotabulacios marked this pull request as ready for review May 19, 2026 19:08
@github-actions
Copy link
Copy Markdown

Codex Code Review

Findings

  1. Medium - Security / supply chain: bench_vs_plonky3/Cargo.toml:15 depends on Plonky3 git main, and the workflow explicitly runs cargo update to latest upstream before building on a self-hosted runner at .github/workflows/bench-vs-nightly.yml:65. That executes mutable third-party code in CI. Pin a reviewed rev for normal CI, or run this “latest upstream” job in an isolated runner with no durable local state/secrets.

  2. Medium - Cryptographic correctness: lambda_fibonacci_pair.rs:273 builds boundary constraints from pub_inputs.initial_values.len() instead of self.num_sequences. A proof with fewer public inputs silently verifies a weaker statement with unconstrained sequences. Add an assert_eq!(pub_inputs.initial_values.len(), self.num_sequences) or make the public input shape fixed.

  3. Low - Benchmark correctness: prove_bench.rs:360 tracks each tracing span with a single active_since. If Plonky3/rayon enters the same span concurrently, the first exit clears the timer and later exits are ignored, undercounting breakdown spans. Use an active-enter count and accumulate only when it drops to zero. The same pattern is duplicated in the ignored test.

I ran bash -n on the changed shell scripts and git diff --check; both passed. I could not run cargo check because rustup tried to write under read-only /home/runner/.rustup.

Comment thread bench_vs_plonky3/run.sh Outdated
Comment on lines +357 to +358
echo -e " ${RED}[${prover}] FAILED on log-rows=${log_rows} run ${run_i}${NC}"
cat "$out_file"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — diagnostic output swallowed by command substitution

run_prover is called as lambda_median=$(run_prover lambda "$lr"). Everything written to stdout inside the function is captured, not printed. When the prover fails, these two lines write to stdout, so the error message and log file contents are silently discarded. The caller then exits with no explanation.

Redirect to stderr:

Suggested change
echo -e " ${RED}[${prover}] FAILED on log-rows=${log_rows} run ${run_i}${NC}"
cat "$out_file"
echo -e " ${RED}[${prover}] FAILED on log-rows=${log_rows} run ${run_i}${NC}" >&2
cat "$out_file" >&2

Same fix is needed for the second cat "$out_file" at line 370.

Comment thread bench_vs_plonky3/src/bin/prove_bench.rs Outdated
Comment on lines +362 to +379
&& entry.active_since.is_none()
{
entry.active_since = Some(Instant::now());
}
}

fn on_exit(&self, id: &tracing::span::Id, _ctx: tracing_subscriber::layer::Context<'_, S>) {
if let Some(entry) = self.spans.lock().unwrap().get_mut(&id.into_u64())
&& let Some(start) = entry.active_since.take()
{
entry.accumulated += start.elapsed();
}
}

fn on_close(&self, id: tracing::span::Id, _ctx: tracing_subscriber::layer::Context<'_, S>) {
if let Some(entry) = self.spans.lock().unwrap().remove(&id.into_u64()) {
let mut total = entry.accumulated;
if let Some(start) = entry.active_since {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — P3TimingLayer systematically undercounts time under Rayon parallelism

When Rayon re-enters a span concurrently on N threads, on_enter is called N times in parallel. Only the first call sees active_since.is_none() and records a start time; the other N−1 threads skip. On on_exit, only the first thread finds active_since set and accumulates a duration; the remaining N−1 exits are no-ops. For parallel phases (FFT, Merkle), the reported span time is 1/N of the actual wall-clock contribution across all threads, making the breakdown numbers misleading.

The comment describes the intent ("only start timing on the first enter after each exit") but that intent was designed for a single-threaded re-entrant span, not for Rayon's concurrent multi-thread entry. For accurate wall-clock measurement in parallel code, consider using on_new_span/on_close instead, which fire exactly once per span lifetime regardless of how many threads enter it.

let fri_params = FriParameters {
log_blowup: blowup.trailing_zeros() as usize,
log_final_poly_len: 0,
max_log_arity: 1,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — hardcoded max_log_arity = 1 makes Plonky3 artificially slow in FRI

max_log_arity = 1 forces binary FRI folding (radix-2). With the nightly config (--log-rows 21, blowup=2), the LDE domain is 2^22, requiring 22 binary fold rounds. Typical Plonky3 production deployments use max_log_arity = 3 or 4 (arity 8/16), reducing fold rounds to ~7–8 and cutting FRI commit time significantly.

The README documents the scalar MMCS choice but not the folding arity choice. The published Lambda/P3 timing ratio is influenced by this sub-optimal P3 configuration that P3 would never use in production. Please either raise this value to match realistic P3 usage, or add a clear note in the README documenting why radix-2 was chosen and its impact on the reported ratio.

Comment thread bench_vs_plonky3/src/bin/prove_bench.rs Outdated
Comment on lines +413 to +416
let maxrss = unsafe { usage.assume_init().ru_maxrss };
#[cfg(target_os = "macos")]
{
Some((maxrss as u64).div_ceil(1024))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low — maxrss cast from i64 to u64 without sign check

On Linux, ru_maxrss is i64. If it ever returns a negative value (kernel oddity, 32-bit platform, or future ABI change), the as u64 cast wraps silently to a huge number, producing a wildly incorrect RSS reading.

Suggested change
let maxrss = unsafe { usage.assume_init().ru_maxrss };
#[cfg(target_os = "macos")]
{
Some((maxrss as u64).div_ceil(1024))
let maxrss = unsafe { usage.assume_init().ru_maxrss };
if maxrss < 0 {
return None;
}
#[cfg(target_os = "macos")]
return Some((maxrss as u64).div_ceil(1024));
#[cfg(not(target_os = "macos"))]
return Some(maxrss as u64);

Comment thread bench_vs_plonky3/src/lib.rs Outdated
Comment on lines +215 to +267
struct SpanState {
name: String,
active_since: Option<std::time::Instant>,
accumulated: std::time::Duration,
}

struct P3TimingLayer {
spans: Mutex<HashMap<u64, SpanState>>,
results: SpanResults,
}

impl<
S: tracing::Subscriber + for<'lookup> tracing_subscriber::registry::LookupSpan<'lookup>,
> tracing_subscriber::Layer<S> for P3TimingLayer
{
fn on_new_span(
&self,
attrs: &tracing::span::Attributes<'_>,
id: &tracing::span::Id,
_ctx: tracing_subscriber::layer::Context<'_, S>,
) {
let name = attrs.metadata().name().to_string();
self.spans.lock().unwrap().insert(
id.into_u64(),
SpanState {
name,
active_since: None,
accumulated: std::time::Duration::ZERO,
},
);
}

// Rayon can re-enter a span across threads, so only start timing on
// the first enter after each exit; accumulate every interval.
fn on_enter(
&self,
id: &tracing::span::Id,
_ctx: tracing_subscriber::layer::Context<'_, S>,
) {
if let Some(entry) = self.spans.lock().unwrap().get_mut(&id.into_u64())
&& entry.active_since.is_none()
{
entry.active_since = Some(std::time::Instant::now());
}
}

fn on_exit(
&self,
id: &tracing::span::Id,
_ctx: tracing_subscriber::layer::Context<'_, S>,
) {
if let Some(entry) = self.spans.lock().unwrap().get_mut(&id.into_u64())
&& let Some(start) = entry.active_since.take()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low — P3TimingLayer is duplicated verbatim from prove_bench.rs

SpanState and P3TimingLayer (plus the full tracing_subscriber::Layer impl) are copy-pasted from src/bin/prove_bench.rs:330–394, with only minor std::time::Instant qualification differences. These two copies will diverge. The implementation should live here in lib.rs (or a dedicated module) and be re-exported for use in the binary.

@claude
Copy link
Copy Markdown

claude Bot commented May 19, 2026

Review Summary

This PR adds a new bench_vs_plonky3 crate with a Fibonacci STARK benchmark comparing Lambda VM against Plonky3, plus CI automation.

Medium

  • Diagnostic output swallowed by command substitution (run.sh:357–358): run_prover is called as lambda_median=$(run_prover ...). Failure messages and log output written to stdout are captured and discarded. Redirect to stderr. (See inline comment.)

  • P3TimingLayer undercounts time under Rayon parallelism (prove_bench.rs:362–379): When Rayon enters a span concurrently on N threads, only the first thread's interval is recorded; the other N−1 are silently dropped. CPU-parallel phases (FFT, Merkle) will show ~1/N of their real wall-clock contribution in the breakdown. (See inline comment.)

  • max_log_arity = 1 is non-representative of Plonky3 production config (plonky3_config.rs:66): Forcing radix-2 FRI folding at 2^22 domain requires 22 fold rounds; typical P3 production configs use arity 8–16 (~7–8 rounds). The headline Lambda/P3 ratio is influenced by this sub-optimal P3 setting. Please document the reasoning or adjust the arity. (See inline comment.)

Low

  • maxrss as u64 cast without sign check (prove_bench.rs:413–416): On Linux ru_maxrss is i64; a negative return value wraps silently to a huge number. Add a guard. (See inline comment.)

  • P3TimingLayer is duplicated between prove_bench.rs:330–394 and lib.rs:215–267; the two copies will diverge. Consolidate in lib.rs and re-export. (See inline comment.)

@jotabulacios
Copy link
Copy Markdown
Collaborator Author

/claude /codex

@github-actions
Copy link
Copy Markdown

Codex Code Review

Findings

  1. High Security: The nightly workflow builds and runs unpinned code from Plonky3 main on a self-hosted runner. The new manifest uses git dependencies without a pinned rev, and the workflow explicitly refreshes them before building. Cargo build scripts or dependency code from a compromised/upstream-changed repo would execute on the persistent self-hosted runner. Pin the Plonky3 revision for CI, or run this job in an isolated disposable environment without sensitive filesystem/network access.
    bench_vs_plonky3/Cargo.toml, .github/workflows/bench-vs-nightly.yml

  2. Medium Bug: The benchmark intentionally updates Plonky3 to latest main, but the generated artifact only records the Lambda repo SHA and git_tree=dirty; it does not record the resolved Plonky3 commit. That makes nightly results non-reproducible and hard to audit, especially if performance changes or upstream breaks. Record the resolved git revision from Cargo.lock/cargo metadata, or archive the post-update lockfile in bench_vs_artifacts.
    .github/workflows/bench-vs-nightly.yml, bench_vs_plonky3/run.sh

I attempted cargo check -p bench-vs-plonky3 --bin prove_bench --benches, but it could not run because rustup tried to create files under the read-only /home/runner/.rustup.

let proof = p3_prove(&config, &air, trace, &pis);

group.bench_with_input(BenchmarkId::new("fibonacci", TRACE_LABEL), &ROWS, |b, _| {
b.iter(|| {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — Benchmark measurement skew

matched_params_config() is called inside b.iter, so config construction (several hash/compressor objects + DFT + FRI params + PCS) is measured as part of every verify iteration. The Lambda verify bench constructs only a lightweight DefaultTranscript per iteration, making the two numbers not directly comparable.

Move config construction outside b.iter:

Suggested change
b.iter(|| {
group.bench_with_input(BenchmarkId::new("fibonacci", TRACE_LABEL), &ROWS, |b, _| {
let config = plonky3_config::matched_params_config();
b.iter(|| {
p3_verify(&config, &air, &proof, &pis).unwrap();
});
});

Comment thread bench_vs_plonky3/run.sh
m = s / n
v = (ss / n) - (m * m)
if (v < 0) v = 0
sd = sqrt(v)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low — Population variance instead of sample variance

v = (ss / n) - (m * m) is the biased population variance (denominator n). The unbiased sample variance uses denominator n-1. For the default --runs 10, this underestimates the standard deviation by a factor of sqrt(n/(n-1)) ≈ 1.054, reporting CV ~5% lower than the statistically correct value.

For n this small, the difference is visible. Consider:

Suggested change
sd = sqrt(v)
v = (ss - n * m * m) / (n - 1)

(guard against n < 2 similarly to the existing n == 0 check)

p3-field = { git = "https://github.com/Plonky3/Plonky3.git" }
p3-goldilocks = { git = "https://github.com/Plonky3/Plonky3.git" }
p3-matrix = { git = "https://github.com/Plonky3/Plonky3.git" }
p3-commit = { git = "https://github.com/Plonky3/Plonky3.git" }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Info — Unpinned git deps, intentional but worth documenting

All p3-* deps resolve to whatever HEAD on Plonky3's main is at cargo update time. The Cargo.lock pins a specific commit (de83ef4), and the nightly workflow explicitly runs cargo update -p p3-* before the bench step, so this is by design.

The only risk is that cargo build in a dev checkout will use the locked commit, which drifts from the nightly. That's generally fine for a benchmark crate, but it means "works locally" and "nightly" can diverge silently. Adding a comment near the deps block explaining that cargo update is expected before nightly runs would help future contributors.

@claude
Copy link
Copy Markdown

claude Bot commented May 22, 2026

Review: Bench vs Plonky3

Overall this is a well-structured, well-documented benchmarking crate. The fairness choices (scalar MMCS, matching AIR shape, same blowup/queries) are clearly reasoned and the README is thorough. A few issues worth fixing:

Medium

  • Verify benchmark skew (benches/stark_comparison.rs:154): matched_params_config() is called inside b.iter, so config construction is measured on every verify iteration. The Lambda verify bench only constructs a lightweight DefaultTranscript per iteration. Move the config outside the iter closure — see inline comment.

Low

  • Population vs sample variance (run.sh cv_pct_file): Uses biased population variance (÷ n) rather than sample variance (÷ n−1). For the default 10 runs this reports CV ~5% lower than the statistically correct value — inline comment has a one-line fix.

Info

  • Unpinned git deps (Cargo.toml): Intentional for nightly tracking, but a brief comment near [dependencies] would help future contributors understand that cargo update -p p3-* is expected before nightly runs.
  • P3TimingLayer in parallel proofs: Rayon tasks can create multiple spans with the same name concurrently. Each closes independently, so --breakdown output shows repeated span names rather than one aggregated line. Fine for a debug tool, but worth knowing when reading breakdown output.

No security or correctness issues found in the AIR, prover, or verifier wiring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants