Bench vs p3 nightly publish#590
Conversation
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
--steps flag for nightly 1M/2M/4M/8M benchmarks
bench_vs/run.sh by switching to sed
Codex Code ReviewFindings
I ran |
| echo -e " ${RED}[${prover}] FAILED on log-rows=${log_rows} run ${run_i}${NC}" | ||
| cat "$out_file" |
There was a problem hiding this comment.
Medium — diagnostic output swallowed by command substitution
run_prover is called as lambda_median=$(run_prover lambda "$lr"). Everything written to stdout inside the function is captured, not printed. When the prover fails, these two lines write to stdout, so the error message and log file contents are silently discarded. The caller then exits with no explanation.
Redirect to stderr:
| echo -e " ${RED}[${prover}] FAILED on log-rows=${log_rows} run ${run_i}${NC}" | |
| cat "$out_file" | |
| echo -e " ${RED}[${prover}] FAILED on log-rows=${log_rows} run ${run_i}${NC}" >&2 | |
| cat "$out_file" >&2 |
Same fix is needed for the second cat "$out_file" at line 370.
| && entry.active_since.is_none() | ||
| { | ||
| entry.active_since = Some(Instant::now()); | ||
| } | ||
| } | ||
|
|
||
| fn on_exit(&self, id: &tracing::span::Id, _ctx: tracing_subscriber::layer::Context<'_, S>) { | ||
| if let Some(entry) = self.spans.lock().unwrap().get_mut(&id.into_u64()) | ||
| && let Some(start) = entry.active_since.take() | ||
| { | ||
| entry.accumulated += start.elapsed(); | ||
| } | ||
| } | ||
|
|
||
| fn on_close(&self, id: tracing::span::Id, _ctx: tracing_subscriber::layer::Context<'_, S>) { | ||
| if let Some(entry) = self.spans.lock().unwrap().remove(&id.into_u64()) { | ||
| let mut total = entry.accumulated; | ||
| if let Some(start) = entry.active_since { |
There was a problem hiding this comment.
Medium — P3TimingLayer systematically undercounts time under Rayon parallelism
When Rayon re-enters a span concurrently on N threads, on_enter is called N times in parallel. Only the first call sees active_since.is_none() and records a start time; the other N−1 threads skip. On on_exit, only the first thread finds active_since set and accumulates a duration; the remaining N−1 exits are no-ops. For parallel phases (FFT, Merkle), the reported span time is 1/N of the actual wall-clock contribution across all threads, making the breakdown numbers misleading.
The comment describes the intent ("only start timing on the first enter after each exit") but that intent was designed for a single-threaded re-entrant span, not for Rayon's concurrent multi-thread entry. For accurate wall-clock measurement in parallel code, consider using on_new_span/on_close instead, which fire exactly once per span lifetime regardless of how many threads enter it.
| let fri_params = FriParameters { | ||
| log_blowup: blowup.trailing_zeros() as usize, | ||
| log_final_poly_len: 0, | ||
| max_log_arity: 1, |
There was a problem hiding this comment.
Medium — hardcoded max_log_arity = 1 makes Plonky3 artificially slow in FRI
max_log_arity = 1 forces binary FRI folding (radix-2). With the nightly config (--log-rows 21, blowup=2), the LDE domain is 2^22, requiring 22 binary fold rounds. Typical Plonky3 production deployments use max_log_arity = 3 or 4 (arity 8/16), reducing fold rounds to ~7–8 and cutting FRI commit time significantly.
The README documents the scalar MMCS choice but not the folding arity choice. The published Lambda/P3 timing ratio is influenced by this sub-optimal P3 configuration that P3 would never use in production. Please either raise this value to match realistic P3 usage, or add a clear note in the README documenting why radix-2 was chosen and its impact on the reported ratio.
| let maxrss = unsafe { usage.assume_init().ru_maxrss }; | ||
| #[cfg(target_os = "macos")] | ||
| { | ||
| Some((maxrss as u64).div_ceil(1024)) |
There was a problem hiding this comment.
Low — maxrss cast from i64 to u64 without sign check
On Linux, ru_maxrss is i64. If it ever returns a negative value (kernel oddity, 32-bit platform, or future ABI change), the as u64 cast wraps silently to a huge number, producing a wildly incorrect RSS reading.
| let maxrss = unsafe { usage.assume_init().ru_maxrss }; | |
| #[cfg(target_os = "macos")] | |
| { | |
| Some((maxrss as u64).div_ceil(1024)) | |
| let maxrss = unsafe { usage.assume_init().ru_maxrss }; | |
| if maxrss < 0 { | |
| return None; | |
| } | |
| #[cfg(target_os = "macos")] | |
| return Some((maxrss as u64).div_ceil(1024)); | |
| #[cfg(not(target_os = "macos"))] | |
| return Some(maxrss as u64); |
| struct SpanState { | ||
| name: String, | ||
| active_since: Option<std::time::Instant>, | ||
| accumulated: std::time::Duration, | ||
| } | ||
|
|
||
| struct P3TimingLayer { | ||
| spans: Mutex<HashMap<u64, SpanState>>, | ||
| results: SpanResults, | ||
| } | ||
|
|
||
| impl< | ||
| S: tracing::Subscriber + for<'lookup> tracing_subscriber::registry::LookupSpan<'lookup>, | ||
| > tracing_subscriber::Layer<S> for P3TimingLayer | ||
| { | ||
| fn on_new_span( | ||
| &self, | ||
| attrs: &tracing::span::Attributes<'_>, | ||
| id: &tracing::span::Id, | ||
| _ctx: tracing_subscriber::layer::Context<'_, S>, | ||
| ) { | ||
| let name = attrs.metadata().name().to_string(); | ||
| self.spans.lock().unwrap().insert( | ||
| id.into_u64(), | ||
| SpanState { | ||
| name, | ||
| active_since: None, | ||
| accumulated: std::time::Duration::ZERO, | ||
| }, | ||
| ); | ||
| } | ||
|
|
||
| // Rayon can re-enter a span across threads, so only start timing on | ||
| // the first enter after each exit; accumulate every interval. | ||
| fn on_enter( | ||
| &self, | ||
| id: &tracing::span::Id, | ||
| _ctx: tracing_subscriber::layer::Context<'_, S>, | ||
| ) { | ||
| if let Some(entry) = self.spans.lock().unwrap().get_mut(&id.into_u64()) | ||
| && entry.active_since.is_none() | ||
| { | ||
| entry.active_since = Some(std::time::Instant::now()); | ||
| } | ||
| } | ||
|
|
||
| fn on_exit( | ||
| &self, | ||
| id: &tracing::span::Id, | ||
| _ctx: tracing_subscriber::layer::Context<'_, S>, | ||
| ) { | ||
| if let Some(entry) = self.spans.lock().unwrap().get_mut(&id.into_u64()) | ||
| && let Some(start) = entry.active_since.take() |
There was a problem hiding this comment.
Low — P3TimingLayer is duplicated verbatim from prove_bench.rs
SpanState and P3TimingLayer (plus the full tracing_subscriber::Layer impl) are copy-pasted from src/bin/prove_bench.rs:330–394, with only minor std::time::Instant qualification differences. These two copies will diverge. The implementation should live here in lib.rs (or a dedicated module) and be re-exported for use in the binary.
Review SummaryThis PR adds a new Medium
Low
|
|
/claude /codex |
Codex Code ReviewFindings
I attempted |
| let proof = p3_prove(&config, &air, trace, &pis); | ||
|
|
||
| group.bench_with_input(BenchmarkId::new("fibonacci", TRACE_LABEL), &ROWS, |b, _| { | ||
| b.iter(|| { |
There was a problem hiding this comment.
Medium — Benchmark measurement skew
matched_params_config() is called inside b.iter, so config construction (several hash/compressor objects + DFT + FRI params + PCS) is measured as part of every verify iteration. The Lambda verify bench constructs only a lightweight DefaultTranscript per iteration, making the two numbers not directly comparable.
Move config construction outside b.iter:
| b.iter(|| { | |
| group.bench_with_input(BenchmarkId::new("fibonacci", TRACE_LABEL), &ROWS, |b, _| { | |
| let config = plonky3_config::matched_params_config(); | |
| b.iter(|| { | |
| p3_verify(&config, &air, &proof, &pis).unwrap(); | |
| }); | |
| }); |
| m = s / n | ||
| v = (ss / n) - (m * m) | ||
| if (v < 0) v = 0 | ||
| sd = sqrt(v) |
There was a problem hiding this comment.
Low — Population variance instead of sample variance
v = (ss / n) - (m * m) is the biased population variance (denominator n). The unbiased sample variance uses denominator n-1. For the default --runs 10, this underestimates the standard deviation by a factor of sqrt(n/(n-1)) ≈ 1.054, reporting CV ~5% lower than the statistically correct value.
For n this small, the difference is visible. Consider:
| sd = sqrt(v) | |
| v = (ss - n * m * m) / (n - 1) |
(guard against n < 2 similarly to the existing n == 0 check)
| p3-field = { git = "https://github.com/Plonky3/Plonky3.git" } | ||
| p3-goldilocks = { git = "https://github.com/Plonky3/Plonky3.git" } | ||
| p3-matrix = { git = "https://github.com/Plonky3/Plonky3.git" } | ||
| p3-commit = { git = "https://github.com/Plonky3/Plonky3.git" } |
There was a problem hiding this comment.
Info — Unpinned git deps, intentional but worth documenting
All p3-* deps resolve to whatever HEAD on Plonky3's main is at cargo update time. The Cargo.lock pins a specific commit (de83ef4), and the nightly workflow explicitly runs cargo update -p p3-* before the bench step, so this is by design.
The only risk is that cargo build in a dev checkout will use the locked commit, which drifts from the nightly. That's generally fine for a benchmark crate, but it means "works locally" and "nightly" can diverge silently. Adding a comment near the deps block explaining that cargo update is expected before nightly runs would help future contributors.
Review: Bench vs Plonky3Overall this is a well-structured, well-documented benchmarking crate. The fairness choices (scalar MMCS, matching AIR shape, same blowup/queries) are clearly reasoned and the README is thorough. A few issues worth fixing: Medium
Low
Info
No security or correctness issues found in the AIR, prover, or verifier wiring. |
New
bench_vs_plonky3/proves a shared Fibonacci AIR on both Lambda and Plonky3. Plonky3's MMCS is configured scalar (no vector lanes) so Merkle compression is one Keccak-f per call on both sides.bench_vs_plonky3/run.shruns the bench, reports median + CV, and writes TSV + metrics to--report-dir.Workflow runs the bench after the SP1 + ethrex steps and appends a "Lambda VM vs Plonky3" section to the same Slack post.