Add Noir's Execution Success test suite#411
Conversation
There was a problem hiding this comment.
Pull request overview
Adds an automated harness to run Noir’s upstream execution_success test corpus through provekit-cli, with both a vendoring script for local use and a GitHub Actions workflow for on-demand runs.
Changes:
- Add a vendoring script to copy Noir’s
test_programs/{execution_success,test_libraries}intotest-programs/noir/with source metadata. - Add a runner script that discovers runnable Noir test projects, compiles them with
nargo, then runsprovekit-cli prepare/prove/verify, producing logs and a grouped failure report. - Add a workflow_dispatch GitHub Actions workflow to fetch Noir test programs via sparse checkout and execute the suite on a self-hosted runner.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| scripts/vendor_noir_execution_success.sh | Vendors Noir test programs into the repo for local/offline runs. |
| scripts/run_noir_execution_success.sh | Discovers and executes Noir execution_success tests via nargo + provekit-cli, generating logs/reports. |
| .github/workflows/noir-execution-success.yml | Workflow to fetch Noir tests (sparse) and run the suite in CI with artifacts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
8132422 to
61fb151
Compare
ea55439 to
dbe013c
Compare
|
/claude-review |
Code ReviewSummaryThis PR adds CI/tooling infrastructure for running Noir's ReviewCorrectness & Soundness No issues with ZK proof system correctness — this is purely CI/test tooling. One logic issue in Also at line 817, Code Quality The bash script is well-structured: The Python script handles live-fetch failures gracefully with a hardcoded fallback and a 15s timeout. The merge dict order ( One maintenance concern: Security Two best-practice issues with the workflow:
Testing This PR is itself the test infrastructure; validation happens by running it. The Python script is pure data transformation (CSV to markdown) and is simple enough for manual review to suffice. One edge case handled correctly: if Suggestions
Verdict
|
04664f5 to
13dcba6
Compare
There was a problem hiding this comment.
Pull request overview
Adds an integration workflow and supporting scripts to run Noir’s test_programs/execution_success suite through provekit-cli, collect logs/artifacts, and generate a ProveKit vs Mavros witness-count comparison report for CI visibility.
Changes:
- Introduces a new GitHub Actions workflow to build ProveKit, install Noir, fetch Noir test programs via sparse checkout, run the suite, and upload artifacts (plus PR reporting).
- Adds a suite runner + helpers (bash + python) to discover tests, run compile/prepare/prove/verify, and produce grouped failure reports + witness CSV.
- Adds vendoring + skip-list + witness comparison tooling (including live STATUS.md fetch with fallback data).
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
.github/workflows/noir-execution-success.yml |
New CI workflow to execute the Noir suite, upload artifacts, and post/update a sticky PR comment. |
.github/scripts/build_noir_pr_comment.py |
Builds a size-bounded sticky PR comment body from artifacts (grouped failures + witness comparison). |
scripts/run_noir_execution_success.sh |
Main runner to discover Noir execution_success tests and run ProveKit prepare/prove/verify with logging and CSV emission. |
scripts/noir_execution_helpers.py |
Python helpers for discovery, Prover.toml resolution, and grouped failure report generation. |
scripts/generate_witness_comparison.py |
Generates a markdown comparison table between ProveKit witness counts and Mavros STATUS.md “Cols”. |
scripts/noir_skip_tests.txt |
Shared skip list for unsupported-blackbox tests (runner + witness comparison). |
scripts/vendor_noir_execution_success.sh |
Sparse-vendors Noir execution_success and test_libraries into the repo with source metadata. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # --- Unimplemented blackbox skip list: no log, no noise --- | ||
| if [[ "${SKIP_SET["${leaf_name}"]:-}" == "1" ]]; then | ||
| echo "SKIP (blackbox): ${test_name}" | ||
| (( skipped += 1 )) | ||
| continue | ||
| fi | ||
|
|
||
| test_log="${LOG_DIR}/per_test/${safe_test_name}.log" | ||
|
|
| lines = [ | ||
| "# Mavros vs Provekit Witnesses Count", | ||
| "", | ||
| f"Union {len(all_names)} circuits: {len(comparable)} comparable, " | ||
| f"{missing_in_provekit} missing in Provekit, {missing_in_mavros} missing in Mavros.", | ||
| f"Among comparable: {equal} equal, {mavros_better} Mavros better, " | ||
| f"{provekit_better} Provekit better.", | ||
| "", | ||
| "| Test | Mavros Cols | Provekit Post-GE | Delta | Better | Factor |", | ||
| "|------|-------------|------------------|-------|--------|--------|", |
| default: "v1.0.0-beta.19" | ||
|
|
||
| permissions: | ||
| contents: read |
| on: | ||
| pull_request: | ||
| workflow_dispatch: | ||
| inputs: |
| cancel-in-progress: true | ||
|
|
||
| jobs: | ||
| noir-execution-success: |
Noir execution_success report
Failing circuits (15)
|
…list Drop the provekit_witness_report.md table from the sticky PR comment and add a "Failing circuits (N)" details section parsed from the [stages] block of grouped_error_report.txt, alongside the grouped report itself.
Adds a separate Noir-CSP benchmark suite that runs prepare/prove/verify on every circuit under noir-examples/csp-benchmarks/ and posts a sticky PR comment with prover time, peak RSS, peak heap, verifier time, proof size, and PKP size — averaged over BENCH_RUNS (default 3) iterations. - scripts/run_csp_benchmarks.sh: bench runner. Per circuit: nargo compile + provekit-cli prepare once, then prove and verify each three times under /usr/bin/time -f '%e %M', capturing prover stderr so the helper can extract the largest "peak memory" entry from the span_stats trace. - scripts/csp_benchmark_helpers.py: parses /usr/bin/time output and the SI-formatted peak-memory entries (with U+202F separator), converts to bytes, and emits a single averaged CSV row per circuit. - .github/scripts/build_csp_pr_comment.py: renders results.csv as a markdown table and wraps it with a <!-- csp-benchmarks-report --> marker so it lives in its own sticky comment (no race with the noir-execution-success workflow). - .github/workflows/csp-benchmarks.yml: ubuntu-24.04-arm, runs on PRs from this repo and on workflow_dispatch (with a configurable bench_runs override).
2cce98f to
5f0a319
Compare
CSP benchmarks
Prover time, peak RSS, peak heap, and verifier time are arithmetic means across the iterations. Peak heap comes from the largest No baseline available yet — deltas will appear once this workflow has produced at least one successful Results
|
- scripts/csp_benchmark_helpers.py: spell "unparsable" the way crate-ci/typos expects, fixing the Spell check job that was failing on this PR. - scripts/run_noir_execution_success.sh: don't exit 1 when only some circuits fail. The PR sticky comment surfaces failing circuits, so the workflow shouldn't block PRs on known provekit-cli limitations. Local callers that want the old behaviour can opt back in via STRICT_FAIL=1.
- build_csp_pr_comment.py: drop the "Runs" column (the iteration count is already shown in the metric metadata table) and accept an optional --baseline-csv. When present, each metric cell shows a percentage delta against the latest successful CSP-benchmarks run on main. Circuits absent from the baseline are flagged "(new)". - csp-benchmarks.yml: also run on push to main (so each main commit publishes a new baseline artifact) and add a best-effort step that uses gh to look up the latest successful main run, download its results.csv into csp-bench-logs/baseline/, and pass it through to the renderer along with the baseline run id for the comment link. - Add actions:read so the workflow can list runs and download artifacts from previous runs.
- build_noir_pr_comment.py: stop rendering the raw grouped_error_report.txt details block. The failing-circuits list and PASS/FAIL/SKIP counts are parsed from the report and shown directly, so the raw dump is redundant. Truncation infrastructure (clip_tail, build_with_truncation, the sanitize_code_fence helper) is removed since the comment is now small enough to fit comfortably; a single hard-cut guard remains. - build_csp_pr_comment.py: remove the trailing "_This comment is automatically updated_" footer.
| provekit: dict[str, int] = {} | ||
| with csv_path.open() as f: | ||
| for row in csv.DictReader(f): | ||
| leaf = row["test_name"].split("/")[-1] |
There was a problem hiding this comment.
Key normalization is inconsistent here: Mavros keeps full relative paths (workspace/crates/a) while ProveKit reduces CSV names to the last path segment (a). Nested/workspace tests won’t match correctly in the comparison output.
|
|
||
| env: | ||
| CARGO_TERM_COLOR: always | ||
| NOIR_REF: ${{ github.event_name == 'workflow_dispatch' && (github.event.inputs.noir_ref != '' && github.event.inputs.noir_ref || 'v1.0.0-beta.19') || 'v1.0.0-beta.19' }} |
There was a problem hiding this comment.
Here and at line 31
Deduplicate NOIR_REF default v1.0.0-beta.19 is hardcoded three times across the env expression, job
name:, and workflow_dispatch input default. Simplify to:
env:
NOIR_REF: ${{ github.event.inputs.noir_ref || 'v1.0.0-beta.19' }}
and reference ${{ env.NOIR_REF }} in the job name. Single source of truth for
when the default bumps.
| rm -f "${test_log}" | ||
| done | ||
|
|
||
| attempted=$((passed + failed + skipped)) |
There was a problem hiding this comment.
attempted count overstates by blackbox-skip count. total is deliberately not incremented for blackbox skips (line 183 comment confirms intent). But here attempted=$((passed + failed + skipped)) re-adds them. Observed run: PASS=332, FAIL=15, SKIP=20 (all 20 blackbox) → reports Attempted=367, actual=347. Fix: attempted=${total}.
Reads the counts from `provekit-cli prove`'s `Read Noir proof scheme` log line (already captured per run); no CLI changes required. Renders both as integer columns with thousands separators and the same delta-vs-main treatment as the perf metrics. Backwards-compatible with older baseline CSVs lacking the columns.
- Deduplicate NOIR_REF: collapse the nested ternary in env to a single
`||` fallback and have the job name read `${{ env.NOIR_REF }}` so
the literal version string only needs to bump in two places.
- Fix attempted overcount: blackbox skips bump `skipped` without
bumping `total`, so summing passed+failed+skipped double-counts
them. Track `attempted=${total}` instead. Observed 332 PASS / 15
FAIL / 20 SKIP run now reports 347 (was 367).
Summary
This PR adds a dedicated Noir
execution_successintegration workflow for ProveKit, plus reporting that compares ProveKit witness counts against Mavros .The goal is to make it easy to run the suite on demand, preserve actionable logs as artifacts, and provide an always-updated comparison table in CI output.
What This PR Adds
1) New GitHub Action workflow for Noir execution-success tests
.github/workflows/noir-execution-success.ymlworkflow_dispatch(manual run)noir_ref(defaultv1.0.0-beta.19)Linux,ARM64,provekit-build)provekit-cli, installs Noir toolchain, sparse-clones Noir test programs, runs suite script, uploads logs artifact, and cleans up temp clone.2) New runner script for suite execution
scripts/run_noir_execution_success.shtest_programs/execution_successnargo compile(with optional-Zenumsfallback)provekit-cli prepareprovekit-cli proveprovekit-cli verifytomllibsupport.3) New vendoring helper
scripts/vendor_noir_execution_success.shexecution_success+test_libraries4) Witness comparison generator
scripts/generate_witness_comparison.pySTATUS.md(reilabs/mavros/main) and parses only:noir/test_programs/execution_success/*witness_comparison.mdreports the union of names from both sides.missing_provekitmissing_mavrosArtifact Contents (What gets uploaded)
Workflow uploads
noir-execution-logs/as artifact:grouped_error_report.txtprovekit_witness_counts.csvwitness_comparison.mdper_test/*.logfor failed/skipped testsRetention: 7 days.
Note:
missing_provekitif present on Mavros side.