Add Noir's Execution Success test suite by x-senpai-x · Pull Request #411 · worldfnd/provekit

x-senpai-x · 2026-04-15T18:27:06Z

Summary

This PR adds a dedicated Noir execution_success integration workflow for ProveKit, plus reporting that compares ProveKit witness counts against Mavros .

The goal is to make it easy to run the suite on demand, preserve actionable logs as artifacts, and provide an always-updated comparison table in CI output.

What This PR Adds

1) New GitHub Action workflow for Noir execution-success tests

Adds .github/workflows/noir-execution-success.yml
Trigger: workflow_dispatch (manual run)
Input: noir_ref (default v1.0.0-beta.19)
Runner: self-hosted (Linux, ARM64, provekit-build)
Builds provekit-cli, installs Noir toolchain, sparse-clones Noir test programs, runs suite script, uploads logs artifact, and cleans up temp clone.

2) New runner script for suite execution

Adds scripts/run_noir_execution_success.sh
Discovers runnable tests under Noir test_programs/execution_success
Runs:
- nargo compile (with optional -Zenums fallback)
- provekit-cli prepare
- provekit-cli prove
- provekit-cli verify
Produces:
- per-test logs (failures/skips retained, pass logs deleted to keep artifacts lean)
- grouped failure report
- witness count CSV
- witness comparison markdown
Enforces required nargo version and Python tomllib support.

3) New vendoring helper

Adds scripts/vendor_noir_execution_success.sh
Sparse-vendors Noir execution_success + test_libraries
Writes source metadata file

4) Witness comparison generator

Adds scripts/generate_witness_comparison.py
Reads ProveKit witness CSV
Fetches live Mavros STATUS.md (reilabs/mavros/main) and parses only:
- noir/test_programs/execution_success/*
Uses hardcoded fallback map if live fetch/parse fails.
Comparison includes full gap visibility
witness_comparison.md reports the union of names from both sides.
If provekit or mavros fails/skips a test Rows are marked as:
- missing_provekit
- missing_mavros
Summary includes:
- total union count
- comparable count
- missing counts on each side
- equal / better breakdown among comparable rows.

Artifact Contents (What gets uploaded)

Workflow uploads noir-execution-logs/ as artifact:

grouped_error_report.txt
provekit_witness_counts.csv
witness_comparison.md
per_test/*.log for failed/skipped tests

Retention: 7 days.

Note:

Prepare failures do not produce a ProveKit witness row, but those circuits appear in comparison as missing_provekit if present on Mavros side.

Copilot

Pull request overview

Adds an automated harness to run Noir’s upstream execution_success test corpus through provekit-cli, with both a vendoring script for local use and a GitHub Actions workflow for on-demand runs.

Changes:

Add a vendoring script to copy Noir’s test_programs/{execution_success,test_libraries} into test-programs/noir/ with source metadata.
Add a runner script that discovers runnable Noir test projects, compiles them with nargo, then runs provekit-cli prepare/prove/verify, producing logs and a grouped failure report.
Add a workflow_dispatch GitHub Actions workflow to fetch Noir test programs via sparse checkout and execute the suite on a self-hosted runner.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File	Description
scripts/vendor_noir_execution_success.sh	Vendors Noir test programs into the repo for local/offline runs.
scripts/run_noir_execution_success.sh	Discovers and executes Noir `execution_success` tests via `nargo` + `provekit-cli`, generating logs/reports.
.github/workflows/noir-execution-success.yml	Workflow to fetch Noir tests (sparse) and run the suite in CI with artifacts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Bisht13 · 2026-04-21T08:46:33Z

/claude-review

github-actions · 2026-04-21T08:50:33Z

Code Review

Summary

This PR adds CI/tooling infrastructure for running Noir's execution_success test suite against ProveKit: a GitHub Actions workflow, a bash test runner, a Python witness-count comparison generator, and a vendoring helper. No Rust proof-system code is touched, so ZK soundness and transcript invariants from CLAUDE.md are N/A.

Review

Correctness & Soundness

No issues with ZK proof system correctness — this is purely CI/test tooling.

One logic issue in run_noir_execution_success.sh: the (( total += 1 )) increment at line 779 happens before the blackbox skip check at line 790. When MAX_TESTS is set, blackbox-skipped tests consume cap slots. If the sorted test list begins with many blackbox tests, the effective number of prove/verify tests that run could be much less than MAX_TESTS. Moving the blackbox-skip check before the total increment would give more intuitive behaviour. Non-blocking since MAX_TESTS is not set in the workflow.

Also at line 817, grep -qr targets a specific file — the -r (recursive) flag is unnecessary and semantically incorrect. Should be grep -q 'test_libraries' "${test_dir}/Nargo.toml".

Code Quality

The bash script is well-structured: set -euo pipefail, sandbox isolation per test with proper trap EXIT INT TERM cleanup, mapfile -t for safe test discovery, and graceful degradation when TEST_LIB_ROOT is absent. Good.

The Python script handles live-fetch failures gracefully with a hardcoded fallback and a 15s timeout. The merge dict order ({**MAVROS_COLS, **live}) correctly lets live data take precedence over the hardcoded table. Good.

One maintenance concern: SKIP_TESTS is defined identically in both generate_witness_comparison.py (line 319) and run_noir_execution_success.sh (line 531). The # Keep in sync comment helps, but this is a divergence hazard. A single source of truth — e.g., a shared text file read by both scripts, or a CI assertion diffing the two lists — would be more robust. Non-blocking since they currently match exactly.

Security

Two best-practice issues with the workflow:

No explicit permissions block (.github/workflows/noir-execution-success.yml). Without it, the job inherits the repository's default token permissions. Adding an explicit minimal block prevents accidental scope creep:
```
permissions:
  contents: read
```
Actions pinned to floating tags (actions/checkout@v4, moonrepo/setup-rust@v1, noir-lang/noirup@v0.1.2, actions/upload-artifact@v4) rather than SHA digests. SHA-pinning with a version comment is the hardened approach. Non-blocking for a manual-dispatch-only internal workflow, but worth noting.

Testing

This PR is itself the test infrastructure; validation happens by running it. The Python script is pure data transformation (CSV to markdown) and is simple enough for manual review to suffice.

One edge case handled correctly: if provekit_witness_counts.csv is empty (all tests failed/skipped), the comparison script produces all Mavros rows marked missing_provekit — valid and informative output.

Suggestions

(Non-blocking) Move the blackbox-skip check before (( total += 1 )) so MAX_TESTS counts only runnable tests.
(Non-blocking) Drop -r from grep -qr at run_noir_execution_success.sh line 817.
(Non-blocking) Add permissions: contents: read to the workflow job.
(Non-blocking) Consider SHA-pinning third-party actions.
(Non-blocking) Establish a single source of truth for SKIP_TESTS shared between the shell and Python scripts.
(Nitpick) _MAVROS_STATUS_URL is pinned to the main branch of reilabs/mavros. If that branch is renamed, the live fetch silently falls back to hardcoded data — acceptable since the fallback is handled correctly.

Verdict

⚠️ Approve with suggestions — the PR is functionally correct and well-engineered. All suggestions above are non-blocking; the most impactful to address before merge is adding the explicit permissions block to the workflow.

Copilot

Pull request overview

Adds an integration workflow and supporting scripts to run Noir’s test_programs/execution_success suite through provekit-cli, collect logs/artifacts, and generate a ProveKit vs Mavros witness-count comparison report for CI visibility.

Changes:

Introduces a new GitHub Actions workflow to build ProveKit, install Noir, fetch Noir test programs via sparse checkout, run the suite, and upload artifacts (plus PR reporting).
Adds a suite runner + helpers (bash + python) to discover tests, run compile/prepare/prove/verify, and produce grouped failure reports + witness CSV.
Adds vendoring + skip-list + witness comparison tooling (including live STATUS.md fetch with fallback data).

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`.github/workflows/noir-execution-success.yml`	New CI workflow to execute the Noir suite, upload artifacts, and post/update a sticky PR comment.
`.github/scripts/build_noir_pr_comment.py`	Builds a size-bounded sticky PR comment body from artifacts (grouped failures + witness comparison).
`scripts/run_noir_execution_success.sh`	Main runner to discover Noir `execution_success` tests and run ProveKit prepare/prove/verify with logging and CSV emission.
`scripts/noir_execution_helpers.py`	Python helpers for discovery, Prover.toml resolution, and grouped failure report generation.
`scripts/generate_witness_comparison.py`	Generates a markdown comparison table between ProveKit witness counts and Mavros STATUS.md “Cols”.
`scripts/noir_skip_tests.txt`	Shared skip list for unsupported-blackbox tests (runner + witness comparison).
`scripts/vendor_noir_execution_success.sh`	Sparse-vendors Noir `execution_success` and `test_libraries` into the repo with source metadata.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+  # --- Unimplemented blackbox skip list: no log, no noise ---
+  if [[ "${SKIP_SET["${leaf_name}"]:-}" == "1" ]]; then
+    echo "SKIP (blackbox): ${test_name}"
+    (( skipped += 1 ))
+    continue
+  fi
+
+  test_log="${LOG_DIR}/per_test/${safe_test_name}.log"
+


+    lines = [
+        "# Mavros vs Provekit Witnesses Count",
+        "",
+        f"Union {len(all_names)} circuits: {len(comparable)} comparable, "
+        f"{missing_in_provekit} missing in Provekit, {missing_in_mavros} missing in Mavros.",
+        f"Among comparable: {equal} equal, {mavros_better} Mavros better, "
+        f"{provekit_better} Provekit better.",
+        "",
+        "| Test | Mavros Cols | Provekit Post-GE | Delta | Better | Factor |",
+        "|------|-------------|------------------|-------|--------|--------|",


+        default: "v1.0.0-beta.19"
+
+permissions:
+  contents: read


+on:
+  pull_request:
+  workflow_dispatch:
+    inputs:


+  cancel-in-progress: true
+
+jobs:
+  noir-execution-success:


github-actions · 2026-04-21T19:25:07Z

Noir execution_success report

Metric	Value
Workflow status	[PASS] success
Noir ref	`v1.0.0-beta.19`
Commit	`de63eae7835d`
Run	#25046805681
PASS	332
FAIL	15
SKIP	20

Failing circuits (15)

databus
databus_composite_calldata
databus_two_calldata
databus_two_calldata_simple
fold_2_to_17
fold_after_inlined_calls
fold_basic
fold_basic_nested_call
fold_call_witness_condition
fold_complex_outputs
fold_distinct_return
fold_fibonacci
fold_numeric_generic_poseidon
regression_7143
regression_7612

…list Drop the provekit_witness_report.md table from the sticky PR comment and add a "Failing circuits (N)" details section parsed from the [stages] block of grouped_error_report.txt, alongside the grouped report itself.

Adds a separate Noir-CSP benchmark suite that runs prepare/prove/verify on every circuit under noir-examples/csp-benchmarks/ and posts a sticky PR comment with prover time, peak RSS, peak heap, verifier time, proof size, and PKP size — averaged over BENCH_RUNS (default 3) iterations. - scripts/run_csp_benchmarks.sh: bench runner. Per circuit: nargo compile + provekit-cli prepare once, then prove and verify each three times under /usr/bin/time -f '%e %M', capturing prover stderr so the helper can extract the largest "peak memory" entry from the span_stats trace. - scripts/csp_benchmark_helpers.py: parses /usr/bin/time output and the SI-formatted peak-memory entries (with U+202F separator), converts to bytes, and emits a single averaged CSV row per circuit. - .github/scripts/build_csp_pr_comment.py: renders results.csv as a markdown table and wraps it with a  marker so it lives in its own sticky comment (no race with the noir-execution-success workflow). - .github/workflows/csp-benchmarks.yml: ubuntu-24.04-arm, runs on PRs from this repo and on workflow_dispatch (with a configurable bench_runs override).

github-actions · 2026-04-28T08:47:00Z

CSP benchmarks

Metric	Value
Workflow status	[PASS] success
Commit	`89bf505994a2`
Run	#25048427741
Circuits benchmarked	21
Iterations averaged per circuit	3

Prover time, peak RSS, peak heap, and verifier time are arithmetic means across the iterations. Peak heap comes from the largest peak memory entry in provekit-cli prove's tracing output; peak RSS is reported by /usr/bin/time -v (max-resident-set-size).

No baseline available yet — deltas will appear once this workflow has produced at least one successful main run.

Results

Circuit	Constraints	Witnesses	Prover time	Peak RSS	Peak heap	Verifier time	Proof size	PKP size
`ecdsa_p256`	143,282	258,158	2.98 s	258 MB	225 MB	340 ms	2.80 MB	810 KB
`keccak_1024`	822,870	1,543,366	6.34 s	985 MB	953 MB	843 ms	3.07 MB	6.07 MB
`keccak_128`	163,058	313,707	2.12 s	273 MB	242 MB	370 ms	2.80 MB	1.22 MB
`keccak_2048`	1,575,606	2,945,822	11.82 s	1.81 GB	1.80 GB	1.42 s	3.30 MB	12.36 MB
`keccak_256`	256,206	487,012	2.31 s	327 MB	290 MB	407 ms	2.83 MB	1.97 MB
`keccak_512`	445,094	839,130	3.62 s	594 MB	509 MB	553 ms	3.01 MB	3.40 MB
`poseidon2_12`	479	563	350 ms	23.52 MB	14.69 MB	100 ms	1.01 MB	436 KB
`poseidon2_16`	556	719	353 ms	23.88 MB	14.88 MB	100 ms	1.05 MB	530 KB
`poseidon2_2`	231	278	340 ms	22.69 MB	14.11 MB	100 ms	1.05 MB	108 KB
`poseidon2_4`	529	535	340 ms	22.98 MB	14.31 MB	100 ms	1.04 MB	31.67 KB
`poseidon2_8`	363	423	350 ms	23.69 MB	14.50 MB	103 ms	1.05 MB	365 KB
`poseidon_12`	504	524	350 ms	23.82 MB	14.69 MB	103 ms	1.04 MB	410 KB
`poseidon_16`	609	633	350 ms	23.71 MB	14.97 MB	100 ms	1.04 MB	536 KB
`poseidon_2`	240	249	343 ms	22.50 MB	14.02 MB	100 ms	1.03 MB	53.79 KB
`poseidon_4`	297	309	340 ms	23.10 MB	14.31 MB	100 ms	1.02 MB	210 KB
`poseidon_8`	402	418	350 ms	23.12 MB	14.50 MB	100 ms	1.06 MB	305 KB
`sha256_1024`	196,940	339,764	2.19 s	302 MB	273 MB	420 ms	2.80 MB	1.87 MB
`sha256_128`	46,398	80,974	1.08 s	99.90 MB	83.67 MB	260 ms	2.53 MB	513 KB
`sha256_2048`	345,399	612,724	3.57 s	547 MB	484 MB	600 ms	2.97 MB	3.02 MB
`sha256_256`	67,904	117,944	1.37 s	151 MB	130 MB	290 ms	2.65 MB	720 KB
`sha256_512`	110,916	191,884	1.50 s	181 MB	158 MB	320 ms	2.65 MB	1.09 MB

- scripts/csp_benchmark_helpers.py: spell "unparsable" the way crate-ci/typos expects, fixing the Spell check job that was failing on this PR. - scripts/run_noir_execution_success.sh: don't exit 1 when only some circuits fail. The PR sticky comment surfaces failing circuits, so the workflow shouldn't block PRs on known provekit-cli limitations. Local callers that want the old behaviour can opt back in via STRICT_FAIL=1.

- build_csp_pr_comment.py: drop the "Runs" column (the iteration count is already shown in the metric metadata table) and accept an optional --baseline-csv. When present, each metric cell shows a percentage delta against the latest successful CSP-benchmarks run on main. Circuits absent from the baseline are flagged "(new)". - csp-benchmarks.yml: also run on push to main (so each main commit publishes a new baseline artifact) and add a best-effort step that uses gh to look up the latest successful main run, download its results.csv into csp-bench-logs/baseline/, and pass it through to the renderer along with the baseline run id for the comment link. - Add actions:read so the workflow can list runs and download artifacts from previous runs.

- build_noir_pr_comment.py: stop rendering the raw grouped_error_report.txt details block. The failing-circuits list and PASS/FAIL/SKIP counts are parsed from the report and shown directly, so the raw dump is redundant. Truncation infrastructure (clip_tail, build_with_truncation, the sanitize_code_fence helper) is removed since the comment is now small enough to fit comfortably; a single hard-cut guard remains. - build_csp_pr_comment.py: remove the trailing "_This comment is automatically updated_" footer.

Bisht13 · 2026-04-22T08:15:29Z

+    provekit: dict[str, int] = {}
+    with csv_path.open() as f:
+        for row in csv.DictReader(f):
+            leaf = row["test_name"].split("/")[-1]


Key normalization is inconsistent here: Mavros keeps full relative paths (workspace/crates/a) while ProveKit reduces CSV names to the last path segment (a). Nested/workspace tests won’t match correctly in the comparison output.

Bisht13 · 2026-04-22T08:17:48Z

+
+env:
+  CARGO_TERM_COLOR: always
+  NOIR_REF: ${{ github.event_name == 'workflow_dispatch' && (github.event.inputs.noir_ref != '' && github.event.inputs.noir_ref || 'v1.0.0-beta.19') || 'v1.0.0-beta.19' }}


Here and at line 31
Deduplicate NOIR_REF default v1.0.0-beta.19 is hardcoded three times across the env expression, job
name:, and workflow_dispatch input default. Simplify to:

env: NOIR_REF: ${{ github.event.inputs.noir_ref || 'v1.0.0-beta.19' }}

and reference ${{ env.NOIR_REF }} in the job name. Single source of truth for
when the default bumps.

Bisht13 · 2026-04-22T08:18:23Z

+  rm -f "${test_log}"
+done
+
+attempted=$((passed + failed + skipped))


attempted count overstates by blackbox-skip count. total is deliberately not incremented for blackbox skips (line 183 comment confirms intent). But here attempted=$((passed + failed + skipped)) re-adds them. Observed run: PASS=332, FAIL=15, SKIP=20 (all 20 blackbox) → reports Attempted=367, actual=347. Fix: attempted=${total}.

Reads the counts from `provekit-cli prove`'s `Read Noir proof scheme` log line (already captured per run); no CLI changes required. Renders both as integer columns with thousands separators and the same delta-vs-main treatment as the perf metrics. Backwards-compatible with older baseline CSVs lacking the columns.

- Deduplicate NOIR_REF: collapse the nested ternary in env to a single `||` fallback and have the job name read `${{ env.NOIR_REF }}` so the literal version string only needs to bump in two places. - Fix attempted overcount: blackbox skips bump `skipped` without bumping `total`, so summing passed+failed+skipped double-counts them. Track `attempted=${total}` instead. Observed 332 PASS / 15 FAIL / 20 SKIP run now reports 347 (was 367).

Copilot AI review requested due to automatic review settings April 15, 2026 18:27

Copilot started reviewing on behalf of x-senpai-x April 15, 2026 18:27 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

x-senpai-x requested a review from Bisht13 April 16, 2026 07:06

x-senpai-x force-pushed the us/bump-noir branch from 8132422 to 61fb151 Compare April 19, 2026 21:25

Base automatically changed from us/bump-noir to main April 20, 2026 11:50

x-senpai-x force-pushed the us/noir-integration-tests branch from ea55439 to dbe013c Compare April 20, 2026 11:53

x-senpai-x force-pushed the us/noir-integration-tests branch from 04664f5 to 13dcba6 Compare April 21, 2026 18:28

x-senpai-x requested a review from Copilot April 21, 2026 18:29

Copilot started reviewing on behalf of x-senpai-x April 21, 2026 18:30 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

x-senpai-x and others added 11 commits April 28, 2026 13:46

noir_execution_success test suite in workflow

c031e6d

copilot issues addressed

120dd27

added mavros comparison

3541e6c

made v19 defauilt in vendor script

688d761

added missing tests in one of them as well

4d24b89

refactored shell script

fe608a9

minor nits

65fd51a

switched runner to ubuntu-24.04-arm

8e01106

removed mavros comparison

d7eca99

Bisht13 force-pushed the us/noir-integration-tests branch from 2cce98f to 5f0a319 Compare April 28, 2026 08:22

Bisht13 added 3 commits April 28, 2026 14:54

Bisht13 approved these changes Apr 28, 2026

View reviewed changes

Bisht13 added 2 commits April 28, 2026 16:16

Bisht13 merged commit 1c18dc5 into main Apr 28, 2026
8 of 9 checks passed

Bisht13 deleted the us/noir-integration-tests branch April 28, 2026 11:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Noir's Execution Success test suite#411

Add Noir's Execution Success test suite#411
Bisht13 merged 16 commits intomainfrom
us/noir-integration-tests

x-senpai-x commented Apr 15, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Bisht13 commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

github-actions Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 28, 2026 •

edited

Loading

Uh oh!

Bisht13 Apr 22, 2026

Uh oh!

Bisht13 Apr 22, 2026

Uh oh!

Bisht13 Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

x-senpai-x commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What This PR Adds

1) New GitHub Action workflow for Noir execution-success tests

2) New runner script for suite execution

3) New vendoring helper

4) Witness comparison generator

Artifact Contents (What gets uploaded)

Note:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Bisht13 commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026

Code Review

Summary

Review

Verdict

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

github-actions Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Noir execution_success report

Uh oh!

github-actions Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CSP benchmarks

Uh oh!

Bisht13 Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Bisht13 Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Bisht13 Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

x-senpai-x commented Apr 15, 2026 •

edited

Loading

github-actions Bot commented Apr 21, 2026 •

edited

Loading

github-actions Bot commented Apr 28, 2026 •

edited

Loading