Skip to content

Add Noir's Execution Success test suite#411

Merged
Bisht13 merged 16 commits intomainfrom
us/noir-integration-tests
Apr 28, 2026
Merged

Add Noir's Execution Success test suite#411
Bisht13 merged 16 commits intomainfrom
us/noir-integration-tests

Conversation

@x-senpai-x
Copy link
Copy Markdown
Collaborator

@x-senpai-x x-senpai-x commented Apr 15, 2026

Summary

This PR adds a dedicated Noir execution_success integration workflow for ProveKit, plus reporting that compares ProveKit witness counts against Mavros .

The goal is to make it easy to run the suite on demand, preserve actionable logs as artifacts, and provide an always-updated comparison table in CI output.


What This PR Adds

1) New GitHub Action workflow for Noir execution-success tests

  • Adds .github/workflows/noir-execution-success.yml
  • Trigger: workflow_dispatch (manual run)
  • Input: noir_ref (default v1.0.0-beta.19)
  • Runner: self-hosted (Linux, ARM64, provekit-build)
  • Builds provekit-cli, installs Noir toolchain, sparse-clones Noir test programs, runs suite script, uploads logs artifact, and cleans up temp clone.

2) New runner script for suite execution

  • Adds scripts/run_noir_execution_success.sh
  • Discovers runnable tests under Noir test_programs/execution_success
  • Runs:
    • nargo compile (with optional -Zenums fallback)
    • provekit-cli prepare
    • provekit-cli prove
    • provekit-cli verify
  • Produces:
    • per-test logs (failures/skips retained, pass logs deleted to keep artifacts lean)
    • grouped failure report
    • witness count CSV
    • witness comparison markdown
  • Enforces required nargo version and Python tomllib support.

3) New vendoring helper

  • Adds scripts/vendor_noir_execution_success.sh
  • Sparse-vendors Noir execution_success + test_libraries
  • Writes source metadata file

4) Witness comparison generator

  • Adds scripts/generate_witness_comparison.py
  • Reads ProveKit witness CSV
  • Fetches live Mavros STATUS.md (reilabs/mavros/main) and parses only:
    • noir/test_programs/execution_success/*
  • Uses hardcoded fallback map if live fetch/parse fails.
  • Comparison includes full gap visibility
  • witness_comparison.md reports the union of names from both sides.
  • If provekit or mavros fails/skips a test Rows are marked as:
    • missing_provekit
    • missing_mavros
  • Summary includes:
    • total union count
    • comparable count
    • missing counts on each side
    • equal / better breakdown among comparable rows.

Artifact Contents (What gets uploaded)

Workflow uploads noir-execution-logs/ as artifact:

  • grouped_error_report.txt
  • provekit_witness_counts.csv
  • witness_comparison.md
  • per_test/*.log for failed/skipped tests

Retention: 7 days.


Note:

  • Prepare failures do not produce a ProveKit witness row, but those circuits appear in comparison as missing_provekit if present on Mavros side.

Copilot AI review requested due to automatic review settings April 15, 2026 18:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an automated harness to run Noir’s upstream execution_success test corpus through provekit-cli, with both a vendoring script for local use and a GitHub Actions workflow for on-demand runs.

Changes:

  • Add a vendoring script to copy Noir’s test_programs/{execution_success,test_libraries} into test-programs/noir/ with source metadata.
  • Add a runner script that discovers runnable Noir test projects, compiles them with nargo, then runs provekit-cli prepare/prove/verify, producing logs and a grouped failure report.
  • Add a workflow_dispatch GitHub Actions workflow to fetch Noir test programs via sparse checkout and execute the suite on a self-hosted runner.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
scripts/vendor_noir_execution_success.sh Vendors Noir test programs into the repo for local/offline runs.
scripts/run_noir_execution_success.sh Discovers and executes Noir execution_success tests via nargo + provekit-cli, generating logs/reports.
.github/workflows/noir-execution-success.yml Workflow to fetch Noir tests (sparse) and run the suite in CI with artifacts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/workflows/noir-execution-success.yml
Comment thread scripts/run_noir_execution_success.sh Outdated
Comment thread scripts/run_noir_execution_success.sh
Comment thread scripts/vendor_noir_execution_success.sh
Comment thread .github/workflows/noir-execution-success.yml Outdated
Comment thread .github/workflows/noir-execution-success.yml Outdated
@x-senpai-x x-senpai-x requested a review from Bisht13 April 16, 2026 07:06
Base automatically changed from us/bump-noir to main April 20, 2026 11:50
@x-senpai-x x-senpai-x force-pushed the us/noir-integration-tests branch from ea55439 to dbe013c Compare April 20, 2026 11:53
@Bisht13
Copy link
Copy Markdown
Collaborator

Bisht13 commented Apr 21, 2026

/claude-review

@github-actions
Copy link
Copy Markdown

Code Review

Summary

This PR adds CI/tooling infrastructure for running Noir's execution_success test suite against ProveKit: a GitHub Actions workflow, a bash test runner, a Python witness-count comparison generator, and a vendoring helper. No Rust proof-system code is touched, so ZK soundness and transcript invariants from CLAUDE.md are N/A.


Review

Correctness & Soundness

No issues with ZK proof system correctness — this is purely CI/test tooling.

One logic issue in run_noir_execution_success.sh: the (( total += 1 )) increment at line 779 happens before the blackbox skip check at line 790. When MAX_TESTS is set, blackbox-skipped tests consume cap slots. If the sorted test list begins with many blackbox tests, the effective number of prove/verify tests that run could be much less than MAX_TESTS. Moving the blackbox-skip check before the total increment would give more intuitive behaviour. Non-blocking since MAX_TESTS is not set in the workflow.

Also at line 817, grep -qr targets a specific file — the -r (recursive) flag is unnecessary and semantically incorrect. Should be grep -q 'test_libraries' "${test_dir}/Nargo.toml".

Code Quality

The bash script is well-structured: set -euo pipefail, sandbox isolation per test with proper trap EXIT INT TERM cleanup, mapfile -t for safe test discovery, and graceful degradation when TEST_LIB_ROOT is absent. Good.

The Python script handles live-fetch failures gracefully with a hardcoded fallback and a 15s timeout. The merge dict order ({**MAVROS_COLS, **live}) correctly lets live data take precedence over the hardcoded table. Good.

One maintenance concern: SKIP_TESTS is defined identically in both generate_witness_comparison.py (line 319) and run_noir_execution_success.sh (line 531). The # Keep in sync comment helps, but this is a divergence hazard. A single source of truth — e.g., a shared text file read by both scripts, or a CI assertion diffing the two lists — would be more robust. Non-blocking since they currently match exactly.

Security

Two best-practice issues with the workflow:

  1. No explicit permissions block (.github/workflows/noir-execution-success.yml). Without it, the job inherits the repository's default token permissions. Adding an explicit minimal block prevents accidental scope creep:

    permissions:
      contents: read
  2. Actions pinned to floating tags (actions/checkout@v4, moonrepo/setup-rust@v1, noir-lang/noirup@v0.1.2, actions/upload-artifact@v4) rather than SHA digests. SHA-pinning with a version comment is the hardened approach. Non-blocking for a manual-dispatch-only internal workflow, but worth noting.

Testing

This PR is itself the test infrastructure; validation happens by running it. The Python script is pure data transformation (CSV to markdown) and is simple enough for manual review to suffice.

One edge case handled correctly: if provekit_witness_counts.csv is empty (all tests failed/skipped), the comparison script produces all Mavros rows marked missing_provekit — valid and informative output.

Suggestions

  • (Non-blocking) Move the blackbox-skip check before (( total += 1 )) so MAX_TESTS counts only runnable tests.
  • (Non-blocking) Drop -r from grep -qr at run_noir_execution_success.sh line 817.
  • (Non-blocking) Add permissions: contents: read to the workflow job.
  • (Non-blocking) Consider SHA-pinning third-party actions.
  • (Non-blocking) Establish a single source of truth for SKIP_TESTS shared between the shell and Python scripts.
  • (Nitpick) _MAVROS_STATUS_URL is pinned to the main branch of reilabs/mavros. If that branch is renamed, the live fetch silently falls back to hardcoded data — acceptable since the fallback is handled correctly.

Verdict

⚠️ Approve with suggestions — the PR is functionally correct and well-engineered. All suggestions above are non-blocking; the most impactful to address before merge is adding the explicit permissions block to the workflow.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an integration workflow and supporting scripts to run Noir’s test_programs/execution_success suite through provekit-cli, collect logs/artifacts, and generate a ProveKit vs Mavros witness-count comparison report for CI visibility.

Changes:

  • Introduces a new GitHub Actions workflow to build ProveKit, install Noir, fetch Noir test programs via sparse checkout, run the suite, and upload artifacts (plus PR reporting).
  • Adds a suite runner + helpers (bash + python) to discover tests, run compile/prepare/prove/verify, and produce grouped failure reports + witness CSV.
  • Adds vendoring + skip-list + witness comparison tooling (including live STATUS.md fetch with fallback data).

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
.github/workflows/noir-execution-success.yml New CI workflow to execute the Noir suite, upload artifacts, and post/update a sticky PR comment.
.github/scripts/build_noir_pr_comment.py Builds a size-bounded sticky PR comment body from artifacts (grouped failures + witness comparison).
scripts/run_noir_execution_success.sh Main runner to discover Noir execution_success tests and run ProveKit prepare/prove/verify with logging and CSV emission.
scripts/noir_execution_helpers.py Python helpers for discovery, Prover.toml resolution, and grouped failure report generation.
scripts/generate_witness_comparison.py Generates a markdown comparison table between ProveKit witness counts and Mavros STATUS.md “Cols”.
scripts/noir_skip_tests.txt Shared skip list for unsupported-blackbox tests (runner + witness comparison).
scripts/vendor_noir_execution_success.sh Sparse-vendors Noir execution_success and test_libraries into the repo with source metadata.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +182 to +190
# --- Unimplemented blackbox skip list: no log, no noise ---
if [[ "${SKIP_SET["${leaf_name}"]:-}" == "1" ]]; then
echo "SKIP (blackbox): ${test_name}"
(( skipped += 1 ))
continue
fi

test_log="${LOG_DIR}/per_test/${safe_test_name}.log"

Comment thread scripts/generate_witness_comparison.py Outdated
Comment on lines +302 to +311
lines = [
"# Mavros vs Provekit Witnesses Count",
"",
f"Union {len(all_names)} circuits: {len(comparable)} comparable, "
f"{missing_in_provekit} missing in Provekit, {missing_in_mavros} missing in Mavros.",
f"Among comparable: {equal} equal, {mavros_better} Mavros better, "
f"{provekit_better} Provekit better.",
"",
"| Test | Mavros Cols | Provekit Post-GE | Delta | Better | Factor |",
"|------|-------------|------------------|-------|--------|--------|",
default: "v1.0.0-beta.19"

permissions:
contents: read
Comment on lines +4 to +7
on:
pull_request:
workflow_dispatch:
inputs:
cancel-in-progress: true

jobs:
noir-execution-success:
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 21, 2026

Noir execution_success report

Metric Value
Workflow status [PASS] success
Noir ref v1.0.0-beta.19
Commit de63eae7835d
Run #25046805681
PASS 332
FAIL 15
SKIP 20
Failing circuits (15)
  • databus
  • databus_composite_calldata
  • databus_two_calldata
  • databus_two_calldata_simple
  • fold_2_to_17
  • fold_after_inlined_calls
  • fold_basic
  • fold_basic_nested_call
  • fold_call_witness_condition
  • fold_complex_outputs
  • fold_distinct_return
  • fold_fibonacci
  • fold_numeric_generic_poseidon
  • regression_7143
  • regression_7612

x-senpai-x and others added 11 commits April 28, 2026 13:46
…list

Drop the provekit_witness_report.md table from the sticky PR comment and
add a "Failing circuits (N)" details section parsed from the [stages]
block of grouped_error_report.txt, alongside the grouped report itself.
Adds a separate Noir-CSP benchmark suite that runs prepare/prove/verify
on every circuit under noir-examples/csp-benchmarks/ and posts a sticky
PR comment with prover time, peak RSS, peak heap, verifier time, proof
size, and PKP size — averaged over BENCH_RUNS (default 3) iterations.

- scripts/run_csp_benchmarks.sh: bench runner. Per circuit: nargo
  compile + provekit-cli prepare once, then prove and verify each three
  times under /usr/bin/time -f '%e %M', capturing prover stderr so the
  helper can extract the largest "peak memory" entry from the
  span_stats trace.
- scripts/csp_benchmark_helpers.py: parses /usr/bin/time output and the
  SI-formatted peak-memory entries (with U+202F separator), converts
  to bytes, and emits a single averaged CSV row per circuit.
- .github/scripts/build_csp_pr_comment.py: renders results.csv as a
  markdown table and wraps it with a <!-- csp-benchmarks-report -->
  marker so it lives in its own sticky comment (no race with the
  noir-execution-success workflow).
- .github/workflows/csp-benchmarks.yml: ubuntu-24.04-arm, runs on PRs
  from this repo and on workflow_dispatch (with a configurable
  bench_runs override).
@Bisht13 Bisht13 force-pushed the us/noir-integration-tests branch from 2cce98f to 5f0a319 Compare April 28, 2026 08:22
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 28, 2026

CSP benchmarks

Metric Value
Workflow status [PASS] success
Commit 89bf505994a2
Run #25048427741
Circuits benchmarked 21
Iterations averaged per circuit 3

Prover time, peak RSS, peak heap, and verifier time are arithmetic means across the iterations. Peak heap comes from the largest peak memory entry in provekit-cli prove's tracing output; peak RSS is reported by /usr/bin/time -v (max-resident-set-size).

No baseline available yet — deltas will appear once this workflow has produced at least one successful main run.

Results
Circuit Constraints Witnesses Prover time Peak RSS Peak heap Verifier time Proof size PKP size
ecdsa_p256 143,282 258,158 2.98 s 258 MB 225 MB 340 ms 2.80 MB 810 KB
keccak_1024 822,870 1,543,366 6.34 s 985 MB 953 MB 843 ms 3.07 MB 6.07 MB
keccak_128 163,058 313,707 2.12 s 273 MB 242 MB 370 ms 2.80 MB 1.22 MB
keccak_2048 1,575,606 2,945,822 11.82 s 1.81 GB 1.80 GB 1.42 s 3.30 MB 12.36 MB
keccak_256 256,206 487,012 2.31 s 327 MB 290 MB 407 ms 2.83 MB 1.97 MB
keccak_512 445,094 839,130 3.62 s 594 MB 509 MB 553 ms 3.01 MB 3.40 MB
poseidon2_12 479 563 350 ms 23.52 MB 14.69 MB 100 ms 1.01 MB 436 KB
poseidon2_16 556 719 353 ms 23.88 MB 14.88 MB 100 ms 1.05 MB 530 KB
poseidon2_2 231 278 340 ms 22.69 MB 14.11 MB 100 ms 1.05 MB 108 KB
poseidon2_4 529 535 340 ms 22.98 MB 14.31 MB 100 ms 1.04 MB 31.67 KB
poseidon2_8 363 423 350 ms 23.69 MB 14.50 MB 103 ms 1.05 MB 365 KB
poseidon_12 504 524 350 ms 23.82 MB 14.69 MB 103 ms 1.04 MB 410 KB
poseidon_16 609 633 350 ms 23.71 MB 14.97 MB 100 ms 1.04 MB 536 KB
poseidon_2 240 249 343 ms 22.50 MB 14.02 MB 100 ms 1.03 MB 53.79 KB
poseidon_4 297 309 340 ms 23.10 MB 14.31 MB 100 ms 1.02 MB 210 KB
poseidon_8 402 418 350 ms 23.12 MB 14.50 MB 100 ms 1.06 MB 305 KB
sha256_1024 196,940 339,764 2.19 s 302 MB 273 MB 420 ms 2.80 MB 1.87 MB
sha256_128 46,398 80,974 1.08 s 99.90 MB 83.67 MB 260 ms 2.53 MB 513 KB
sha256_2048 345,399 612,724 3.57 s 547 MB 484 MB 600 ms 2.97 MB 3.02 MB
sha256_256 67,904 117,944 1.37 s 151 MB 130 MB 290 ms 2.65 MB 720 KB
sha256_512 110,916 191,884 1.50 s 181 MB 158 MB 320 ms 2.65 MB 1.09 MB

Bisht13 added 3 commits April 28, 2026 14:54
- scripts/csp_benchmark_helpers.py: spell "unparsable" the way crate-ci/typos
  expects, fixing the Spell check job that was failing on this PR.
- scripts/run_noir_execution_success.sh: don't exit 1 when only some
  circuits fail. The PR sticky comment surfaces failing circuits, so the
  workflow shouldn't block PRs on known provekit-cli limitations. Local
  callers that want the old behaviour can opt back in via STRICT_FAIL=1.
- build_csp_pr_comment.py: drop the "Runs" column (the iteration count
  is already shown in the metric metadata table) and accept an optional
  --baseline-csv. When present, each metric cell shows a percentage
  delta against the latest successful CSP-benchmarks run on main.
  Circuits absent from the baseline are flagged "(new)".
- csp-benchmarks.yml: also run on push to main (so each main commit
  publishes a new baseline artifact) and add a best-effort step that
  uses gh to look up the latest successful main run, download its
  results.csv into csp-bench-logs/baseline/, and pass it through to
  the renderer along with the baseline run id for the comment link.
- Add actions:read so the workflow can list runs and download
  artifacts from previous runs.
- build_noir_pr_comment.py: stop rendering the raw grouped_error_report.txt
  details block. The failing-circuits list and PASS/FAIL/SKIP counts are
  parsed from the report and shown directly, so the raw dump is redundant.
  Truncation infrastructure (clip_tail, build_with_truncation, the
  sanitize_code_fence helper) is removed since the comment is now small
  enough to fit comfortably; a single hard-cut guard remains.
- build_csp_pr_comment.py: remove the trailing "_This comment is
  automatically updated_" footer.
Comment thread scripts/generate_witness_comparison.py Outdated
provekit: dict[str, int] = {}
with csv_path.open() as f:
for row in csv.DictReader(f):
leaf = row["test_name"].split("/")[-1]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key normalization is inconsistent here: Mavros keeps full relative paths (workspace/crates/a) while ProveKit reduces CSV names to the last path segment (a). Nested/workspace tests won’t match correctly in the comparison output.


env:
CARGO_TERM_COLOR: always
NOIR_REF: ${{ github.event_name == 'workflow_dispatch' && (github.event.inputs.noir_ref != '' && github.event.inputs.noir_ref || 'v1.0.0-beta.19') || 'v1.0.0-beta.19' }}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and at line 31
Deduplicate NOIR_REF default v1.0.0-beta.19 is hardcoded three times across the env expression, job
name:, and workflow_dispatch input default. Simplify to:

    env:
      NOIR_REF: ${{ github.event.inputs.noir_ref || 'v1.0.0-beta.19' }}

and reference ${{ env.NOIR_REF }} in the job name. Single source of truth for
when the default bumps.

Comment thread scripts/run_noir_execution_success.sh Outdated
rm -f "${test_log}"
done

attempted=$((passed + failed + skipped))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attempted count overstates by blackbox-skip count. total is deliberately not incremented for blackbox skips (line 183 comment confirms intent). But here attempted=$((passed + failed + skipped)) re-adds them. Observed run: PASS=332, FAIL=15, SKIP=20 (all 20 blackbox) → reports Attempted=367, actual=347. Fix: attempted=${total}.

Bisht13 added 2 commits April 28, 2026 16:16
Reads the counts from `provekit-cli prove`'s `Read Noir proof scheme`
log line (already captured per run); no CLI changes required. Renders
both as integer columns with thousands separators and the same
delta-vs-main treatment as the perf metrics. Backwards-compatible with
older baseline CSVs lacking the columns.
- Deduplicate NOIR_REF: collapse the nested ternary in env to a single
  `||` fallback and have the job name read `${{ env.NOIR_REF }}` so
  the literal version string only needs to bump in two places.
- Fix attempted overcount: blackbox skips bump `skipped` without
  bumping `total`, so summing passed+failed+skipped double-counts
  them. Track `attempted=${total}` instead. Observed 332 PASS / 15
  FAIL / 20 SKIP run now reports 347 (was 367).
@Bisht13 Bisht13 merged commit 1c18dc5 into main Apr 28, 2026
8 of 9 checks passed
@Bisht13 Bisht13 deleted the us/noir-integration-tests branch April 28, 2026 11:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants