Does a Dafny lemma actually mean what a natural language requirement says? Dafny can verify proofs, but it can't verify meaning. Claimcheck fills that gap.
Someone else (Claude Code, a human, any agent) writes the lemmas and claims "requirement X is covered by lemma Y." Claimcheck verifies that claim via a round-trip: informalize the lemma back to English (without seeing the requirement), then compare.
Blog post: claimcheck: Narrowing the Gap between Proof and Intent.
npm install
npm linkStructural separation — different models for informalization and comparison:
1. Extract mapped lemmas from claims .dfy file
2. Batch informalize all lemmas (haiku) — does NOT see requirements
3. Batch compare back-translations against requirements (sonnet)
4. Report: confirmed / disputed
Prompt-level separation — one model does both passes sequentially:
1. Extract mapped lemmas from claims .dfy file
2. For each mapping: single LLM call with two-pass prompt
a. Pass 1: informalize the lemma (before seeing the NL requirement)
b. Pass 2: compare, check vacuity, flag surprising restrictions
3. Report with richer verdicts: JUSTIFIED / PARTIALLY_JUSTIFIED / NOT_JUSTIFIED / VACUOUS
Same single-prompt approach piped through claude -p — no structural or prompt-level separation (the model sees everything at once). Useful for comparing whether look-ahead matters.
# Two-pass audit (default)
claimcheck \
-m test/integration/mappings/counter.json \
--dfy test/integration/claims/counter.dfy \
-d counter
# With explicit module (needed for --verify with module-based .dfy files)
claimcheck \
-m test/integration/mappings/counter.json \
--dfy test/integration/claims/counter.dfy \
--module CounterDomain -d counter --verify
# Single-prompt audit
claimcheck \
-m test/integration/mappings/counter.json \
--dfy test/integration/claims/counter.dfy \
-d counter --single-prompt --jsonPre-extract your claims and pipe them in:
echo '{
"claims": [
{
"requirement": "The counter value is always non-negative",
"lemmaName": "CounterNonNegative",
"dafnyCode": "lemma CounterNonNegative(m: int)\n requires Inv(m)\n ensures m >= 0\n{}"
}
],
"domain": "counter"
}' | claimcheck --stdinimport { claimcheck } from 'claimcheck';
const { results, tokenUsage } = await claimcheck({
claims: [
{
requirement: 'The counter value is always non-negative',
lemmaName: 'CounterNonNegative',
dafnyCode: 'lemma CounterNonNegative(m: int)\n requires Inv(m)\n ensures m >= 0\n{}',
},
],
domain: 'counter',
});# All test projects
node test/integration/run-all.js
# Single test project
node test/integration/run-all.js counter
# Run benchmarks
node eval/bench.js --runs 3 --label two-pass
node eval/bench.js --runs 3 --label single-prompt --single-prompt
node eval/bench-cc.js --runs 1 --label cc-sonnet
# Run a single domain or lemma
node eval/bench-cc.js --runs 1 --label test --domain counter
node eval/bench-cc.js --runs 1 --label test --domain counter --lemma CounterNonNegative
# Compare results
node eval/compare.js two-pass single-prompt
node eval/compare.js two-pass cc-sonnet| Flag | Description |
|---|---|
-m, --mapping <path> |
Path to mapping file (JSON: [{requirement, lemmaName}, ...]) |
--dfy <path> |
Path to claims .dfy file (contains the lemmas) |
--module <name> |
Dafny module name (optional; needed for --verify with module-based files) |
-d, --domain <name> |
Human-readable domain name (default: from --module or .dfy filename) |
--json |
Output JSON instead of markdown |
--single-prompt |
Use single-prompt claimcheck mode (one call per pair) |
--model <id> |
Model for single-prompt mode (default: sonnet) |
--verify |
Also run dafny verify on each lemma |
--informalize-model <id> |
Model for back-translation in two-pass mode (default: haiku) |
--compare-model <id> |
Model for comparison in two-pass mode (default: sonnet) |
--stdin |
Read JSON from stdin (pure claimcheck, no file extraction) |
-v, --verbose |
Verbose API/verification logging |
For each mapping entry, one of:
| Status | Meaning |
|---|---|
| confirmed | Round-trip passed — lemma faithfully expresses the requirement |
| disputed | Round-trip failed — discrepancy between lemma meaning and requirement |
| verify-failed | Dafny verification failed (only with --verify flag) |
| error | Lemma not found in source |
In single-prompt mode, disputed results include richer detail: verdict category, vacuity analysis, and surprising restrictions.
| Project | Claims file | Domain module |
|---|---|---|
| counter | test/integration/claims/counter.dfy |
CounterDomain |
| kanban | test/integration/claims/kanban.dfy |
KanbanDomain |
| colorwheel | test/integration/claims/colorwheel.dfy |
ColorWheelDomain |
| canon | test/integration/claims/canon.dfy |
CanonDomain |
| delegation-auth | test/integration/claims/delegation-auth.dfy |
DelegationAuthDomain |
Each claims file includes its domain from ../dafny-replay/. Mappings in test/integration/mappings/.
Accuracy across 5 domains (counter, kanban, colorwheel, canon, delegation-auth) with 36 requirement-lemma pairs, including 8 deliberately bogus lemmas (tautologies, weakened postconditions, narrowed scope):
| Variant | Accuracy | Time/run | API calls/run |
|---|---|---|---|
| Two-pass (default) | 96.3% (104/108) | ~108s | 2 (batch informalize + batch compare) |
| Single-prompt | 86.1% (31/36) | ~353s | 36 (one per pair) |
Claude Code (bench-cc) |
69.4% (25/36) | ~693s | 36 (one claude -p per pair) |
Two-pass had 3 runs; single-prompt and Claude Code had 1 run each.
Key takeaways:
- Structural separation (two-pass) is both the most accurate and fastest
- The informalize-without-seeing-requirement step prevents anchoring bias
- Batching into 2 API calls vs 36 individual calls gives a ~3x speed advantage
- Claude Code's general-purpose system prompt and lack of structured output hurt both accuracy and speed
- Node.js 18+
ANTHROPIC_API_KEYenvironment variabledafnyin PATH (only for--verify)- dafny-replay cloned as a sibling directory (required for tests and benchmarks — the claims
.dfyfilesincludedomain files from../../dafny-replay/)