pinchbench · jb510 · Mar 8, 2026 · Mar 8, 2026
diff --git a/README.md b/README.md
@@ -74,15 +74,39 @@ Skip uploading with `--no-upload` if you just want local results.
 
 | Flag | Description |
 |------|-------------|
-| `--model MODEL` | Model to test (e.g., `anthropic/claude-sonnet-4`) |
+| `--model MODEL` | **Required provider-qualified model** to test (e.g., `anthropic/claude-sonnet-4`, `openai-codex/gpt-5.3-codex`) |
+| `--judge-model MODEL` | Provider-qualified model for LLM-judge tasks (default: `anthropic/claude-opus-4.5`) |
 | `--suite SUITE` | `all`, `automated-only`, or comma-separated task IDs |
+| `--thinking LEVEL` | Thinking level for benchmark turns (`off|minimal|low|medium|high|xhigh|adaptive`) |
 | `--runs N` | Number of runs per task for averaging |
 | `--timeout-multiplier N` | Scale timeouts for slower models |
-| `--output-dir DIR` | Where to save results (default: `results/`) |
+| `--output-dir DIR` | Where to save results/checkpoints (default: `results/`) |
+| `--judge-only FILE` | Re-run grading only (no task execution) from an existing results/checkpoint JSON |
+| `--clear-sessions` | Clear stored OpenClaw session transcripts before each turn (default is keep/persistent) |
 | `--no-upload` | Skip uploading to leaderboard |
 | `--register` | Request an API token for submissions |
 | `--upload FILE` | Upload a previous results JSON |
 
+## Deterministic Model Routing (No Silent Fallbacks)
+
+PinchBench now enforces strict model determinism for benchmark integrity:
+
+- You must provide provider-qualified model refs (no bare model IDs)
+- Requested models must exist and be available in local OpenClaw model catalog
+- Runtime provider/model is verified on every task via transcript model snapshots
+- Any provider/model mismatch fails the run immediately
+- Judge execution is skipped when task execution fails
+
+If a requested model is unavailable, PinchBench errors early with a suggestion when possible (for example `openrouter/<provider>/<model>`).
+
+## Persistence, Checkpoints, and Re-Judging
+
+- Sessions are now **persistent by default** (no auto cleanup between tasks)
+- A rolling checkpoint is written during benchmark runs: `results/<run>_<model>.checkpoint.json`
+- Final output includes full transcripts per task for audit + regrading workflows
+- Use `--judge-only <results-or-checkpoint.json> --judge-model <provider/model>` to re-run grading with a different judge model without re-executing tasks
+- Use `--clear-sessions` if you explicitly want old session transcripts deleted before each turn
+
 ## Contributing Tasks
 
 We welcome new tasks! Check out [`tasks/TASK_TEMPLATE.md`](tasks/TASK_TEMPLATE.md) for the format. Good tasks are:

diff --git a/SKILL.md b/SKILL.md
@@ -68,15 +68,37 @@ uv run benchmark.py --model anthropic/claude-sonnet-4 --no-upload
 
 | Option | Description |
 |--------|-------------|
-| `--model` | Model identifier (e.g., `anthropic/claude-sonnet-4`) |
+| `--model` | **Required provider-qualified model** (e.g., `anthropic/claude-sonnet-4`, `openai-codex/gpt-5.3-codex`) |
+| `--judge-model` | Provider-qualified judge model (default: `anthropic/claude-opus-4.5`) |
 | `--suite` | `all`, `automated-only`, or comma-separated task IDs |
-| `--output-dir` | Results directory (default: `results/`) |
+| `--thinking` | Thinking level (`off|minimal|low|medium|high|xhigh|adaptive`) for benchmark turns |
+| `--output-dir` | Results/checkpoint directory (default: `results/`) |
 | `--timeout-multiplier` | Scale task timeouts for slower models |
 | `--runs` | Number of runs per task for averaging |
+| `--judge-only FILE` | Re-run grading only from existing results/checkpoint JSON (no task execution) |
+| `--clear-sessions` | Clear agent/judge transcript sessions before each turn (default is persistent) |
 | `--no-upload` | Skip uploading to leaderboard |
 | `--register` | Request new API token for submissions |
 | `--upload FILE` | Upload previous results JSON |
 
+## Determinism Guarantees
+
+PinchBench enforces strict model routing for benchmark validity:
+
+- No silent provider/model fallbacks
+- Requested benchmark model must match runtime model on every task
+- Requested judge model must match runtime judge model
+- Model mismatches fail the run immediately
+- If execution fails, PinchBench does not continue to judge that task
+
+## Persistence & Re-Judging
+
+- Session transcripts are preserved by default (no auto cleanup between tasks)
+- Benchmarks write rolling checkpoint files during execution: `results/<run>_<model>.checkpoint.json`
+- Final results include per-task transcripts to support replay/regrading workflows
+- You can rerun grading with another judge model using `--judge-only <json> --judge-model <provider/model>`
+- Use `--clear-sessions` when you explicitly want old transcript state removed before each turn
+
 ## Token Registration
 
 To submit results to the leaderboard: