Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 26 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,15 +74,39 @@ Skip uploading with `--no-upload` if you just want local results.

| Flag | Description |
|------|-------------|
| `--model MODEL` | Model to test (e.g., `anthropic/claude-sonnet-4`) |
| `--model MODEL` | **Required provider-qualified model** to test (e.g., `anthropic/claude-sonnet-4`, `openai-codex/gpt-5.3-codex`) |
| `--judge-model MODEL` | Provider-qualified model for LLM-judge tasks (default: `anthropic/claude-opus-4.5`) |
| `--suite SUITE` | `all`, `automated-only`, or comma-separated task IDs |
| `--thinking LEVEL` | Thinking level for benchmark turns (`off|minimal|low|medium|high|xhigh|adaptive`) |
| `--runs N` | Number of runs per task for averaging |
| `--timeout-multiplier N` | Scale timeouts for slower models |
| `--output-dir DIR` | Where to save results (default: `results/`) |
| `--output-dir DIR` | Where to save results/checkpoints (default: `results/`) |
| `--judge-only FILE` | Re-run grading only (no task execution) from an existing results/checkpoint JSON |
| `--clear-sessions` | Clear stored OpenClaw session transcripts before each turn (default is keep/persistent) |
| `--no-upload` | Skip uploading to leaderboard |
| `--register` | Request an API token for submissions |
| `--upload FILE` | Upload a previous results JSON |

## Deterministic Model Routing (No Silent Fallbacks)

PinchBench now enforces strict model determinism for benchmark integrity:

- You must provide provider-qualified model refs (no bare model IDs)
- Requested models must exist and be available in local OpenClaw model catalog
- Runtime provider/model is verified on every task via transcript model snapshots
- Any provider/model mismatch fails the run immediately
- Judge execution is skipped when task execution fails

If a requested model is unavailable, PinchBench errors early with a suggestion when possible (for example `openrouter/<provider>/<model>`).

## Persistence, Checkpoints, and Re-Judging

- Sessions are now **persistent by default** (no auto cleanup between tasks)
- A rolling checkpoint is written during benchmark runs: `results/<run>_<model>.checkpoint.json`
- Final output includes full transcripts per task for audit + regrading workflows
- Use `--judge-only <results-or-checkpoint.json> --judge-model <provider/model>` to re-run grading with a different judge model without re-executing tasks
- Use `--clear-sessions` if you explicitly want old session transcripts deleted before each turn

## Contributing Tasks

We welcome new tasks! Check out [`tasks/TASK_TEMPLATE.md`](tasks/TASK_TEMPLATE.md) for the format. Good tasks are:
Expand Down
26 changes: 24 additions & 2 deletions SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,15 +68,37 @@ uv run benchmark.py --model anthropic/claude-sonnet-4 --no-upload

| Option | Description |
|--------|-------------|
| `--model` | Model identifier (e.g., `anthropic/claude-sonnet-4`) |
| `--model` | **Required provider-qualified model** (e.g., `anthropic/claude-sonnet-4`, `openai-codex/gpt-5.3-codex`) |
| `--judge-model` | Provider-qualified judge model (default: `anthropic/claude-opus-4.5`) |
| `--suite` | `all`, `automated-only`, or comma-separated task IDs |
| `--output-dir` | Results directory (default: `results/`) |
| `--thinking` | Thinking level (`off|minimal|low|medium|high|xhigh|adaptive`) for benchmark turns |
| `--output-dir` | Results/checkpoint directory (default: `results/`) |
| `--timeout-multiplier` | Scale task timeouts for slower models |
| `--runs` | Number of runs per task for averaging |
| `--judge-only FILE` | Re-run grading only from existing results/checkpoint JSON (no task execution) |
| `--clear-sessions` | Clear agent/judge transcript sessions before each turn (default is persistent) |
| `--no-upload` | Skip uploading to leaderboard |
| `--register` | Request new API token for submissions |
| `--upload FILE` | Upload previous results JSON |

## Determinism Guarantees

PinchBench enforces strict model routing for benchmark validity:

- No silent provider/model fallbacks
- Requested benchmark model must match runtime model on every task
- Requested judge model must match runtime judge model
- Model mismatches fail the run immediately
- If execution fails, PinchBench does not continue to judge that task

## Persistence & Re-Judging

- Session transcripts are preserved by default (no auto cleanup between tasks)
- Benchmarks write rolling checkpoint files during execution: `results/<run>_<model>.checkpoint.json`
- Final results include per-task transcripts to support replay/regrading workflows
- You can rerun grading with another judge model using `--judge-only <json> --judge-model <provider/model>`
- Use `--clear-sessions` when you explicitly want old transcript state removed before each turn

## Token Registration

To submit results to the leaderboard:
Expand Down
Loading