Major Model Benchmark

Open benchmark harness for comparing current major AI models on general work: reasoning, coding, structured output, long-context extraction, tool-use readiness, and lightweight data analysis.

This repository is a reproducibility-first benchmark starter. It publishes the model roster, task format, scoring code, and raw result schema before any leaderboard claims.

Current Roster

Roster snapshot: 2026-05-05.

The benchmark roster is in model_roster.json. It currently tracks OpenAI, Anthropic, Google, xAI, DeepSeek, Mistral, Cohere, Alibaba Qwen, Moonshot Kimi, MiniMax, Z.AI GLM, and Meta Llama.

Provider model names move quickly. Every run should record:

exact model ID used by the provider
whether the ID is a stable snapshot, alias, preview, or partner-hosted ID
temperature, max output, reasoning mode, tool access, and timestamp
raw prompt and raw output

See docs/research-notes.md for the source notes behind the initial roster.

What This Measures

The public seed suite in tasks/general_seed.jsonl is intentionally small. It validates the harness and scoring path. Larger benchmark suites should be versioned under tasks/vYYYY-MM-DD/ and should include public seed tasks plus private holdout tasks.

Task families:

reasoning: compact logic and quantitative reasoning
coding: bug diagnosis and patch generation
structured-output: JSON extraction and schema discipline
long-context: retrieval of the exact relevant clause from distractors
data-analysis: table arithmetic and probabilistic scoring
calibration: probability math and uncertainty discipline

Run

Validate roster/tasks and emit placeholder rows:

python src/sf_benchmark/runner.py \
  --models model_roster.json \
  --tasks tasks/general_seed.jsonl \
  --out results/dry-run.json

Score a JSONL file of model responses:

python src/sf_benchmark/runner.py \
  --models model_roster.json \
  --tasks tasks/general_seed.jsonl \
  --responses examples/responses.seed.jsonl \
  --out results/scored-seed.json

Run tests:

python -m unittest discover -s tests

Response Format

--responses expects JSONL:

{"model":"openai/gpt-5.5","task_id":"data.table.001","output":"0.715"}

Results follow schemas/result.schema.json.

Benchmark Rules

Use the same prompt and same tool policy across models in the same run.
Prefer stable snapshot IDs; if using aliases, record the alias resolution date.
Do not rank models until raw outputs, scoring code, and run metadata are public.
Report accuracy next to latency, cost, refusal/error rate, and output length.
Keep public seed tasks separate from private holdout tasks to reduce benchmark overfitting.

The methodology is in docs/methodology.md.

License

MIT.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Major Model Benchmark

Current Roster

What This Measures

Run

Response Format

Benchmark Rules

Related

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
examples		examples
schemas		schemas
src/sf_benchmark		src/sf_benchmark
tasks		tasks
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
model_roster.json		model_roster.json
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Major Model Benchmark

Current Roster

What This Measures

Run

Response Format

Benchmark Rules

Related

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages