Add OpenAI benchmark sweep and analysis tooling by Muhtasham · Pull Request #99 · CodeClash-ai/CodeClash

Muhtasham · 2026-04-26T00:55:06Z

Summary

add OpenAI benchmark runner scripts, progress watching, and eval pipeline helpers for sweep workflows
add generated configs, reporting scripts, and feedback docs for GPT-5.4 / GPT-5.3-Codex analysis
preserve model aliases in Elo and win-rate analysis, plus fix RoboCode tie handling

Changes

sweep tooling: add run_openai_sweep.sh, run_openai_model_benchmarks.sh, run helpers, watcher updates, and eval/report scripts
analysis: update Elo/win-rate/heatmap display logic to keep per-alias model names and generate shareable comparison plots
arenas: fix RoboCode round winner selection on tied scores and add a regression test
configs/docs: add generated OpenAI matchup configs, ablation scaffold files, and OpenAI feedback notes

Test plan

reran the eval pipeline on completed OpenAI sweep logs and generated leaderboard/plot artifacts
verified RoboCode tie handling with a direct uv run python harness returning Tie for 0-0 scores
run full automated pytest coverage once pytest is installed in the project environment

- add benchmark runners, watcher, and eval pipeline scripts for OpenAI sweeps - add generated configs, reporting utilities, and OpenAI feedback notes - preserve model aliases in analysis and fix RoboCode tie handling

Add OpenAI sweep tooling and analysis helpers

41e80a5

- add benchmark runners, watcher, and eval pipeline scripts for OpenAI sweeps - add generated configs, reporting utilities, and OpenAI feedback notes - preserve model aliases in analysis and fix RoboCode tie handling

Muhtasham closed this Apr 26, 2026

Muhtasham deleted the codex/feat-openai-sweep-tooling branch April 26, 2026 07:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add OpenAI benchmark sweep and analysis tooling#99

Add OpenAI benchmark sweep and analysis tooling#99
Muhtasham wants to merge 1 commit intoCodeClash-ai:mainfrom
Muhtasham:codex/feat-openai-sweep-tooling

Muhtasham commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Muhtasham commented Apr 26, 2026

Summary

Changes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant