Feature hasn't been suggested before.
Describe the enhancement you want to request
Problem
LLM providers and models have biased benchmarks. Existing evaluations (HumanEval, provider-reported metrics) don't reflect real-world coding performance. I want to build a blind testing plugin that:
- Randomly assigns models from a pool at session start
- Hides model identity during use
- Prompts users to rate after the session
- Reveals the model only after rating
- Collects unbiased benchmark data
- This enables organizations and open-source contributors to compare LLMs for coding without brand bias.
Proposal
Add a chat.model plugin hook to override model selection:
"chat.model"?: (
input: { sessionID, agent, model, provider, message },
output: {
model?: { providerID: string; modelID: string }
displayModel?: string
}
) => Promise<void>
- model: Override the actual model used
- displayModel: Custom name shown in UI (e.g., "Model A")
This follows existing patterns like chat.params and chat.headers.
Why OpenCode Benefits
- Zen's mission - OpenCode Zen aims to benchmark the best models. Real-world blind test data would improve recommendations.
- Public benchmark dashboard - OpenCode could host a leaderboard at opencode.ai/benchmark showing unbiased model rankings from community contributions. Differentiates from competitors.
- Thought leadership - "Most comprehensive real-world LLM benchmark for coding" drives blog posts, press, and adoption.
- Generic utility - Beyond benchmarking, the hook enables provider failover, A/B testing, and enterprise model governance.
Implementation
Small scope:
- packages/plugin/src/index.ts - hook type
- packages/opencode/src/session/llm.ts - trigger hook before LLM call
- packages/opencode/src/session/message-v2.ts - add displayModel field
- TUI components - show displayModel if present
Discussion
- Is the hook signature appropriate?
- Interest in OpenCode hosting a public benchmark dashboard?
- I'm willing to implement the core changes and build the blind test plugin.
Feature hasn't been suggested before.
Describe the enhancement you want to request
Problem
LLM providers and models have biased benchmarks. Existing evaluations (HumanEval, provider-reported metrics) don't reflect real-world coding performance. I want to build a blind testing plugin that:
Proposal
Add a chat.model plugin hook to override model selection:
This follows existing patterns like chat.params and chat.headers.
Why OpenCode Benefits
Implementation
Small scope:
Discussion