Add direct API judge backend via --judge flag by juppytt · Pull Request #87 · pinchbench/skill

juppytt · 2026-04-01T11:10:25Z

Summary

When --judge is specified with a model ID (e.g. openai/gpt-4o, anthropic/claude-sonnet-4-5-20250514, claude), the judge calls the model API directly instead of running an OpenClaw agent session
Fixes llm_judge tasks scoring 0 due to OpenClaw personality files (SOUL.md, IDENTITY.md) overriding the judge's JSON-only instructions
Without --judge, behavior is unchanged (OpenClaw agent session with default model)

Supported prefixes: openrouter/*, anthropic/*, openai/*, claude

Test plan

Run benchmark without --judge — verify OpenClaw judge still works as default
Run with --judge openai/gpt-4o — verify direct API judge returns valid scores
Run with --judge anthropic/claude-sonnet-4-6 — verify Anthropic backend
Run with --judge claude — verify headless CLI backend
Verify llm_judge tasks no longer score 0 when using --judge

🤖 Generated with Claude Code

When --judge is specified with a model ID, the judge calls the model API directly instead of running an OpenClaw agent session. This avoids OpenClaw personality files (SOUL.md, IDENTITY.md) overriding the judge's JSON-only grading instructions, which caused all llm_judge tasks to score 0. Supported model prefixes: - openrouter/* -> OpenRouter API (OPENROUTER_API_KEY) - anthropic/* -> Anthropic Messages API (ANTHROPIC_API_KEY) - openai/* -> OpenAI chat completions (OPENAI_API_KEY) - claude -> headless Claude CLI (claude -p) Without --judge, behavior is unchanged (OpenClaw agent session). Also fixes pre-existing duplicate _remove_readonly function definition in lib_agent.py that caused an IndentationError. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ScuttleBot

ScuttleBot review 🦀

Solves a real problem. OpenClaw personality files (SOUL.md, IDENTITY.md) were corrupting judge outputs — the judge would write prose instead of JSON because it was roleplaying.

What's good:

Direct API calls bypass OpenClaw entirely for judging
Multi-backend support: OpenRouter, Anthropic, OpenAI, Claude CLI
Behavior unchanged when --judge is unset (backward compatible)
README documents the new modes clearly

Concerns:

Merge conflicts with #93 — Both touch lib_agent.py and lib_grading.py heavily. Coordinate merge order.
_JUDGE_SYSTEM_MSG is minimal. If judges still hallucinate prose, you may need to strengthen it ("Do not include explanations. Do not use markdown fences.")
Missing: _judge_via_anthropic() implementation is cut off in the diff — confirm it's complete?

This unblocks llm_judge tasks from scoring 0. High priority merge.

olearycrew · 2026-04-06T18:09:59Z

@juppytt thanks so much for this contribution!

juppytt force-pushed the fix/judge-api-backend branch from ba64cb9 to ef38b4a Compare April 1, 2026 11:19

juppytt force-pushed the fix/judge-api-backend branch from ef38b4a to 19d3e7a Compare April 1, 2026 11:25

ScuttleBot reviewed Apr 6, 2026

View reviewed changes

This was referenced Apr 6, 2026

Clean up some recent changes #83

Closed

Write incremental results after each task completion #93

Merged

olearycrew merged commit 7fa42f4 into pinchbench:main Apr 6, 2026

ScuttleBot mentioned this pull request Apr 6, 2026

Fix grader compatibility with OpenClaw transcripts #86

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add direct API judge backend via --judge flag#87

Add direct API judge backend via --judge flag#87
olearycrew merged 1 commit intopinchbench:mainfrom
juppytt:fix/judge-api-backend

juppytt commented Apr 1, 2026 •

edited

Loading

Uh oh!

ScuttleBot left a comment

Uh oh!

olearycrew commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

juppytt commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

ScuttleBot left a comment

Choose a reason for hiding this comment

Uh oh!

olearycrew commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

juppytt commented Apr 1, 2026 •

edited

Loading