Add direct API judge backend via --judge flag#87
Merged
olearycrew merged 1 commit intopinchbench:mainfrom Apr 6, 2026
Merged
Conversation
ba64cb9 to
ef38b4a
Compare
When --judge is specified with a model ID, the judge calls the model API directly instead of running an OpenClaw agent session. This avoids OpenClaw personality files (SOUL.md, IDENTITY.md) overriding the judge's JSON-only grading instructions, which caused all llm_judge tasks to score 0. Supported model prefixes: - openrouter/* -> OpenRouter API (OPENROUTER_API_KEY) - anthropic/* -> Anthropic Messages API (ANTHROPIC_API_KEY) - openai/* -> OpenAI chat completions (OPENAI_API_KEY) - claude -> headless Claude CLI (claude -p) Without --judge, behavior is unchanged (OpenClaw agent session). Also fixes pre-existing duplicate _remove_readonly function definition in lib_agent.py that caused an IndentationError. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ef38b4a to
19d3e7a
Compare
ScuttleBot
reviewed
Apr 6, 2026
ScuttleBot
left a comment
There was a problem hiding this comment.
ScuttleBot review 🦀
Solves a real problem. OpenClaw personality files (SOUL.md, IDENTITY.md) were corrupting judge outputs — the judge would write prose instead of JSON because it was roleplaying.
What's good:
- Direct API calls bypass OpenClaw entirely for judging
- Multi-backend support: OpenRouter, Anthropic, OpenAI, Claude CLI
- Behavior unchanged when
--judgeis unset (backward compatible) - README documents the new modes clearly
Concerns:
- Merge conflicts with #93 — Both touch
lib_agent.pyandlib_grading.pyheavily. Coordinate merge order. _JUDGE_SYSTEM_MSGis minimal. If judges still hallucinate prose, you may need to strengthen it ("Do not include explanations. Do not use markdown fences.")- Missing:
_judge_via_anthropic()implementation is cut off in the diff — confirm it's complete?
This unblocks llm_judge tasks from scoring 0. High priority merge.
This was referenced Apr 6, 2026
Member
|
@juppytt thanks so much for this contribution! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--judgeis specified with a model ID (e.g.openai/gpt-4o,anthropic/claude-sonnet-4-5-20250514,claude), the judge calls the model API directly instead of running an OpenClaw agent session--judge, behavior is unchanged (OpenClaw agent session with default model)Supported prefixes:
openrouter/*,anthropic/*,openai/*,claudeTest plan
--judge— verify OpenClaw judge still works as default--judge openai/gpt-4o— verify direct API judge returns valid scores--judge anthropic/claude-sonnet-4-6— verify Anthropic backend--judge claude— verify headless CLI backend--judge🤖 Generated with Claude Code