Skip to content

Conversation

@geelen
Copy link
Contributor

@geelen geelen commented Dec 17, 2025

Have been chipping away at this new eval for a few weeks and starting to gear up to get it properly reviewed. I'd done a lot of development directly here in OpenBench but the most recent work I've done is moving as much as possible to a new repo: https://github.com/geelen/progressivemcpbench

Note: this PR includes #333 to test with Gemini 3.

This PR has two parts:

The first is the main ProgressiveMCPEval, which has 6 strategies for tool discovery to compare:

Strategy Description
copilot Semantic search via route() + execute-tool()
directory Filesystem-like exploration via ls() + read-tool-file()
minimal-servers Direct access to required servers only
minimal-tools Direct access to exact tools needed (upper bound)
distraction-64 Required tools + distractors (64 total)
distraction-128 Required tools + distractors (128 total)

These probably aren't the final set we will go with, but it's been working well enough for me to flesh things out. All of them require being able to hit a large number of reliable, realistic-looking MCP servers, which has been most of the work. That's moved on to the other project, and hosted here: https://progressive-mcp-bench.groq-dev.workers.dev/servers

I'll be tidying up this draft PR and submitting it for review once I'm happy with the eval's design.

Groq-Responses provider and server-side MCP support

There's actually a seventh strategy called minimal-servers-remote, which instead of running an agentic loop locally, passing local tools of type function up with each inference call, and dispatching tool calls to MCP servers locally, it instead sends a payload including the MCP servers themselves (as type=mcp), and expects the server to respond in a single call. This has been useful for testing changes to Groq's internal services as well as benchmarking them against a reference implementation, and it's the reason the MCP mock layer is deployed publicly at all.

However, it required use of the responses endpoint, and since InspectAI appeared to have no support for remote MCP, when I asked Amp to implement this it bypassed a lot of the inspect AI code and just used the OpenAI SDK directly with a provider called groq-responses. So that needs to be fixed.

Maybe #335 is a good starting point for adding server-side MCP in a generic way?

In any case, I'll back this server-side stuff out before submitting this PR, but thought I'd mention it.

@github-actions
Copy link
Contributor

✅ Benchmark documentation has been automatically updated.

@socket-security
Copy link

socket-security bot commented Dec 17, 2025

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Updatedinspect-ai@​0.3.141 ⏵ 0.3.15174 -26100100100100
Updatedmcp@​1.13.1 ⏵ 1.22.09985100100100
Updatedopenai@​2.8.1 ⏵ 2.8.096100100100100
Addedpycparser@​2.2397100100100100
Updatedanthropic@​0.74.1 ⏵ 0.73.097100100100100
Addedfrozendict@​2.4.710010010010070
Addednest-asyncio2@​1.7.1100100100100100
Updatedinspect-swe@​0.2.26 ⏵ 0.2.27100 +1100100100100
Addedpyjwt@​2.10.1100100100100100

View full report

A new benchmark for evaluating LLM tool-calling capabilities using synthetic MCP servers.

Features:
- Synthetic MCP server with 15 domains and 50+ tools
- Multiple evaluation strategies: copilot, single-shot, distraction modes
- LLM-as-judge scoring for semantic answer matching
- Support for both local and remote MCP execution
- Groq Responses API provider for server-side MCP

Strategies:
- copilot: Multi-turn with local MCP server (default)
- minimal-tools: Single-shot, only required tools exposed
- distraction-16/64: Includes distractor tools
- minimal-servers-remote: Groq server-side MCP execution

New files:
- src/openbench/evals/progressivemcpbench.py
- src/openbench/datasets/progressivemcpbench.py
- src/openbench/scorers/progressivemcpbench.py
- src/openbench/model/_providers/groq_responses.py
- src/openbench/tools/progressivemcpbench/
- synthetic_mcp/ (data generation pipeline)
- docs/evals/progressivemcpbench.mdx

Amp-Thread-ID: https://ampcode.com/threads/T-019b2b8f-be9a-7070-b38b-d5ff16e8b2eb
Co-authored-by: Amp <amp@ampcode.com>
@github-actions
Copy link
Contributor

✅ Benchmark documentation has been automatically updated.

geelen and others added 14 commits December 18, 2025 22:20
Add strict=true to function calling in the Groq provider to fix
malformed JSON generation from models like gpt-oss-120b.

- Add _make_schema_strict() to transform schemas for strict mode
- Set additionalProperties: false on object schemas
- Add all properties to required array
- Enable strict: true flag on all tool definitions

This fixes 100% failure rate on progressivemcpbench directory strategy
with gpt-oss-120b, reducing from 191 messages/sample to ~8k tokens in 3s.
…te with tool_discovery parameter

- Refactor minimal-servers-remote to use remote_mcp handlers
- Support Groq (Responses API) and Anthropic (MCP connector)
- Add tool_discovery parameter: directory (Groq), regex/bm25 (Anthropic)
- Task name includes tool_discovery when set
…ame prefix

- Change supports_provider(model_name) to supports_api(api)
- Add provider_name() method for display
- Pass model.api to registry instead of model.name
- Add @solver decorator to parameterized solver function
- Use mcp_servers as top-level API parameter (not inside tools)
- Use type: 'mcp_toolset' with mcp_server_name in tools array
- Add required 'name' field for tool_search_tool types

Fixes 0% accuracy issue where API was rejecting invalid request format.

Amp-Thread-ID: https://ampcode.com/threads/T-019b446c-2451-76ca-a3e5-e35d23f16b30
Co-authored-by: Amp <amp@ampcode.com>
…ndler

Allow minimal-servers-remote to work with standard groq/ models,
not just groq-responses/. The handler uses the Responses API directly
regardless of which Groq provider was specified.
The minimal-servers-remote strategy now uses the Responses API directly
with the standard groq/ provider, so groq-responses is no longer needed.

- Remove groq-responses from _registry.py
- Remove GROQ_RESPONSES from provider_config.py
- Delete model/_providers/groq_responses.py
- Update GroqRemoteMCPHandler to only check for GroqAPI
Uses Groq-Beta: mcp-deferred-directory-lazy header instead of
mcp-deferred-directory for lazy loading behavior.
…tion

- Add server list directly to meta__ls tool description to avoid round trip
- Disallow listing /tools directly (returns error with guidance)
- Update system message to reflect new behavior
- Add ellipsis to truncated server descriptions

Amp-Thread-ID: https://ampcode.com/threads/T-019b48c5-791e-77aa-a274-17b769cc8968
Co-authored-by: Amp <amp@ampcode.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants