[DRAFT] ProgressiveMCPBench #340

geelen · 2025-12-17T11:55:37Z

Have been chipping away at this new eval for a few weeks and starting to gear up to get it properly reviewed. I'd done a lot of development directly here in OpenBench but the most recent work I've done is moving as much as possible to a new repo: https://github.com/geelen/progressivemcpbench

Note: this PR includes #333 to test with Gemini 3.

This PR has two parts:

The first is the main ProgressiveMCPEval, which has 6 strategies for tool discovery to compare:

Strategy	Description
`copilot`	Semantic search via `route()` + `execute-tool()`
`directory`	Filesystem-like exploration via `ls()` + `read-tool-file()`
`minimal-servers`	Direct access to required servers only
`minimal-tools`	Direct access to exact tools needed (upper bound)
`distraction-64`	Required tools + distractors (64 total)
`distraction-128`	Required tools + distractors (128 total)

These probably aren't the final set we will go with, but it's been working well enough for me to flesh things out. All of them require being able to hit a large number of reliable, realistic-looking MCP servers, which has been most of the work. That's moved on to the other project, and hosted here: https://progressive-mcp-bench.groq-dev.workers.dev/servers

I'll be tidying up this draft PR and submitting it for review once I'm happy with the eval's design.

Groq-Responses provider and server-side MCP support

There's actually a seventh strategy called minimal-servers-remote, which instead of running an agentic loop locally, passing local tools of type function up with each inference call, and dispatching tool calls to MCP servers locally, it instead sends a payload including the MCP servers themselves (as type=mcp), and expects the server to respond in a single call. This has been useful for testing changes to Groq's internal services as well as benchmarking them against a reference implementation, and it's the reason the MCP mock layer is deployed publicly at all.

However, it required use of the responses endpoint, and since InspectAI appeared to have no support for remote MCP, when I asked Amp to implement this it bypassed a lot of the inspect AI code and just used the OpenAI SDK directly with a provider called groq-responses. So that needs to be fixed.

Maybe #335 is a good starting point for adding server-side MCP in a generic way?

In any case, I'll back this server-side stuff out before submitting this PR, but thought I'd mention it.

… Pro Preview

github-actions · 2025-12-17T11:56:02Z

✅ Benchmark documentation has been automatically updated.

socket-security · 2025-12-17T11:56:18Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	inspect-ai@0.3.141 ⏵ 0.3.151	^-26
	mcp@1.13.1 ⏵ 1.22.0
	openai@2.8.1 ⏵ 2.8.0
	pycparser@2.23
	anthropic@0.74.1 ⏵ 0.73.0
	frozendict@2.4.7
	nest-asyncio2@1.7.1
	inspect-swe@0.2.26 ⏵ 0.2.27	⁺¹
	pyjwt@2.10.1

View full report

A new benchmark for evaluating LLM tool-calling capabilities using synthetic MCP servers. Features: - Synthetic MCP server with 15 domains and 50+ tools - Multiple evaluation strategies: copilot, single-shot, distraction modes - LLM-as-judge scoring for semantic answer matching - Support for both local and remote MCP execution - Groq Responses API provider for server-side MCP Strategies: - copilot: Multi-turn with local MCP server (default) - minimal-tools: Single-shot, only required tools exposed - distraction-16/64: Includes distractor tools - minimal-servers-remote: Groq server-side MCP execution New files: - src/openbench/evals/progressivemcpbench.py - src/openbench/datasets/progressivemcpbench.py - src/openbench/scorers/progressivemcpbench.py - src/openbench/model/_providers/groq_responses.py - src/openbench/tools/progressivemcpbench/ - synthetic_mcp/ (data generation pipeline) - docs/evals/progressivemcpbench.mdx Amp-Thread-ID: https://ampcode.com/threads/T-019b2b8f-be9a-7070-b38b-d5ff16e8b2eb Co-authored-by: Amp <amp@ampcode.com>

github-actions · 2025-12-17T11:57:51Z

✅ Benchmark documentation has been automatically updated.

Add strict=true to function calling in the Groq provider to fix malformed JSON generation from models like gpt-oss-120b. - Add _make_schema_strict() to transform schemas for strict mode - Set additionalProperties: false on object schemas - Add all properties to required array - Enable strict: true flag on all tool definitions This fixes 100% failure rate on progressivemcpbench directory strategy with gpt-oss-120b, reducing from 191 messages/sample to ~8k tokens in 3s.

…tools

…d tool search

…te with tool_discovery parameter - Refactor minimal-servers-remote to use remote_mcp handlers - Support Groq (Responses API) and Anthropic (MCP connector) - Add tool_discovery parameter: directory (Groq), regex/bm25 (Anthropic) - Task name includes tool_discovery when set

…ame prefix - Change supports_provider(model_name) to supports_api(api) - Add provider_name() method for display - Pass model.api to registry instead of model.name - Add @solver decorator to parameterized solver function

- Use mcp_servers as top-level API parameter (not inside tools) - Use type: 'mcp_toolset' with mcp_server_name in tools array - Add required 'name' field for tool_search_tool types Fixes 0% accuracy issue where API was rejecting invalid request format. Amp-Thread-ID: https://ampcode.com/threads/T-019b446c-2451-76ca-a3e5-e35d23f16b30 Co-authored-by: Amp <amp@ampcode.com>

…ndler Allow minimal-servers-remote to work with standard groq/ models, not just groq-responses/. The handler uses the Responses API directly regardless of which Groq provider was specified.

The minimal-servers-remote strategy now uses the Responses API directly with the standard groq/ provider, so groq-responses is no longer needed. - Remove groq-responses from _registry.py - Remove GROQ_RESPONSES from provider_config.py - Delete model/_providers/groq_responses.py - Update GroqRemoteMCPHandler to only check for GroqAPI

Uses Groq-Beta: mcp-deferred-directory-lazy header instead of mcp-deferred-directory for lazy loading behavior.

…tion - Add server list directly to meta__ls tool description to avoid round trip - Disallow listing /tools directly (returns error with guidance) - Update system message to reflect new behavior - Add ellipsis to truncated server descriptions Amp-Thread-ID: https://ampcode.com/threads/T-019b48c5-791e-77aa-a274-17b769cc8968 Co-authored-by: Amp <amp@ampcode.com>

feat: upgraded InspectAI, OpenAI and MCP versions to support Gemini 3…

2cd4e13

… Pro Preview

geelen force-pushed the progressive-mcp branch from 414174f to eff6da2 Compare December 17, 2025 11:57

chore: update benchmark docs [skip ci]

561d630

geelen and others added 14 commits December 18, 2025 22:20

feat(progressivemcpbench): add deferred_mode=directory to remote MCP …

a74bb64

…tools

feat(remote_mcp): add base module structure and RemoteMCPHandler ABC

3d2b72d

feat(remote_mcp): add GroqRemoteMCPHandler for Responses API with MCP

02fba24

feat(remote_mcp): add AnthropicRemoteMCPHandler with MCP connector an…

14d22e1

…d tool search

feat(remote_mcp): add registry for provider dispatch

3545894

fix(remote_mcp): support both GroqAPI and GroqResponsesAPI in Groq ha…

b38d607

…ndler Allow minimal-servers-remote to work with standard groq/ models, not just groq-responses/. The handler uses the Responses API directly regardless of which Groq provider was specified.

feat(remote_mcp): add Groq-Beta header for MCP deferred directory

af776b5

feat(remote_mcp): add directory-lazy tool discovery option for Groq

6b1b323

Uses Groq-Beta: mcp-deferred-directory-lazy header instead of mcp-deferred-directory for lazy loading behavior.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DRAFT] ProgressiveMCPBench #340

[DRAFT] ProgressiveMCPBench #340

Uh oh!

geelen commented Dec 17, 2025

Uh oh!

github-actions bot commented Dec 17, 2025

Uh oh!

socket-security bot commented Dec 17, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

[DRAFT] ProgressiveMCPBench #340

Are you sure you want to change the base?

[DRAFT] ProgressiveMCPBench #340

Uh oh!

Conversation

geelen commented Dec 17, 2025

The first is the main ProgressiveMCPEval, which has 6 strategies for tool discovery to compare:

Groq-Responses provider and server-side MCP support

Uh oh!

github-actions bot commented Dec 17, 2025

Uh oh!

socket-security bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

socket-security bot commented Dec 17, 2025 •

edited

Loading