-
Notifications
You must be signed in to change notification settings - Fork 95
[DRAFT] ProgressiveMCPBench #340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
geelen
wants to merge
17
commits into
main
Choose a base branch
from
progressive-mcp
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Contributor
|
✅ Benchmark documentation has been automatically updated. |
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
A new benchmark for evaluating LLM tool-calling capabilities using synthetic MCP servers. Features: - Synthetic MCP server with 15 domains and 50+ tools - Multiple evaluation strategies: copilot, single-shot, distraction modes - LLM-as-judge scoring for semantic answer matching - Support for both local and remote MCP execution - Groq Responses API provider for server-side MCP Strategies: - copilot: Multi-turn with local MCP server (default) - minimal-tools: Single-shot, only required tools exposed - distraction-16/64: Includes distractor tools - minimal-servers-remote: Groq server-side MCP execution New files: - src/openbench/evals/progressivemcpbench.py - src/openbench/datasets/progressivemcpbench.py - src/openbench/scorers/progressivemcpbench.py - src/openbench/model/_providers/groq_responses.py - src/openbench/tools/progressivemcpbench/ - synthetic_mcp/ (data generation pipeline) - docs/evals/progressivemcpbench.mdx Amp-Thread-ID: https://ampcode.com/threads/T-019b2b8f-be9a-7070-b38b-d5ff16e8b2eb Co-authored-by: Amp <amp@ampcode.com>
414174f to
eff6da2
Compare
Contributor
|
✅ Benchmark documentation has been automatically updated. |
Add strict=true to function calling in the Groq provider to fix malformed JSON generation from models like gpt-oss-120b. - Add _make_schema_strict() to transform schemas for strict mode - Set additionalProperties: false on object schemas - Add all properties to required array - Enable strict: true flag on all tool definitions This fixes 100% failure rate on progressivemcpbench directory strategy with gpt-oss-120b, reducing from 191 messages/sample to ~8k tokens in 3s.
…te with tool_discovery parameter - Refactor minimal-servers-remote to use remote_mcp handlers - Support Groq (Responses API) and Anthropic (MCP connector) - Add tool_discovery parameter: directory (Groq), regex/bm25 (Anthropic) - Task name includes tool_discovery when set
…ame prefix - Change supports_provider(model_name) to supports_api(api) - Add provider_name() method for display - Pass model.api to registry instead of model.name - Add @solver decorator to parameterized solver function
- Use mcp_servers as top-level API parameter (not inside tools) - Use type: 'mcp_toolset' with mcp_server_name in tools array - Add required 'name' field for tool_search_tool types Fixes 0% accuracy issue where API was rejecting invalid request format. Amp-Thread-ID: https://ampcode.com/threads/T-019b446c-2451-76ca-a3e5-e35d23f16b30 Co-authored-by: Amp <amp@ampcode.com>
…ndler Allow minimal-servers-remote to work with standard groq/ models, not just groq-responses/. The handler uses the Responses API directly regardless of which Groq provider was specified.
The minimal-servers-remote strategy now uses the Responses API directly with the standard groq/ provider, so groq-responses is no longer needed. - Remove groq-responses from _registry.py - Remove GROQ_RESPONSES from provider_config.py - Delete model/_providers/groq_responses.py - Update GroqRemoteMCPHandler to only check for GroqAPI
Uses Groq-Beta: mcp-deferred-directory-lazy header instead of mcp-deferred-directory for lazy loading behavior.
…tion - Add server list directly to meta__ls tool description to avoid round trip - Disallow listing /tools directly (returns error with guidance) - Update system message to reflect new behavior - Add ellipsis to truncated server descriptions Amp-Thread-ID: https://ampcode.com/threads/T-019b48c5-791e-77aa-a274-17b769cc8968 Co-authored-by: Amp <amp@ampcode.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Have been chipping away at this new eval for a few weeks and starting to gear up to get it properly reviewed. I'd done a lot of development directly here in OpenBench but the most recent work I've done is moving as much as possible to a new repo: https://github.com/geelen/progressivemcpbench
Note: this PR includes #333 to test with Gemini 3.
This PR has two parts:
The first is the main ProgressiveMCPEval, which has 6 strategies for tool discovery to compare:
copilotroute()+execute-tool()directoryls()+read-tool-file()minimal-serversminimal-toolsdistraction-64distraction-128These probably aren't the final set we will go with, but it's been working well enough for me to flesh things out. All of them require being able to hit a large number of reliable, realistic-looking MCP servers, which has been most of the work. That's moved on to the other project, and hosted here: https://progressive-mcp-bench.groq-dev.workers.dev/servers
I'll be tidying up this draft PR and submitting it for review once I'm happy with the eval's design.
Groq-Responses provider and server-side MCP support
There's actually a seventh strategy called
minimal-servers-remote, which instead of running an agentic loop locally, passing local tools of typefunctionup with each inference call, and dispatching tool calls to MCP servers locally, it instead sends a payload including the MCP servers themselves (astype=mcp), and expects the server to respond in a single call. This has been useful for testing changes to Groq's internal services as well as benchmarking them against a reference implementation, and it's the reason the MCP mock layer is deployed publicly at all.However, it required use of the
responsesendpoint, and since InspectAI appeared to have no support for remote MCP, when I asked Amp to implement this it bypassed a lot of the inspect AI code and just used the OpenAI SDK directly with a provider calledgroq-responses. So that needs to be fixed.Maybe #335 is a good starting point for adding server-side MCP in a generic way?
In any case, I'll back this server-side stuff out before submitting this PR, but thought I'd mention it.