Skip to content

docs/examples: public benchmark starter pack for major open benchmark repos #1076

@christso

Description

@christso

Objective

Add a first-party docs/examples starter pack showing how to use AgentV with the main public benchmark repos people already recognize, instead of leaving each benchmark as a one-off research note.

Why this is needed

AgentV already has the core primitives for benchmark-facing workflows:

  • agentv import huggingface
  • workspace isolation / Docker-oriented execution paths
  • code, rubric, llm-judge, and tool-trajectory grading
  • multi-trial aggregation (pass_at_k, mean, confidence_interval)

What is still missing is the starter layer: copy-paste examples and short guides that map well-known public benchmark repos into AgentV.

Recent MiniMax M2.7 follow-up research: https://github.com/agentevals/agentevals-research/pull/58

Scope

Create a curated benchmark example pack covering the highest-signal public repos:

  1. SWE-bench/SWE-bench — canonical issue-resolution benchmark import example
  2. multi-swe-bench/multi-swe-bench — multilingual SWE-style example
  3. laude-institute/terminal-bench and harbor-framework/terminal-bench-2 — execution-based terminal task examples
  4. multimodal-art-projection/NL2RepoBench — greenfield repo-generation example
  5. openai/mle-bench — MLE-Bench Lite guide as a practical long-running benchmark example
  6. anomalyco/opencode-bench — advanced multi-judge pattern reference

Deliverables

  • docs page or example-pack index for public benchmark repos
  • at least one runnable example/import path for each benchmark shape (issue-resolution, terminal tasks, repo generation, long-running optimization)
  • clear caveats for heavyweight setups so examples stay honest about runtime costs
  • explicit note that internal-only vendor eval names are not canonical public benchmark targets

Non-goals

  • not a new benchmark ingestion subsystem
  • not a public leaderboard product by itself
  • not bundling huge benchmark datasets into the repo

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    docsImprovements or additions to documentation

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Ready

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions