Skip to content

examples/showcase: expand bug-fix-benchmark with rigorous multi-scenario workflow evals #1100

@christso

Description

@christso

Context

The current bug-fix-benchmark (examples/showcase/bug-fix-benchmark) compares engineering workflow plugins (agent-skills, superpowers, compound) against a baseline on a real GitHub repo. It has one test case — a single-file fix with the root cause and file location described in the prompt.

The recent SkillsBench paper provides methodology grounding. Key finding: Software Engineering showed the smallest improvement (+4.5pp) when tasks were prescriptive — agents could navigate them without plugin help.

Related: addyosmani/agent-skills#51

What's needed

1. More complex task scenarios

Current task is too prescriptive — the prompt names the file, method, and fix pattern. Add at least 4 new tasks covering distinct types:

  • Multi-file bugs — root cause spans 2+ files, no location hints in the prompt
  • Regression bugs — "works on commit A, fails on commit B, find why"
  • Spec-driven implementation — given a spec, implement + add tests from scratch
  • Refactoring under test — restructure code without breaking existing test suite

All tasks must use the same agentv repo (https://github.com/EntityProcess/agentv) as the workspace so no new repo setup is needed.

2. Self-generated skills as a control condition

Add a fourth variant claude-self-generated alongside the existing three. Its workspaces/self-generated/CLAUDE.md should instruct the agent to write its own procedural knowledge before starting the task — something like: "Before solving this task, write a SKILL.md describing your approach and the engineering process you will follow. Then follow it." No plugin is installed. This isolates whether curated plugin content outperforms an agent's own self-generated process notes.

3. Multi-trial runs with confidence intervals

Configure the eval with:

trials:
  count: 5
  strategy: confidence_interval

This gives statistical significance to pass-rate deltas instead of single-run noise.

4. Token/cost/latency tracking

Add evaluators to measure the overhead of each plugin variant:

evaluators:
  - type: token-usage
  - type: cost
  - type: latency

Answers "is the plugin worth its cost?" — SkillsBench found skills add ~13s and ~1,700 tokens per task on average.

5. Difficulty stratification and domain tagging

Tag each test case with difficulty tier (core / extended / extreme based on estimated human completion time) and scenario type. Enables stratified analysis in agentv compare.

tests:
  - id: fix-multi-file-auth
    tags: [extended, multi-file, bugfix]

6. Multi-model comparison

Run each variant across at least 2 model tiers (e.g. Sonnet 4.5 + Opus 4.6) to test whether skills compensate for model scale. SkillsBench found Haiku + skills (27.7%) outperformed Opus without skills (22.0%).

Acceptance signals

  • At least 5 test cases total (existing + 4 new) covering distinct scenario types
  • claude-self-generated variant added with appropriate CLAUDE.md
  • trials: 5 with confidence_interval strategy configured
  • token-usage, cost, latency evaluators included
  • Each task tagged with difficulty tier and scenario type
  • Multi-model targets configured (at least 2 model tiers)
  • README updated with methodology notes and link to SkillsBench paper

Non-goals

  • Not reproducing SkillsBench (84 tasks, 7 model configs) — this is a focused workflow benchmark
  • Not adding Docker containerization — git workspace isolation is sufficient
  • Not covering domains outside software engineering
  • Not implementing leakage prevention CI — 5-10 tasks can be reviewed manually
  • Normalized gain and negative delta detection are already handled by agentv compare (feat(compare): add normalized gain metric #1101) — no changes needed there

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    docsImprovements or additions to documentationenhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions