refine with real tool use and better nuanced metrics by AustinKelsay · Pull Request #5 · AustinKelsay/plebdev-bench

AustinKelsay · 2026-01-30T01:50:52Z

Summary by CodeRabbit

New Features
- Added dashboard charts: Composite Score Chart, Blind vs Informed Chart, Tooling Breakdown, and Failure Breakdown for enhanced analytics.
- Added informational tooltips across dashboard components for improved user guidance.
- Added Dimension Detail Dialog for deeper data exploration.
- Enabled tool-calling support in harnesses for improved code generation.
- Added tool-smoke test for preflight verification.
Improvements
- Enhanced test prompts to support tool and file-based workflows.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-30T01:51:06Z

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

This PR introduces tool-calling support to harnesses with file-based code generation, adds comprehensive dashboard charts and drill-down dialogs with tooltips, expands the data aggregation layer with composite metrics and failure statistics, implements a tool-smoke preflight test, and updates test prompts to permit tool usage.

Changes

Cohort / File(s)	Summary
Dependencies `apps/dashboard/package.json`	Added Radix UI components (dialog, select, slot, tabs, tooltip), utility libraries (class-variance-authority, clsx, lucide-react), and removed duplicates.
Dashboard Charts `apps/dashboard/src/components/charts/composite-score-chart.tsx`, `blind-vs-informed-chart.tsx`, `frontier-eval-scatter.tsx`, `timing-distribution.tsx`	Introduced CompositeScoreChart and BlindVsInformedChart for multi-metric visualization; added tooltips to frontier and timing distribution charts.
Dashboard Breakdowns & Dialogs `apps/dashboard/src/components/run-detail/failure-breakdown.tsx`, `tooling-breakdown.tsx`, `dimension-detail-dialog.tsx`	Added FailureBreakdown and ToolingBreakdown components for failure and tool statistics; introduced DimensionDetailDialog for drill-down exploration with sub-dimension tables.
Dashboard Utilities `apps/dashboard/src/components/ui/info-tooltip.tsx`, `apps/dashboard/src/lib/tooltip-content.ts`	Created reusable InfoTooltip and WithInfoTooltip components; centralized tooltip content as typed constants for all dashboard sections.
Dashboard Integration `apps/dashboard/src/components/run-detail/run-detail-page.tsx`, `compare-summary.tsx`, `matrix-table.tsx`, `item-detail-dialog.tsx`, `scoring-breakdown.tsx`, `timing-stats.tsx`	Integrated tooltips across UI headers and labels; replaced PassRateChart with BlindVsInformedChart; added CompositeScoreChart with dimension click handling; included new breakdowns and dialogs.
Data Aggregations `apps/dashboard/src/lib/aggregations.ts`	Added groupByModelHarness function; introduced CompositeMetrics, BlindInformedBreakdown, FailureStats, ToolUseStats, and related computation functions for composite scoring, tool statistics, and failure analysis.
Type Extensions `apps/dashboard/src/lib/types.ts`, `src/lib/types.ts` (via schemas)	Extended GenerationFailureType with "tool_missing"; added codeFilePath to GenerationResult; introduced ExtractionMethod union, TOOL_SMOKE_TEST_SLUG constant, and isToolSmokeItem utility.
Tool-Calling Harness Implementation `src/harnesses/tool-prompt.ts`, `goose-adapter.ts`, `opencode-adapter.ts`	Introduced ToolPromptConfig and buildToolPrompt for tool-first prompts; refactored Goose and OpenCode adapters to run in dedicated work directories, parse tool-call outputs, write code to solution.ts, and handle tool-missing errors.
Harness Infrastructure `src/harnesses/harness.ts`, `ollama-adapter.ts`	Added TOOL_CALLING_HARNESS_NAMES constant and ToolCallingHarnessName type; extended GenerateResult with optional codeFilePath; added OLLAMA_PROMPT_PREFIX for consistent markdown output formatting.
Code Extraction & Scoring `src/lib/code-extractor.ts`, `scorer.ts`, `failure-classifier.ts`	Extended extraction methods to include "file"; added XML tag stripping and code-end-pattern detection; updated scorer to read code from codeFilePath when available; added "tool_missing" classification logic.
Runner & Executor `src/runner/index.ts`, `item-executor.ts`, `plan-builder.ts`	Removed OpenCode server pre-warming; added tool-smoke preflight handling with status tracking; reordered tests to prioritize tool-smoke; updated item execution to propagate codeFilePath and support file-based scoring; implemented tool-smoke pass-type selection.
Tool-Smoke Test Utilities `src/lib/tool-smoke.ts`	Added TOOL_SMOKE_TEST_SLUG constant, isToolSmokeTest predicate, and selectToolSmokePassType utility for tool-smoke preflight orchestration.
Schemas `src/schemas/common.schema.ts`, `result.schema.ts`, `scoring.schema.ts`	Extended generationFailureTypes to include "tool_missing"; added codeFilePath field to GenerationResultSchema; added "file" extraction method to ScoringResultSchema.
Tool-Smoke Test Definition `src/tests/tool-smoke/README.md`, `prompt.blind.md`, `prompt.informed.md`, `rubric.md`, `scoring.spec.ts`	Added complete tool-smoke test specification with prompts, rubric, and scoring spec for preflight tool-calling validation.
Test Prompts `src/tests//prompt..md` (calculator-basic, calculator-stateful, smoke, todo-app)	Removed "No explanations or tool/file usage." restriction from all test prompts to permit tool-based code generation.
Documentation `llm/context/codebase-overview.md`, `llm/implementation/.md`, `llm/project/.md`	Updated project overview, design rules, tech stack, and implementation guides to reflect new tool-calling capabilities, dashboard components, aggregations, and tool-smoke test.
Test Results `results/index.json`	Updated run metadata and aggregated summary statistics with new execution results.

Sequence Diagram(s)

sequenceDiagram
    actor User as User/Runner
    participant HarnessAdapter as Harness Adapter
    participant ToolCaller as Tool-Calling Process
    participant CodeExtractor as Code Extractor
    participant FileSystem as File System
    participant Scorer as Scorer
    
    User->>HarnessAdapter: generate(prompt, model)
    HarnessAdapter->>HarnessAdapter: buildToolPrompt(taskPrompt)
    HarnessAdapter->>ToolCaller: execute (Goose/OpenCode + stdin prompt)
    Note over ToolCaller: Runs with tool-calling enabled
    ToolCaller->>ToolCaller: parse tool-call JSON output
    ToolCaller->>FileSystem: write solution.ts via edit tool
    FileSystem-->>ToolCaller: file created
    ToolCaller-->>HarnessAdapter: return { output, codeFilePath }
    HarnessAdapter->>FileSystem: check solution.ts exists & valid
    alt file exists and valid
        HarnessAdapter-->>User: { codeFilePath: "path/to/solution.ts" }
    else file missing/empty
        HarnessAdapter-->>User: error (tool_missing)
    end
    
    User->>Scorer: scoreGeneration(output, codeFilePath)
    alt codeFilePath provided
        Scorer->>FileSystem: read code from codeFilePath
        FileSystem-->>CodeExtractor: code content
        CodeExtractor->>CodeExtractor: extract(code, method="file")
    else fallback to output
        CodeExtractor->>CodeExtractor: extract(output, method=heuristic)
    end
    CodeExtractor-->>Scorer: extracted code
    Scorer->>Scorer: load spec, execute tests
    Scorer-->>User: { passed, failed, ... }

sequenceDiagram
    actor User as User
    participant DashUI as Dashboard UI
    participant CompositeChart as CompositeScoreChart
    participant DataLayer as Aggregations
    participant DetailDialog as DimensionDetailDialog
    
    User->>DashUI: view run-detail page
    DashUI->>CompositeChart: render with items + onDimensionClick
    CompositeChart->>DataLayer: computeCompositeMetrics(items)
    DataLayer-->>CompositeChart: { metrics by model/harness/test }
    CompositeChart->>CompositeChart: render grouped bars (tabs)
    
    User->>CompositeChart: click bar (dimension)
    CompositeChart->>DashUI: onDimensionClick(dimension, name)
    DashUI->>DashUI: setSelectedDimension({dimension, name})
    DashUI->>DetailDialog: open with dimension, name, items
    DetailDialog->>DataLayer: filterItems(dimension, name)
    DataLayer-->>DetailDialog: filtered items
    DetailDialog->>DetailDialog: computeStats, aggregateBreakdowns
    DetailDialog->>DetailDialog: render summary, blind-vs-informed, sub-tables
    DetailDialog-->>User: detail view

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

AustainKelsay/plebdev-bench#4: Introduces shared failure outcome types (generationFailure, scoringFailure, frontierEvalFailure) and matrix result structures that align with this PR's schema extensions and dashboard component usage.

Poem

🐰 Behold the tools we now employ,
Charts and dialogs bring such joy!
Composite scores dance with care,
Tooltips whisper, details share,
Tool-smoke tests make sure we're blessed—
Hop along, we pass each test!

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch scale-and-polish-phase

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

refine with real tool use and better nuanced metrics

10d5133

AustinKelsay merged commit 920a8f7 into main Jan 30, 2026
1 check was pending

This was referenced Feb 9, 2026

Feature/multi runtime #6

Merged

tighten todo prompt and export exceptions #8

Merged

coderabbitai Bot mentioned this pull request Feb 17, 2026

dashboard: add coverage diagnostics and improve run alignment #9

Merged

This was referenced Mar 4, 2026

feat: add test categories and catalog metadata #12

Merged

Feature/result checkpointing and aggregation #13

Merged

bench: add computer-use workspace tests #14

Merged

This was referenced Mar 20, 2026

Staging #15

Merged

Implement canonical machine profiles #18

Merged

Add trusted signal assessment to benchmark results #20

Merged

Staging #19

Merged

coderabbitai Bot mentioned this pull request Mar 30, 2026

Add dashboard test type filters and charts #22

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refine with real tool use and better nuanced metrics#5

refine with real tool use and better nuanced metrics#5
AustinKelsay merged 1 commit intomainfrom
scale-and-polish-phase

AustinKelsay commented Jan 30, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jan 30, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AustinKelsay commented Jan 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AustinKelsay commented Jan 30, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jan 30, 2026 •

edited

Loading