Skip to content

refine with real tool use and better nuanced metrics#5

Merged
AustinKelsay merged 1 commit intomainfrom
scale-and-polish-phase
Jan 30, 2026
Merged

refine with real tool use and better nuanced metrics#5
AustinKelsay merged 1 commit intomainfrom
scale-and-polish-phase

Conversation

@AustinKelsay
Copy link
Copy Markdown
Owner

@AustinKelsay AustinKelsay commented Jan 30, 2026

Summary by CodeRabbit

  • New Features

    • Added dashboard charts: Composite Score Chart, Blind vs Informed Chart, Tooling Breakdown, and Failure Breakdown for enhanced analytics.
    • Added informational tooltips across dashboard components for improved user guidance.
    • Added Dimension Detail Dialog for deeper data exploration.
    • Enabled tool-calling support in harnesses for improved code generation.
    • Added tool-smoke test for preflight verification.
  • Improvements

    • Enhanced test prompts to support tool and file-based workflows.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jan 30, 2026

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

This PR introduces tool-calling support to harnesses with file-based code generation, adds comprehensive dashboard charts and drill-down dialogs with tooltips, expands the data aggregation layer with composite metrics and failure statistics, implements a tool-smoke preflight test, and updates test prompts to permit tool usage.

Changes

Cohort / File(s) Summary
Dependencies
apps/dashboard/package.json
Added Radix UI components (dialog, select, slot, tabs, tooltip), utility libraries (class-variance-authority, clsx, lucide-react), and removed duplicates.
Dashboard Charts
apps/dashboard/src/components/charts/composite-score-chart.tsx, blind-vs-informed-chart.tsx, frontier-eval-scatter.tsx, timing-distribution.tsx
Introduced CompositeScoreChart and BlindVsInformedChart for multi-metric visualization; added tooltips to frontier and timing distribution charts.
Dashboard Breakdowns & Dialogs
apps/dashboard/src/components/run-detail/failure-breakdown.tsx, tooling-breakdown.tsx, dimension-detail-dialog.tsx
Added FailureBreakdown and ToolingBreakdown components for failure and tool statistics; introduced DimensionDetailDialog for drill-down exploration with sub-dimension tables.
Dashboard Utilities
apps/dashboard/src/components/ui/info-tooltip.tsx, apps/dashboard/src/lib/tooltip-content.ts
Created reusable InfoTooltip and WithInfoTooltip components; centralized tooltip content as typed constants for all dashboard sections.
Dashboard Integration
apps/dashboard/src/components/run-detail/run-detail-page.tsx, compare-summary.tsx, matrix-table.tsx, item-detail-dialog.tsx, scoring-breakdown.tsx, timing-stats.tsx
Integrated tooltips across UI headers and labels; replaced PassRateChart with BlindVsInformedChart; added CompositeScoreChart with dimension click handling; included new breakdowns and dialogs.
Data Aggregations
apps/dashboard/src/lib/aggregations.ts
Added groupByModelHarness function; introduced CompositeMetrics, BlindInformedBreakdown, FailureStats, ToolUseStats, and related computation functions for composite scoring, tool statistics, and failure analysis.
Type Extensions
apps/dashboard/src/lib/types.ts, src/lib/types.ts (via schemas)
Extended GenerationFailureType with "tool_missing"; added codeFilePath to GenerationResult; introduced ExtractionMethod union, TOOL_SMOKE_TEST_SLUG constant, and isToolSmokeItem utility.
Tool-Calling Harness Implementation
src/harnesses/tool-prompt.ts, goose-adapter.ts, opencode-adapter.ts
Introduced ToolPromptConfig and buildToolPrompt for tool-first prompts; refactored Goose and OpenCode adapters to run in dedicated work directories, parse tool-call outputs, write code to solution.ts, and handle tool-missing errors.
Harness Infrastructure
src/harnesses/harness.ts, ollama-adapter.ts
Added TOOL_CALLING_HARNESS_NAMES constant and ToolCallingHarnessName type; extended GenerateResult with optional codeFilePath; added OLLAMA_PROMPT_PREFIX for consistent markdown output formatting.
Code Extraction & Scoring
src/lib/code-extractor.ts, scorer.ts, failure-classifier.ts
Extended extraction methods to include "file"; added XML tag stripping and code-end-pattern detection; updated scorer to read code from codeFilePath when available; added "tool_missing" classification logic.
Runner & Executor
src/runner/index.ts, item-executor.ts, plan-builder.ts
Removed OpenCode server pre-warming; added tool-smoke preflight handling with status tracking; reordered tests to prioritize tool-smoke; updated item execution to propagate codeFilePath and support file-based scoring; implemented tool-smoke pass-type selection.
Tool-Smoke Test Utilities
src/lib/tool-smoke.ts
Added TOOL_SMOKE_TEST_SLUG constant, isToolSmokeTest predicate, and selectToolSmokePassType utility for tool-smoke preflight orchestration.
Schemas
src/schemas/common.schema.ts, result.schema.ts, scoring.schema.ts
Extended generationFailureTypes to include "tool_missing"; added codeFilePath field to GenerationResultSchema; added "file" extraction method to ScoringResultSchema.
Tool-Smoke Test Definition
src/tests/tool-smoke/README.md, prompt.blind.md, prompt.informed.md, rubric.md, scoring.spec.ts
Added complete tool-smoke test specification with prompts, rubric, and scoring spec for preflight tool-calling validation.
Test Prompts
src/tests/*/prompt.*.md (calculator-basic, calculator-stateful, smoke, todo-app)
Removed "No explanations or tool/file usage." restriction from all test prompts to permit tool-based code generation.
Documentation
llm/context/codebase-overview.md, llm/implementation/*.md, llm/project/*.md
Updated project overview, design rules, tech stack, and implementation guides to reflect new tool-calling capabilities, dashboard components, aggregations, and tool-smoke test.
Test Results
results/index.json
Updated run metadata and aggregated summary statistics with new execution results.

Sequence Diagram(s)

sequenceDiagram
    actor User as User/Runner
    participant HarnessAdapter as Harness Adapter
    participant ToolCaller as Tool-Calling Process
    participant CodeExtractor as Code Extractor
    participant FileSystem as File System
    participant Scorer as Scorer
    
    User->>HarnessAdapter: generate(prompt, model)
    HarnessAdapter->>HarnessAdapter: buildToolPrompt(taskPrompt)
    HarnessAdapter->>ToolCaller: execute (Goose/OpenCode + stdin prompt)
    Note over ToolCaller: Runs with tool-calling enabled
    ToolCaller->>ToolCaller: parse tool-call JSON output
    ToolCaller->>FileSystem: write solution.ts via edit tool
    FileSystem-->>ToolCaller: file created
    ToolCaller-->>HarnessAdapter: return { output, codeFilePath }
    HarnessAdapter->>FileSystem: check solution.ts exists & valid
    alt file exists and valid
        HarnessAdapter-->>User: { codeFilePath: "path/to/solution.ts" }
    else file missing/empty
        HarnessAdapter-->>User: error (tool_missing)
    end
    
    User->>Scorer: scoreGeneration(output, codeFilePath)
    alt codeFilePath provided
        Scorer->>FileSystem: read code from codeFilePath
        FileSystem-->>CodeExtractor: code content
        CodeExtractor->>CodeExtractor: extract(code, method="file")
    else fallback to output
        CodeExtractor->>CodeExtractor: extract(output, method=heuristic)
    end
    CodeExtractor-->>Scorer: extracted code
    Scorer->>Scorer: load spec, execute tests
    Scorer-->>User: { passed, failed, ... }
Loading
sequenceDiagram
    actor User as User
    participant DashUI as Dashboard UI
    participant CompositeChart as CompositeScoreChart
    participant DataLayer as Aggregations
    participant DetailDialog as DimensionDetailDialog
    
    User->>DashUI: view run-detail page
    DashUI->>CompositeChart: render with items + onDimensionClick
    CompositeChart->>DataLayer: computeCompositeMetrics(items)
    DataLayer-->>CompositeChart: { metrics by model/harness/test }
    CompositeChart->>CompositeChart: render grouped bars (tabs)
    
    User->>CompositeChart: click bar (dimension)
    CompositeChart->>DashUI: onDimensionClick(dimension, name)
    DashUI->>DashUI: setSelectedDimension({dimension, name})
    DashUI->>DetailDialog: open with dimension, name, items
    DetailDialog->>DataLayer: filterItems(dimension, name)
    DataLayer-->>DetailDialog: filtered items
    DetailDialog->>DetailDialog: computeStats, aggregateBreakdowns
    DetailDialog->>DetailDialog: render summary, blind-vs-informed, sub-tables
    DetailDialog-->>User: detail view
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

  • AustainKelsay/plebdev-bench#4: Introduces shared failure outcome types (generationFailure, scoringFailure, frontierEvalFailure) and matrix result structures that align with this PR's schema extensions and dashboard component usage.

Poem

🐰 Behold the tools we now employ,
Charts and dialogs bring such joy!
Composite scores dance with care,
Tooltips whisper, details share,
Tool-smoke tests make sure we're blessed—
Hop along, we pass each test!

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch scale-and-polish-phase

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant