refine with real tool use and better nuanced metrics#5
Merged
AustinKelsay merged 1 commit intomainfrom Jan 30, 2026
Merged
Conversation
|
Caution Review failedThe pull request is closed. 📝 WalkthroughWalkthroughThis PR introduces tool-calling support to harnesses with file-based code generation, adds comprehensive dashboard charts and drill-down dialogs with tooltips, expands the data aggregation layer with composite metrics and failure statistics, implements a tool-smoke preflight test, and updates test prompts to permit tool usage. Changes
Sequence Diagram(s)sequenceDiagram
actor User as User/Runner
participant HarnessAdapter as Harness Adapter
participant ToolCaller as Tool-Calling Process
participant CodeExtractor as Code Extractor
participant FileSystem as File System
participant Scorer as Scorer
User->>HarnessAdapter: generate(prompt, model)
HarnessAdapter->>HarnessAdapter: buildToolPrompt(taskPrompt)
HarnessAdapter->>ToolCaller: execute (Goose/OpenCode + stdin prompt)
Note over ToolCaller: Runs with tool-calling enabled
ToolCaller->>ToolCaller: parse tool-call JSON output
ToolCaller->>FileSystem: write solution.ts via edit tool
FileSystem-->>ToolCaller: file created
ToolCaller-->>HarnessAdapter: return { output, codeFilePath }
HarnessAdapter->>FileSystem: check solution.ts exists & valid
alt file exists and valid
HarnessAdapter-->>User: { codeFilePath: "path/to/solution.ts" }
else file missing/empty
HarnessAdapter-->>User: error (tool_missing)
end
User->>Scorer: scoreGeneration(output, codeFilePath)
alt codeFilePath provided
Scorer->>FileSystem: read code from codeFilePath
FileSystem-->>CodeExtractor: code content
CodeExtractor->>CodeExtractor: extract(code, method="file")
else fallback to output
CodeExtractor->>CodeExtractor: extract(output, method=heuristic)
end
CodeExtractor-->>Scorer: extracted code
Scorer->>Scorer: load spec, execute tests
Scorer-->>User: { passed, failed, ... }
sequenceDiagram
actor User as User
participant DashUI as Dashboard UI
participant CompositeChart as CompositeScoreChart
participant DataLayer as Aggregations
participant DetailDialog as DimensionDetailDialog
User->>DashUI: view run-detail page
DashUI->>CompositeChart: render with items + onDimensionClick
CompositeChart->>DataLayer: computeCompositeMetrics(items)
DataLayer-->>CompositeChart: { metrics by model/harness/test }
CompositeChart->>CompositeChart: render grouped bars (tabs)
User->>CompositeChart: click bar (dimension)
CompositeChart->>DashUI: onDimensionClick(dimension, name)
DashUI->>DashUI: setSelectedDimension({dimension, name})
DashUI->>DetailDialog: open with dimension, name, items
DetailDialog->>DataLayer: filterItems(dimension, name)
DataLayer-->>DetailDialog: filtered items
DetailDialog->>DetailDialog: computeStats, aggregateBreakdowns
DetailDialog->>DetailDialog: render summary, blind-vs-informed, sub-tables
DetailDialog-->>User: detail view
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
This was referenced Feb 9, 2026
Merged
This was referenced Mar 4, 2026
This was referenced Mar 20, 2026
Merged
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary by CodeRabbit
New Features
Improvements
✏️ Tip: You can customize this high-level summary in your review settings.