Add trusted signal assessment to benchmark results#20
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR introduces a comprehensive "signal assessment" system that classifies benchmark results as "tainted" (unreliable) or "trustworthy" (reliable) based on heuristics like confirmation-only outputs, tool-call payloads, and permission errors. Signal assessments flow through harness adapters, item execution, and stats aggregation to compute trusted metrics that exclude tainted rows, enabling clearer distinction between raw and high-confidence evaluation results. Changes
Sequence Diagram(s)sequenceDiagram
participant Harness as Harness Adapter<br/>(direct/goose/opencode)
participant Policy as Code-Only<br/>Output Policy
participant ItemExec as Item<br/>Executor
participant Scoring as Automated<br/>Scoring
participant Stats as Stats<br/>Aggregation
participant CLI as CLI<br/>Compare Output
Harness->>Policy: generate() → output
Policy->>Policy: evaluate output format<br/>detect taint reasons
Policy-->>Harness: CodeOnlyOutputDecision<br/>+ taintReasons
Harness->>Harness: finalize signal assessment<br/>from taintReasons
Harness-->>ItemExec: GenerateResult<br/>+ signalAssessment
ItemExec->>Scoring: run automated score
Scoring-->>ItemExec: AutomatedScore
ItemExec->>ItemExec: finalizeItemSignalAssessment()<br/>apply workspace taint logic
ItemExec-->>ItemExec: MatrixItemResult<br/>+ signalAssessment
ItemExec->>Stats: collect all items
Stats->>Stats: filter tainted items<br/>compute trustedResults
Stats->>Stats: calculate trustedScoring<br/>+ trustedFrontier
Stats-->>Stats: RunStats with signal,<br/>trustedScoring, trustedFrontier
Stats->>CLI: pass RunStats to comparison
CLI->>CLI: compute trustedScoringDelta<br/>on non-tainted pairs
CLI->>CLI: format output:<br/>"Raw" metrics + "Trusted" metrics
CLI-->>CLI: printSummary with<br/>Signal section
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@CodeRabbit full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
test/workspace-prompt-parity.test.ts (1)
57-97: Consider extracting a path helper to reduce repetition.The test logic and assertions are correct. The pattern of
path.join(process.cwd(), "src", "tests", "<name>", "prompt.blind.md")is repeated across both new tests and the existing test. A small helper could reduce duplication.♻️ Optional: Extract path helper
+const promptPath = (testName: string, variant: "blind" | "informed" = "blind") => + path.join(process.cwd(), "src", "tests", testName, `prompt.${variant}.md`); + describe("workspace prompt parity", () => { it("file-locator informed prompt includes every scored JSON field", () => { - const promptPath = path.join( - process.cwd(), - "src", - "tests", - "file-locator", - "prompt.informed.md", - ); - const prompt = fs.readFileSync(promptPath, "utf-8"); + const prompt = fs.readFileSync(promptPath("file-locator", "informed"), "utf-8"); expect(prompt).toContain("reports/found-values.json"); // ... }); it("blind workspace prompts expose required parent-directory creation", () => { - const fileDeletePrompt = fs.readFileSync( - path.join( - process.cwd(), - "src", - "tests", - "file-delete-smoke", - "prompt.blind.md", - ), - "utf-8", - ); + const fileDeletePrompt = fs.readFileSync(promptPath("file-delete-smoke"), "utf-8"); // ... });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/workspace-prompt-parity.test.ts` around lines 57 - 97, Extract a small helper function (e.g., getBlindPromptPath or buildTestPromptPath) to centralize the repeated path.join(process.cwd(), "src", "tests", <name>, "prompt.blind.md") logic and replace the three inline calls that set calculatorPrompt, emitterPrompt, and rateLimiterPrompt with calls to that helper; ensure the helper accepts the test folder name (e.g., "calculator-stateful", "event-emitter", "rate-limiter") and returns the full path string used by fs.readFileSync so the existing expect assertions remain unchanged.src/results/compare.ts (1)
240-271: Consider extracting a shared aggregation helper to reduce duplication.The trusted scoring/frontier calculations mirror the raw calculations (lines 195-230). While acceptable for clarity, a shared helper like
computeAggregateScoring(matchedItems, accessor)could reduce the ~50 lines of duplication.Also applies to: 274-298
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/results/compare.ts` around lines 240 - 271, The trusted scoring computation duplicates logic from the raw calculations; extract a reusable helper (e.g., computeAggregateScoring(matchedItems, accessor)) that accepts the matched array and an accessor to read automatedScore (for itemA/itemB) and returns { totalTests, passedTests, passRate }; replace the duplicated reduce/passRate logic for trustedItemsWithScoreA/trustedItemsWithScoreB and the corresponding raw/frontier blocks (the same pattern appears around trustedItemsWithScoreA/trustedItemsWithScoreB and the earlier raw calculations) to call this helper and compute passRateDelta/totalTestsDelta from returned values.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/results/compare.ts`:
- Around line 240-271: The trusted scoring computation duplicates logic from the
raw calculations; extract a reusable helper (e.g.,
computeAggregateScoring(matchedItems, accessor)) that accepts the matched array
and an accessor to read automatedScore (for itemA/itemB) and returns {
totalTests, passedTests, passRate }; replace the duplicated reduce/passRate
logic for trustedItemsWithScoreA/trustedItemsWithScoreB and the corresponding
raw/frontier blocks (the same pattern appears around
trustedItemsWithScoreA/trustedItemsWithScoreB and the earlier raw calculations)
to call this helper and compute passRateDelta/totalTestsDelta from returned
values.
In `@test/workspace-prompt-parity.test.ts`:
- Around line 57-97: Extract a small helper function (e.g., getBlindPromptPath
or buildTestPromptPath) to centralize the repeated path.join(process.cwd(),
"src", "tests", <name>, "prompt.blind.md") logic and replace the three inline
calls that set calculatorPrompt, emitterPrompt, and rateLimiterPrompt with calls
to that helper; ensure the helper accepts the test folder name (e.g.,
"calculator-stateful", "event-emitter", "rate-limiter") and returns the full
path string used by fs.readFileSync so the existing expect assertions remain
unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 058a8150-a08b-4338-877f-713fabe13186
📒 Files selected for processing (28)
src/cli/compare-command.tssrc/harnesses/code-output-policy.tssrc/harnesses/direct-adapter.tssrc/harnesses/goose-adapter.tssrc/harnesses/harness.tssrc/harnesses/opencode-adapter.tssrc/lib/signal-assessment.tssrc/lib/stats-format.tssrc/lib/stats.tssrc/results/compare.tssrc/runner/generation-retry.tssrc/runner/item-executor.tssrc/runner/item-retry.tssrc/schemas/common.schema.tssrc/schemas/index.tssrc/schemas/result.schema.tssrc/tests/calculator-stateful/prompt.blind.mdsrc/tests/event-emitter/prompt.blind.mdsrc/tests/file-delete-smoke/prompt.blind.mdsrc/tests/rate-limiter/prompt.blind.mdsrc/tests/safe-cleanup/prompt.blind.mdtest/code-output-policy.test.tstest/compare.test.tstest/goose-adapter.test.tstest/opencode-adapter.test.tstest/signal-assessment.test.tstest/stats.test.tstest/workspace-prompt-parity.test.ts
Summary by CodeRabbit
New Features
Documentation