Skip to content

Add trusted signal assessment to benchmark results#20

Merged
AustinKelsay merged 2 commits intostagingfrom
codex/benchmark-signal-cleanup
Mar 24, 2026
Merged

Add trusted signal assessment to benchmark results#20
AustinKelsay merged 2 commits intostagingfrom
codex/benchmark-signal-cleanup

Conversation

@AustinKelsay
Copy link
Copy Markdown
Owner

@AustinKelsay AustinKelsay commented Mar 24, 2026

Summary by CodeRabbit

  • New Features

    • Added signal assessment tracking to identify and flag potentially unreliable generated outputs.
    • Comparison reports now separately report "Raw" metrics alongside "Trusted" metrics, filtering flagged outputs.
    • New Signal section displays tainted row counts and breakdown by harness.
  • Documentation

    • Updated benchmark test specifications with explicit method signatures and behavioral constraints.

@vercel
Copy link
Copy Markdown

vercel Bot commented Mar 24, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
plebdev-bench-dashboard Ready Ready Preview, Comment Mar 24, 2026 10:10pm

Request Review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 24, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 53cf2bc3-301f-4f11-adcf-5dd61533d347

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR introduces a comprehensive "signal assessment" system that classifies benchmark results as "tainted" (unreliable) or "trustworthy" (reliable) based on heuristics like confirmation-only outputs, tool-call payloads, and permission errors. Signal assessments flow through harness adapters, item execution, and stats aggregation to compute trusted metrics that exclude tainted rows, enabling clearer distinction between raw and high-confidence evaluation results.

Changes

Cohort / File(s) Summary
Signal Assessment Core
src/schemas/common.schema.ts, src/schemas/index.ts, src/lib/signal-assessment.ts
Added new exported types SignalAssessment, SignalAssessmentClassification, SignalAssessmentReason with validation; implemented helpers to construct/append/finalize signal assessments and detect confirmation-only outputs, tool-call payloads, and taint eligibility.
Harness Result Schema
src/schemas/result.schema.ts, src/harnesses/harness.ts
Extended MatrixItemResultSchema and GenerateResult interface with optional signalAssessment field to propagate signal assessments from adapters through the harness boundary.
Code-Only Output Policy
src/harnesses/code-output-policy.ts
Extended CodeOnlyOutputDecision with taintReasons field; added regex-based logic to classify accepted outputs into taint categories (output-contract violations, mixed prose, etc.) based on extraction method and output format.
Direct & Workspace Adapters
src/harnesses/direct-adapter.ts, src/harnesses/goose-adapter.ts, src/harnesses/opencode-adapter.ts
All three adapters now compute signal assessments from code-output policy decisions, stderr patterns (permission errors), and tool-call detection, returning signalAssessment in GenerateResult; direct adapter implements controlled retry flow when output validation fails.
Item Execution & Retry Pipeline
src/runner/item-executor.ts, src/runner/item-retry.ts, src/runner/generation-retry.ts
Extended execution pipeline to propagate signalAssessment from generation/retry results through finalization boundaries, applying additional taint logic at completion and error catch points.
Stats & Comparison
src/lib/stats.ts, src/lib/stats-format.ts, src/results/compare.ts
Added SignalStats and TaintBreakdown types; extended RunStats with signal, trustedScoring, trustedFrontier computed by filtering non-tainted items; updated compareRuns to compute trusted deltas and signal availability flags; enhanced formatting to display taint breakdowns and trusted metrics alongside raw metrics.
CLI Output
src/cli/compare-command.ts
Updated printSummary to prefix existing metrics with "Raw" and conditionally report trusted metrics (pass rate, avg score) when available, plus a new "Signal" section showing tainted/trusted item counts by harness.
Test Fixtures & Contracts
src/tests/calculator-stateful/prompt.blind.md, src/tests/event-emitter/prompt.blind.md, src/tests/file-delete-smoke/prompt.blind.md, src/tests/rate-limiter/prompt.blind.md, src/tests/safe-cleanup/prompt.blind.md
Tightened output contracts with explicit method signatures, return types, and non-negativity constraints; added guidance allowing creation of missing reports/ directories.
New & Updated Tests
test/code-output-policy.test.ts, test/compare.test.ts, test/goose-adapter.test.ts, test/opencode-adapter.test.ts, test/signal-assessment.test.ts, test/stats.test.ts, test/workspace-prompt-parity.test.ts
Added assertions for taintReasons in policy tests; added regression tests for tool-call detection and permission-denial scenarios; introduced comprehensive signal-assessment helper tests; updated stats tests with createRunStats helper and trusted-metrics formatting assertions; added parity tests verifying prompt fixture contracts.

Sequence Diagram(s)

sequenceDiagram
    participant Harness as Harness Adapter<br/>(direct/goose/opencode)
    participant Policy as Code-Only<br/>Output Policy
    participant ItemExec as Item<br/>Executor
    participant Scoring as Automated<br/>Scoring
    participant Stats as Stats<br/>Aggregation
    participant CLI as CLI<br/>Compare Output

    Harness->>Policy: generate() → output
    Policy->>Policy: evaluate output format<br/>detect taint reasons
    Policy-->>Harness: CodeOnlyOutputDecision<br/>+ taintReasons

    Harness->>Harness: finalize signal assessment<br/>from taintReasons
    Harness-->>ItemExec: GenerateResult<br/>+ signalAssessment

    ItemExec->>Scoring: run automated score
    Scoring-->>ItemExec: AutomatedScore
    
    ItemExec->>ItemExec: finalizeItemSignalAssessment()<br/>apply workspace taint logic
    ItemExec-->>ItemExec: MatrixItemResult<br/>+ signalAssessment

    ItemExec->>Stats: collect all items
    Stats->>Stats: filter tainted items<br/>compute trustedResults
    Stats->>Stats: calculate trustedScoring<br/>+ trustedFrontier
    Stats-->>Stats: RunStats with signal,<br/>trustedScoring, trustedFrontier

    Stats->>CLI: pass RunStats to comparison
    CLI->>CLI: compute trustedScoringDelta<br/>on non-tainted pairs
    CLI->>CLI: format output:<br/>"Raw" metrics + "Trusted" metrics
    CLI-->>CLI: printSummary with<br/>Signal section
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Poem

🐰 A signal whispers through the code,
Trustworthy paths and tainted roads,
We filter out the muddied noise,
And keep the metrics we rejoice! 📊✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add trusted signal assessment to benchmark results' accurately summarizes the primary change: introducing signal assessment tracking (taint/trustworthy classification) with trusted metrics computation throughout the codebase.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/benchmark-signal-cleanup

Comment @coderabbitai help to get the list of available commands and usage tips.

@AustinKelsay
Copy link
Copy Markdown
Owner Author

@CodeRabbit full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 24, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
test/workspace-prompt-parity.test.ts (1)

57-97: Consider extracting a path helper to reduce repetition.

The test logic and assertions are correct. The pattern of path.join(process.cwd(), "src", "tests", "<name>", "prompt.blind.md") is repeated across both new tests and the existing test. A small helper could reduce duplication.

♻️ Optional: Extract path helper
+const promptPath = (testName: string, variant: "blind" | "informed" = "blind") =>
+	path.join(process.cwd(), "src", "tests", testName, `prompt.${variant}.md`);
+
 describe("workspace prompt parity", () => {
 	it("file-locator informed prompt includes every scored JSON field", () => {
-		const promptPath = path.join(
-			process.cwd(),
-			"src",
-			"tests",
-			"file-locator",
-			"prompt.informed.md",
-		);
-		const prompt = fs.readFileSync(promptPath, "utf-8");
+		const prompt = fs.readFileSync(promptPath("file-locator", "informed"), "utf-8");
 
 		expect(prompt).toContain("reports/found-values.json");
 		// ...
 	});
 
 	it("blind workspace prompts expose required parent-directory creation", () => {
-		const fileDeletePrompt = fs.readFileSync(
-			path.join(
-				process.cwd(),
-				"src",
-				"tests",
-				"file-delete-smoke",
-				"prompt.blind.md",
-			),
-			"utf-8",
-		);
+		const fileDeletePrompt = fs.readFileSync(promptPath("file-delete-smoke"), "utf-8");
 		// ...
 	});
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/workspace-prompt-parity.test.ts` around lines 57 - 97, Extract a small
helper function (e.g., getBlindPromptPath or buildTestPromptPath) to centralize
the repeated path.join(process.cwd(), "src", "tests", <name>, "prompt.blind.md")
logic and replace the three inline calls that set calculatorPrompt,
emitterPrompt, and rateLimiterPrompt with calls to that helper; ensure the
helper accepts the test folder name (e.g., "calculator-stateful",
"event-emitter", "rate-limiter") and returns the full path string used by
fs.readFileSync so the existing expect assertions remain unchanged.
src/results/compare.ts (1)

240-271: Consider extracting a shared aggregation helper to reduce duplication.

The trusted scoring/frontier calculations mirror the raw calculations (lines 195-230). While acceptable for clarity, a shared helper like computeAggregateScoring(matchedItems, accessor) could reduce the ~50 lines of duplication.

Also applies to: 274-298

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/results/compare.ts` around lines 240 - 271, The trusted scoring
computation duplicates logic from the raw calculations; extract a reusable
helper (e.g., computeAggregateScoring(matchedItems, accessor)) that accepts the
matched array and an accessor to read automatedScore (for itemA/itemB) and
returns { totalTests, passedTests, passRate }; replace the duplicated
reduce/passRate logic for trustedItemsWithScoreA/trustedItemsWithScoreB and the
corresponding raw/frontier blocks (the same pattern appears around
trustedItemsWithScoreA/trustedItemsWithScoreB and the earlier raw calculations)
to call this helper and compute passRateDelta/totalTestsDelta from returned
values.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/results/compare.ts`:
- Around line 240-271: The trusted scoring computation duplicates logic from the
raw calculations; extract a reusable helper (e.g.,
computeAggregateScoring(matchedItems, accessor)) that accepts the matched array
and an accessor to read automatedScore (for itemA/itemB) and returns {
totalTests, passedTests, passRate }; replace the duplicated reduce/passRate
logic for trustedItemsWithScoreA/trustedItemsWithScoreB and the corresponding
raw/frontier blocks (the same pattern appears around
trustedItemsWithScoreA/trustedItemsWithScoreB and the earlier raw calculations)
to call this helper and compute passRateDelta/totalTestsDelta from returned
values.

In `@test/workspace-prompt-parity.test.ts`:
- Around line 57-97: Extract a small helper function (e.g., getBlindPromptPath
or buildTestPromptPath) to centralize the repeated path.join(process.cwd(),
"src", "tests", <name>, "prompt.blind.md") logic and replace the three inline
calls that set calculatorPrompt, emitterPrompt, and rateLimiterPrompt with calls
to that helper; ensure the helper accepts the test folder name (e.g.,
"calculator-stateful", "event-emitter", "rate-limiter") and returns the full
path string used by fs.readFileSync so the existing expect assertions remain
unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 058a8150-a08b-4338-877f-713fabe13186

📥 Commits

Reviewing files that changed from the base of the PR and between 741ac1a and a0aa67a.

📒 Files selected for processing (28)
  • src/cli/compare-command.ts
  • src/harnesses/code-output-policy.ts
  • src/harnesses/direct-adapter.ts
  • src/harnesses/goose-adapter.ts
  • src/harnesses/harness.ts
  • src/harnesses/opencode-adapter.ts
  • src/lib/signal-assessment.ts
  • src/lib/stats-format.ts
  • src/lib/stats.ts
  • src/results/compare.ts
  • src/runner/generation-retry.ts
  • src/runner/item-executor.ts
  • src/runner/item-retry.ts
  • src/schemas/common.schema.ts
  • src/schemas/index.ts
  • src/schemas/result.schema.ts
  • src/tests/calculator-stateful/prompt.blind.md
  • src/tests/event-emitter/prompt.blind.md
  • src/tests/file-delete-smoke/prompt.blind.md
  • src/tests/rate-limiter/prompt.blind.md
  • src/tests/safe-cleanup/prompt.blind.md
  • test/code-output-policy.test.ts
  • test/compare.test.ts
  • test/goose-adapter.test.ts
  • test/opencode-adapter.test.ts
  • test/signal-assessment.test.ts
  • test/stats.test.ts
  • test/workspace-prompt-parity.test.ts

@AustinKelsay AustinKelsay merged commit 0573863 into staging Mar 24, 2026
3 checks passed
@coderabbitai coderabbitai Bot mentioned this pull request Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant