Add trusted signal assessment to benchmark results by AustinKelsay · Pull Request #20 · AustinKelsay/plebdev-bench

AustinKelsay · 2026-03-24T20:37:59Z

Summary by CodeRabbit

New Features
- Added signal assessment tracking to identify and flag potentially unreliable generated outputs.
- Comparison reports now separately report "Raw" metrics alongside "Trusted" metrics, filtering flagged outputs.
- New Signal section displays tainted row counts and breakdown by harness.
Documentation
- Updated benchmark test specifications with explicit method signatures and behavioral constraints.

vercel · 2026-03-24T20:38:05Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
plebdev-bench-dashboard	Ready	Preview, Comment	Mar 24, 2026 10:10pm

coderabbitai · 2026-03-24T20:38:07Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 53cf2bc3-301f-4f11-adcf-5dd61533d347

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR introduces a comprehensive "signal assessment" system that classifies benchmark results as "tainted" (unreliable) or "trustworthy" (reliable) based on heuristics like confirmation-only outputs, tool-call payloads, and permission errors. Signal assessments flow through harness adapters, item execution, and stats aggregation to compute trusted metrics that exclude tainted rows, enabling clearer distinction between raw and high-confidence evaluation results.

Changes

Cohort / File(s)	Summary
Signal Assessment Core `src/schemas/common.schema.ts`, `src/schemas/index.ts`, `src/lib/signal-assessment.ts`	Added new exported types `SignalAssessment`, `SignalAssessmentClassification`, `SignalAssessmentReason` with validation; implemented helpers to construct/append/finalize signal assessments and detect confirmation-only outputs, tool-call payloads, and taint eligibility.
Harness Result Schema `src/schemas/result.schema.ts`, `src/harnesses/harness.ts`	Extended `MatrixItemResultSchema` and `GenerateResult` interface with optional `signalAssessment` field to propagate signal assessments from adapters through the harness boundary.
Code-Only Output Policy `src/harnesses/code-output-policy.ts`	Extended `CodeOnlyOutputDecision` with `taintReasons` field; added regex-based logic to classify accepted outputs into taint categories (output-contract violations, mixed prose, etc.) based on extraction method and output format.
Direct & Workspace Adapters `src/harnesses/direct-adapter.ts`, `src/harnesses/goose-adapter.ts`, `src/harnesses/opencode-adapter.ts`	All three adapters now compute signal assessments from code-output policy decisions, stderr patterns (permission errors), and tool-call detection, returning `signalAssessment` in `GenerateResult`; direct adapter implements controlled retry flow when output validation fails.
Item Execution & Retry Pipeline `src/runner/item-executor.ts`, `src/runner/item-retry.ts`, `src/runner/generation-retry.ts`	Extended execution pipeline to propagate `signalAssessment` from generation/retry results through finalization boundaries, applying additional taint logic at completion and error catch points.
Stats & Comparison `src/lib/stats.ts`, `src/lib/stats-format.ts`, `src/results/compare.ts`	Added `SignalStats` and `TaintBreakdown` types; extended `RunStats` with `signal`, `trustedScoring`, `trustedFrontier` computed by filtering non-tainted items; updated `compareRuns` to compute trusted deltas and signal availability flags; enhanced formatting to display taint breakdowns and trusted metrics alongside raw metrics.
CLI Output `src/cli/compare-command.ts`	Updated `printSummary` to prefix existing metrics with "Raw" and conditionally report trusted metrics (pass rate, avg score) when available, plus a new "Signal" section showing tainted/trusted item counts by harness.
Test Fixtures & Contracts `src/tests/calculator-stateful/prompt.blind.md`, `src/tests/event-emitter/prompt.blind.md`, `src/tests/file-delete-smoke/prompt.blind.md`, `src/tests/rate-limiter/prompt.blind.md`, `src/tests/safe-cleanup/prompt.blind.md`	Tightened output contracts with explicit method signatures, return types, and non-negativity constraints; added guidance allowing creation of missing `reports/` directories.
New & Updated Tests `test/code-output-policy.test.ts`, `test/compare.test.ts`, `test/goose-adapter.test.ts`, `test/opencode-adapter.test.ts`, `test/signal-assessment.test.ts`, `test/stats.test.ts`, `test/workspace-prompt-parity.test.ts`	Added assertions for `taintReasons` in policy tests; added regression tests for tool-call detection and permission-denial scenarios; introduced comprehensive signal-assessment helper tests; updated stats tests with `createRunStats` helper and trusted-metrics formatting assertions; added parity tests verifying prompt fixture contracts.

Sequence Diagram(s)

sequenceDiagram
    participant Harness as Harness Adapter<br/>(direct/goose/opencode)
    participant Policy as Code-Only<br/>Output Policy
    participant ItemExec as Item<br/>Executor
    participant Scoring as Automated<br/>Scoring
    participant Stats as Stats<br/>Aggregation
    participant CLI as CLI<br/>Compare Output

    Harness->>Policy: generate() → output
    Policy->>Policy: evaluate output format<br/>detect taint reasons
    Policy-->>Harness: CodeOnlyOutputDecision<br/>+ taintReasons

    Harness->>Harness: finalize signal assessment<br/>from taintReasons
    Harness-->>ItemExec: GenerateResult<br/>+ signalAssessment

    ItemExec->>Scoring: run automated score
    Scoring-->>ItemExec: AutomatedScore
    
    ItemExec->>ItemExec: finalizeItemSignalAssessment()<br/>apply workspace taint logic
    ItemExec-->>ItemExec: MatrixItemResult<br/>+ signalAssessment

    ItemExec->>Stats: collect all items
    Stats->>Stats: filter tainted items<br/>compute trustedResults
    Stats->>Stats: calculate trustedScoring<br/>+ trustedFrontier
    Stats-->>Stats: RunStats with signal,<br/>trustedScoring, trustedFrontier

    Stats->>CLI: pass RunStats to comparison
    CLI->>CLI: compute trustedScoringDelta<br/>on non-tainted pairs
    CLI->>CLI: format output:<br/>"Raw" metrics + "Trusted" metrics
    CLI-->>CLI: printSummary with<br/>Signal section

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

tighten todo prompt and export exceptions #8 — Related code-output policy and harness adapter changes implementing structured output decisions and retry flow.
refine with real tool use and better nuanced metrics #5 — Overlapping harness adapter and schema modifications for signal/taint propagation and tool-call handling.
Staging #15 — Shared harness adapter and signal assessment schema changes affecting GenerateResult and trusted-metrics computation.

Poem

🐰 A signal whispers through the code,
Trustworthy paths and tainted roads,
We filter out the muddied noise,
And keep the metrics we rejoice! 📊✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add trusted signal assessment to benchmark results' accurately summarizes the primary change: introducing signal assessment tracking (taint/trustworthy classification) with trusted metrics computation throughout the codebase.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch codex/benchmark-signal-cleanup

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

AustinKelsay · 2026-03-24T20:38:12Z

@CodeRabbit full review

coderabbitai · 2026-03-24T20:38:18Z

✅ Actions performed

Full review triggered.

coderabbitai

🧹 Nitpick comments (2)

test/workspace-prompt-parity.test.ts (1)

57-97: Consider extracting a path helper to reduce repetition.

The test logic and assertions are correct. The pattern of path.join(process.cwd(), "src", "tests", "<name>", "prompt.blind.md") is repeated across both new tests and the existing test. A small helper could reduce duplication.

♻️ Optional: Extract path helper

+const promptPath = (testName: string, variant: "blind" | "informed" = "blind") =>
+	path.join(process.cwd(), "src", "tests", testName, `prompt.${variant}.md`);
+
 describe("workspace prompt parity", () => {
 	it("file-locator informed prompt includes every scored JSON field", () => {
-		const promptPath = path.join(
-			process.cwd(),
-			"src",
-			"tests",
-			"file-locator",
-			"prompt.informed.md",
-		);
-		const prompt = fs.readFileSync(promptPath, "utf-8");
+		const prompt = fs.readFileSync(promptPath("file-locator", "informed"), "utf-8");
 
 		expect(prompt).toContain("reports/found-values.json");
 		// ...
 	});
 
 	it("blind workspace prompts expose required parent-directory creation", () => {
-		const fileDeletePrompt = fs.readFileSync(
-			path.join(
-				process.cwd(),
-				"src",
-				"tests",
-				"file-delete-smoke",
-				"prompt.blind.md",
-			),
-			"utf-8",
-		);
+		const fileDeletePrompt = fs.readFileSync(promptPath("file-delete-smoke"), "utf-8");
 		// ...
 	});

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/workspace-prompt-parity.test.ts` around lines 57 - 97, Extract a small
helper function (e.g., getBlindPromptPath or buildTestPromptPath) to centralize
the repeated path.join(process.cwd(), "src", "tests", <name>, "prompt.blind.md")
logic and replace the three inline calls that set calculatorPrompt,
emitterPrompt, and rateLimiterPrompt with calls to that helper; ensure the
helper accepts the test folder name (e.g., "calculator-stateful",
"event-emitter", "rate-limiter") and returns the full path string used by
fs.readFileSync so the existing expect assertions remain unchanged.

src/results/compare.ts (1)

240-271: Consider extracting a shared aggregation helper to reduce duplication.

The trusted scoring/frontier calculations mirror the raw calculations (lines 195-230). While acceptable for clarity, a shared helper like computeAggregateScoring(matchedItems, accessor) could reduce the ~50 lines of duplication.

Also applies to: 274-298
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/results/compare.ts` around lines 240 - 271, The trusted scoring
computation duplicates logic from the raw calculations; extract a reusable
helper (e.g., computeAggregateScoring(matchedItems, accessor)) that accepts the
matched array and an accessor to read automatedScore (for itemA/itemB) and
returns { totalTests, passedTests, passRate }; replace the duplicated
reduce/passRate logic for trustedItemsWithScoreA/trustedItemsWithScoreB and the
corresponding raw/frontier blocks (the same pattern appears around
trustedItemsWithScoreA/trustedItemsWithScoreB and the earlier raw calculations)
to call this helper and compute passRateDelta/totalTestsDelta from returned
values.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@src/results/compare.ts`:
- Around line 240-271: The trusted scoring computation duplicates logic from the
raw calculations; extract a reusable helper (e.g.,
computeAggregateScoring(matchedItems, accessor)) that accepts the matched array
and an accessor to read automatedScore (for itemA/itemB) and returns {
totalTests, passedTests, passRate }; replace the duplicated reduce/passRate
logic for trustedItemsWithScoreA/trustedItemsWithScoreB and the corresponding
raw/frontier blocks (the same pattern appears around
trustedItemsWithScoreA/trustedItemsWithScoreB and the earlier raw calculations)
to call this helper and compute passRateDelta/totalTestsDelta from returned
values.

In `@test/workspace-prompt-parity.test.ts`:
- Around line 57-97: Extract a small helper function (e.g., getBlindPromptPath
or buildTestPromptPath) to centralize the repeated path.join(process.cwd(),
"src", "tests", <name>, "prompt.blind.md") logic and replace the three inline
calls that set calculatorPrompt, emitterPrompt, and rateLimiterPrompt with calls
to that helper; ensure the helper accepts the test folder name (e.g.,
"calculator-stateful", "event-emitter", "rate-limiter") and returns the full
path string used by fs.readFileSync so the existing expect assertions remain
unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 058a8150-a08b-4338-877f-713fabe13186

📥 Commits

Reviewing files that changed from the base of the PR and between 741ac1a and a0aa67a.

📒 Files selected for processing (28)

src/cli/compare-command.ts
src/harnesses/code-output-policy.ts
src/harnesses/direct-adapter.ts
src/harnesses/goose-adapter.ts
src/harnesses/harness.ts
src/harnesses/opencode-adapter.ts
src/lib/signal-assessment.ts
src/lib/stats-format.ts
src/lib/stats.ts
src/results/compare.ts
src/runner/generation-retry.ts
src/runner/item-executor.ts
src/runner/item-retry.ts
src/schemas/common.schema.ts
src/schemas/index.ts
src/schemas/result.schema.ts
src/tests/calculator-stateful/prompt.blind.md
src/tests/event-emitter/prompt.blind.md
src/tests/file-delete-smoke/prompt.blind.md
src/tests/rate-limiter/prompt.blind.md
src/tests/safe-cleanup/prompt.blind.md
test/code-output-policy.test.ts
test/compare.test.ts
test/goose-adapter.test.ts
test/opencode-adapter.test.ts
test/signal-assessment.test.ts
test/stats.test.ts
test/workspace-prompt-parity.test.ts

Add trusted signal assessment to benchmark results

a0aa67a

coderabbitai Bot reviewed Mar 24, 2026

View reviewed changes

Refactor compare aggregates and prompt path tests

4f0e104

vercel Bot deployed to Preview March 24, 2026 22:10 View deployment

AustinKelsay merged commit 0573863 into staging Mar 24, 2026
3 checks passed

coderabbitai Bot mentioned this pull request Mar 24, 2026

Staging #19

Merged

This was referenced Apr 5, 2026

[codex] bench: harden signal assessment and retry fairness #23

Merged

Staging #24

Open

coderabbitai Bot mentioned this pull request Apr 22, 2026

[codex] rebuild opencode harness #25

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add trusted signal assessment to benchmark results#20

Add trusted signal assessment to benchmark results#20
AustinKelsay merged 2 commits intostagingfrom
codex/benchmark-signal-cleanup

AustinKelsay commented Mar 24, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

vercel Bot commented Mar 24, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Mar 24, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

AustinKelsay commented Mar 24, 2026

Uh oh!

coderabbitai Bot commented Mar 24, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AustinKelsay commented Mar 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

vercel Bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

AustinKelsay commented Mar 24, 2026

Uh oh!

coderabbitai Bot commented Mar 24, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AustinKelsay commented Mar 24, 2026 •

edited by coderabbitai Bot

Loading

vercel Bot commented Mar 24, 2026 •

edited

Loading

coderabbitai Bot commented Mar 24, 2026 •

edited

Loading