[codex] bench: harden signal assessment and retry fairness by AustinKelsay · Pull Request #23 · AustinKelsay/plebdev-bench

AustinKelsay · 2026-04-05T17:51:38Z

What changed

This hardens benchmark signal classification and harness normalization on staging without changing prompt contracts or workspace exact-match scoring.

Added two new taint reasons: internal_tool_transcript and agent_requested_input
Mirrored the new reason codes through the dashboard schemas/types
Broadened post-scoring tainting so failed rows are tainted for transcript leakage, continuation prompts, confirmation-only output, and raw unevaluated tool payloads across scoring modes
Changed OpenCode and Goose normalizers so parsed JSON/JSONL that contains no assistant artifact now normalizes to an empty payload instead of leaking raw protocol/status text downstream
Updated Goose/OpenCode adapters to append transcript/input taint reasons from raw and normalized output
Removed the harness gate on compile-feedback retry so direct code-module rows now use the same one-shot retry path as Goose/OpenCode
Added unit coverage for the new normalizer behavior, signal-assessment behavior, and direct-harness compile retry

Why it changed

The previous run analysis showed that benchmark failures were not always clean model-quality signal.

Root causes addressed here:

OpenCode/Goose could leak internal protocol transcripts or continuation prompts into scored output while still being treated as trustworthy failures
Direct harness rows were excluded from compile-feedback retry, making cross-harness comparisons unfair on import/missing-export failures
Parsed-but-empty JSON/JSONL payloads were falling back to raw transport text instead of being retried/tainted cleanly

Impact

Failed rows caused by internal tool transcripts or agent continuation prompts should now be marked tainted instead of trustworthy
OpenCode/Goose should stop surfacing raw protocol JSONL/status chatter as benchmark output
Direct harness rows now participate in compile-feedback retry and record retryGenerationDurationMs when that path runs
Dashboard validation continues to accept/render the new signal reason codes

Validation

Completed:

bun test
bun test test/hardware-profile.test.ts

Notes:

bun test hit two unrelated failures in test/hardware-profile.test.ts, but that file passes when run in isolation. This appears to be a pre-existing order-dependent test issue outside the scope of this PR.
I started the longer direct-harness benchmark validation matrix and confirmed direct rows now enter compile-feedback retry and record retryGenerationDurationMs, but I stopped the full matrix because the full targeted reruns are materially long on this machine and would take hours to complete end-to-end.
The long benchmark reruns from the original plan are still pending for staging validation:
- direct targeted matrix
- goose targeted matrix
- opencode targeted matrix
- full post-fix ollama matrix

Reviewer focus

Please focus review on:

signal-assessment reasoning and taint boundaries
OpenCode/Goose normalization behavior for parsed-but-empty output
direct compile-feedback retry fairness
whether the remaining long benchmark reruns should be completed before merge from draft

Summary by CodeRabbit

New Features
- Added two signal assessment reasons: internal_tool_transcript and agent_requested_input to better classify transcript-only outputs and continuation prompts.
- Expanded detection to surface these conditions across adapters and normalization paths.
Bug Fixes
- More precise empty-output detection and improved error reporting when adapters produce no assistant content.
Tests
- Added/expanded tests covering output normalization, transcript-only cases, continuation prompts, and retry flows.

vercel · 2026-04-05T17:51:43Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
plebdev-bench-dashboard	Ready	Preview, Comment	Apr 8, 2026 4:53pm

coderabbitai · 2026-04-05T17:51:45Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3b0fd435-3d7d-4177-b90d-a63bff25038f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds two new signal-assessment reasons ("internal_tool_transcript", "agent_requested_input"), detectors and helpers to surface them, updates output normalization to treat parsed-but-empty JSON as empty, and plumbs transcript/input taint reasons through harness adapters and the item finalization flow via an explicit rowFailed parameter.

Changes

Cohort / File(s)	Summary
Schema & Types `apps/dashboard/src/lib/schemas.ts`, `apps/dashboard/src/lib/types.ts`, `src/schemas/common.schema.ts`	Added `"internal_tool_transcript"` and `"agent_requested_input"` to the SignalAssessmentReason enum/constant and updated exported `SignalAssessmentReason` type.
Signal assessment utilities `src/lib/signal-assessment.ts`	Added `isInternalToolTranscriptOutput`, `isAgentRequestedInputOutput`, `getTranscriptOrInputTaintReasons`; refactored `finalizeItemSignalAssessment` to take optional `rowFailed?: boolean` and always include transcript/input taint reasons.
Goose adapter & output normalization `src/harnesses/goose-adapter.ts`, `src/harnesses/goose-output.ts`	Extract transcript/input taint reasons from raw and normalized outputs (deduplicated), compute/pass `signalAssessment` through results, tighten fast-empty-output guard, and return `{ output: "", method: "json" }` when parsed JSON yields no assistant text.
OpenCode adapter & output normalization `src/harnesses/opencode-adapter.ts`, `src/harnesses/opencode-output.ts`	Merge stderr permission-detection with transcript/input taint reasons (stdout/stderr/normalized), build/propagate `signalAssessment` including merged reasons, include `output` and `durationMs` in error details, and return `{ output: "", method: "json" }` for parsed-but-empty JSON/JSONL.
Runner integration & retries `src/runner/item-executor.ts`	Calls to `finalizeItemSignalAssessment` updated to pass explicit `rowFailed`; compile-retry support broadened for all `code-module` scoring modes; failure paths updated to set `rowFailed: true`.
Tests `test/dashboard-schema-compat.test.ts`, `test/goose-adapter.test.ts`, `test/goose-output.test.ts`, `test/opencode-adapter.test.ts`, `test/opencode-output.test.ts`, `test/signal-assessment.test.ts`, `test/item-retry.test.ts`	Added/updated tests covering new reason codes, detectors, normalization edge cases (parsed-but-empty JSON/JSONL), adapter taint propagation, finalize behavior with `rowFailed`, and compile-retry flows.

Sequence Diagram(s)

sequenceDiagram
  participant Harness as Harness (goose/opencode)
  participant Normalizer as Output Normalizer
  participant Detector as SignalAssessment Detector
  participant Runner as Item Executor / Finalizer
  participant Schema as Schema/Types

  Harness->>Normalizer: produce raw stdout/stderr
  Normalizer->>Normalizer: parse JSON/JSONL or raw
  Normalizer-->>Detector: normalized output (may be "")
  Detector->>Detector: detect transcript/input taint reasons
  Detector-->>Runner: signalAssessment reasons (if any)
  Runner->>Schema: finalizeItemSignalAssessment(rowFailed?, reasons)
  Runner-->>Harness: return GenerateResult (output + signalAssessment)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Add trusted signal assessment to benchmark results #20: Extends the same signal-assessment feature and touches the same helper/adapter code paths.
Staging #19: Adds/updates SignalAssessment schema/types that this PR further expands with new reason codes.
dashboard layout improvements #21: Interacts with src/lib/signal-assessment.ts and related test/asset integration referenced by these changes.

Poem

🐰 I sniffed the transcript, found a trace,

"agent_requested_input" hopped into place.
Taint and prompts I gently flag,
Across adapters, guard, and test-tag.
Hooray — small reasons, big embrace! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 76.92% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title '[codex] bench: harden signal assessment and retry fairness' directly maps to the PR's core objectives: hardening signal assessment classification with new taint reasons and improving retry fairness across harness types.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/benchmark-signal-hardening

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

AustinKelsay · 2026-04-05T18:59:36Z

@coderabbitai review

coderabbitai · 2026-04-05T18:59:43Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/harnesses/opencode-adapter.ts (1)

446-454: ⚠️ Potential issue | 🟠 Major

Use the pre-normalization boundary assessment in the fast-empty guard.

Protocol-only JSON/JSONL now normalizes to "", so this branch can still throw the "model ... may not be recognized" error for fast runs even when raw stdout/stderr already produced a tainted signalAssessment (for example internal_tool_transcript). That bypasses the retry/taint path this PR is trying to preserve for parsed-but-empty output.

💡 Suggested fix

 				if (!codeFilePath) {
 					if (
 						durationMs < 2000 &&
 						(!output || output.trim().length < MIN_OUTPUT_LENGTH) &&
-						normalizedReasons.length === 0
+						normalizedSignalAssessment?.classification !== "tainted"
 					) {
 						throw new Error(
 							`OpenCode returned empty output instantly (${durationMs}ms) - model "${model}" may not be recognized by OpenCode`,
 						);
 					}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/opencode-adapter.ts` around lines 446 - 454, The fast-empty
guard currently checks normalizedReasons.length === 0 (post-normalization) which
can miss pre-normalization taints; update the condition to also require that the
pre-normalization boundary assessment (the variable representing the raw signal
assessment, e.g., signalAssessment or preNormalizationBoundaryAssessment)
indicates no taint before throwing. In the if block that checks codeFilePath,
durationMs, output and normalizedReasons, add a check that the pre-normalization
assessment is absent/empty (for example signalAssessment == null ||
signalAssessment.reasons?.length === 0) so the error is only thrown when both
pre- and post-normalization show no taint and preserve the retry/taint path when
the raw stdout/stderr had a taint.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/signal-assessment.ts`:
- Around line 22-42: The current regex arrays INTERNAL_TOOL_TRANSCRIPT_PATTERNS
and AGENT_REQUESTED_INPUT_PATTERNS are too broad (notably the
/,\s*filePath\s*:/i entry and /\bplease confirm(?: how)?\b/i) and will match
ordinary code/UI copy; narrow them to harness-specific shapes (e.g., require
surrounding JSON-like keys or tool-type prefixes: match '\"filePath\"' or
'"parameter=filePath"' exactly or require adjacent tool/session markers such as
'"type":"tool_' or '"sessionID"' in the same string) and make confirmation
checks require conversational phrasing used by the harness (e.g., full sentences
like "Please confirm to continue" or anchored fragments like "(?:please confirm
(?:to|that|how) )" rather than any occurrence of "please confirm"); update the
patterns /,\s*filePath\s*:/i and /\bplease confirm(?: how)?\b/i accordingly in
INTERNAL_TOOL_TRANSCRIPT_PATTERNS and AGENT_REQUESTED_INPUT_PATTERNS, and add
negative regression tests asserting that benign object literals containing
filePath and ordinary confirmation UI copy are NOT flagged.

---

Outside diff comments:
In `@src/harnesses/opencode-adapter.ts`:
- Around line 446-454: The fast-empty guard currently checks
normalizedReasons.length === 0 (post-normalization) which can miss
pre-normalization taints; update the condition to also require that the
pre-normalization boundary assessment (the variable representing the raw signal
assessment, e.g., signalAssessment or preNormalizationBoundaryAssessment)
indicates no taint before throwing. In the if block that checks codeFilePath,
durationMs, output and normalizedReasons, add a check that the pre-normalization
assessment is absent/empty (for example signalAssessment == null ||
signalAssessment.reasons?.length === 0) so the error is only thrown when both
pre- and post-normalization show no taint and preserve the retry/taint path when
the raw stdout/stderr had a taint.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7cd816ca-969c-4e04-b473-3767b770a3ed

📥 Commits

Reviewing files that changed from the base of the PR and between b3e1910 and 589033c.

📒 Files selected for processing (16)

apps/dashboard/src/lib/schemas.ts
apps/dashboard/src/lib/types.ts
src/harnesses/goose-adapter.ts
src/harnesses/goose-output.ts
src/harnesses/opencode-adapter.ts
src/harnesses/opencode-output.ts
src/lib/signal-assessment.ts
src/runner/item-executor.ts
src/schemas/common.schema.ts
test/dashboard-schema-compat.test.ts
test/goose-adapter.test.ts
test/goose-output.test.ts
test/item-retry.test.ts
test/opencode-adapter.test.ts
test/opencode-output.test.ts
test/signal-assessment.test.ts

AustinKelsay · 2026-04-05T19:51:47Z

@coderabbitai review

coderabbitai · 2026-04-05T19:51:53Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

♻️ Duplicate comments (1)

src/lib/signal-assessment.ts (1)

22-42: ⚠️ Potential issue | 🟠 Major

Tighten the remaining broad taint matchers.

Line 27 still matches any line-start tool-named call like write(, and Line 37 matches any sentence that merely contains without user input. Both shapes occur in ordinary code or explanatory output, so failed semantic rows can still be reclassified as boundary-tainted. Please anchor these to harness-specific phrasing/shapes and add negative regressions for benign snippets/text.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/signal-assessment.ts` around lines 22 - 42, The two broad regexes
need to be tightened: in INTERNAL_TOOL_TRANSCRIPT_PATTERNS replace the generic
function-name matcher (the entry matching
/(?:^|\n)\s*(?:bash|edit|glob|grep|read|write)\s*(?:\{|\()/im) with a pattern
that only matches when the tool context is explicit (e.g., anchored to
transcript/tool metadata like a preceding "[Tool]" or "<function=...>" token or
a JSON "type":"tool_*" context), for example require a preceding transcript
marker (e.g., (?:^|\n)\s*(?:\[Tool\b|\<function=|\\"type\\":\s*\\"tool_)) before
the tool name; and in AGENT_REQUESTED_INPUT_PATTERNS replace the plain "without
user input" alternative with a context-anchored variant that requires an agent
subject (e.g.,
/\b(?:assistant|agent)\s+(?:is\s+)?operating\s+without\s+user\s+input\b/i or
phrases like "run without user input" tied to agent wording). Add corresponding
negative-regression testcases for benign code/comments to ensure these tightened
patterns do not false-positive; update the arrays
INTERNAL_TOOL_TRANSCRIPT_PATTERNS and AGENT_REQUESTED_INPUT_PATTERNS
accordingly.

🧹 Nitpick comments (1)

src/lib/signal-assessment.ts (1)

214-223: Make the failure-state contract explicit.

If a caller omits both rowFailed and automatedScore, Line 221 silently resolves the row to “not failed” and skips all post-scoring taint checks. Requiring one of those inputs, or throwing when neither is present, would prevent under-tainting from a missed call-site update.

♻️ Proposed guard

 export function finalizeItemSignalAssessment(input: {
 	existing: SignalAssessment | undefined;
 	automatedScore: AutomatedScore | undefined;
 	rowFailed?: boolean;
 	output: string | undefined;
 }): SignalAssessment {
 	let assessment = input.existing ?? createTrustworthySignalAssessment();
+	if (input.rowFailed === undefined && input.automatedScore === undefined) {
+		throw new Error(
+			"finalizeItemSignalAssessment requires rowFailed or automatedScore",
+		);
+	}
 	const rowFailed =
 		input.rowFailed ?? Boolean(input.automatedScore && input.automatedScore.failed > 0);

As per coding guidelines, "Throw on invalid config, invalid schemas, and programmer mistakes; avoid silent fallbacks."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/signal-assessment.ts` around lines 214 - 223,
finalizeItemSignalAssessment currently treats a missing rowFailed and
automatedScore as "not failed" which can silently under-taint; add an explicit
guard at the start of finalizeItemSignalAssessment to require that at least one
of input.rowFailed or input.automatedScore is provided and throw a clear Error
otherwise. Locate the function finalizeItemSignalAssessment and the variables
input.rowFailed and input.automatedScore, add the check before computing
rowFailed (before line that sets const rowFailed = ...) and throw a descriptive
error like "finalizeItemSignalAssessment requires rowFailed or automatedScore"
so callers must opt-in rather than relying on a silent default.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/lib/signal-assessment.ts`:
- Around line 22-42: The two broad regexes need to be tightened: in
INTERNAL_TOOL_TRANSCRIPT_PATTERNS replace the generic function-name matcher (the
entry matching /(?:^|\n)\s*(?:bash|edit|glob|grep|read|write)\s*(?:\{|\()/im)
with a pattern that only matches when the tool context is explicit (e.g.,
anchored to transcript/tool metadata like a preceding "[Tool]" or
"<function=...>" token or a JSON "type":"tool_*" context), for example require a
preceding transcript marker (e.g.,
(?:^|\n)\s*(?:\[Tool\b|\<function=|\\"type\\":\s*\\"tool_)) before the tool
name; and in AGENT_REQUESTED_INPUT_PATTERNS replace the plain "without user
input" alternative with a context-anchored variant that requires an agent
subject (e.g.,
/\b(?:assistant|agent)\s+(?:is\s+)?operating\s+without\s+user\s+input\b/i or
phrases like "run without user input" tied to agent wording). Add corresponding
negative-regression testcases for benign code/comments to ensure these tightened
patterns do not false-positive; update the arrays
INTERNAL_TOOL_TRANSCRIPT_PATTERNS and AGENT_REQUESTED_INPUT_PATTERNS
accordingly.

---

Nitpick comments:
In `@src/lib/signal-assessment.ts`:
- Around line 214-223: finalizeItemSignalAssessment currently treats a missing
rowFailed and automatedScore as "not failed" which can silently under-taint; add
an explicit guard at the start of finalizeItemSignalAssessment to require that
at least one of input.rowFailed or input.automatedScore is provided and throw a
clear Error otherwise. Locate the function finalizeItemSignalAssessment and the
variables input.rowFailed and input.automatedScore, add the check before
computing rowFailed (before line that sets const rowFailed = ...) and throw a
descriptive error like "finalizeItemSignalAssessment requires rowFailed or
automatedScore" so callers must opt-in rather than relying on a silent default.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3d999f2c-5aef-4ef7-9f38-5e04fa302951

📥 Commits

Reviewing files that changed from the base of the PR and between 589033c and 42bd9ea.

📒 Files selected for processing (4)

src/harnesses/opencode-adapter.ts
src/lib/signal-assessment.ts
test/opencode-adapter.test.ts
test/signal-assessment.test.ts

✅ Files skipped from review due to trivial changes (1)

test/opencode-adapter.test.ts

🚧 Files skipped from review as they are similar to previous changes (2)

test/signal-assessment.test.ts
src/harnesses/opencode-adapter.ts

bench: harden signal assessment and retry fairness

589033c

coderabbitai Bot reviewed Apr 5, 2026

View reviewed changes

Comment thread src/lib/signal-assessment.ts

bench: tighten taint detection regressions

42bd9ea

vercel Bot deployed to Preview April 5, 2026 19:51 View deployment

coderabbitai Bot reviewed Apr 5, 2026

View reviewed changes

fix: tighten signal assessment taint guards

aef3606

vercel Bot deployed to Preview April 8, 2026 16:53 View deployment

AustinKelsay marked this pull request as ready for review April 10, 2026 21:17

AustinKelsay merged commit 9c372ee into staging Apr 10, 2026
3 checks passed

coderabbitai Bot mentioned this pull request Apr 10, 2026

Staging #24

Open

Conversation

AustinKelsay commented Apr 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Why it changed

Impact

Validation

Reviewer focus

Summary by CodeRabbit

Uh oh!

vercel Bot commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

AustinKelsay commented Apr 5, 2026

Uh oh!

coderabbitai Bot commented Apr 5, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AustinKelsay commented Apr 5, 2026

Uh oh!

coderabbitai Bot commented Apr 5, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AustinKelsay commented Apr 5, 2026 •

edited by coderabbitai Bot

Loading

vercel Bot commented Apr 5, 2026 •

edited

Loading

coderabbitai Bot commented Apr 5, 2026 •

edited

Loading