Skip to content

[codex] bench: harden signal assessment and retry fairness#23

Merged
AustinKelsay merged 3 commits intostagingfrom
fix/benchmark-signal-hardening
Apr 10, 2026
Merged

[codex] bench: harden signal assessment and retry fairness#23
AustinKelsay merged 3 commits intostagingfrom
fix/benchmark-signal-hardening

Conversation

@AustinKelsay
Copy link
Copy Markdown
Owner

@AustinKelsay AustinKelsay commented Apr 5, 2026

What changed

This hardens benchmark signal classification and harness normalization on staging without changing prompt contracts or workspace exact-match scoring.

  • Added two new taint reasons: internal_tool_transcript and agent_requested_input
  • Mirrored the new reason codes through the dashboard schemas/types
  • Broadened post-scoring tainting so failed rows are tainted for transcript leakage, continuation prompts, confirmation-only output, and raw unevaluated tool payloads across scoring modes
  • Changed OpenCode and Goose normalizers so parsed JSON/JSONL that contains no assistant artifact now normalizes to an empty payload instead of leaking raw protocol/status text downstream
  • Updated Goose/OpenCode adapters to append transcript/input taint reasons from raw and normalized output
  • Removed the harness gate on compile-feedback retry so direct code-module rows now use the same one-shot retry path as Goose/OpenCode
  • Added unit coverage for the new normalizer behavior, signal-assessment behavior, and direct-harness compile retry

Why it changed

The previous run analysis showed that benchmark failures were not always clean model-quality signal.

Root causes addressed here:

  • OpenCode/Goose could leak internal protocol transcripts or continuation prompts into scored output while still being treated as trustworthy failures
  • Direct harness rows were excluded from compile-feedback retry, making cross-harness comparisons unfair on import/missing-export failures
  • Parsed-but-empty JSON/JSONL payloads were falling back to raw transport text instead of being retried/tainted cleanly

Impact

  • Failed rows caused by internal tool transcripts or agent continuation prompts should now be marked tainted instead of trustworthy
  • OpenCode/Goose should stop surfacing raw protocol JSONL/status chatter as benchmark output
  • Direct harness rows now participate in compile-feedback retry and record retryGenerationDurationMs when that path runs
  • Dashboard validation continues to accept/render the new signal reason codes

Validation

Completed:

  • bun test
  • bun test test/hardware-profile.test.ts

Notes:

  • bun test hit two unrelated failures in test/hardware-profile.test.ts, but that file passes when run in isolation. This appears to be a pre-existing order-dependent test issue outside the scope of this PR.
  • I started the longer direct-harness benchmark validation matrix and confirmed direct rows now enter compile-feedback retry and record retryGenerationDurationMs, but I stopped the full matrix because the full targeted reruns are materially long on this machine and would take hours to complete end-to-end.
  • The long benchmark reruns from the original plan are still pending for staging validation:
    • direct targeted matrix
    • goose targeted matrix
    • opencode targeted matrix
    • full post-fix ollama matrix

Reviewer focus

Please focus review on:

  • signal-assessment reasoning and taint boundaries
  • OpenCode/Goose normalization behavior for parsed-but-empty output
  • direct compile-feedback retry fairness
  • whether the remaining long benchmark reruns should be completed before merge from draft

Summary by CodeRabbit

  • New Features

    • Added two signal assessment reasons: internal_tool_transcript and agent_requested_input to better classify transcript-only outputs and continuation prompts.
    • Expanded detection to surface these conditions across adapters and normalization paths.
  • Bug Fixes

    • More precise empty-output detection and improved error reporting when adapters produce no assistant content.
  • Tests

    • Added/expanded tests covering output normalization, transcript-only cases, continuation prompts, and retry flows.

@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 5, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
plebdev-bench-dashboard Ready Ready Preview, Comment Apr 8, 2026 4:53pm

Request Review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 5, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3b0fd435-3d7d-4177-b90d-a63bff25038f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds two new signal-assessment reasons ("internal_tool_transcript", "agent_requested_input"), detectors and helpers to surface them, updates output normalization to treat parsed-but-empty JSON as empty, and plumbs transcript/input taint reasons through harness adapters and the item finalization flow via an explicit rowFailed parameter.

Changes

Cohort / File(s) Summary
Schema & Types
apps/dashboard/src/lib/schemas.ts, apps/dashboard/src/lib/types.ts, src/schemas/common.schema.ts
Added "internal_tool_transcript" and "agent_requested_input" to the SignalAssessmentReason enum/constant and updated exported SignalAssessmentReason type.
Signal assessment utilities
src/lib/signal-assessment.ts
Added isInternalToolTranscriptOutput, isAgentRequestedInputOutput, getTranscriptOrInputTaintReasons; refactored finalizeItemSignalAssessment to take optional rowFailed?: boolean and always include transcript/input taint reasons.
Goose adapter & output normalization
src/harnesses/goose-adapter.ts, src/harnesses/goose-output.ts
Extract transcript/input taint reasons from raw and normalized outputs (deduplicated), compute/pass signalAssessment through results, tighten fast-empty-output guard, and return { output: "", method: "json" } when parsed JSON yields no assistant text.
OpenCode adapter & output normalization
src/harnesses/opencode-adapter.ts, src/harnesses/opencode-output.ts
Merge stderr permission-detection with transcript/input taint reasons (stdout/stderr/normalized), build/propagate signalAssessment including merged reasons, include output and durationMs in error details, and return { output: "", method: "json" } for parsed-but-empty JSON/JSONL.
Runner integration & retries
src/runner/item-executor.ts
Calls to finalizeItemSignalAssessment updated to pass explicit rowFailed; compile-retry support broadened for all code-module scoring modes; failure paths updated to set rowFailed: true.
Tests
test/dashboard-schema-compat.test.ts, test/goose-adapter.test.ts, test/goose-output.test.ts, test/opencode-adapter.test.ts, test/opencode-output.test.ts, test/signal-assessment.test.ts, test/item-retry.test.ts
Added/updated tests covering new reason codes, detectors, normalization edge cases (parsed-but-empty JSON/JSONL), adapter taint propagation, finalize behavior with rowFailed, and compile-retry flows.

Sequence Diagram(s)

sequenceDiagram
  participant Harness as Harness (goose/opencode)
  participant Normalizer as Output Normalizer
  participant Detector as SignalAssessment Detector
  participant Runner as Item Executor / Finalizer
  participant Schema as Schema/Types

  Harness->>Normalizer: produce raw stdout/stderr
  Normalizer->>Normalizer: parse JSON/JSONL or raw
  Normalizer-->>Detector: normalized output (may be "")
  Detector->>Detector: detect transcript/input taint reasons
  Detector-->>Runner: signalAssessment reasons (if any)
  Runner->>Schema: finalizeItemSignalAssessment(rowFailed?, reasons)
  Runner-->>Harness: return GenerateResult (output + signalAssessment)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Poem

🐰 I sniffed the transcript, found a trace,

"agent_requested_input" hopped into place.
Taint and prompts I gently flag,
Across adapters, guard, and test-tag.
Hooray — small reasons, big embrace! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 76.92% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '[codex] bench: harden signal assessment and retry fairness' directly maps to the PR's core objectives: hardening signal assessment classification with new taint reasons and improving retry fairness across harness types.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/benchmark-signal-hardening

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Owner Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 5, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/harnesses/opencode-adapter.ts (1)

446-454: ⚠️ Potential issue | 🟠 Major

Use the pre-normalization boundary assessment in the fast-empty guard.

Protocol-only JSON/JSONL now normalizes to "", so this branch can still throw the "model ... may not be recognized" error for fast runs even when raw stdout/stderr already produced a tainted signalAssessment (for example internal_tool_transcript). That bypasses the retry/taint path this PR is trying to preserve for parsed-but-empty output.

💡 Suggested fix
 				if (!codeFilePath) {
 					if (
 						durationMs < 2000 &&
 						(!output || output.trim().length < MIN_OUTPUT_LENGTH) &&
-						normalizedReasons.length === 0
+						normalizedSignalAssessment?.classification !== "tainted"
 					) {
 						throw new Error(
 							`OpenCode returned empty output instantly (${durationMs}ms) - model "${model}" may not be recognized by OpenCode`,
 						);
 					}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/opencode-adapter.ts` around lines 446 - 454, The fast-empty
guard currently checks normalizedReasons.length === 0 (post-normalization) which
can miss pre-normalization taints; update the condition to also require that the
pre-normalization boundary assessment (the variable representing the raw signal
assessment, e.g., signalAssessment or preNormalizationBoundaryAssessment)
indicates no taint before throwing. In the if block that checks codeFilePath,
durationMs, output and normalizedReasons, add a check that the pre-normalization
assessment is absent/empty (for example signalAssessment == null ||
signalAssessment.reasons?.length === 0) so the error is only thrown when both
pre- and post-normalization show no taint and preserve the retry/taint path when
the raw stdout/stderr had a taint.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/signal-assessment.ts`:
- Around line 22-42: The current regex arrays INTERNAL_TOOL_TRANSCRIPT_PATTERNS
and AGENT_REQUESTED_INPUT_PATTERNS are too broad (notably the
/,\s*filePath\s*:/i entry and /\bplease confirm(?: how)?\b/i) and will match
ordinary code/UI copy; narrow them to harness-specific shapes (e.g., require
surrounding JSON-like keys or tool-type prefixes: match '\"filePath\"' or
'"parameter=filePath"' exactly or require adjacent tool/session markers such as
'"type":"tool_' or '"sessionID"' in the same string) and make confirmation
checks require conversational phrasing used by the harness (e.g., full sentences
like "Please confirm to continue" or anchored fragments like "(?:please confirm
(?:to|that|how) )" rather than any occurrence of "please confirm"); update the
patterns /,\s*filePath\s*:/i and /\bplease confirm(?: how)?\b/i accordingly in
INTERNAL_TOOL_TRANSCRIPT_PATTERNS and AGENT_REQUESTED_INPUT_PATTERNS, and add
negative regression tests asserting that benign object literals containing
filePath and ordinary confirmation UI copy are NOT flagged.

---

Outside diff comments:
In `@src/harnesses/opencode-adapter.ts`:
- Around line 446-454: The fast-empty guard currently checks
normalizedReasons.length === 0 (post-normalization) which can miss
pre-normalization taints; update the condition to also require that the
pre-normalization boundary assessment (the variable representing the raw signal
assessment, e.g., signalAssessment or preNormalizationBoundaryAssessment)
indicates no taint before throwing. In the if block that checks codeFilePath,
durationMs, output and normalizedReasons, add a check that the pre-normalization
assessment is absent/empty (for example signalAssessment == null ||
signalAssessment.reasons?.length === 0) so the error is only thrown when both
pre- and post-normalization show no taint and preserve the retry/taint path when
the raw stdout/stderr had a taint.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7cd816ca-969c-4e04-b473-3767b770a3ed

📥 Commits

Reviewing files that changed from the base of the PR and between b3e1910 and 589033c.

📒 Files selected for processing (16)
  • apps/dashboard/src/lib/schemas.ts
  • apps/dashboard/src/lib/types.ts
  • src/harnesses/goose-adapter.ts
  • src/harnesses/goose-output.ts
  • src/harnesses/opencode-adapter.ts
  • src/harnesses/opencode-output.ts
  • src/lib/signal-assessment.ts
  • src/runner/item-executor.ts
  • src/schemas/common.schema.ts
  • test/dashboard-schema-compat.test.ts
  • test/goose-adapter.test.ts
  • test/goose-output.test.ts
  • test/item-retry.test.ts
  • test/opencode-adapter.test.ts
  • test/opencode-output.test.ts
  • test/signal-assessment.test.ts

Comment thread src/lib/signal-assessment.ts
Copy link
Copy Markdown
Owner Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 5, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/lib/signal-assessment.ts (1)

22-42: ⚠️ Potential issue | 🟠 Major

Tighten the remaining broad taint matchers.

Line 27 still matches any line-start tool-named call like write(, and Line 37 matches any sentence that merely contains without user input. Both shapes occur in ordinary code or explanatory output, so failed semantic rows can still be reclassified as boundary-tainted. Please anchor these to harness-specific phrasing/shapes and add negative regressions for benign snippets/text.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/signal-assessment.ts` around lines 22 - 42, The two broad regexes
need to be tightened: in INTERNAL_TOOL_TRANSCRIPT_PATTERNS replace the generic
function-name matcher (the entry matching
/(?:^|\n)\s*(?:bash|edit|glob|grep|read|write)\s*(?:\{|\()/im) with a pattern
that only matches when the tool context is explicit (e.g., anchored to
transcript/tool metadata like a preceding "[Tool]" or "<function=...>" token or
a JSON "type":"tool_*" context), for example require a preceding transcript
marker (e.g., (?:^|\n)\s*(?:\[Tool\b|\<function=|\\"type\\":\s*\\"tool_)) before
the tool name; and in AGENT_REQUESTED_INPUT_PATTERNS replace the plain "without
user input" alternative with a context-anchored variant that requires an agent
subject (e.g.,
/\b(?:assistant|agent)\s+(?:is\s+)?operating\s+without\s+user\s+input\b/i or
phrases like "run without user input" tied to agent wording). Add corresponding
negative-regression testcases for benign code/comments to ensure these tightened
patterns do not false-positive; update the arrays
INTERNAL_TOOL_TRANSCRIPT_PATTERNS and AGENT_REQUESTED_INPUT_PATTERNS
accordingly.
🧹 Nitpick comments (1)
src/lib/signal-assessment.ts (1)

214-223: Make the failure-state contract explicit.

If a caller omits both rowFailed and automatedScore, Line 221 silently resolves the row to “not failed” and skips all post-scoring taint checks. Requiring one of those inputs, or throwing when neither is present, would prevent under-tainting from a missed call-site update.

♻️ Proposed guard
 export function finalizeItemSignalAssessment(input: {
 	existing: SignalAssessment | undefined;
 	automatedScore: AutomatedScore | undefined;
 	rowFailed?: boolean;
 	output: string | undefined;
 }): SignalAssessment {
 	let assessment = input.existing ?? createTrustworthySignalAssessment();
+	if (input.rowFailed === undefined && input.automatedScore === undefined) {
+		throw new Error(
+			"finalizeItemSignalAssessment requires rowFailed or automatedScore",
+		);
+	}
 	const rowFailed =
 		input.rowFailed ?? Boolean(input.automatedScore && input.automatedScore.failed > 0);

As per coding guidelines, "Throw on invalid config, invalid schemas, and programmer mistakes; avoid silent fallbacks."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/signal-assessment.ts` around lines 214 - 223,
finalizeItemSignalAssessment currently treats a missing rowFailed and
automatedScore as "not failed" which can silently under-taint; add an explicit
guard at the start of finalizeItemSignalAssessment to require that at least one
of input.rowFailed or input.automatedScore is provided and throw a clear Error
otherwise. Locate the function finalizeItemSignalAssessment and the variables
input.rowFailed and input.automatedScore, add the check before computing
rowFailed (before line that sets const rowFailed = ...) and throw a descriptive
error like "finalizeItemSignalAssessment requires rowFailed or automatedScore"
so callers must opt-in rather than relying on a silent default.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/lib/signal-assessment.ts`:
- Around line 22-42: The two broad regexes need to be tightened: in
INTERNAL_TOOL_TRANSCRIPT_PATTERNS replace the generic function-name matcher (the
entry matching /(?:^|\n)\s*(?:bash|edit|glob|grep|read|write)\s*(?:\{|\()/im)
with a pattern that only matches when the tool context is explicit (e.g.,
anchored to transcript/tool metadata like a preceding "[Tool]" or
"<function=...>" token or a JSON "type":"tool_*" context), for example require a
preceding transcript marker (e.g.,
(?:^|\n)\s*(?:\[Tool\b|\<function=|\\"type\\":\s*\\"tool_)) before the tool
name; and in AGENT_REQUESTED_INPUT_PATTERNS replace the plain "without user
input" alternative with a context-anchored variant that requires an agent
subject (e.g.,
/\b(?:assistant|agent)\s+(?:is\s+)?operating\s+without\s+user\s+input\b/i or
phrases like "run without user input" tied to agent wording). Add corresponding
negative-regression testcases for benign code/comments to ensure these tightened
patterns do not false-positive; update the arrays
INTERNAL_TOOL_TRANSCRIPT_PATTERNS and AGENT_REQUESTED_INPUT_PATTERNS
accordingly.

---

Nitpick comments:
In `@src/lib/signal-assessment.ts`:
- Around line 214-223: finalizeItemSignalAssessment currently treats a missing
rowFailed and automatedScore as "not failed" which can silently under-taint; add
an explicit guard at the start of finalizeItemSignalAssessment to require that
at least one of input.rowFailed or input.automatedScore is provided and throw a
clear Error otherwise. Locate the function finalizeItemSignalAssessment and the
variables input.rowFailed and input.automatedScore, add the check before
computing rowFailed (before line that sets const rowFailed = ...) and throw a
descriptive error like "finalizeItemSignalAssessment requires rowFailed or
automatedScore" so callers must opt-in rather than relying on a silent default.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3d999f2c-5aef-4ef7-9f38-5e04fa302951

📥 Commits

Reviewing files that changed from the base of the PR and between 589033c and 42bd9ea.

📒 Files selected for processing (4)
  • src/harnesses/opencode-adapter.ts
  • src/lib/signal-assessment.ts
  • test/opencode-adapter.test.ts
  • test/signal-assessment.test.ts
✅ Files skipped from review due to trivial changes (1)
  • test/opencode-adapter.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • test/signal-assessment.test.ts
  • src/harnesses/opencode-adapter.ts

@AustinKelsay AustinKelsay marked this pull request as ready for review April 10, 2026 21:17
@AustinKelsay AustinKelsay merged commit 9c372ee into staging Apr 10, 2026
3 checks passed
@coderabbitai coderabbitai Bot mentioned this pull request Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant