[codex] bench: harden signal assessment and retry fairness#23
[codex] bench: harden signal assessment and retry fairness#23AustinKelsay merged 3 commits intostagingfrom
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughAdds two new signal-assessment reasons ("internal_tool_transcript", "agent_requested_input"), detectors and helpers to surface them, updates output normalization to treat parsed-but-empty JSON as empty, and plumbs transcript/input taint reasons through harness adapters and the item finalization flow via an explicit Changes
Sequence Diagram(s)sequenceDiagram
participant Harness as Harness (goose/opencode)
participant Normalizer as Output Normalizer
participant Detector as SignalAssessment Detector
participant Runner as Item Executor / Finalizer
participant Schema as Schema/Types
Harness->>Normalizer: produce raw stdout/stderr
Normalizer->>Normalizer: parse JSON/JSONL or raw
Normalizer-->>Detector: normalized output (may be "")
Detector->>Detector: detect transcript/input taint reasons
Detector-->>Runner: signalAssessment reasons (if any)
Runner->>Schema: finalizeItemSignalAssessment(rowFailed?, reasons)
Runner-->>Harness: return GenerateResult (output + signalAssessment)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/harnesses/opencode-adapter.ts (1)
446-454:⚠️ Potential issue | 🟠 MajorUse the pre-normalization boundary assessment in the fast-empty guard.
Protocol-only JSON/JSONL now normalizes to
"", so this branch can still throw the"model ... may not be recognized"error for fast runs even when raw stdout/stderr already produced a taintedsignalAssessment(for exampleinternal_tool_transcript). That bypasses the retry/taint path this PR is trying to preserve for parsed-but-empty output.💡 Suggested fix
if (!codeFilePath) { if ( durationMs < 2000 && (!output || output.trim().length < MIN_OUTPUT_LENGTH) && - normalizedReasons.length === 0 + normalizedSignalAssessment?.classification !== "tainted" ) { throw new Error( `OpenCode returned empty output instantly (${durationMs}ms) - model "${model}" may not be recognized by OpenCode`, ); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/harnesses/opencode-adapter.ts` around lines 446 - 454, The fast-empty guard currently checks normalizedReasons.length === 0 (post-normalization) which can miss pre-normalization taints; update the condition to also require that the pre-normalization boundary assessment (the variable representing the raw signal assessment, e.g., signalAssessment or preNormalizationBoundaryAssessment) indicates no taint before throwing. In the if block that checks codeFilePath, durationMs, output and normalizedReasons, add a check that the pre-normalization assessment is absent/empty (for example signalAssessment == null || signalAssessment.reasons?.length === 0) so the error is only thrown when both pre- and post-normalization show no taint and preserve the retry/taint path when the raw stdout/stderr had a taint.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/signal-assessment.ts`:
- Around line 22-42: The current regex arrays INTERNAL_TOOL_TRANSCRIPT_PATTERNS
and AGENT_REQUESTED_INPUT_PATTERNS are too broad (notably the
/,\s*filePath\s*:/i entry and /\bplease confirm(?: how)?\b/i) and will match
ordinary code/UI copy; narrow them to harness-specific shapes (e.g., require
surrounding JSON-like keys or tool-type prefixes: match '\"filePath\"' or
'"parameter=filePath"' exactly or require adjacent tool/session markers such as
'"type":"tool_' or '"sessionID"' in the same string) and make confirmation
checks require conversational phrasing used by the harness (e.g., full sentences
like "Please confirm to continue" or anchored fragments like "(?:please confirm
(?:to|that|how) )" rather than any occurrence of "please confirm"); update the
patterns /,\s*filePath\s*:/i and /\bplease confirm(?: how)?\b/i accordingly in
INTERNAL_TOOL_TRANSCRIPT_PATTERNS and AGENT_REQUESTED_INPUT_PATTERNS, and add
negative regression tests asserting that benign object literals containing
filePath and ordinary confirmation UI copy are NOT flagged.
---
Outside diff comments:
In `@src/harnesses/opencode-adapter.ts`:
- Around line 446-454: The fast-empty guard currently checks
normalizedReasons.length === 0 (post-normalization) which can miss
pre-normalization taints; update the condition to also require that the
pre-normalization boundary assessment (the variable representing the raw signal
assessment, e.g., signalAssessment or preNormalizationBoundaryAssessment)
indicates no taint before throwing. In the if block that checks codeFilePath,
durationMs, output and normalizedReasons, add a check that the pre-normalization
assessment is absent/empty (for example signalAssessment == null ||
signalAssessment.reasons?.length === 0) so the error is only thrown when both
pre- and post-normalization show no taint and preserve the retry/taint path when
the raw stdout/stderr had a taint.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7cd816ca-969c-4e04-b473-3767b770a3ed
📒 Files selected for processing (16)
apps/dashboard/src/lib/schemas.tsapps/dashboard/src/lib/types.tssrc/harnesses/goose-adapter.tssrc/harnesses/goose-output.tssrc/harnesses/opencode-adapter.tssrc/harnesses/opencode-output.tssrc/lib/signal-assessment.tssrc/runner/item-executor.tssrc/schemas/common.schema.tstest/dashboard-schema-compat.test.tstest/goose-adapter.test.tstest/goose-output.test.tstest/item-retry.test.tstest/opencode-adapter.test.tstest/opencode-output.test.tstest/signal-assessment.test.ts
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/lib/signal-assessment.ts (1)
22-42:⚠️ Potential issue | 🟠 MajorTighten the remaining broad taint matchers.
Line 27 still matches any line-start tool-named call like
write(, and Line 37 matches any sentence that merely containswithout user input. Both shapes occur in ordinary code or explanatory output, so failed semantic rows can still be reclassified as boundary-tainted. Please anchor these to harness-specific phrasing/shapes and add negative regressions for benign snippets/text.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/signal-assessment.ts` around lines 22 - 42, The two broad regexes need to be tightened: in INTERNAL_TOOL_TRANSCRIPT_PATTERNS replace the generic function-name matcher (the entry matching /(?:^|\n)\s*(?:bash|edit|glob|grep|read|write)\s*(?:\{|\()/im) with a pattern that only matches when the tool context is explicit (e.g., anchored to transcript/tool metadata like a preceding "[Tool]" or "<function=...>" token or a JSON "type":"tool_*" context), for example require a preceding transcript marker (e.g., (?:^|\n)\s*(?:\[Tool\b|\<function=|\\"type\\":\s*\\"tool_)) before the tool name; and in AGENT_REQUESTED_INPUT_PATTERNS replace the plain "without user input" alternative with a context-anchored variant that requires an agent subject (e.g., /\b(?:assistant|agent)\s+(?:is\s+)?operating\s+without\s+user\s+input\b/i or phrases like "run without user input" tied to agent wording). Add corresponding negative-regression testcases for benign code/comments to ensure these tightened patterns do not false-positive; update the arrays INTERNAL_TOOL_TRANSCRIPT_PATTERNS and AGENT_REQUESTED_INPUT_PATTERNS accordingly.
🧹 Nitpick comments (1)
src/lib/signal-assessment.ts (1)
214-223: Make the failure-state contract explicit.If a caller omits both
rowFailedandautomatedScore, Line 221 silently resolves the row to “not failed” and skips all post-scoring taint checks. Requiring one of those inputs, or throwing when neither is present, would prevent under-tainting from a missed call-site update.♻️ Proposed guard
export function finalizeItemSignalAssessment(input: { existing: SignalAssessment | undefined; automatedScore: AutomatedScore | undefined; rowFailed?: boolean; output: string | undefined; }): SignalAssessment { let assessment = input.existing ?? createTrustworthySignalAssessment(); + if (input.rowFailed === undefined && input.automatedScore === undefined) { + throw new Error( + "finalizeItemSignalAssessment requires rowFailed or automatedScore", + ); + } const rowFailed = input.rowFailed ?? Boolean(input.automatedScore && input.automatedScore.failed > 0);As per coding guidelines, "Throw on invalid config, invalid schemas, and programmer mistakes; avoid silent fallbacks."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/signal-assessment.ts` around lines 214 - 223, finalizeItemSignalAssessment currently treats a missing rowFailed and automatedScore as "not failed" which can silently under-taint; add an explicit guard at the start of finalizeItemSignalAssessment to require that at least one of input.rowFailed or input.automatedScore is provided and throw a clear Error otherwise. Locate the function finalizeItemSignalAssessment and the variables input.rowFailed and input.automatedScore, add the check before computing rowFailed (before line that sets const rowFailed = ...) and throw a descriptive error like "finalizeItemSignalAssessment requires rowFailed or automatedScore" so callers must opt-in rather than relying on a silent default.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/lib/signal-assessment.ts`:
- Around line 22-42: The two broad regexes need to be tightened: in
INTERNAL_TOOL_TRANSCRIPT_PATTERNS replace the generic function-name matcher (the
entry matching /(?:^|\n)\s*(?:bash|edit|glob|grep|read|write)\s*(?:\{|\()/im)
with a pattern that only matches when the tool context is explicit (e.g.,
anchored to transcript/tool metadata like a preceding "[Tool]" or
"<function=...>" token or a JSON "type":"tool_*" context), for example require a
preceding transcript marker (e.g.,
(?:^|\n)\s*(?:\[Tool\b|\<function=|\\"type\\":\s*\\"tool_)) before the tool
name; and in AGENT_REQUESTED_INPUT_PATTERNS replace the plain "without user
input" alternative with a context-anchored variant that requires an agent
subject (e.g.,
/\b(?:assistant|agent)\s+(?:is\s+)?operating\s+without\s+user\s+input\b/i or
phrases like "run without user input" tied to agent wording). Add corresponding
negative-regression testcases for benign code/comments to ensure these tightened
patterns do not false-positive; update the arrays
INTERNAL_TOOL_TRANSCRIPT_PATTERNS and AGENT_REQUESTED_INPUT_PATTERNS
accordingly.
---
Nitpick comments:
In `@src/lib/signal-assessment.ts`:
- Around line 214-223: finalizeItemSignalAssessment currently treats a missing
rowFailed and automatedScore as "not failed" which can silently under-taint; add
an explicit guard at the start of finalizeItemSignalAssessment to require that
at least one of input.rowFailed or input.automatedScore is provided and throw a
clear Error otherwise. Locate the function finalizeItemSignalAssessment and the
variables input.rowFailed and input.automatedScore, add the check before
computing rowFailed (before line that sets const rowFailed = ...) and throw a
descriptive error like "finalizeItemSignalAssessment requires rowFailed or
automatedScore" so callers must opt-in rather than relying on a silent default.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3d999f2c-5aef-4ef7-9f38-5e04fa302951
📒 Files selected for processing (4)
src/harnesses/opencode-adapter.tssrc/lib/signal-assessment.tstest/opencode-adapter.test.tstest/signal-assessment.test.ts
✅ Files skipped from review due to trivial changes (1)
- test/opencode-adapter.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
- test/signal-assessment.test.ts
- src/harnesses/opencode-adapter.ts
What changed
This hardens benchmark signal classification and harness normalization on
stagingwithout changing prompt contracts or workspace exact-match scoring.internal_tool_transcriptandagent_requested_inputcode-modulerows now use the same one-shot retry path as Goose/OpenCodeWhy it changed
The previous run analysis showed that benchmark failures were not always clean model-quality signal.
Root causes addressed here:
Impact
retryGenerationDurationMswhen that path runsValidation
Completed:
bun testbun test test/hardware-profile.test.tsNotes:
bun testhit two unrelated failures intest/hardware-profile.test.ts, but that file passes when run in isolation. This appears to be a pre-existing order-dependent test issue outside the scope of this PR.retryGenerationDurationMs, but I stopped the full matrix because the full targeted reruns are materially long on this machine and would take hours to complete end-to-end.Reviewer focus
Please focus review on:
Summary by CodeRabbit
New Features
Bug Fixes
Tests