tighten todo prompt and export exceptions by AustinKelsay · Pull Request #8 · AustinKelsay/plebdev-bench

AustinKelsay · 2026-02-12T18:51:31Z

Summary by CodeRabbit

New Features
- Two new scoring failure types added ("missing_export", "factory_init_failed"), richer per-case scoring details, and retry-aware generation/scoring for improved robustness.
Schema
- Schema version bumped to 0.2.2 to include the new failure types.
Tests
- Expanded tests to cover new failure categories, output-evaluation logic, and retry behaviors.
Documentation
- Updated READMEs, prompts, and guides clarifying output contracts, CLI commands, and spec conventions.

vercel · 2026-02-12T18:51:35Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
plebdev-bench-dashboard	Ready	Preview, Comment	Feb 15, 2026 3:15pm

coderabbitai · 2026-02-12T18:51:57Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds two new scoring failure types ("missing_export", "factory_init_failed"); extends schemas/types and tests; updates error classification to return these types; extracts stdout suppression into src/lib/stdout-suppressor.ts; updates scorer to use suppression, aggregate expected totals, and return detailed failure results; introduces code-output policy, harness retry logic, and a compile-feedback retry flow in the runner.

Changes

Cohort / File(s)	Summary
Schema & Types `apps/dashboard/src/lib/schemas.ts`, `apps/dashboard/src/lib/types.ts`, `src/schemas/common.schema.ts`	Added `"missing_export"` and `"factory_init_failed"` to enums/unions and bumped `SCHEMA_VERSION` to `0.2.2`.
Failure Classification & Tests `src/lib/failure-classifier.ts`, `test/failure-classifier.test.ts`	Mapped new error phrases to `"missing_export"` and `"factory_init_failed"` (kept `export_validation`); updated tests to assert new classifications and ordering.
Stdout Suppression Utility `src/lib/stdout-suppressor.ts`, `src/lib/scorer.ts`	New suppressor module exported; scorer now imports `suppressStdout`/`suppressStdoutAsync` and removed in-file suppression implementations.
Scorer Logic `src/lib/scorer.ts`	Introduced `expectedTotal` for unified totals; added consolidated missing-export path returning per-export `details` with `failureType: "missing_export"`; added `factory_init_failed` branch and harmonized totals across error branches.
Code Output Policy & Harnesses `src/harnesses/code-output-policy.ts`, `src/harnesses/goose-adapter.ts`, `src/harnesses/opencode-adapter.ts`	New centralized code-only prompt/evaluation utilities; adapters now use decision-based evaluation, support a single retry with retry-marker, and handle off-task/suspicious outputs; Opencode adapter adds `ping()` and adapter retry-aware prompt handling.
Runner Compile-Feedback Retry `src/runner/item-executor.ts`	Adds compile-feedback retry flow: builds retry prompts with compiler/import errors, re-runs generation/scoring when appropriate, and promotes better retry results.
Tests for New Policy `test/code-output-policy.test.ts`, `test/schemas.test.ts`	Added tests for code-output policy behavior and updated schema-version expectation.
Docs, Prompts & Bench Tests `README.md`, `llm/`, `src/tests/`	Numerous prompt/readme updates: added "Output contract" sections, renamed some test artifacts to `scoring.spec.ts`, refined rubrics and instructions across many tests.
Dashboard Results & Plan `apps/dashboard/public/results/*.json`	Updated plan/result metadata to `schemaVersion: 0.2.2`, adjusted timeouts/runtimes, and replaced/removed run entries.
Misc (new modules & test artifacts) `src/lib/stdout-suppressor.ts`, `src/harnesses/`, `test/`	Added new utility module and policy harness, expanded adapter logic and tests; various files added/modified to support retry and evaluation flows.

Sequence Diagram(s)

sequenceDiagram
    participant Runner
    participant Harness as Harness (Goose/OpenCode)
    participant CodePolicy as CodeOutputPolicy
    participant Scorer
    participant StdSuppress as StdoutSuppressor
    participant Classifier

    Runner->>Harness: generateCode(prompt)
    Harness-->>CodePolicy: rawOutput
    CodePolicy-->>Harness: decision {shouldRetry, reason, code}
    alt decision.shouldRetry && retryAllowed
        Harness->>Harness: re-invoke with retry marker
        Harness-->>CodePolicy: rawOutput (retry)
    end
    CodePolicy-->>Runner: {code, method, shouldRetry=false}
    Runner->>Scorer: scoreGeneration(testSlug, codeFilePath?, rawOutput?)
    Scorer->>StdSuppress: suppressStdoutAsync(run import/execution)
    StdSuppress-->>Scorer: suppression active
    Scorer->>Scorer: import/execute candidate module
    alt missing exports detected
        Scorer->>Classifier: "Missing export ..."
        Classifier-->>Scorer: "missing_export"
        Scorer->>StdSuppress: restore
        Scorer-->>Runner: failure(failureType: "missing_export", totals, details)
    else factory init fails
        Scorer->>Classifier: "Failed to create instance ..."
        Classifier-->>Scorer: "factory_init_failed"
        Scorer->>StdSuppress: restore
        Scorer-->>Runner: failure(failureType: "factory_init_failed", totals, details)
    else imports & factory OK
        Scorer->>StdSuppress: restore
        Scorer-->>Runner: success(results)
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

refine with real tool use and better nuanced metrics #5: Overlaps modifications to scoring failure handling (schemas/types, classifier, scorer) and likely shares intent and code paths.
review and harden phase executed by codex #4: Related refactor/additions touching stdout suppression and scorer/classifier changes.
Feature/multi runtime #6: Makes similar edits to classifier, scorer, and related schema updates.

Poem

🐰 I hopped through code with tiny paws,
I named two faults with tidy claws,
I hushed the logs while imports tried,
Gave missing exports a place to hide,
Then nudged retries — the tests looked pleased.

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'tighten todo prompt and export exceptions' is directly related to the main changes, which involve tightening prompt specifications and handling two new export-related failure types (missing_export and factory_init_failed).
Docstring Coverage	✅ Passed	Docstring coverage is 94.44% which is sufficient. The required threshold is 80.00%.
Merge Conflict Detection	✅ Passed	✅ No merge conflicts detected when merging into `main`

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch refactor/tighten-prompts-and-existing-tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@src/schemas/common.schema.ts`:
- Around line 65-66: Update the schema version and test expectation to reflect
the additive enum change: bump SCHEMA_VERSION from "0.2.1" to "0.2.2" next to
the modified scoringFailureTypes (which now includes "missing_export" and
"factory_init_failed"), and update the assertion in test/schemas.test.ts that
checks SCHEMA_VERSION so the test expects "0.2.2". Ensure only the
SCHEMA_VERSION value and the test expectation are changed; do not alter the
scoringFailureTypes entries themselves.

In `@src/tests/todo-app/prompt.blind.md`:
- Around line 9-13: Remove the duplicate plain-list entry that names the methods
`listTodos`, `listCompleted`, `listPending`, and `clearCompleted` (the short
list without signatures) so only the detailed signature lines for `listTodos():
Array<{ id: number, text: string, completed: boolean }>`, `listCompleted():
Array<{ id: number, text: string, completed: boolean }>`, `listPending():
Array<{ id: number, text: string, completed: boolean }>`, and `clearCompleted():
void` remain; simply delete the redundant line to avoid repeating those method
names without signatures.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/schemas/common.schema.ts (1)
61-72: ⚠️ Potential issue | 🟡 Minor

Update header export list to include scoring failure exports.

The header’s export list omits the newly exported scoring failure constants/types, so it no longer matches the file’s public surface. Please update it for accuracy.
✍️ Suggested header update
- * Exports: SCHEMA_VERSION, passTypes, PassTypeSchema, PassType,
- *          itemStatusTypes, ItemStatusSchema, ItemStatus,
- *          runtimeNames, RuntimeNameSchema, RuntimeName
+ * Exports: SCHEMA_VERSION, runtimeNames, RuntimeNameSchema, RuntimeName,
+ *          passTypes, PassTypeSchema, PassType,
+ *          itemStatusTypes, ItemStatusSchema, ItemStatus,
+ *          generationFailureTypes, GenerationFailureTypeSchema, GenerationFailureType,
+ *          scoringFailureTypes, ScoringFailureTypeSchema, ScoringFailureType,
+ *          frontierEvalFailureTypes, FrontierEvalFailureTypeSchema, FrontierEvalFailureType
As per coding guidelines, "Every file begins with a short header documenting purpose, exports, and invariants".

AustinKelsay · 2026-02-12T21:39:16Z

@CodeRabbit full review

coderabbitai · 2026-02-12T21:39:29Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/lib/stdout-suppressor.ts`:
- Around line 37-48: The return from result.finally(restore) is a Promise when
fn() yields a Promise, but the code casts it incorrectly to T; update the cast
so the expression in the try branch returns a Promise<T> by changing the cast on
result.finally(restore) to as Promise<T>, keeping the surrounding logic (fn,
result, restore) intact so the async overload's return type is correct.

🧹 Nitpick comments (1)

llm/project/project-rules.md (1)
41-41: Document the new *.spec.ts suffix convention.

The rename from scoring.test.ts to scoring.spec.ts introduces a new file naming pattern that distinguishes specification files from test files. However, the suffix conventions section (lines 70-71) only documents *.test.ts for vitest tests.

Consider adding an entry for *.spec.ts to clarify its purpose:

*.spec.ts — scoring specifications (define expected exports and test cases for benchmark tests)

*.test.ts — vitest tests
📝 Suggested addition to naming conventions

Add this to the "Suffix conventions" section around line 71:
 - Suffix conventions:
+  - `*.spec.ts` — scoring specifications (expected exports and test cases for benchmarks)
   - `*.test.ts` — vitest tests
   - `*.schema.ts` — Zod schemas
Also applies to: 159-159

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@apps/dashboard/public/results/20260213-075756-7dc316/plan.json`:
- Around line 4557-4559: The summary field "runtimes" is incorrect (shows 2)
while every item in the plan lists "runtime": "ollama", so update the runtimes
summary to the correct count (1) or compute it dynamically from unique runtime
values; specifically change the "runtimes" value to 1 (or derive it from the set
of runtime values) so it matches the items and prevents dashboard
inconsistency—look for the JSON keys "totalItems", "runtimes", "models" and the
per-item "runtime": "ollama" when making the fix.

In `@apps/dashboard/public/results/index.json`:
- Around line 3-6: The durationMs (27786998) does not equal the delta between
startedAt and completedAt for runId "20260213-075756-7dc316"; recompute
durationMs as the millisecond difference between completedAt and startedAt
(completedAt 2026-02-14T01:38:41.264Z minus startedAt 2026-02-13T15:57:56.054Z =
34845210) and update the durationMs field accordingly (or adjust the timestamps
if those are wrong).

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@src/harnesses/code-output-policy.ts`:
- Around line 53-77: The TSDoc for the exported function buildCodeOnlyPrompt is
missing an `@throws` tag; update its docblock to include an `@throws` entry
describing any errors that can propagate (e.g., if getTestSpecificContract
throws or if invalid input is provided) or explicitly state that it does not
throw, so the docblock contains purpose, `@param`, `@returns` and `@throws`; apply the
same pattern to the other exported functions noted (lines 112-178 and 230-251)
using their respective function names (e.g., getTestSpecificContract) to
indicate possible thrown errors or "never throws" as appropriate.

In `@src/harnesses/opencode-adapter.ts`:
- Around line 46-47: Update the TSDoc for the exported function
createOpenCodeAdapter to include a brief purpose description plus explicit
`@returns` and `@throws` tags: state that it returns a Harness instance (type:
Harness) and document any runtime errors that callers should expect (e.g.,
initialization/config errors or exceptions thrown during adapter construction).
Attach these tags to the existing docblock above createOpenCodeAdapter so the
exported function has purpose, returns, and throws documented for consumers.

In `@src/runner/item-executor.ts`:
- Around line 319-416: The current promotion logic in the run that sets
generation = retryAttempt.generation and scoringResult =
retryAttempt.scoringResult loses the original attempt evidence; before replacing
these fields in the promotion branch of runCompileFeedbackRetry (the block
handling shouldPromoteRetry around retryAttempt and variables generation,
scoringResult, previousPassed), capture the original attempt (generation,
scoringResult, duration, and any failureType/error) and append it to a
persistent attempts/retryHistory on the item result (or include an
originalAttempt field on MatrixItemResult) so both the original and promoted
attempts are stored; then perform the promotion (replace
generation/scoringResult) but leave the new history entry referencing both
attempts and any metadata (timestamps, durations, pass counts) so
auditing/metrics can reconstruct the full decision.

In `@test/code-output-policy.test.ts`:
- Around line 1-4: Add an "Invariants" line to the file header in
test/code-output-policy.test.ts so it follows the project header standard;
update the existing comment block (the header with Purpose and Exports) to
include a third line beginning with "Invariants:" and describe the file-level
guarantees (e.g., test-only helpers, no side effects, or the specific invariant
relevant to these unit tests) to make the header include Purpose, Exports, and
Invariants.

🧹 Nitpick comments (1)

src/runner/item-executor.ts (1)

70-179: Consider extracting the compile-retry helpers to keep the file under 500 lines.
With the new retry helpers, this file now ends around Line 538. Moving COMPILE_RETRY_ERROR_MAX_LENGTH, buildCompileRetryPrompt, CompileRetryContext, and runCompileFeedbackRetry into a small helper module (e.g., src/runner/compile-retry.ts) would keep responsibilities split and meet the size guideline.

As per coding guidelines: "Keep files under 500 lines; split aggressively by responsibility".

coderabbitai · 2026-02-15T15:22:54Z

+/**
+ * Builds the strict code-only prompt contract.
+ *
+ * @param prompt - Base task prompt
+ * @param isRetry - Whether this is the retry attempt
+ * @returns Prompt text with strict output contract appended
+ */
+export function buildCodeOnlyPrompt(prompt: string, isRetry: boolean): string {
+	const retryLine = isRetry
+		? "Previous output was unusable. Retry now and output only final TypeScript source."
+		: "";
+	const testSpecificLines = getTestSpecificContract(prompt);
+	return [
+		prompt.trim(),
+		"",
+		"Output contract:",
+		"- Return only final TypeScript source code.",
+		"- Do not include markdown fences, analysis, status messages, or repository exploration text.",
+		"- If uncertain, still return your best complete TypeScript implementation.",
+		...testSpecificLines,
+		retryLine,
+	]
+		.filter((line) => line.length > 0)
+		.join("\n");
+}


⚠️ Potential issue | 🟡 Minor

Add @throws tags to exported function TSDoc.
The guideline requires purpose/params/returns/throws on exported functions; these blocks omit @throws.

✍️ Suggested doc updates

@@ * `@param` prompt - Base task prompt * `@param` isRetry - Whether this is the retry attempt * `@returns` Prompt text with strict output contract appended + * `@throws` {Error} Never. */ export function buildCodeOnlyPrompt(prompt: string, isRetry: boolean): string { @@ * `@param` output - Harness text output * `@param` minOutputLength - Minimum usable length * `@returns` Decision describing retry behavior and extracted code candidate + * `@throws` {Error} Never. */ export function evaluateCodeOnlyOutput( @@ * `@param` prompt - Prompt text * `@returns` True when the prompt includes the internal retry marker + * `@throws` {Error} Never. */ export function hasRetryMarker(prompt: string): boolean { @@ * `@param` prompt - Prompt text * `@returns` Prompt text with marker appended + * `@throws` {Error} Never. */ export function appendRetryMarker(prompt: string): string { @@ * `@param` prompt - Prompt text * `@returns` Prompt without internal retry marker + * `@throws` {Error} Never. */ export function stripRetryMarker(prompt: string): string {

As per coding guidelines: "All exported functions require TSDoc/JSDoc documentation (purpose, params, returns, throws)".

Also applies to: 112-178, 230-251

🤖 Prompt for AI Agents

In `@src/harnesses/code-output-policy.ts` around lines 53 - 77, The TSDoc for the exported function buildCodeOnlyPrompt is missing an `@throws` tag; update its docblock to include an `@throws` entry describing any errors that can propagate (e.g., if getTestSpecificContract throws or if invalid input is provided) or explicitly state that it does not throw, so the docblock contains purpose, `@param`, `@returns` and `@throws`; apply the same pattern to the other exported functions noted (lines 112-178 and 230-251) using their respective function names (e.g., getTestSpecificContract) to indicate possible thrown errors or "never throws" as appropriate.

coderabbitai · 2026-02-15T15:22:54Z

+/** Creates an OpenCode harness adapter. */
 export function createOpenCodeAdapter(): Harness {


⚠️ Potential issue | 🟡 Minor

Expand createOpenCodeAdapter TSDoc to include returns/throws.
As an exported function, it should include the full TSDoc tag set.

✍️ Suggested doc update

-/** Creates an OpenCode harness adapter. */ +/** + * Creates an OpenCode harness adapter. + * + * `@returns` Harness instance for OpenCode + * `@throws` {Error} Never. + */

As per coding guidelines: "All exported functions require TSDoc/JSDoc documentation (purpose, params, returns, throws)".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

/** Creates an OpenCode harness adapter. */

export function createOpenCodeAdapter(): Harness {

/**

* Creates an OpenCode harness adapter.

*

* `@returns` Harness instance for OpenCode

* `@throws` {Error} Never.

*/

export function createOpenCodeAdapter(): Harness {

🤖 Prompt for AI Agents

In `@src/harnesses/opencode-adapter.ts` around lines 46 - 47, Update the TSDoc for the exported function createOpenCodeAdapter to include a brief purpose description plus explicit `@returns` and `@throws` tags: state that it returns a Harness instance (type: Harness) and document any runtime errors that callers should expect (e.g., initialization/config errors or exceptions thrown during adapter construction). Attach these tags to the existing docblock above createOpenCodeAdapter so the exported function has purpose, returns, and throws documented for consumers.

coderabbitai · 2026-02-15T15:22:54Z

+			try {
+				scoringResult = await scoreGeneration(
+					item.test,
+					generation.output ?? "", // empty string OK when codeFilePath is set
+					undefined, // use default timeout
+					generation.codeFilePath, // pass file path from tool-calling harness
+				);
+			} catch (scoringError) {
+				const scoringErrorMessage =
+					scoringError instanceof Error
+						? scoringError.message
+						: String(scoringError);
+				if (
+					supportsCompileRetry &&
+					harnessForRetry &&
+					runtimeForRetry &&
+					promptForRetry.length > 0
+				) {
+					const retryFromException = await runCompileFeedbackRetry({
+						item,
+						harness: harnessForRetry,
+						runtime: runtimeForRetry,
+						promptForRetry,
+						timeoutMs,
+						unloadAfter,
+						log,
+						currentGenerationDurationMs: generation.durationMs,
+						compileError: scoringErrorMessage,
+					});
+					if (retryFromException) {
+						compileRetryUsed = true;
+						generation = retryFromException.generation;
+						scoringResult = retryFromException.scoringResult;
+					} else {
+						throw scoringError;
+					}
+				} else {
+					throw scoringError;
+				}
+			}
+
+			const compileError =
+				scoringResult.failureType === "import" ||
+				scoringResult.failureType === "missing_export"
+					? scoringResult.error
+					: undefined;
+			if (
+				!compileRetryUsed &&
+				supportsCompileRetry &&
+				typeof compileError === "string" &&
+				harnessForRetry &&
+				runtimeForRetry &&
+				promptForRetry.length > 0
+			) {
+				const retryAttempt = await runCompileFeedbackRetry({
+					item,
+					harness: harnessForRetry,
+					runtime: runtimeForRetry,
+					promptForRetry,
+					timeoutMs,
+					unloadAfter,
+					log,
+					currentGenerationDurationMs: generation.durationMs,
+					compileError,
+				});
+				if (retryAttempt) {
+					const previousPassed = scoringResult.passed;
+					const shouldPromoteRetry =
+						retryAttempt.scoringResult.passed > previousPassed ||
+						(retryAttempt.scoringResult.passed === previousPassed &&
+							scoringResult.failureType === "import" &&
+							retryAttempt.scoringResult.failureType !== "import");
+					if (shouldPromoteRetry) {
+						generation = retryAttempt.generation;
+						scoringResult = retryAttempt.scoringResult;
+						log.info(
+							{
+								harness: item.harness,
+								test: item.test,
+								passType: item.passType,
+								beforePassed: previousPassed,
+								afterPassed: retryAttempt.scoringResult.passed,
+							},
+							"Compile-feedback retry promoted as best attempt",
+						);
+					} else {
+						log.warn(
+							{
+								harness: item.harness,
+								test: item.test,
+								passType: item.passType,
+								beforePassed: previousPassed,
+								retryPassed: retryAttempt.scoringResult.passed,
+							},
+							"Compile-feedback retry did not improve score; keeping original attempt",
+						);
+					}
+				}


⚠️ Potential issue | 🟠 Major

Retry promotion overwrites the original attempt without preserving evidence.
When generation and scoringResult are replaced (Line 392), the initial failure context is lost in MatrixItemResult, which makes runs harder to explain or audit. Please persist both attempts (e.g., store an attempts/retryHistory array or include the original scoring failure alongside the promoted result) and keep the promoted attempt as best.

Based on learnings: "Never implicitly fix up results after a run; capture enough evidence to explain outcomes (test failures, eval reasoning, durations, best-effort metrics)".

🤖 Prompt for AI Agents

In `@src/runner/item-executor.ts` around lines 319 - 416, The current promotion logic in the run that sets generation = retryAttempt.generation and scoringResult = retryAttempt.scoringResult loses the original attempt evidence; before replacing these fields in the promotion branch of runCompileFeedbackRetry (the block handling shouldPromoteRetry around retryAttempt and variables generation, scoringResult, previousPassed), capture the original attempt (generation, scoringResult, duration, and any failureType/error) and append it to a persistent attempts/retryHistory on the item result (or include an originalAttempt field on MatrixItemResult) so both the original and promoted attempts are stored; then perform the promotion (replace generation/scoringResult) but leave the new history entry referencing both attempts and any metadata (timestamps, durations, pass counts) so auditing/metrics can reconstruct the full decision.

coderabbitai · 2026-02-15T15:22:54Z

+/**
+ * Purpose: Unit tests for harness code-output policy helpers.
+ * Exports: (none)
+ */


⚠️ Potential issue | 🟡 Minor

Add an invariants line to the file header.
The header is missing invariants, which is required for TS/JS files.

✍️ Suggested header update

/** * Purpose: Unit tests for harness code-output policy helpers. * Exports: (none) + * Invariants: Tests are deterministic and network-free. */

As per coding guidelines: "Every file begins with a short header documenting purpose, exports, and invariants".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

/**

* Purpose: Unit tests for harness code-output policy helpers.

* Exports: (none)

*/

/**

* Purpose: Unit tests for harness code-output policy helpers.

* Exports: (none)

* Invariants: Tests are deterministic and network-free.

*/

🤖 Prompt for AI Agents

In `@test/code-output-policy.test.ts` around lines 1 - 4, Add an "Invariants" line to the file header in test/code-output-policy.test.ts so it follows the project header standard; update the existing comment block (the header with Purpose and Exports) to include a third line beginning with "Invariants:" and describe the file-level guarantees (e.g., test-only helpers, no side effects, or the specific invariant relevant to these unit tests) to make the header include Purpose, Exports, and Invariants.

tighten todo prompt and export exceptions

a955a14

coderabbitai Bot reviewed Feb 12, 2026

View reviewed changes

Comment thread src/schemas/common.schema.ts

Comment thread src/tests/todo-app/prompt.blind.md Outdated

chore: dedupe todo blind prompt and bump schema version to 0.2.2

3467af7

vercel Bot deployed to Preview February 12, 2026 19:54 View deployment

coderabbitai Bot reviewed Feb 12, 2026

View reviewed changes

Tighten benchmark prompts and align docs for fair scoring

e2a5ffe

vercel Bot deployed to Preview February 12, 2026 21:04 View deployment

coderabbitai Bot reviewed Feb 12, 2026

View reviewed changes

Comment thread src/lib/stdout-suppressor.ts

dashboard: keep only latest benchmark run in frontend data

997c6e1

vercel Bot deployed to Preview February 15, 2026 15:14 View deployment

harnesses: add code output policy and wire execution updates

bdfddb5

vercel Bot deployed to Preview February 15, 2026 15:15 View deployment

coderabbitai Bot reviewed Feb 15, 2026

View reviewed changes

Comment thread apps/dashboard/public/results/20260213-075756-7dc316/plan.json

Comment thread apps/dashboard/public/results/index.json

AustinKelsay merged commit 267a50c into main Feb 15, 2026
2 of 3 checks passed

coderabbitai Bot reviewed Feb 15, 2026

View reviewed changes

coderabbitai Bot mentioned this pull request Feb 17, 2026

dashboard: add coverage diagnostics and improve run alignment #9

Merged

This was referenced Mar 4, 2026

feat: add test categories and catalog metadata #12

Merged

Feature/result checkpointing and aggregation #13

Merged

This was referenced Mar 24, 2026

Add trusted signal assessment to benchmark results #20

Merged

Staging #19

Merged

		/** Creates an OpenCode harness adapter. */
		export function createOpenCodeAdapter(): Harness {

Conversation

AustinKelsay commented Feb 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

vercel Bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

AustinKelsay commented Feb 12, 2026

Uh oh!

coderabbitai Bot commented Feb 12, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AustinKelsay commented Feb 12, 2026 •

edited by coderabbitai Bot

Loading

vercel Bot commented Feb 12, 2026 •

edited

Loading

coderabbitai Bot commented Feb 12, 2026 •

edited

Loading