Staging by AustinKelsay · Pull Request #24 · AustinKelsay/plebdev-bench

AustinKelsay · 2026-04-10T21:29:05Z

Summary by CodeRabbit

New Features
- Ollama-only default execution; per-run isolated OpenCode workspaces; residency guard and optional model-exclusions in run plans; progress checkpointing.
Bug Fixes
- Preserve process output/duration on failures; improved timeout/stale-output detection; clearer error reporting and retry behavior.
Enhancements
- Two new taint reason codes and deterministic taint merging; richer retry/scoring metadata; schema/plan parsing now preserves additive/future fields and legacy compatibility.
Tests
- Large expansion of regression/unit tests across adapters, OpenCode/Goose, residency, retries, and schemas.
Documentation
- README, CLI help and docs updated to reflect Ollama-first execution and OpenCode isolation.

[codex] bench: harden signal assessment and retry fairness

vercel · 2026-04-10T21:29:12Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
plebdev-bench-dashboard	Ready	Preview, Comment	Apr 25, 2026 7:31pm

coderabbitai · 2026-04-10T21:29:21Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Converges execution to Ollama-only, adds Ollama residency enforcement and model-eligibility exclusions, extends signal-assessment with new taint reasons and merging, refactors OpenCode into modular CLI/runner/artifacts/events/permissions, removes vLLM/OpenAI-compat code, tightens schemas/types for artifact vs supported runtimes, and adds broad test coverage and runner checkpointing helpers.

Changes

Cohort / File(s)	Summary
Schemas & public types `src/schemas/common.schema.ts`, `src/schemas/index.ts`, `src/schemas/*.schema.ts`, `apps/dashboard/src/lib/schemas.ts`, `apps/dashboard/src/lib/types.ts`	Bump schema version; split supported vs artifact runtime names; export artifact-compatible model-profile/registry types; add `ModelExclusionSchema`/`ModelExclusion`; export `MatrixItemResultSchema`; make `RunPlanSchema` a preprocess that preserves unknown/additive fields and accepts `modelExclusions`.
Runner / plan / eligibility / progress `src/runner/plan-builder.ts`, `src/runner/index.ts`, `src/runner/item-executor.ts`, `src/runner/model-eligibility.ts`, `src/runner/model-availability.ts`, `src/runner/run-progress.ts`	Enforce Ollama residency per model group; add `filterGenerativeModels` producing `ModelExclusion` entries; change checkpointing to progress checkpoints; thread/merge `signalAssessment` across generation/scoring; remove `vllmBaseUrl` from plan config.
OpenCode modularization `src/harnesses/opencode-cli.ts`, `src/harnesses/opencode-runner.ts`, `src/harnesses/opencode-artifacts.ts`, `src/harnesses/opencode-events.ts`, `src/harnesses/opencode-provider.ts`, `src/harnesses/opencode-permissions.ts`, `src/harnesses/opencode-*`	Break OpenCode into: CLI feature detection & args builder, robust execa runner with timeout/stale-output handling, per-run artifact management, JSONL event parser with tool-call extraction & permission detection, provider spec builder and permission policy. Removed old opencode process helpers and monolithic output normalizer.
Adapters & outputs (Goose/Direct/OpenCode) `src/harnesses/goose-adapter.ts`, `src/harnesses/goose-output.ts`, `src/harnesses/direct-adapter.ts`, `src/harnesses/opencode-adapter.ts`	Constrain adapters to Ollama, centralize hygiene utils (goose-utils), extract/deduplicate taint reasons from stdout/stderr/normalized outputs, attach/merge `signalAssessment` across attempts, and overhaul retry/timeout/error handling and output selection.
Runtimes & residency `src/runtimes/index.ts`, `src/runtimes/runtime.ts`, `src/runtimes/ollama-residency.ts`, `src/runtimes/ollama-runtime.ts`, deleted `vllm`/discovery files	Remove vLLM and runtime discovery; make active runtime Ollama-only; add Ollama residency APIs (list/unload/ensureOnlyOllamaModelLoaded) and enrich `ModelInfo` with `modelKind`, `capabilities`, and `metadata`.
Signal assessment core `src/lib/signal-assessment.ts`, `test/signal-assessment.test.ts`	Add transcript/input taint detectors (`internal_tool_transcript`, `agent_requested_input`), `getTranscriptOrInputTaintReasons`, `mergeSignalAssessments`; require `rowFailed` or `automatedScore` in `finalizeItemSignalAssessment` and append taint reasons deterministically.
Result schema & metrics `src/schemas/result.schema.ts`, `test/schema-regressions.test.ts`	`ScoringMetricsSchema` gains optional retry metadata (kind/reason/attempted/promoted) with a `.refine(...)` enforcing consistent presence/absence; `MatrixItemResultSchema.runtime` uses artifact runtime names.
Removed OpenAI-compat / vLLM modules `src/lib/openai-compat-client.ts`, `src/runtimes/vllm-runtime.ts`, `src/runtimes/discovery.ts`, `src/harnesses/goose-openai.ts`	Deleted OpenAI-compatible client, vLLM runtime and runtime discovery, and OpenAI base-path normalization helper.
OpenCode signal & tooling helpers `src/harnesses/opencode-signal.ts`, `src/harnesses/opencode-cli.ts`, `src/harnesses/opencode-artifacts.ts`, `src/harnesses/opencode-events.ts`, `src/harnesses/opencode-provider.ts`, `src/harnesses/opencode-permissions.ts`, `src/harnesses/opencode-runner.ts`	Add selection/taint builders, CLI feature detection with caching, artifact prep/cleanup, JSONL event parsing and tool-call extraction, provider spec normalization, permission policy and detection, and robust runner with stale-output/timeouts.
Tests & fixtures `test/**`, `test/utils/fixtures.ts`	Large expansion and refactors of unit/regression tests covering OpenCode, Ollama residency, adapter behaviors, schema regressions, retry metadata, model availability, runner residency orchestration, and new deterministic fixtures/helpers.
CLI, compare, docs, dashboard `src/cli/run-command.ts`, `src/cli/compare-command.ts`, `src/cli/compare-utils.ts`, `src/cli/compare-formatters.ts`, `README.md`, `llm/`, `apps/dashboard/`, `package.json`	Make CLI/docs Ollama-first (deprecate vllm flags/scripts), refactor `bench compare` into utils + formatters, document OpenCode isolation, update dashboard UI/text and color helper, remove vLLM examples.
Misc utilities & refactors `src/lib/`, `src/harnesses/`, logger, benchmark-checkpoint, model-profile registry compat	Add goose-utils, logger pretty-stream/env validation, export core benchmark assets, add model-profile registry-compat normalizer, many small API/signature/schema adjustments and formatting cleanups.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Runner as Runner
    participant Plan as PlanBuilder
    participant Executor as ItemExecutor
    participant Runtime as OllamaRuntime
    participant Harness as HarnessAdapter
    participant OpenCode as OpenCodeRunner
    participant Events as EventParser
    participant Assessor as SignalAssessor
    participant Writer as ResultWriter

    Runner->>Plan: buildRunPlan(config)
    Runner->>Executor: executeItem(item)
    Executor->>Runtime: ensureOnlyOllamaModelLoaded(allowedModel?)
    Executor->>Harness: generate(prompt, opts)
    alt harness is OpenCode
        Harness->>OpenCode: prepareArtifacts + buildOpenCodeRunArgs
        Harness->>OpenCode: runOpenCodeCommand(args, env, cwd, timeout)
        OpenCode-->>Harness: stdout/stderr/exitCode
        Harness->>Events: parseOpenCodeEvents(stdout+stderr)
        Events-->>Harness: { output, method, permissionDenied }
    else other harness
        Harness-->>Executor: stdout/stderr/exitCode
    end
    Harness->>Assessor: buildSignalAssessment(parsed?, stdout, stderr)
    Assessor-->>Executor: signalAssessment
    Executor->>Assessor: mergeSignalAssessments(generationAssessment, scoringAssessment)
    Assessor-->>Executor: finalSignalAssessment
    Executor->>Writer: writeResult(result)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

[codex] rebuild opencode harness #25 — Major OpenCode infra refactor (provider/runner/artifacts/events/permissions) with strong overlap.
[codex] bench: harden signal assessment and retry fairness #23 — Overlapping signal-assessment changes (new taint reasons and merging across adapters/runner).
Add trusted signal assessment to benchmark results #20 — Related propagation of signal-assessment through harnesses/results and matching schema updates.

Poem

🐇 I sniffed the stdout trail and hopped to find,

New taints and guards to keep the models kind.
Ollama watches doors while OpenCode builds its lair,
Artifacts tidy, tests clap tiny paws in the air.
Hooray — the bench runs cleaner, cautious, and fair.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch staging

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/schemas/common.schema.ts (1)
174-182: ⚠️ Potential issue | 🟠 Major

Version this additive on-disk schema change.

These new signalAssessment.reasons literals will be emitted into artifacts, but SCHEMA_VERSION stays 0.5.0. That leaves two different serialized shapes under the same version and prevents explicit compatibility or migration handling.

As per coding guidelines "Maintain versioned schemas with schemaVersion field and use additive changes; migrations must be explicit".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/schemas/common.schema.ts` around lines 174 - 182, This change adds new
literals to signalAssessmentReasonTypes which alters on-disk artifacts, so bump
the package schema version constant (SCHEMA_VERSION) to a new semver (e.g.,
0.5.1) and ensure any serialized objects include the updated schemaVersion field
where artifacts are created/serialized (search for usages that stamp
schemaVersion in export/save code and update them to emit the new value); update
tests/fixtures that assert schemaVersion or serialized shapes to the new version
as well.
src/runner/item-executor.ts (1)
138-157: ⚠️ Potential issue | 🟠 Major

Carry harness-supplied failure metadata through these catch blocks.

Both paths rebuild the failure from error.message only, so the output, signalAssessment, and durationMs now attached by src/harnesses/goose-adapter.ts and src/harnesses/opencode-adapter.ts are dropped. A failed row can therefore lose the very taint reason this PR is adding and end up recorded as a generic trustworthy failure.

Based on learnings "For per-item runtime failures (timeouts, model errors, eval failures), record structured failure in run.json and continue the matrix".

Also applies to: 334-368
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/runner/item-executor.ts` around lines 138 - 157, The catch blocks
reconstruct failure metadata from error.message and drop harness-provided
fields; update the catch handlers in item-executor.ts (the blocks that set
generation, generationFailure, and call finalizeItemSignalAssessment) to
merge/retain any harness-supplied metadata present on the thrown value (e.g.
(error as any).output, .durationMs, .failureType, and any signalAssessment info)
instead of overwriting them—use the existing
classifyGenerationError(errorMessage) only if a failureType isn't present, set
generation.durationMs to the harness-provided duration if available, preserve
generation.output when provided, and build generationFailure.type/message from
the harness fields when present; apply the same merging logic to both catch
sites that currently call finalizeItemSignalAssessment so the harness-supplied
taint reason and duration are carried through.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/harnesses/goose-adapter.ts`:
- Around line 515-529: The error handler currently only inspects
execaError.stderr, losing context printed to stdout; update the catch branch in
goose-adapter.ts (the block using execaError, getTranscriptOrInputTaintReasons,
and appendSignalAssessmentReasons) to aggregate stdout and stderr (e.g., const
combined = `${execaError.stdout || ''}\n${execaError.stderr || ''}`), run
getTranscriptOrInputTaintReasons(combined) and use combined for the Error
message and output property, and adjust the execaError type to include stdout so
both streams are preserved in the thrown Error's signalAssessment and output.

In `@src/lib/signal-assessment.ts`:
- Around line 22-31: The current INTERNAL_TOOL_TRANSCRIPT_PATTERNS is too
permissive and matches isolated tokens; tighten detection by replacing the
single-pattern test with a composite matcher that requires stronger transcript
shapes—e.g., check for JSON-like pairs appearing together (require both
"sessionID" and a nearby "type":"tool_|step_" within ~200 chars or same line),
or require tool-specific markers plus filePath/function signatures together
(e.g., "<function=bash>" with "<parameter=filePath>" or "filePath:" in the same
block); update the code that inspects failed rows to use this new composite
matcher (instead of matching any entry in INTERNAL_TOOL_TRANSCRIPT_PATTERNS
directly) and add a regression test that feeds a benign artifact containing only
"sessionID" or "step_start"/"tool_call" tokens and asserts it is NOT classified
as internal_tool_transcript while another test includes the stronger multi-token
transcript and asserts it IS classified.

---

Outside diff comments:
In `@src/runner/item-executor.ts`:
- Around line 138-157: The catch blocks reconstruct failure metadata from
error.message and drop harness-provided fields; update the catch handlers in
item-executor.ts (the blocks that set generation, generationFailure, and call
finalizeItemSignalAssessment) to merge/retain any harness-supplied metadata
present on the thrown value (e.g. (error as any).output, .durationMs,
.failureType, and any signalAssessment info) instead of overwriting them—use the
existing classifyGenerationError(errorMessage) only if a failureType isn't
present, set generation.durationMs to the harness-provided duration if
available, preserve generation.output when provided, and build
generationFailure.type/message from the harness fields when present; apply the
same merging logic to both catch sites that currently call
finalizeItemSignalAssessment so the harness-supplied taint reason and duration
are carried through.

In `@src/schemas/common.schema.ts`:
- Around line 174-182: This change adds new literals to
signalAssessmentReasonTypes which alters on-disk artifacts, so bump the package
schema version constant (SCHEMA_VERSION) to a new semver (e.g., 0.5.1) and
ensure any serialized objects include the updated schemaVersion field where
artifacts are created/serialized (search for usages that stamp schemaVersion in
export/save code and update them to emit the new value); update tests/fixtures
that assert schemaVersion or serialized shapes to the new version as well.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b8ce16e5-0258-4842-bdc5-f5d7e80b9ec9

📥 Commits

Reviewing files that changed from the base of the PR and between 56b63e7 and 9c372ee.

📒 Files selected for processing (16)

apps/dashboard/src/lib/schemas.ts
apps/dashboard/src/lib/types.ts
src/harnesses/goose-adapter.ts
src/harnesses/goose-output.ts
src/harnesses/opencode-adapter.ts
src/harnesses/opencode-output.ts
src/lib/signal-assessment.ts
src/runner/item-executor.ts
src/schemas/common.schema.ts
test/dashboard-schema-compat.test.ts
test/goose-adapter.test.ts
test/goose-output.test.ts
test/item-retry.test.ts
test/opencode-adapter.test.ts
test/opencode-output.test.ts
test/signal-assessment.test.ts

- keep stdout/stderr and signal assessments from Goose failures - carry structured generation failure details through executeItem - bump schema version and accept prior current-version artifacts

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/harnesses/goose-adapter.ts (1)

323-367: ⚠️ Potential issue | 🟠 Major

Scan stderr for taint even when stdout is present.

transcriptOrInputReasons is currently derived from rawOutput/output, but rawOutput is assigned after the stdout→stderr fallback. That means a successful Goose run can print code on stdout and leak agent_requested_input or internal_tool_transcript only on stderr, and this adapter will still return no signalAssessment.

💡 Suggested fix

 				let output = result.stdout;
 				const stderr = result.stderr?.trim() || "";
+				const stdout = result.stdout?.trim() || "";
@@
-				const rawOutput = output;
+				const rawOutput = output;
 				const normalized = normalizeGooseOutput(output);
@@
 				const transcriptOrInputReasons = Array.from(
 					new Set([
+						...getTranscriptOrInputTaintReasons(stdout),
+						...getTranscriptOrInputTaintReasons(stderr),
 						...getTranscriptOrInputTaintReasons(rawOutput),
 						...getTranscriptOrInputTaintReasons(output),
 					]),
 				);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/goose-adapter.ts` around lines 323 - 367, The code only checks
getTranscriptOrInputTaintReasons against rawOutput/output but rawOutput is set
after the stdout→stderr fallback, so taint on stderr is missed when stdout
exists; update the logic in the goose adapter (around variables rawOutput,
output and calls to
normalizeGooseOutput/getTranscriptOrInputTaintReasons/appendSignalAssessmentReasons)
to always include stderr in the taint scan (for example compute rawOutputParts =
[result.stdout, result.stderr] or call getTranscriptOrInputTaintReasons on
stderr as well) before building transcriptOrInputReasons, so that signals like
agent_requested_input or internal_tool_transcript found only on stderr are
detected and passed into appendSignalAssessmentReasons.

🧹 Nitpick comments (1)

test/machine-profile-migration.test.ts (1)

131-159: Expand this schema-bump regression to cover run artifacts too.

Lines 131-159 only validate the "0.5.0" compatibility path for parseKnownPlanPayload. Add a sibling parseKnownRunPayload assertion so plan/run migration guarantees stay symmetric.

♻️ Suggested test extension

 it("accepts prior current-version artifacts after a schema bump", () => {
 	const parsedPlan = parseKnownPlanPayload({
 		schemaVersion: "0.5.0",
 		runId: "run-current-minus-one",
 		createdAt: "2026-03-05T21:51:18.583Z",
 		runtimeEnvironment: {
 			platform: "darwin",
 			bunVersion: "1.3.3",
 		},
 		machine: LEGACY_MACHINE,
 		config: {
 			ollamaBaseUrl: "http://localhost:11434",
 			vllmBaseUrl: "http://localhost:8000",
 			generateTimeoutMs: 120_000,
 			passTypes: ["blind"],
 		},
 		items: [],
 		summary: {
 			totalItems: 0,
 			runtimes: 0,
 			models: 0,
 			harnesses: 0,
 			tests: 0,
 		},
 	});
+	const parsedRun = parseKnownRunPayload({
+		schemaVersion: "0.5.0",
+		runId: "run-current-minus-one",
+		machine: LEGACY_MACHINE,
+		startedAt: "2026-03-05T21:51:18.583Z",
+		completedAt: "2026-03-05T21:52:18.583Z",
+		durationMs: 60_000,
+		summary: {
+			total: 0,
+			completed: 0,
+			failed: 0,
+			pending: 0,
+		},
+		items: [],
+	});
 
 	expect(parsedPlan.schemaVersion).toBe(SCHEMA_VERSION);
 	expect(parsedPlan.machine?.profileKey).toBe(LEGACY_PROFILE_KEY);
+	expect(parsedRun.schemaVersion).toBe(SCHEMA_VERSION);
+	expect(parsedRun.machine?.profileKey).toBe(LEGACY_PROFILE_KEY);
 });

Based on learnings: “Maintain versioned schemas with schemaVersion field and use additive changes; migrations must be explicit.”

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/machine-profile-migration.test.ts` around lines 131 - 159, Add a
mirrored assertion for run artifacts: alongside the existing
parseKnownPlanPayload test that feeds a "0.5.0" payload using LEGACY_MACHINE,
call parseKnownRunPayload with the same payload shape (schemaVersion "0.5.0",
runId, createdAt, runtimeEnvironment, machine: LEGACY_MACHINE, config, items,
summary) and assert the returned object's schemaVersion equals SCHEMA_VERSION
and machine?.profileKey equals LEGACY_PROFILE_KEY so plan/run migration behavior
stays symmetric (look for parseKnownPlanPayload and add the sibling
parseKnownRunPayload assertion near it).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/harnesses/goose-adapter.ts`:
- Around line 515-530: The error message and output are being built from
combined which becomes "\n" when both stdout and stderr are empty, causing the
real execaError.message to be lost; update the logic around the
execaError/combined variables in the Goose failure handling block so that you
first compute a trimmedCombined (or similar) from execaError.stdout and
execaError.stderr and if that result is empty use execaError.message as the
fallback for both the thrown Error message and the output property; keep the
existing calls to getTranscriptOrInputTaintReasons and
appendSignalAssessmentReasons unchanged but feed them the effective output
(trimmedCombined or execaError.message) so downstream signalAssessment and
output reflect the real failure.

---

Outside diff comments:
In `@src/harnesses/goose-adapter.ts`:
- Around line 323-367: The code only checks getTranscriptOrInputTaintReasons
against rawOutput/output but rawOutput is set after the stdout→stderr fallback,
so taint on stderr is missed when stdout exists; update the logic in the goose
adapter (around variables rawOutput, output and calls to
normalizeGooseOutput/getTranscriptOrInputTaintReasons/appendSignalAssessmentReasons)
to always include stderr in the taint scan (for example compute rawOutputParts =
[result.stdout, result.stderr] or call getTranscriptOrInputTaintReasons on
stderr as well) before building transcriptOrInputReasons, so that signals like
agent_requested_input or internal_tool_transcript found only on stderr are
detected and passed into appendSignalAssessmentReasons.

---

Nitpick comments:
In `@test/machine-profile-migration.test.ts`:
- Around line 131-159: Add a mirrored assertion for run artifacts: alongside the
existing parseKnownPlanPayload test that feeds a "0.5.0" payload using
LEGACY_MACHINE, call parseKnownRunPayload with the same payload shape
(schemaVersion "0.5.0", runId, createdAt, runtimeEnvironment, machine:
LEGACY_MACHINE, config, items, summary) and assert the returned object's
schemaVersion equals SCHEMA_VERSION and machine?.profileKey equals
LEGACY_PROFILE_KEY so plan/run migration behavior stays symmetric (look for
parseKnownPlanPayload and add the sibling parseKnownRunPayload assertion near
it).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 44b63a0b-5e91-4db2-bf85-a11925c28932

📥 Commits

Reviewing files that changed from the base of the PR and between 9c372ee and b6f41a8.

📒 Files selected for processing (11)

src/harnesses/goose-adapter.ts
src/lib/machine-profile/legacy.ts
src/lib/signal-assessment.ts
src/runner/item-executor.ts
src/schemas/common.schema.ts
test/dashboard-schema-compat.test.ts
test/goose-adapter.test.ts
test/item-executor.test.ts
test/machine-profile-migration.test.ts
test/schemas.test.ts
test/signal-assessment.test.ts

✅ Files skipped from review due to trivial changes (3)

test/schemas.test.ts
src/lib/machine-profile/legacy.ts
test/dashboard-schema-compat.test.ts

🚧 Files skipped from review as they are similar to previous changes (3)

test/goose-adapter.test.ts
test/signal-assessment.test.ts
src/schemas/common.schema.ts

- Scan stderr alongside stdout for taint reasons - Use execa message when Goose emits no stream output - Cover legacy run migration for machine profiles

- Remove vLLM and OpenAI-compatible runtime paths - Simplify harness, config, and docs around Ollama-only runs - Keep legacy vLLM artifacts readable for compare and history

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

README.md (1)
182-205: ⚠️ Potential issue | 🟡 Minor

Update the example model-profile schemaVersion.

This example still shows "0.5.0", but the current schema version in this PR is 0.5.1. As written, a copy-pasted config will be rejected by the parser.
Proposed fix
-  "schemaVersion": "0.5.0",
+  "schemaVersion": "0.5.1",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 182 - 205, Update the example model-profile JSON to
use the current schema version so pasted configs pass validation: change the
"schemaVersion" value in the README example from "0.5.0" to "0.5.1" (the
canonical model-profile example block that users load with --model-config should
reflect the PR's schemaVersion).

♻️ Duplicate comments (1)

src/runner/item-executor.ts (1)
88-145: ⚠️ Potential issue | 🟡 Minor

Validate signalAssessment against the real schema before persisting it.

isSignalAssessment() is still only a loose shape check, so malformed payloads from thrown harness/runtime errors can flow straight into run.json as long as they look like { classification, reasons: string[] }. This helper is normalizing unknown input at a boundary; it should parse against the actual schema/invariants, not just the outer shape.
#!/bin/bash
# Verify whether the persisted signal-assessment schema is stricter than the local guard.
rg -n -C3 'SignalAssessmentSchema|signalAssessmentReasons|SignalAssessmentReason|isSignalAssessment' src
Based on learnings, "Validate at boundaries with Zod and pass typed data inward" and "Treat all boundary inputs (CLI args, env, JSON files, HTTP responses) as unknown and parse with Zod immediately".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/runner/item-executor.ts` around lines 88 - 145, The current
isSignalAssessment loosenly checks shape and lets malformed payloads pass into
persisted run.json; replace that loose guard in extractGenerationFailureDetails
with a strict parse against the canonical schema (use
SignalAssessmentSchema.parse or safeParse) so only fully valid SignalAssessment
objects are accepted; specifically, in extractGenerationFailureDetails use
SignalAssessmentSchema.safeParse(errorRecord?.signalAssessment) and set
signalAssessment to the parsed value on success or undefined on failure (remove
or keep isSignalAssessment only for pre-checks but do not persist its result) so
all boundary inputs are validated by the real schema before being written.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/schemas/model-profile.schema.ts`:
- Around line 109-112: The schema change narrowed
ConfiguredModelProfileSchema.variants to SupportedRuntimeNameSchema which breaks
loading legacy persisted model-profile files that contain legacy runtime keys
like "vllm"; revert or relax the variants constraint so loadModelProfiles(...)
accepts unknown/legacy runtime keys (e.g., accept string keys or union with
legacy enum) or add an explicit migration step: update
ConfiguredModelProfileSchema.variants to allow legacy names (including "vllm")
alongside SupportedRuntimeNameSchema OR implement a migration function invoked
by loadModelProfiles that maps legacy keys (like "vllm") to current runtime
names (e.g., Ollama) before validation, ensuring backward compatibility without
changing runtime behavior.

---

Outside diff comments:
In `@README.md`:
- Around line 182-205: Update the example model-profile JSON to use the current
schema version so pasted configs pass validation: change the "schemaVersion"
value in the README example from "0.5.0" to "0.5.1" (the canonical model-profile
example block that users load with --model-config should reflect the PR's
schemaVersion).

---

Duplicate comments:
In `@src/runner/item-executor.ts`:
- Around line 88-145: The current isSignalAssessment loosenly checks shape and
lets malformed payloads pass into persisted run.json; replace that loose guard
in extractGenerationFailureDetails with a strict parse against the canonical
schema (use SignalAssessmentSchema.parse or safeParse) so only fully valid
SignalAssessment objects are accepted; specifically, in
extractGenerationFailureDetails use
SignalAssessmentSchema.safeParse(errorRecord?.signalAssessment) and set
signalAssessment to the parsed value on success or undefined on failure (remove
or keep isSignalAssessment only for pre-checks but do not persist its result) so
all boundary inputs are validated by the real schema before being written.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c0c41d59-1088-4aac-b6a9-34ef8ea71fed

📥 Commits

Reviewing files that changed from the base of the PR and between b2dd0fa and 8ea88b5.

📒 Files selected for processing (42)

README.md
apps/dashboard/src/lib/types.ts
llm/context/codebase-overview.md
llm/implementation/multi-runtime-mvp-implementation.md
llm/project/project-overview.md
llm/project/project-rules.md
llm/project/user-flow.md
package.json
src/cli/run-command.ts
src/harnesses/direct-adapter.ts
src/harnesses/goose-adapter.ts
src/harnesses/goose-openai.ts
src/harnesses/harness.ts
src/harnesses/opencode-adapter.ts
src/harnesses/opencode-config.ts
src/lib/benchmark-checkpoint.ts
src/lib/model-aliases.ts
src/lib/model-profile/normalization.ts
src/lib/model-profile/registry.ts
src/lib/openai-compat-client.ts
src/runner/index.ts
src/runner/item-executor.ts
src/runner/plan-builder.ts
src/runtimes/discovery.ts
src/runtimes/index.ts
src/runtimes/runtime.ts
src/runtimes/vllm-runtime.ts
src/schemas/common.schema.ts
src/schemas/config.schema.ts
src/schemas/index.ts
src/schemas/model-alias.schema.ts
src/schemas/model-profile.schema.ts
src/schemas/plan.schema.ts
src/schemas/result.schema.ts
test/benchmark-checkpoint.test.ts
test/build-index.test.ts
test/harness-compatibility.test.ts
test/item-executor.test.ts
test/opencode-config.test.ts
test/plan-builder.test.ts
test/schemas.test.ts
test/vllm-runtime.test.ts

💤 Files with no reviewable changes (11)

src/runner/index.ts
test/opencode-config.test.ts
test/benchmark-checkpoint.test.ts
package.json
src/lib/benchmark-checkpoint.ts
src/harnesses/goose-openai.ts
test/build-index.test.ts
test/vllm-runtime.test.ts
src/runtimes/discovery.ts
src/runtimes/vllm-runtime.ts
src/lib/openai-compat-client.ts

✅ Files skipped from review due to trivial changes (4)

src/lib/model-aliases.ts
llm/implementation/multi-runtime-mvp-implementation.md
llm/project/project-overview.md
src/schemas/model-alias.schema.ts

🚧 Files skipped from review as they are similar to previous changes (3)

test/schemas.test.ts
test/item-executor.test.ts
src/schemas/common.schema.ts

👮 Files not reviewed due to content moderation or server errors (8)

src/runtimes/runtime.ts
src/harnesses/opencode-config.ts
src/harnesses/goose-adapter.ts
src/harnesses/opencode-adapter.ts
src/cli/run-command.ts
llm/project/project-rules.md
src/schemas/result.schema.ts
llm/project/user-flow.md

- Remove deprecated `vllmBaseUrl` from plan config - Add optional Goose turn limits and test categories

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/dashboard/src/lib/schemas.ts`:
- Around line 426-430: Update the four goose turn-count fields (gooseMaxTurns,
gooseRetryMaxTurns, gooseWorkspaceMaxTurns, gooseWorkspaceRetryMaxTurns) to
enforce positive integers instead of any number: replace their
z.number().optional() validators with z.number().int().positive().optional() so
validation matches the core plan.schema.ts rules.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c706018b-a626-4ccd-b42b-6afb48e87e8a

📥 Commits

Reviewing files that changed from the base of the PR and between 8ea88b5 and 5c5bc91.

📒 Files selected for processing (2)

apps/dashboard/src/lib/schemas.ts
apps/dashboard/src/lib/types.ts

- Accept legacy model profile files by dropping unsupported runtime variants - Validate harness signal assessments with Zod before persisting failures - Tighten dashboard schema bounds for Goose turn counts

AustinKelsay · 2026-04-20T03:08:34Z

@CodeRabbit full review

coderabbitai · 2026-04-20T03:08:40Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (10)

apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx (1)

270-313: Consider adding TSDoc to the exported function.

The exported BlindVsInformedChart function lacks TSDoc documentation. Adding it would improve discoverability and align with the project's documentation standards.

📝 Example TSDoc addition

+/**
+ * Renders a chart comparing pass rates between blind and informed prompts.
+ * Provides model-level and harness-level breakdowns with delta summaries.
+ *
+ * `@param` items - Array of matrix item results to analyze
+ * `@returns` A card component containing tabbed charts and summaries
+ */
 export function BlindVsInformedChart({ items }: BlindVsInformedChartProps) {

As per coding guidelines: "All exported functions must have TSDoc/JSDoc documentation including purpose, params, returns, and throws"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx` around
lines 270 - 313, Add a TSDoc block above the exported BlindVsInformedChart
function describing its purpose, parameters, and return value; document the
items parameter (type BlindVsInformedChartProps) and what the component renders
(chart tabs for model/harness breakdown), include `@param` for items and `@returns`
React.ReactElement (or JSX.Element), and add any `@throws` if it can throw—place
the doc comment immediately above the export function BlindVsInformedChart
declaration to satisfy the project guideline that all exported functions have
TSDoc.

apps/dashboard/src/lib/api.ts (1)

1-4: Move the module header back to the top of the file.

The new import block now appears before the required file header. Put the purpose/exports/invariants header first, then imports.

♻️ Proposed reorder

-import {
-	parseKnownPlanPayload,
-	parseKnownRunPayload,
-} from "../../../../src/lib/machine-profile/legacy.js";
 /**
  * Purpose: Fetch run data from static JSON files.
  * Exports: fetchRuns, fetchRun, fetchPlan, fetchRunWithPlan, fetchDashboardIndex, fetchLatestAggregate
@@
  * Data is loaded from the results directory via Vite's dev server.
  * All fetched JSON is validated with Zod schemas at this boundary.
  */
+import {
+	parseKnownPlanPayload,
+	parseKnownRunPayload,
+} from "../../../../src/lib/machine-profile/legacy.js";

As per coding guidelines **/*.{ts,tsx,js,jsx}: Every file begins with a short header documenting purpose, exports, and invariants.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/lib/api.ts` around lines 1 - 4, The file header
documenting purpose/exports/invariants must be the very first thing in the
module; move the existing module header comment so it appears before the import
block that currently brings in parseKnownPlanPayload and parseKnownRunPayload
from "../../../../src/lib/machine-profile/legacy.js", and ensure the header
briefly states the file's purpose, its exported items, and any invariants as per
the coding guideline so the header precedes all imports and code.

llm/project/user-flow.md (1)

209-209: Clarify Line 209 to avoid mixed messaging about runtime selection.

Line 209 says runtime is fixed, but “unless explicitly passed as --runtimes ollama” still reads like a selectable runtime path. Consider simplifying to state that runtime is fixed to ollama in MVP.

Suggested doc tweak

-- **Discovery**: models/harnesses are auto-discovered by default; runtime is fixed to `ollama` unless explicitly passed as `--runtimes ollama`.
+- **Discovery**: models/harnesses are auto-discovered by default; runtime is fixed to `ollama` in MVP.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@llm/project/user-flow.md` at line 209, Edit the "Discovery" bullet so it
clearly states the runtime decision for MVP: remove the conditional phrase
"unless explicitly passed as `--runtimes ollama`" and reword to say "Discovery:
models/harnesses are auto-discovered by default; runtime is fixed to `ollama`
for the MVP." Reference the "Discovery" bullet and the string "`--runtimes
ollama`" when making the change to ensure the mixed message is eliminated.

llm/implementation/multi-runtime-mvp-implementation.md (1)

1-5: Consider making the historical framing even harder to miss.

Line 1 and Line 5 are clear, but a quick scan can still misread later sections as current behavior. A small follow-up (e.g., “Historical Behavior (Feb 8, 2026)” heading) would reduce ambiguity.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@llm/implementation/multi-runtime-mvp-implementation.md` around lines 1 - 5,
Add a very prominent historical warning header and timestamp to the top of the
document so readers cannot mistake it for current behavior: change the existing
opening lines in Multi-Runtime MVP Implementation to include a bold/ALL-CAPS
heading like "HISTORICAL BEHAVIOR (Feb 8, 2026) — DO NOT USE FOR CURRENT
RUNTIME" and add a one-line subheading such as "Archived for artifact/debugging
context; active benchmarks are Ollama-only" (replace the current first
paragraph). Ensure the new header appears before any other text so the
historical status is immediately obvious on quick scans.

src/cli/run-command.ts (1)

83-107: Consider a temporary deprecated --vllm-url shim for smoother script migration.

With Ollama-only execution, this is directionally correct; however, existing CI/scripts using --vllm-url will fail hard. A one-release no-op flag that warns would reduce breakage.

Possible transition patch

 export const runCommand = new Command("run")
@@
 	.option("--ollama-url <url>", "Ollama API base URL", "http://localhost:11434")
+	.option(
+		"--vllm-url <url>",
+		"Deprecated: ignored (runtime execution is Ollama-only)",
+	)
@@
 	.action(async (options) => {
 		try {
+			if (typeof options.vllmUrl === "string" && options.vllmUrl.trim().length > 0) {
+				logger.warn(
+					"Warning: --vllm-url is deprecated and ignored (Ollama-only execution)",
+				);
+			}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/cli/run-command.ts` around lines 83 - 107, Add a temporary deprecated
shim flag --vllm-url that accepts a URL and maps it to the existing --ollama-url
behavior: update the CLI option definitions near the existing
.option("--ollama-url <url>") to also declare .option("--vllm-url <url>") as a
deprecated alias, when parsing prefer the explicit --ollama-url but if
--vllm-url is present set the same internal variable (e.g., ollamaUrl) and emit
a single-time warning message informing users the flag is deprecated and will be
removed next release; ensure the shim does not change runtime semantics and that
the warning uses the same logger/console path used elsewhere in this module.

test/model-profile-registry.test.ts (1)

96-97: Consider avoiding schema-version churn in this fixture.

Line [96] hardcodes "0.5.1" even though this test targets unsupported runtime-key filtering, not version migration behavior. Using SCHEMA_VERSION would keep intent stable across future bumps.

♻️ Suggested small cleanup

 import {
 	ConfiguredModelProfileSchema,
 	ModelProfileFileSchema,
 } from "../src/schemas/model-profile.schema.js";
+import { SCHEMA_VERSION } from "../src/schemas/index.js";

 ...
-						schemaVersion: "0.5.1",
+						schemaVersion: SCHEMA_VERSION,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/model-profile-registry.test.ts` around lines 96 - 97, Replace the
hardcoded schemaVersion string "0.5.1" in the test fixture with the canonical
SCHEMA_VERSION constant: import SCHEMA_VERSION into the test and set
schemaVersion: SCHEMA_VERSION in the object used by the model-profile registry
test (the fixture that currently contains schemaVersion and models). This keeps
the test focused on unsupported runtime-key filtering and avoids future churn
when the schema version is bumped.

src/harnesses/opencode-config.ts (1)

128-162: Consider removing unused runtimeName parameter.

Since runtimeName can only be "ollama" and isn't used to branch logic in the function body (no runtime-specific env tuning remains), consider removing it from the interface to simplify the API.

♻️ Optional simplification

 export function buildOpenCodeEnv(opts: {
 	configPath: string;
 	configJson: string;
-	runtimeName: "ollama";
 }): Record<string, string> {
-	const { configPath, configJson, runtimeName } = z
+	const { configPath, configJson } = z
 		.object({
 			configPath: z.string().min(1),
 			configJson: z.string().min(1),
-			runtimeName: RuntimeNameSchema,
 		})
 		.parse(opts);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/opencode-config.ts` around lines 128 - 162, The function
buildOpenCodeEnv currently accepts an opts parameter with a runtimeName property
that is never used; remove runtimeName from the function signature and from the
zod validation to simplify the API: update the opts type to only include
configPath and configJson, remove runtimeName from the destructuring and from
the z.object() schema (RuntimeNameSchema reference), and adjust any call sites
that pass runtimeName to stop providing it or to only pass the two remaining
fields so callers compile.

test/plan-builder.test.ts (1)

16-102: Consider extracting fallback helper to a shared test utility.

The fallbackCollectMachineProfile function is substantial (~85 lines) and may be useful in other test files. Consider extracting it to a shared test fixtures module.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/plan-builder.test.ts` around lines 16 - 102, The
fallbackCollectMachineProfile helper is large and should be extracted to a
shared test fixtures module: create a new test utility file (e.g.,
test/utils/fixtures.ts) exporting fallbackCollectMachineProfile as a named
export, move the function body there (preserving its signature and any helper
readNonEmpty logic and types), update test/plan-builder.test.ts to import {
fallbackCollectMachineProfile } from the new module, and ensure any TypeScript
types referenced by the function are exported or inlined so existing tests
compile unchanged.

src/harnesses/opencode-adapter.ts (1)

546-564: Preserve output/durationMs in the execa-error path too.

This branch still only inspects stderr and omits output plus durationMs. When executeItem catches this, it falls back to durationMs: 0 and loses any failure text emitted on stdout, even though the non-zero-exit path already preserves that evidence.

Possible shape

-				if (error && typeof error === "object" && "stderr" in error) {
-					const execaError = error as { stderr: string; message: string };
-					const errorReasons = [
-						...(hasPermissionDeniedStderr(execaError.stderr)
+				if (error && typeof error === "object" && "stderr" in error) {
+					const execaError = error as {
+						stdout?: string;
+						stderr?: string;
+						message: string;
+					};
+					const effectiveOutput =
+						[execaError.stdout, execaError.stderr]
+							.map((part) => part?.trim() ?? "")
+							.filter((part) => part.length > 0)
+							.join("\n") || execaError.message;
+					const durationMs = Math.round(performance.now() - startTime);
+					const errorReasons = [
+						...(hasPermissionDeniedStderr(execaError.stderr ?? "")
 							? (["tool_permission_denied"] as const)
 							: []),
-						...getTranscriptOrInputTaintReasons(execaError.stderr),
+						...getTranscriptOrInputTaintReasons(effectiveOutput),
 					];
 					throw Object.assign(
-						new Error(
-							`OpenCode failed: ${execaError.stderr || execaError.message}`,
-						),
+						new Error(`OpenCode failed: ${effectiveOutput}`),
 						{
 							signalAssessment:
 								errorReasons.length > 0
 									? appendSignalAssessmentReasons(undefined, errorReasons)
 									: undefined,
+							durationMs,
+							output: effectiveOutput,
 						},
 					);
 				}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/opencode-adapter.ts` around lines 546 - 564, The execa-error
branch in the error handler for executeItem only uses execaError.stderr and
message, dropping execa output and timing; update the throw to preserve
execaError.output (or output/stdout) and durationMs so executeItem can see the
real failure text and timing. Specifically, in the block that builds execaError
and computes errorReasons (symbols: execaError, hasPermissionDeniedStderr,
getTranscriptOrInputTaintReasons, appendSignalAssessmentReasons,
signalAssessment), include the original output and durationMs properties when
calling Object.assign on the new Error so the thrown object contains output and
durationMs for downstream handling by executeItem.

src/schemas/common.schema.ts (1)

38-54: Encode the runtime-list relationship from a single source of truth.

artifactRuntimeNames repeats "ollama" separately from supportedRuntimeNames, so the active-runtime and artifact-runtime lists can drift on a future edit. Building the artifact list from supportedRuntimeNames makes that invariant explicit.

♻️ Proposed refactor

 /** Valid runtime names for active benchmark execution. */
 export const supportedRuntimeNames = ["ollama"] as const;

 /** Zod schema for runtimes accepted by current config and execution flows. */
 export const SupportedRuntimeNameSchema = z.literal("ollama");

 /** Runtime name type accepted by current config and execution flows. */
 export type SupportedRuntimeName = z.infer<typeof SupportedRuntimeNameSchema>;

 /** Valid runtime names allowed when reading historical plan/result artifacts. */
-export const artifactRuntimeNames = ["ollama", "vllm"] as const;
+export const artifactRuntimeNames = [...supportedRuntimeNames, "vllm"] as const;

As per coding guidelines, "Favor iteration and modularization over code duplication".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/schemas/common.schema.ts` around lines 38 - 54, supportedRuntimeNames and
artifactRuntimeNames duplicate "ollama", risking drift; change
artifactRuntimeNames to derive from supportedRuntimeNames (e.g., create
artifactRuntimeNames by spreading supportedRuntimeNames and adding "vllm" with a
const assertion) so the runtime-list relationship is encoded in one source of
truth, then keep ArtifactRuntimeNameSchema = z.enum(artifactRuntimeNames) and
ArtifactRuntimeName type as z.infer<typeof ArtifactRuntimeNameSchema>.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/dashboard/src/components/ui/section-heading.tsx`:
- Around line 12-15: The exported function SectionHeading lacks complete
JSDoc/TSDoc; update the comment block above the SectionHeading declaration to
include a short purpose description, `@param` tags for title and description, an
`@returns` tag describing the React element returned, and an `@throws` tag
(explicitly stating none if it never throws). Ensure the JSDoc sits immediately
above the SectionHeading function and uses the repo's standard TSDoc style.

In `@src/harnesses/goose-adapter.ts`:
- Around line 216-223: The code currently hard-codes provider = "ollama" and
builds env without using runtime.baseUrl; update the Goose configuration to
include the runtime base URL when present by adding an environment variable
(e.g., GOOSE_BASE_URL) to the env object derived from runtime.baseUrl, keeping
provider and GOOSE_MODEL as-is; in other words, when runtime.baseUrl is defined,
set env.GOOSE_BASE_URL = runtime.baseUrl so Ollama/custom endpoints are passed
through (reference the provider variable, the env object, and runtime.baseUrl),
or alternatively remove the file header invariant if you intentionally do not
support custom endpoints.

In `@src/harnesses/opencode-output.ts`:
- Around line 260-262: The early return that sets method "json" when parsedLines
> 0 is too aggressive and drops non-JSON content; change the condition so we
only short-circuit to the JSON-only return when the entire input was
successfully parsed as JSON (e.g., parsedLines === totalLineCount or a flag like
allLinesParsed is true). If mixed content exists, do not return an empty
output/method "json" — instead let the function fall through to the raw/fallback
handling or merge the non-JSON lines into the returned output. Update the logic
around parsedLines and the JSON-short-circuit (the block that currently returns
{ output: "", method: "json" }) to perform this full-stream check or preserve
non-JSON text.

In `@src/lib/model-profile/registry.ts`:
- Around line 148-178: normalizeLoadedModelProfileRegistry currently removes
unsupported runtime variants but leaves profiles whose variants become empty,
causing ModelProfileRegistrySchema.parse to fail; update the function to detect
and drop profiles whose filtered variants object is empty (e.g., track
droppedProfileKeys alongside droppedRuntimeNames), log a warning with the
dropped profile keys and their original runtimes, and build normalized only from
profiles that still have at least one supported variant before calling
ModelProfileRegistrySchema.parse so a single legacy profile doesn’t break the
whole load.

In `@src/lib/signal-assessment.ts`:
- Around line 32-40: The matcher list AGENT_REQUESTED_INPUT_PATTERNS is too
generic and causes UI/artifact text to be classified as agent_requested_input;
restrict its application by either (1) narrowing the regexes to harness-specific
phrasing/markers (e.g., include anchors or known harness tokens) or (2) guarding
the matching logic so AGENT_REQUESTED_INPUT_PATTERNS is only evaluated for
harness boundary/failure rows (e.g., check the row context like isHarnessPrompt
/ source === 'harness' / isBoundaryFailure before running matches in the
function that classifies agent_requested_input). Update the matching code where
AGENT_REQUESTED_INPUT_PATTERNS is used (and the same change at the other
occurrences) so legitimate UI/artifact strings aren’t misclassified.

In `@src/runner/plan-builder.ts`:
- Around line 97-101: Remove the unnecessary type assertion on config.runtimes
in the runtimes assignment: config.runtimes is already correctly typed by the
BenchConfig/Zod schema as SupportedRuntimeName[] (equivalent to RuntimeName[]),
so change the ternary to use config.runtimes directly instead of
(config.runtimes as RuntimeName[]); update the expression that sets runtimes
(which references config.runtimes, RuntimeName and RUNTIME_NAMES) to the simpler
form that returns config.runtimes when non-empty and [...RUNTIME_NAMES]
otherwise.

---

Nitpick comments:
In `@apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx`:
- Around line 270-313: Add a TSDoc block above the exported BlindVsInformedChart
function describing its purpose, parameters, and return value; document the
items parameter (type BlindVsInformedChartProps) and what the component renders
(chart tabs for model/harness breakdown), include `@param` for items and `@returns`
React.ReactElement (or JSX.Element), and add any `@throws` if it can throw—place
the doc comment immediately above the export function BlindVsInformedChart
declaration to satisfy the project guideline that all exported functions have
TSDoc.

In `@apps/dashboard/src/lib/api.ts`:
- Around line 1-4: The file header documenting purpose/exports/invariants must
be the very first thing in the module; move the existing module header comment
so it appears before the import block that currently brings in
parseKnownPlanPayload and parseKnownRunPayload from
"../../../../src/lib/machine-profile/legacy.js", and ensure the header briefly
states the file's purpose, its exported items, and any invariants as per the
coding guideline so the header precedes all imports and code.

In `@llm/implementation/multi-runtime-mvp-implementation.md`:
- Around line 1-5: Add a very prominent historical warning header and timestamp
to the top of the document so readers cannot mistake it for current behavior:
change the existing opening lines in Multi-Runtime MVP Implementation to include
a bold/ALL-CAPS heading like "HISTORICAL BEHAVIOR (Feb 8, 2026) — DO NOT USE FOR
CURRENT RUNTIME" and add a one-line subheading such as "Archived for
artifact/debugging context; active benchmarks are Ollama-only" (replace the
current first paragraph). Ensure the new header appears before any other text so
the historical status is immediately obvious on quick scans.

In `@llm/project/user-flow.md`:
- Line 209: Edit the "Discovery" bullet so it clearly states the runtime
decision for MVP: remove the conditional phrase "unless explicitly passed as
`--runtimes ollama`" and reword to say "Discovery: models/harnesses are
auto-discovered by default; runtime is fixed to `ollama` for the MVP." Reference
the "Discovery" bullet and the string "`--runtimes ollama`" when making the
change to ensure the mixed message is eliminated.

In `@src/cli/run-command.ts`:
- Around line 83-107: Add a temporary deprecated shim flag --vllm-url that
accepts a URL and maps it to the existing --ollama-url behavior: update the CLI
option definitions near the existing .option("--ollama-url <url>") to also
declare .option("--vllm-url <url>") as a deprecated alias, when parsing prefer
the explicit --ollama-url but if --vllm-url is present set the same internal
variable (e.g., ollamaUrl) and emit a single-time warning message informing
users the flag is deprecated and will be removed next release; ensure the shim
does not change runtime semantics and that the warning uses the same
logger/console path used elsewhere in this module.

In `@src/harnesses/opencode-adapter.ts`:
- Around line 546-564: The execa-error branch in the error handler for
executeItem only uses execaError.stderr and message, dropping execa output and
timing; update the throw to preserve execaError.output (or output/stdout) and
durationMs so executeItem can see the real failure text and timing.
Specifically, in the block that builds execaError and computes errorReasons
(symbols: execaError, hasPermissionDeniedStderr,
getTranscriptOrInputTaintReasons, appendSignalAssessmentReasons,
signalAssessment), include the original output and durationMs properties when
calling Object.assign on the new Error so the thrown object contains output and
durationMs for downstream handling by executeItem.

In `@src/harnesses/opencode-config.ts`:
- Around line 128-162: The function buildOpenCodeEnv currently accepts an opts
parameter with a runtimeName property that is never used; remove runtimeName
from the function signature and from the zod validation to simplify the API:
update the opts type to only include configPath and configJson, remove
runtimeName from the destructuring and from the z.object() schema
(RuntimeNameSchema reference), and adjust any call sites that pass runtimeName
to stop providing it or to only pass the two remaining fields so callers
compile.

In `@src/schemas/common.schema.ts`:
- Around line 38-54: supportedRuntimeNames and artifactRuntimeNames duplicate
"ollama", risking drift; change artifactRuntimeNames to derive from
supportedRuntimeNames (e.g., create artifactRuntimeNames by spreading
supportedRuntimeNames and adding "vllm" with a const assertion) so the
runtime-list relationship is encoded in one source of truth, then keep
ArtifactRuntimeNameSchema = z.enum(artifactRuntimeNames) and ArtifactRuntimeName
type as z.infer<typeof ArtifactRuntimeNameSchema>.

In `@test/model-profile-registry.test.ts`:
- Around line 96-97: Replace the hardcoded schemaVersion string "0.5.1" in the
test fixture with the canonical SCHEMA_VERSION constant: import SCHEMA_VERSION
into the test and set schemaVersion: SCHEMA_VERSION in the object used by the
model-profile registry test (the fixture that currently contains schemaVersion
and models). This keeps the test focused on unsupported runtime-key filtering
and avoids future churn when the schema version is bumped.

In `@test/plan-builder.test.ts`:
- Around line 16-102: The fallbackCollectMachineProfile helper is large and
should be extracted to a shared test fixtures module: create a new test utility
file (e.g., test/utils/fixtures.ts) exporting fallbackCollectMachineProfile as a
named export, move the function body there (preserving its signature and any
helper readNonEmpty logic and types), update test/plan-builder.test.ts to import
{ fallbackCollectMachineProfile } from the new module, and ensure any TypeScript
types referenced by the function are exported or inlined so existing tests
compile unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d585f805-fd92-49c5-ac58-25092ba73ba5

📥 Commits

Reviewing files that changed from the base of the PR and between 56b63e7 and fc34144.

📒 Files selected for processing (116)

README.md
apps/dashboard/scripts/build-index.ts
apps/dashboard/src/components/about/about-content.ts
apps/dashboard/src/components/about/about-page.tsx
apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx
apps/dashboard/src/components/charts/chart-primitives.tsx
apps/dashboard/src/components/charts/composite-score-chart.tsx
apps/dashboard/src/components/charts/frontier-eval-scatter.tsx
apps/dashboard/src/components/charts/model-comparison-chart.tsx
apps/dashboard/src/components/charts/model-radar-chart.tsx
apps/dashboard/src/components/charts/model-test-heatmap.tsx
apps/dashboard/src/components/charts/test-difficulty-chart.tsx
apps/dashboard/src/components/charts/test-type-split-chart.tsx
apps/dashboard/src/components/charts/token-efficiency-chart.tsx
apps/dashboard/src/components/layout/header.tsx
apps/dashboard/src/components/leaderboard/leaderboard-filters.ts
apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx
apps/dashboard/src/components/leaderboard/leaderboard-page.tsx
apps/dashboard/src/components/leaderboard/leaderboard-results-table.tsx
apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx
apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx
apps/dashboard/src/components/run-list/run-list-page.tsx
apps/dashboard/src/components/ui/section-heading.tsx
apps/dashboard/src/index.css
apps/dashboard/src/lib/aggregations-charts.ts
apps/dashboard/src/lib/aggregations-compare.ts
apps/dashboard/src/lib/aggregations-diagnostics.ts
apps/dashboard/src/lib/aggregations-test-types.ts
apps/dashboard/src/lib/aggregations-tooling.ts
apps/dashboard/src/lib/aggregations.ts
apps/dashboard/src/lib/api.ts
apps/dashboard/src/lib/schemas.ts
apps/dashboard/src/lib/types.ts
llm/context/codebase-overview.md
llm/implementation/multi-runtime-mvp-implementation.md
llm/project/project-overview.md
llm/project/project-rules.md
llm/project/user-flow.md
package.json
src/cli/compare-command.ts
src/cli/migrate-machine-command.ts
src/cli/run-command.ts
src/harnesses/direct-adapter.ts
src/harnesses/goose-adapter.ts
src/harnesses/goose-openai.ts
src/harnesses/goose-output.ts
src/harnesses/harness.ts
src/harnesses/opencode-adapter.ts
src/harnesses/opencode-config.ts
src/harnesses/opencode-output.ts
src/harnesses/tool-prompt.ts
src/lib/benchmark-checkpoint.ts
src/lib/hardware-profile.ts
src/lib/machine-profile/instance-id.ts
src/lib/machine-profile/legacy.ts
src/lib/machine-profile/normalization.ts
src/lib/machine-profile/probe-linux.ts
src/lib/machine-profile/probe-macos.ts
src/lib/machine-profile/probe-utils.ts
src/lib/machine-profile/probe-windows.ts
src/lib/model-aliases.ts
src/lib/model-profile/normalization.ts
src/lib/model-profile/registry.ts
src/lib/openai-compat-client.ts
src/lib/signal-assessment.ts
src/lib/stats-format.ts
src/lib/test-catalog.ts
src/runner/index.ts
src/runner/item-executor.ts
src/runner/plan-builder.ts
src/runtimes/discovery.ts
src/runtimes/index.ts
src/runtimes/runtime.ts
src/runtimes/vllm-runtime.ts
src/schemas/common.schema.ts
src/schemas/config.schema.ts
src/schemas/index.ts
src/schemas/model-alias.schema.ts
src/schemas/model-profile.schema.ts
src/schemas/plan.schema.ts
src/schemas/result.schema.ts
src/schemas/test-catalog.schema.ts
src/tests/calculator-basic/test.meta.json
src/tests/calculator-stateful/test.meta.json
src/tests/event-emitter/test.meta.json
src/tests/rate-limiter/test.meta.json
src/tests/smoke/test.meta.json
src/tests/todo-app/test.meta.json
src/tests/ttl-cache/test.meta.json
src/tests/workspace-smoke/test.meta.json
test/aggregate.test.ts
test/benchmark-checkpoint.test.ts
test/build-index.test.ts
test/dashboard-schema-compat.test.ts
test/dashboard-test-type.test.ts
test/generation-retry.test.ts
test/goose-adapter.test.ts
test/goose-output.test.ts
test/hardware-profile.test.ts
test/harness-compatibility.test.ts
test/item-executor.test.ts
test/item-retry.test.ts
test/machine-profile-migration.test.ts
test/model-profile-registry.test.ts
test/opencode-adapter.test.ts
test/opencode-config.test.ts
test/opencode-output.test.ts
test/plan-builder.test.ts
test/schemas.test.ts
test/signal-assessment.test.ts
test/stats-semantics.test.ts
test/test-catalog.test.ts
test/tool-prompt.test.ts
test/vllm-runtime.test.ts
test/workspace-capability-parity.test.ts
test/workspace-prompt-parity.test.ts

💤 Files with no reviewable changes (9)

src/runner/index.ts
test/benchmark-checkpoint.test.ts
test/opencode-config.test.ts
package.json
src/harnesses/goose-openai.ts
src/runtimes/discovery.ts
test/vllm-runtime.test.ts
src/lib/openai-compat-client.ts
src/runtimes/vllm-runtime.ts

[codex] rebuild opencode harness

AustinKelsay · 2026-04-24T23:39:35Z

@CodeRabbit full review

coderabbitai · 2026-04-24T23:39:40Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 12

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/cli/migrate-machine-command.ts (1)

334-337: ⚠️ Potential issue | 🟠 Major

Do not force non-zero exit for user validation errors.

Line 336 can throw for invalid user options, and Line 363 currently exits with status 1 for all caught errors. That treats non-crash usage errors as crash exits.

💡 Suggested adjustment

 	.action(async (options) => {
 		try {
 			const parsedOptions = MigrateMachineCommandOptionsSchema.parse(options);
 			const resultsDir = path.resolve(parsedOptions.dir);
 			const migrated = await migrateResultsDirectory(resultsDir);
 			logger.info(
 				{ resultsDir, ...migrated },
 				"Migrated machine-profile artifacts",
 			);
@@
-		} catch (error) {
-			logger.error({ error }, "Machine-profile migration failed");
-			process.exit(1);
+		} catch (error) {
+			if (error instanceof z.ZodError) {
+				logger.warn(
+					{ issues: error.issues },
+					"Invalid options for migrate-machine-profiles; skipping migration",
+				);
+				return;
+			}
+			logger.error({ error }, "Machine-profile migration failed");
+			process.exit(1);
 		}
 	});

As per coding guidelines "CLI should be non-interactive by default; exit non-zero only on crashes (MVP)".

Also applies to: 361-364

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/cli/migrate-machine-command.ts` around lines 334 - 337, The action
handler currently calls MigrateMachineCommandOptionsSchema.parse(options) and
then unconditionally calls process.exit(1) for any caught error; change this so
Zod/validation errors from MigrateMachineCommandOptionsSchema.parse (or other
user-input validation) are handled separately: detect the validation error
(e.g., instanceof ZodError or error.name === 'ZodError'), print a helpful
validation message to the user and return/exit with success (no non-zero exit),
while only calling process.exit(1) for unexpected runtime/crash errors; update
the .action(async (options) => { ... }) catch block to branch on validation vs
unexpected errors and avoid forcing a non-zero exit for user validation issues.

apps/dashboard/src/lib/schemas.ts (1)

363-384: ⚠️ Potential issue | 🟠 Major

Mirror the new scoringMetrics fields here before exporting this result schema.

MatrixItemResultSchema now validates live run.json payloads for the dashboard, but the nested ScoringMetricsSchema still only keeps durationMs. Current results can include retryGenerationDurationMs, retryKind, retryReason, retryAttempted, and retryPromoted, and Zod will silently strip all of them at this boundary.

📦 Suggested schema sync

-const ScoringMetricsSchema = z.object({
-	durationMs: z.number(),
-});
+const ScoringMetricsSchema = z
+	.object({
+		durationMs: z.number(),
+		scoringDurationMs: z.number().optional(),
+		retryGenerationDurationMs: z.number().optional(),
+		retryKind: z.enum(["compile-feedback", "opencode-workspace"]).optional(),
+		retryReason: z.string().optional(),
+		retryAttempted: z.boolean().optional(),
+		retryPromoted: z.boolean().optional(),
+	})
+	.refine(
+		(metrics) => {
+			const hasAnyRetryField =
+				metrics.retryKind !== undefined ||
+				metrics.retryReason !== undefined ||
+				metrics.retryAttempted !== undefined ||
+				metrics.retryPromoted !== undefined ||
+				metrics.retryGenerationDurationMs !== undefined;
+			if (!hasAnyRetryField) {
+				return true;
+			}
+			if (metrics.retryAttempted === true) {
+				return (
+					metrics.retryKind !== undefined &&
+					typeof metrics.retryReason === "string" &&
+					metrics.retryReason.trim().length > 0 &&
+					typeof metrics.retryPromoted === "boolean" &&
+					typeof metrics.retryGenerationDurationMs === "number" &&
+					metrics.retryGenerationDurationMs >= 0
+				);
+			}
+			if (metrics.retryAttempted === false) {
+				return (
+					metrics.retryKind === undefined &&
+					metrics.retryReason === undefined &&
+					metrics.retryPromoted === undefined &&
+					metrics.retryGenerationDurationMs === undefined
+				);
+			}
+			return false;
+		},
+		{
+			message:
+				"retry metrics must be fully absent, or when retryAttempted is true include retryKind, non-empty retryReason, retryPromoted, and non-negative retryGenerationDurationMs; when retryAttempted is false the other retry fields must be absent",
+			path: ["retryKind"],
+		},
+	);

Based on learnings, "Maintain versioned schemas with schemaVersion field and use additive changes; migrations must be explicit" and "Use Zod for schema validation at all boundaries (CLI args, env, JSON files, HTTP responses)".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/lib/schemas.ts` around lines 363 - 384,
MatrixItemResultSchema currently references ScoringMetricsSchema but that schema
only includes durationMs, causing fields like retryGenerationDurationMs,
retryKind, retryReason, retryAttempted, and retryPromoted to be stripped; update
ScoringMetricsSchema to include these additional optional fields and then ensure
MatrixItemResultSchema continues to reference ScoringMetricsSchema (and export
remains unchanged) so live run.json payloads retain those properties; locate
ScoringMetricsSchema in the same file (or its import) and add the five new
optional fields with appropriate types (numbers/strings/booleans) to match
existing usages.

♻️ Duplicate comments (3)

README.md (1)
142-143: ⚠️ Potential issue | 🟡 Minor

Don’t describe bun pb as auto-discovering runtimes anymore.

The README now says new runs are Ollama-only, so this example overstates the default behavior. Keep the discovery note to models, harnesses, and tests.
📝 Suggested doc fix
-# Run benchmarks (auto-discovers available runtimes, models, harnesses, and tests)
+# Run benchmarks (Ollama runtime; auto-discovers models, harnesses, and tests)
 bun pb
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 142 - 143, Update the README line that currently
claims "auto-discovers available runtimes, models, harnesses, and tests" for the
`bun pb` example so it no longer states runtimes are auto-discovered; change the
wording to indicate `bun pb` auto-discovers only models, harnesses, and tests
(and that new runs are Ollama-only) so the example matches the current behavior.
llm/implementation/harnesses-implementation.md (1)
340-345: ⚠️ Potential issue | 🟡 Minor

This doc still points to two different Goose tool-enablement mechanisms.

The table now says Goose tool access comes from generated config/permission rules, but the earlier “Tool-Calling Mode (Goose)” section still tells readers to use --with-builtin developer. Please reconcile both sections in the same pass or the troubleshooting guidance stays ambiguous.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@llm/implementation/harnesses-implementation.md` around lines 340 - 345, The
doc is inconsistent about how Goose enables tools: reconcile the "Tool-Calling
Mode (Goose)" section with the table entry "Goose | Per-item provider config +
permission rules via `goose run`" by choosing the intended mechanism and
updating the other place to match; either remove the reference to the CLI flag
`--with-builtin developer` from "Tool-Calling Mode (Goose)" and replace it with
a clear description of per-item provider config + permission rules (including
mention of using `goose run`), or conversely, change the table to reflect the
CLI flag approach—ensure all references to `--with-builtin developer`,
"Tool-Calling Mode (Goose)", and the table entry are consistent and use the same
terminology and steps.
src/harnesses/opencode-events.ts (1)
240-249: ⚠️ Potential issue | 🟠 Major

Preserve the separator after raw fragments.

joinOrderedOutputParts() still collapses a raw line into the next parsed text chunk. A sequence like raw -> text becomes rawtext instead of preserving the newline boundary, which can corrupt mixed-output salvage and scorer-facing transcripts.
♻️ Suggested fix
 function joinOrderedOutputParts(parts: OrderedOutputPart[]): string {
 	let output = "";
 	for (const part of parts) {
 		if (part.type === "text") {
 			output += part.content;
 			continue;
 		}
-		output += output.length > 0 ? `\n${part.content}` : part.content;
+		if (output.length > 0 && !output.endsWith("\n")) {
+			output += "\n";
+		}
+		output += `${part.content}\n`;
 	}
-	return output;
+	return output.trimEnd();
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/opencode-events.ts` around lines 240 - 249,
joinOrderedOutputParts currently collapses a raw (non-"text") part into the
following text part (e.g., raw -> text becomes rawtext); update the function
(joinOrderedOutputParts and references to OrderedOutputPart) so that when
appending a non-"text" part you always preserve its boundary by ensuring a
newline separates it from subsequent text—append the raw part.content and then,
if the output does not already end with '\n', append a single '\n' (avoid
duplicating newlines), rather than conditionally prefixing the raw piece with a
newline based on output length.

🧹 Nitpick comments (6)

test/item-retry.test.ts (1)

69-186: Optional: extract shared retry-case setup to reduce duplication.

Both tests are solid, but they repeat a large common arrangement block. A tiny helper would improve maintainability without changing behavior.

♻️ Refactor sketch

+function createRetryArgs(params: {
+  testName: string;
+  initialOutput: string;
+  prompt: string;
+}) {
+  return {
+    item: createMatrixItem(params.testName),
+    generation: { success: true, output: params.initialOutput, durationMs: 300 },
+    harnessForRetry: createHarness(generateMock),
+    runtimeForRetry: createRuntime(),
+    promptForRetry: params.prompt,
+    timeoutMs: 5_000,
+    unloadAfter: true,
+    log: { info: vi.fn(), warn: vi.fn() },
+    supportsCompileRetry: true,
+  };
+}
...
- const outcome = await runScoringWithCompileRetry({
-   item: createMatrixItem("todo-app"),
-   generation: { ... },
-   ...
- });
+ const outcome = await runScoringWithCompileRetry(
+   createRetryArgs({
+     testName: "todo-app",
+     initialOutput: "export const notTheRightThing = true;",
+     prompt: "Implement createTodoApp().",
+   }),
+ );

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/item-retry.test.ts` around lines 69 - 186, Extract the repeated
compile-retry test setup into a small helper to remove duplication: create a
helper (e.g., runCompileRetryScenario or setupCompileRetry) in
test/item-retry.test.ts that accepts the varying bits (item name, initial
generation.output/durationMs, generateMock resolved output/durationMs, first
scoring failure object, promptForRetry) and internally imports
runScoringWithCompileRetry, configures generateMock and scoreGenerationMock,
constructs harnessForRetry/runtimeForRetry, calls runScoringWithCompileRetry
with the shared options (timeoutMs, unloadAfter, log, supportsCompileRetry) and
returns the outcome; then update both tests to call this helper and assert on
the returned outcome and on generateMock call expectations.

test/build-index.test.ts (1)

28-48: Consider centralizing required benchmark lib asset fixture data.

This list is manually mirrored from src/lib/benchmark-checkpoint.ts, so asset changes require edits in multiple places. A shared export/helper would reduce drift risk.

As per coding guidelines, "Favor iteration and modularization over code duplication."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/build-index.test.ts` around lines 28 - 48, The test's local
REQUIRED_LIB_ASSETS duplicates the asset list from
src/lib/benchmark-checkpoint.ts; remove the hardcoded array and import a single
exported constant from the benchmark-checkpoint module (or a new shared helper
module) to centralize the fixture data. Update test/build-index.test.ts to
import the exported symbol (e.g., BENCHMARK_REQUIRED_ASSETS or REQUIRES_ASSETS)
and use it in place of the local REQUIRED_LIB_ASSETS, and add the export to
src/lib/benchmark-checkpoint.ts (or a new shared file) so both production code
and tests reference the same source of truth.

apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx (1)

199-203: Reuse dedupeAndSortModels in “Select all” for consistency.

This path currently reimplements sorting and skips dedupe. Reusing the helper keeps one canonical normalization path.

Proposed patch

 							onClick={() =>
-								onSelectionChange(
-									[...models].sort((a, b) => MODEL_SORT_COLLATOR.compare(a, b)),
-								)
+								onSelectionChange(dedupeAndSortModels(models))
 							}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx` around
lines 199 - 203, The "Select all" onClick currently builds a new array and
manually sorts via MODEL_SORT_COLLATOR without deduplicating; replace that logic
to call the existing helper dedupeAndSortModels(models) and pass its result to
onSelectionChange to ensure canonical dedupe+sort behavior (update the onClick
that currently calls onSelectionChange([...models].sort(...)) to use
dedupeAndSortModels(models) instead).

src/cli/compare-formatters.ts (1)

26-36: Pin the display-row collation instead of inheriting the process locale.

localeCompare() without an explicit locale uses the runtime’s default collation, so these sort orders can still drift across machines when model, harness, or test names contain locale-sensitive characters. If the goal is deterministic CLI output, make the comparator locale-independent or at least pin it to a fixed locale.

♻️ Suggested fix

+function compareText(left: string, right: string): number {
+	if (left < right) return -1;
+	if (left > right) return 1;
+	return 0;
+}
+
 function compareDisplayRows(
 	left: ComparableDisplayRow,
 	right: ComparableDisplayRow,
 ): number {
-	const modelCompare = left.model.localeCompare(right.model);
+	const modelCompare = compareText(left.model, right.model);
 	if (modelCompare !== 0) return modelCompare;
-	const harnessCompare = left.harness.localeCompare(right.harness);
+	const harnessCompare = compareText(left.harness, right.harness);
 	if (harnessCompare !== 0) return harnessCompare;
-	const testCompare = left.test.localeCompare(right.test);
+	const testCompare = compareText(left.test, right.test);
 	if (testCompare !== 0) return testCompare;
-	return left.passType.localeCompare(right.passType);
+	return compareText(left.passType, right.passType);
 }

Based on learnings, "Output should be table/diff oriented, concise, and deterministic; never rely on color alone".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/cli/compare-formatters.ts` around lines 26 - 36, The comparator
compareDisplayRows currently uses localeCompare() with the environment default;
make sorting deterministic by pinning the locale and options, e.g. call
localeCompare on model, harness, test, and passType with a fixed locale (like
"en" or "en-US") and explicit options (for example { numeric: true, sensitivity:
"variant" }) so comparisons do not vary across machines; update the calls in
compareDisplayRows (referencing left.model, left.harness, left.test,
left.passType and their right.* counterparts) to include the chosen locale and
options.

test/goose-adapter.test.ts (1)

238-245: Don’t make these regressions depend on taint-reason ordering.

signalAssessment.reasons is effectively set-like metadata. Matching the exact array order makes these tests fail on harmless merge/refactor changes even when the same reasons are present. Prefer membership-based assertions or compare sorted arrays.

♻️ Suggested assertion pattern

-				signalAssessment: {
-					classification: "tainted",
-					reasons: ["internal_tool_transcript", "agent_requested_input"],
-				},
+				signalAssessment: expect.objectContaining({
+					classification: "tainted",
+					reasons: expect.arrayContaining([
+						"internal_tool_transcript",
+						"agent_requested_input",
+					]),
+				}),

Also applies to: 279-285

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/goose-adapter.test.ts` around lines 238 - 245, The test currently
asserts exact ordering on signalAssessment.reasons which is brittle; update the
assertion in the failing expectation (the block containing signalAssessment: {
classification: "tainted", reasons: [...] }) to assert membership rather than
order — e.g., use expect.arrayContaining([...]) for signalAssessment.reasons or
compare a sorted copy (sort both actual and expected) so the test verifies the
presence of "internal_tool_transcript" and "agent_requested_input" regardless of
order; apply the same change to the other occurrence around lines 279-285 that
also asserts signalAssessment.reasons.

src/schemas/common.schema.ts (1)

25-32: Keep artifact runtimes independent from the active runtime list.

artifactRuntimeNames = supportedRuntimeNames means the next active-runtime removal will also narrow artifact parsing unless someone remembers to split the arrays in the same edit. Since this file is introducing separate “active” vs “persisted” runtime surfaces, make the artifact list its own literal set now.
♻️ Suggested refactor
 /** Valid runtime names allowed when reading historical plan/result artifacts. */
-export const artifactRuntimeNames = supportedRuntimeNames;
+export const artifactRuntimeNames = ["ollama", "vllm"] as const;
As per coding guidelines, "Maintain versioned schemas with schemaVersion field and use additive changes; migrations must be explicit".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/schemas/common.schema.ts` around lines 25 - 32, artifactRuntimeNames
currently aliases supportedRuntimeNames which couples persisted artifact parsing
to the active runtime list; replace that alias with an explicit, independent
literal array of allowed persisted runtimes (assign a new array to
artifactRuntimeNames rather than referencing supportedRuntimeNames), then keep
ArtifactRuntimeNameSchema = z.enum(artifactRuntimeNames) and ArtifactRuntimeName
= z.infer<typeof ArtifactRuntimeNameSchema> so the schema/type use the new
independent list; ensure you update only artifactRuntimeNames (not
supportedRuntimeNames) so future active-runtime removals won't change artifact
parsing and remember to increment schemaVersion/migration if this is a breaking
change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx`:
- Around line 69-76: The ModelFilterDropdown component accepts props without
runtime checking; add a Zod schema for the expected shape (matching
ModelFilterDropdownProps) and validate props at the top of the exported
ModelFilterDropdown function, throwing a TypeError when validation fails; after
validation use the parsed/validated values (models, selectedModels,
onSelectionChange) for the rest of the component (e.g., for
normalizedSelectedIds and containerRef) so internal logic relies on
runtime-validated data and the documented contract is enforced.

In `@src/harnesses/opencode-adapter.ts`:
- Around line 149-152: The protocolOnlyReasons computation uses stdout directly,
causing taint from the actual parsed stream (the result of
selectProcessOutput(stdout, stderr) passed into parseOpenCodeEvents()) to be
ignored; update the protocolOnlyReasons assignment in opencode-adapter.ts so it
calls getTranscriptOrInputTaintReasons(...) on the exact stream/string that was
passed into parseOpenCodeEvents/selectProcessOutput (the parsed stream variable)
instead of always using stdout—i.e., locate the protocolOnlyReasons declaration
and replace getTranscriptOrInputTaintReasons(stdout) with
getTranscriptOrInputTaintReasons(parsedStream) or the actual variable holding
selectProcessOutput(...) so stderr-only protocol JSONL preserves its
internal_tool_transcript taint.

In `@src/harnesses/opencode-events.ts`:
- Around line 311-317: When extractFromToolCallObject(event) returns a non-empty
string and you assign it to toolCallOutput, also set hasToolUse = true so the
parsed output reflects that tool content was recovered; update both locations
where this pattern appears (the block using extractedToolCallOutput /
toolCallOutput around extractFromToolCallObject and the similar block at lines
~347-348) to flip hasToolUse when assigning the recovered tool-call string. This
change should reference extractedToolCallOutput, toolCallOutput, hasToolUse, and
extractFromToolCallObject in the affected function(s).

In `@src/runner/index.ts`:
- Around line 280-310: The residency-guard failure branch currently continues
without running post-item teardown, which can leave a previous model resident
and poison subsequent blocks; before the early continue in the catch block,
invoke runPostItemResidencyTeardown with the same context used for successful
items (pass item, plan, config, log, and any residency state) to ensure the
model is unloaded/cleaned up even on failure, then proceed to record the failure
via buildResidencyGuardFailureResult and write the progress checkpoint as
before.

In `@src/runner/model-eligibility.ts`:
- Around line 97-99: The thrown error message always says "(embedding-only)"
even when the real cause is that the model's capabilities.generateText is false;
update the throw in model-eligibility.ts (the branch referencing model, runtime,
detectedModelKind, and capabilities.generateText) to compute an accurate reason
string (e.g., if detectedModelKind === 'embedding' use "embedding-only", else if
!capabilities.generateText use "does not support text generation" or include the
capability flag) and include that reason in the Error message so operators see
the true exclusion cause.

In `@src/runner/plan-builder.ts`:
- Around line 102-110: Currently non-executable entries in config.runtimes are
silently filtered out (configuredRuntimes → runtimes using
isExecutableRuntimeName); instead, validate configuredRuntimes up-front: compute
the set of invalid names = configuredRuntimes.filter(r =>
!isExecutableRuntimeName(r)) and if invalids.length > 0 throw an Error listing
the invalids and the allowed RUNTIME_NAMES; only after that set runtimes =
configuredRuntimes as RuntimeName[] (or keep the existing filter as a sanity
check) so we fail loudly on bad config rather than masking it.

In `@src/runtimes/ollama-residency.ts`:
- Around line 261-266: The residency check in ensureOnlyOllamaModelLoaded uses
isSameOllamaModel but does not trim whitespace from config.allowedModel, so
values like " qwen3.6 " are treated as different; update
ensureOnlyOllamaModelLoaded (or its config validation) to trim() the
allowedModel string from EnsureOnlyOllamaModelLoadedConfig before any calls to
isSameOllamaModel or residency comparisons, and throw if allowedModel is empty
after trimming to avoid silent fallbacks.
- Around line 203-245: Both listRunningOllamaModels and unloadOllamaModel must
validate and normalize incoming config strings: trim config.baseUrl and
config.model (where present) and throw a clear Error if baseUrl or model is
empty/whitespace-only after trimming; also normalize baseUrl by removing any
trailing slash before constructing endpoint (the code that builds endpoint uses
`${config.baseUrl}/api/...`). Update
OllamaResidencyBaseConfig/UnloadOllamaModelConfig consumers in
listRunningOllamaModels (endpoint) and unloadOllamaModel (endpoint and request
body.model) to perform this validation/normalization at the top of each function
and reject invalid input immediately.

In `@src/runtimes/ollama-runtime.ts`:
- Around line 37-42: Normalize and validate the 'family' and 'families' fields
at the HTTP boundary by updating the zod 'details' schema: trim whitespace and
reject empty/blank strings for the single 'family' field and for each element of
the 'families' array so downstream functions like inferModelKind() and returned
metadata only see normalized values; specifically, change the 'family' schema to
apply a trim transform and a non-empty check and change 'families' to be an
array of trimmed/non-empty strings (keep both optional semantics), so invalid or
blank entries are rejected during parsing.

In `@test/opencode-output.test.ts`:
- Around line 88-106: The test currently uses a non-write second entry (command:
"edit") so it doesn't actually verify that a later write beats an earlier one;
update the fixture in the "extracts the latest batched write from alternate
tool-call shapes" test that calls parseOpenCodeEvents so the second toolCalls
element is a write-shaped entry (e.g., { tool: "write", raw: { content: "export
const second = 2;" } }) instead of command: "edit", leaving the expectations
(parsed.output and parsed.method) unchanged so the test truly exercises the
"latest write wins" behavior.

In `@test/schemas.test.ts`:
- Around line 223-244: The test currently allows runtimes: ["vllm"] but
BenchConfigSchema should reject that; update the second assertion so parsing {
runtimes: ["vllm"] } throws instead of returning a value. Specifically, change
the expectation around BenchConfigSchema.parse({ runtimes: ["vllm"] }) in the
test to use expect(() => BenchConfigSchema.parse(...)).toThrow(...) (matching
"runtimes" or "vllm") while keeping the other assertions (e.g., the
modelProfiles variant check for vllm) unchanged.
- Around line 69-74: The test is incorrectly asserting that
SupportedRuntimeNameSchema accepts "vllm", collapsing the supported/artifact
split; update the test so SupportedRuntimeNameSchema.parse("vllm") throws (or
remove that assertion) while keeping ArtifactRuntimeNameSchema.parse("vllm")
asserting it returns "vllm", i.e., ensure SupportedRuntimeNameSchema only
accepts supported names (e.g., "ollama") and legacy "vllm" remains covered by
ArtifactRuntimeNameSchema.

---

Outside diff comments:
In `@apps/dashboard/src/lib/schemas.ts`:
- Around line 363-384: MatrixItemResultSchema currently references
ScoringMetricsSchema but that schema only includes durationMs, causing fields
like retryGenerationDurationMs, retryKind, retryReason, retryAttempted, and
retryPromoted to be stripped; update ScoringMetricsSchema to include these
additional optional fields and then ensure MatrixItemResultSchema continues to
reference ScoringMetricsSchema (and export remains unchanged) so live run.json
payloads retain those properties; locate ScoringMetricsSchema in the same file
(or its import) and add the five new optional fields with appropriate types
(numbers/strings/booleans) to match existing usages.

In `@src/cli/migrate-machine-command.ts`:
- Around line 334-337: The action handler currently calls
MigrateMachineCommandOptionsSchema.parse(options) and then unconditionally calls
process.exit(1) for any caught error; change this so Zod/validation errors from
MigrateMachineCommandOptionsSchema.parse (or other user-input validation) are
handled separately: detect the validation error (e.g., instanceof ZodError or
error.name === 'ZodError'), print a helpful validation message to the user and
return/exit with success (no non-zero exit), while only calling process.exit(1)
for unexpected runtime/crash errors; update the .action(async (options) => { ...
}) catch block to branch on validation vs unexpected errors and avoid forcing a
non-zero exit for user validation issues.

---

Duplicate comments:
In `@llm/implementation/harnesses-implementation.md`:
- Around line 340-345: The doc is inconsistent about how Goose enables tools:
reconcile the "Tool-Calling Mode (Goose)" section with the table entry "Goose |
Per-item provider config + permission rules via `goose run`" by choosing the
intended mechanism and updating the other place to match; either remove the
reference to the CLI flag `--with-builtin developer` from "Tool-Calling Mode
(Goose)" and replace it with a clear description of per-item provider config +
permission rules (including mention of using `goose run`), or conversely, change
the table to reflect the CLI flag approach—ensure all references to
`--with-builtin developer`, "Tool-Calling Mode (Goose)", and the table entry are
consistent and use the same terminology and steps.

In `@README.md`:
- Around line 142-143: Update the README line that currently claims
"auto-discovers available runtimes, models, harnesses, and tests" for the `bun
pb` example so it no longer states runtimes are auto-discovered; change the
wording to indicate `bun pb` auto-discovers only models, harnesses, and tests
(and that new runs are Ollama-only) so the example matches the current behavior.

In `@src/harnesses/opencode-events.ts`:
- Around line 240-249: joinOrderedOutputParts currently collapses a raw
(non-"text") part into the following text part (e.g., raw -> text becomes
rawtext); update the function (joinOrderedOutputParts and references to
OrderedOutputPart) so that when appending a non-"text" part you always preserve
its boundary by ensuring a newline separates it from subsequent text—append the
raw part.content and then, if the output does not already end with '\n', append
a single '\n' (avoid duplicating newlines), rather than conditionally prefixing
the raw piece with a newline based on output length.

---

Nitpick comments:
In `@apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx`:
- Around line 199-203: The "Select all" onClick currently builds a new array and
manually sorts via MODEL_SORT_COLLATOR without deduplicating; replace that logic
to call the existing helper dedupeAndSortModels(models) and pass its result to
onSelectionChange to ensure canonical dedupe+sort behavior (update the onClick
that currently calls onSelectionChange([...models].sort(...)) to use
dedupeAndSortModels(models) instead).

In `@src/cli/compare-formatters.ts`:
- Around line 26-36: The comparator compareDisplayRows currently uses
localeCompare() with the environment default; make sorting deterministic by
pinning the locale and options, e.g. call localeCompare on model, harness, test,
and passType with a fixed locale (like "en" or "en-US") and explicit options
(for example { numeric: true, sensitivity: "variant" }) so comparisons do not
vary across machines; update the calls in compareDisplayRows (referencing
left.model, left.harness, left.test, left.passType and their right.*
counterparts) to include the chosen locale and options.

In `@src/schemas/common.schema.ts`:
- Around line 25-32: artifactRuntimeNames currently aliases
supportedRuntimeNames which couples persisted artifact parsing to the active
runtime list; replace that alias with an explicit, independent literal array of
allowed persisted runtimes (assign a new array to artifactRuntimeNames rather
than referencing supportedRuntimeNames), then keep ArtifactRuntimeNameSchema =
z.enum(artifactRuntimeNames) and ArtifactRuntimeName = z.infer<typeof
ArtifactRuntimeNameSchema> so the schema/type use the new independent list;
ensure you update only artifactRuntimeNames (not supportedRuntimeNames) so
future active-runtime removals won't change artifact parsing and remember to
increment schemaVersion/migration if this is a breaking change.

In `@test/build-index.test.ts`:
- Around line 28-48: The test's local REQUIRED_LIB_ASSETS duplicates the asset
list from src/lib/benchmark-checkpoint.ts; remove the hardcoded array and import
a single exported constant from the benchmark-checkpoint module (or a new shared
helper module) to centralize the fixture data. Update test/build-index.test.ts
to import the exported symbol (e.g., BENCHMARK_REQUIRED_ASSETS or
REQUIRES_ASSETS) and use it in place of the local REQUIRED_LIB_ASSETS, and add
the export to src/lib/benchmark-checkpoint.ts (or a new shared file) so both
production code and tests reference the same source of truth.

In `@test/goose-adapter.test.ts`:
- Around line 238-245: The test currently asserts exact ordering on
signalAssessment.reasons which is brittle; update the assertion in the failing
expectation (the block containing signalAssessment: { classification: "tainted",
reasons: [...] }) to assert membership rather than order — e.g., use
expect.arrayContaining([...]) for signalAssessment.reasons or compare a sorted
copy (sort both actual and expected) so the test verifies the presence of
"internal_tool_transcript" and "agent_requested_input" regardless of order;
apply the same change to the other occurrence around lines 279-285 that also
asserts signalAssessment.reasons.

In `@test/item-retry.test.ts`:
- Around line 69-186: Extract the repeated compile-retry test setup into a small
helper to remove duplication: create a helper (e.g., runCompileRetryScenario or
setupCompileRetry) in test/item-retry.test.ts that accepts the varying bits
(item name, initial generation.output/durationMs, generateMock resolved
output/durationMs, first scoring failure object, promptForRetry) and internally
imports runScoringWithCompileRetry, configures generateMock and
scoreGenerationMock, constructs harnessForRetry/runtimeForRetry, calls
runScoringWithCompileRetry with the shared options (timeoutMs, unloadAfter, log,
supportsCompileRetry) and returns the outcome; then update both tests to call
this helper and assert on the returned outcome and on generateMock call
expectations.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1e139aef-83e9-4aad-98b3-76391c93f0ab

📥 Commits

Reviewing files that changed from the base of the PR and between 56b63e7 and 5f87caf.

📒 Files selected for processing (159)

README.md
apps/dashboard/scripts/build-index.ts
apps/dashboard/src/components/about/about-content.ts
apps/dashboard/src/components/about/about-page.tsx
apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx
apps/dashboard/src/components/charts/chart-primitives.tsx
apps/dashboard/src/components/charts/composite-score-chart.tsx
apps/dashboard/src/components/charts/frontier-eval-scatter.tsx
apps/dashboard/src/components/charts/model-comparison-chart.tsx
apps/dashboard/src/components/charts/model-radar-chart.tsx
apps/dashboard/src/components/charts/model-test-heatmap.tsx
apps/dashboard/src/components/charts/test-difficulty-chart.tsx
apps/dashboard/src/components/charts/test-type-split-chart.tsx
apps/dashboard/src/components/charts/token-efficiency-chart.tsx
apps/dashboard/src/components/layout/header.tsx
apps/dashboard/src/components/leaderboard/leaderboard-filters.ts
apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx
apps/dashboard/src/components/leaderboard/leaderboard-page.tsx
apps/dashboard/src/components/leaderboard/leaderboard-results-table.tsx
apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx
apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx
apps/dashboard/src/components/run-list/run-list-page.tsx
apps/dashboard/src/components/ui/section-heading.tsx
apps/dashboard/src/index.css
apps/dashboard/src/lib/aggregations-charts.ts
apps/dashboard/src/lib/aggregations-compare.ts
apps/dashboard/src/lib/aggregations-diagnostics.ts
apps/dashboard/src/lib/aggregations-test-types.ts
apps/dashboard/src/lib/aggregations-tooling.ts
apps/dashboard/src/lib/aggregations.ts
apps/dashboard/src/lib/api.ts
apps/dashboard/src/lib/chart-colors.ts
apps/dashboard/src/lib/schemas.ts
apps/dashboard/src/lib/types.ts
llm/context/codebase-overview.md
llm/implementation/computer-use-hardening.md
llm/implementation/harnesses-implementation.md
llm/implementation/multi-runtime-mvp-implementation.md
llm/project/project-overview.md
llm/project/project-rules.md
llm/project/user-flow.md
package.json
src/cli/compare-command.ts
src/cli/compare-formatters.ts
src/cli/compare-utils.ts
src/cli/migrate-machine-command.ts
src/cli/run-command.ts
src/harnesses/direct-adapter.ts
src/harnesses/discovery.ts
src/harnesses/goose-adapter.ts
src/harnesses/goose-openai.ts
src/harnesses/goose-output.ts
src/harnesses/harness.ts
src/harnesses/opencode-adapter.ts
src/harnesses/opencode-artifacts.ts
src/harnesses/opencode-cli.ts
src/harnesses/opencode-config.ts
src/harnesses/opencode-events.ts
src/harnesses/opencode-model.ts
src/harnesses/opencode-output.ts
src/harnesses/opencode-permissions.ts
src/harnesses/opencode-process.ts
src/harnesses/opencode-provider.ts
src/harnesses/opencode-runner.ts
src/harnesses/tool-prompt.ts
src/lib/benchmark-checkpoint.ts
src/lib/failure-classifier.ts
src/lib/hardware-profile.ts
src/lib/logger.ts
src/lib/machine-profile/instance-id.ts
src/lib/machine-profile/legacy.ts
src/lib/machine-profile/normalization.ts
src/lib/machine-profile/probe-linux.ts
src/lib/machine-profile/probe-macos.ts
src/lib/machine-profile/probe-utils.ts
src/lib/machine-profile/probe-windows.ts
src/lib/model-aliases.ts
src/lib/model-profile/normalization.ts
src/lib/model-profile/registry-compat.ts
src/lib/model-profile/registry.ts
src/lib/openai-compat-client.ts
src/lib/signal-assessment.ts
src/lib/stats-format.ts
src/lib/test-catalog.ts
src/runner/index.ts
src/runner/item-executor.ts
src/runner/model-availability.ts
src/runner/model-eligibility.ts
src/runner/plan-builder.ts
src/runner/run-progress.ts
src/runtimes/discovery.ts
src/runtimes/index.ts
src/runtimes/ollama-residency.ts
src/runtimes/ollama-runtime.ts
src/runtimes/runtime.ts
src/runtimes/vllm-runtime.ts
src/schemas/common.schema.ts
src/schemas/config.schema.ts
src/schemas/index.ts
src/schemas/model-alias.schema.ts
src/schemas/model-profile.schema.ts
src/schemas/plan.schema.ts
src/schemas/result.schema.ts
src/schemas/test-catalog.schema.ts
src/tests/calculator-basic/test.meta.json
src/tests/calculator-stateful/test.meta.json
src/tests/event-emitter/test.meta.json
src/tests/rate-limiter/test.meta.json
src/tests/smoke/test.meta.json
src/tests/todo-app/test.meta.json
src/tests/ttl-cache/test.meta.json
src/tests/workspace-smoke/fixtures/logs/.gitkeep
src/tests/workspace-smoke/prompt.blind.md
src/tests/workspace-smoke/prompt.informed.md
src/tests/workspace-smoke/test.meta.json
test/aggregate.test.ts
test/benchmark-checkpoint.test.ts
test/build-index.test.ts
test/compare-checkpoint-guard.test.ts
test/compare-formatters.test.ts
test/dashboard-schema-compat.test.ts
test/dashboard-test-type.test.ts
test/failure-classifier.test.ts
test/generation-retry.test.ts
test/goose-adapter.test.ts
test/goose-output.test.ts
test/hardware-profile.test.ts
test/harness-compatibility.test.ts
test/item-executor.test.ts
test/item-retry.test.ts
test/machine-profile-migration.test.ts
test/model-alias-schema.test.ts
test/model-availability.test.ts
test/model-profile-registry.test.ts
test/ollama-adapter.test.ts
test/ollama-residency.test.ts
test/opencode-adapter.test.ts
test/opencode-artifacts.test.ts
test/opencode-cli-cache.test.ts
test/opencode-cli.test.ts
test/opencode-config.test.ts
test/opencode-model.test.ts
test/opencode-output.test.ts
test/opencode-runner-race.test.ts
test/opencode-runner.test.ts
test/plan-builder.test.ts
test/run-command.test.ts
test/runner-residency.test.ts
test/schema-regressions.test.ts
test/schemas.test.ts
test/signal-assessment.test.ts
test/stats-semantics.test.ts
test/stats.test.ts
test/test-catalog.test.ts
test/tool-prompt.test.ts
test/utils/fixtures.ts
test/vllm-runtime.test.ts
test/workspace-capability-parity.test.ts
test/workspace-prompt-parity.test.ts

💤 Files with no reviewable changes (10)

package.json
test/benchmark-checkpoint.test.ts
src/harnesses/opencode-process.ts
src/harnesses/goose-openai.ts
test/vllm-runtime.test.ts
src/runtimes/discovery.ts
src/harnesses/opencode-model.ts
src/lib/openai-compat-client.ts
src/runtimes/vllm-runtime.ts
src/harnesses/opencode-output.ts

AustinKelsay · 2026-04-25T13:58:26Z

@CodeRabbit full review

coderabbitai · 2026-04-25T13:58:32Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

apps/dashboard/src/components/layout/header.tsx (1)
1-4: ⚠️ Potential issue | 🟡 Minor

Add missing "Exports:" line to file header.

The file header is missing the Exports: line required by the coding guidelines. As per coding guidelines, every file should document its exports in the header comment.
📝 Proposed fix
 /**
  * Purpose: Application header with navigation, gradient accent bar, and brand styling.
+ * Exports: Header
  * Provides links to the main dashboard views.
  */
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/layout/header.tsx` around lines 1 - 4, Add a
file header "Exports:" line to the top comment documenting what this module
exports (e.g., the Header component and any other named exports) — update the
header block above the purpose comment to include an "Exports:" section that
lists the exported symbols (such as Header or default export) exactly as they
appear in the file (adjust to "default" vs named if applicable).

🧹 Nitpick comments (10)

test/item-retry.test.ts (1)

87-93: Add explicit scoring invocation-count assertions for retry-path robustness.

Consider asserting scoreGenerationMock is called exactly twice in each scenario (initial score + retry score). It makes regressions in retry orchestration easier to catch.

Proposed test hardening

@@
 		expect(generateMock).toHaveBeenCalledTimes(1);
+		expect(scoreGenerationMock).toHaveBeenCalledTimes(2);
@@
 		expect(generateMock).toHaveBeenCalledTimes(1);
+		expect(scoreGenerationMock).toHaveBeenCalledTimes(2);

Also applies to: 121-188

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/item-retry.test.ts` around lines 87 - 93, Add explicit assertions that
scoreGenerationMock was invoked exactly twice to validate initial scoring +
retry; after each scenario's expectations (for the block using
scoreGenerationMock.mockResolvedValueOnce(...) and the later scenario around
lines 121-188), add expect(scoreGenerationMock).toHaveBeenCalledTimes(2) (and
reset/clear the mock between scenarios if not already done) so the retry
orchestration regressions are caught.

apps/dashboard/src/components/layout/header.tsx (1)

54-54: Use CSS variable instead of hardcoded color.

The brand color is hardcoded as "#34c759" instead of referencing the design system's --success or --brand CSS variable. This duplicates the same maintenance issue flagged in the leaderboard cards.

♻️ Refactor to use CSS variable

 									<span
 										className="absolute bottom-0 left-0 right-0 h-0.5 rounded-full"
 										style={{
-											background: "#34c759",
+											background: "hsl(var(--success))",
 											boxShadow: "0 0 8px hsla(142, 60%, 49%, 0.4)",
 										}}
 									/>

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/layout/header.tsx` at line 54, The header
component currently hardcodes the brand color as "#34c759" in the inline style
(background: "#34c759") — update that to use the design system CSS variable
(e.g., background: "var(--success)" or "var(--brand)") in the Header component
(header.tsx) so the style references the theme variable instead of a literal
color; locate the inline style object where background is set and replace the
literal with the appropriate var(...) token.

apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx (1)

63-99: Use CSS variables instead of hardcoded colors.

The card colors are hardcoded with inconsistent formats (hex vs HSL) instead of referencing the design system's CSS variables defined in index.css. This creates maintenance overhead and risks drift from the design tokens.

Several values already map to existing variables:

Lines 63, 74: "#34c759" → should use "hsl(var(--success))" or "hsl(var(--brand))"
Line 80: "hsl(212, 100%, 67%)" → should use "hsl(var(--primary))"

♻️ Refactor to use CSS variables

 	const cards = [
 		{
 			title: "Matched Runs",
 			value: String(aggregate?.summary.runsMatched ?? 0),
 			sub: null,
-			color: "#34c759",
+			color: "hsl(var(--success))",
 		},
 		{
 			title: "Profiles",
 			value: String(aggregate?.summary.machines ?? 0),
 			sub:
 				aggregate?.summary.instances !== undefined
 					? `${aggregate.summary.instances} ${
 							aggregate.summary.instances === 1 ? "instance" : "instances"
 						}`
 					: null,
-			color: "#34c759",
+			color: "hsl(var(--success))",
 		},
 		{
 			title: "Deduped Items",
 			value: String(filteredItemCount),
 			sub: `of ${aggregate?.summary.dedupedItems ?? 0} total`,
-			color: "hsl(212, 100%, 67%)",
+			color: "hsl(var(--primary))",
 		},
 		{
 			title: "Pass Rate",
 			value: formatPercent(passRate.passRate),
 			sub: `${passRate.passed}/${passRate.total} tests`,
-			color: "hsl(156, 67%, 55%)",
+			color: "hsl(var(--success))", // or define a new CSS variable if this specific shade is intentional
 		},
 		{
 			title: "Frontier Coverage",
 			value: formatPercent(frontierCoverage),
 			sub: `${frontierCount} of ${filteredItemCount} items`,
-			color: "hsl(270, 60%, 60%)",
+			color: "hsl(270, 60%, 60%)", // Consider adding --accent-purple to CSS variables
 		},
 		{
 			title: "Median Duration",
 			value: medianDuration !== null ? formatDuration(medianDuration) : "—",
 			sub: durations.length > 0 ? `${durations.length} items` : "no data",
-			color: "hsl(210, 85%, 60%)",
+			color: "hsl(var(--info))", // closest match: --info: 215 70% 62%
 		},
 	];

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx`
around lines 63 - 99, Replace hardcoded color strings in the leaderboard summary
card objects with the design system CSS variables: change occurrences of
"#34c759" (used in the card entries for "Profiles" and the preceding card) to
"hsl(var(--success))" or "hsl(var(--brand))" as appropriate, replace "hsl(212,
100%, 67%)" (the "Deduped Items" card) with "hsl(var(--primary))", and ensure
other color fields (e.g., "hsl(156, 67%, 55%)", "hsl(270, 60%, 60%)", "hsl(210,
85%, 60%)") are audited and updated to the corresponding CSS variables if design
tokens exist; update the color property values in the card object literals in
leaderboard-summary-cards.tsx (the objects that reference title/value/sub/color
for Profiles, Deduped Items, Pass Rate, Frontier Coverage, Median Duration) to
use the CSS variable strings.

src/lib/machine-profile/probe-utils.ts (1)

47-69: Add missing @throws tags on exported parser helpers.

Both exported functions in this segment are missing the required @throws documentation tag.

📝 Suggested doc-only patch

 /**
  * Parses a numeric probe result into a positive integer.
  *
  * `@param` value - Raw probe stdout
  * `@returns` Positive integer when valid
+ * `@throws` Never throws. Returns `undefined` for invalid input.
  */
 export function parsePositiveInt(
 	value: string | undefined,
 ): number | undefined {

 /**
  * Parses a memory string such as `8 GB` or `8192 MB`.
  *
  * `@param` value - Raw memory string
  * `@returns` Memory in bytes when parsable
+ * `@throws` Never throws. Returns `undefined` for unparsable input.
  */
 export function parseMemoryBytes(
 	value: string | undefined,
 ): number | undefined {

As per coding guidelines: “All exported functions must have TSDoc/JSDoc documentation including purpose, params, returns, and throws”.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/machine-profile/probe-utils.ts` around lines 47 - 69, Update the
TSDoc for both exported functions parsePositiveInt and parseMemoryBytes to
include an `@throws` tag: explicitly state that these helpers do not throw
exceptions (invalid or unparsable input results in undefined) so the
documentation meets the “params, returns, and throws” requirement; keep the rest
of the existing description and tags unchanged.

src/runtimes/runtime.ts (1)

1-12: Add an explicit Invariants section to the file header.

The header documents purpose/exports, but it should explicitly include invariants to match the repository header contract.

Proposed header patch

 /**
  * Purpose: Runtime interface and types for inference backends.
  * Exports: Runtime, RuntimeName, RUNTIME_NAMES, ModelInfo
  *
- * A Runtime represents an inference backend for active benchmark execution.
+ * Invariants:
+ * - Runtime names and API formats reflect supported execution backends.
+ * - Runtime contracts expose ping/listModels/getModelInfo consistently.
+ *
+ * A Runtime represents an inference backend for active benchmark execution.
  * Runtimes are responsible for:
  * - Health checks (ping)
  * - Model discovery (listModels)

As per coding guidelines "src/**/*.{ts,tsx}: Every file must start with a descriptive header describing purpose, exports, and invariants".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/runtimes/runtime.ts` around lines 1 - 12, Add an explicit "Invariants"
section to the top-of-file header in src/runtimes/runtime.ts alongside the
existing Purpose and Exports lines; describe any assumptions that must always
hold (e.g., Runtime implementations must implement ping/listModels/getModelInfo,
RUNTIME_NAMES must be stable identifiers, and ModelInfo must be
serializable/immutable) and ensure the header references the exported symbols
Runtime, RuntimeName, RUNTIME_NAMES, and ModelInfo so the file meets the repo's
header contract.

test/build-index.test.ts (1)

31-33: Preserve src/lib-relative fixture paths instead of flattening to basenames.

Using path.basename here can silently collapse distinct assets (or nested paths) into one filename, which weakens checkpoint-fixture fidelity as CORE_BENCHMARK_LIB_ASSETS evolves.

Suggested refactor

-const REQUIRED_LIB_ASSETS = CORE_BENCHMARK_LIB_ASSETS.map((assetPath) =>
-	path.basename(assetPath),
-);
+const REQUIRED_LIB_ASSETS = CORE_BENCHMARK_LIB_ASSETS.map((assetPath) => {
+	const normalized = assetPath.replaceAll("\\", "/");
+	return normalized.startsWith("src/lib/")
+		? normalized.slice("src/lib/".length)
+		: path.basename(normalized);
+});

 for (const filename of REQUIRED_LIB_ASSETS) {
 	const exportName = filename.replaceAll(/[^a-zA-Z0-9]+/g, "_");
+	const assetFilePath = path.join(libRoot, filename);
+	fs.mkdirSync(path.dirname(assetFilePath), { recursive: true });
 	fs.writeFileSync(
-		path.join(libRoot, filename),
+		assetFilePath,
 		`export const ${exportName} = ${JSON.stringify(filename)};\n`,
 	);
 }

Also applies to: 105-111

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/build-index.test.ts` around lines 31 - 33, The test flattens asset paths
using path.basename which can collapse distinct files; update the mapping that
assigns REQUIRED_LIB_ASSETS from CORE_BENCHMARK_LIB_ASSETS to preserve the
src/lib-relative path (e.g., keep the original relative path or use
path.relative(path.join(repoRoot, "src","lib"), assetPath)) instead of calling
path.basename, and make the same change for the second mapping block that
currently uses path.basename around lines where REQUIRED_LIB_ASSETS is created
again so fixtures remain uniquely identified; adjust any downstream expectations
that assumed basenames to use the preserved relative paths.

apps/dashboard/src/components/charts/model-comparison-chart.tsx (1)

97-101: Avoid render-phase logging side effects.

console.error inside render can fire repeatedly for the same bad input. Consider moving it to an effect (or removing it) to avoid noisy repeated logs.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/charts/model-comparison-chart.tsx` around lines
97 - 101, The render currently calls
ModelComparisonChartPropsSchema.safeParse(props) and logs errors with
console.error during render (parsedProps variable), which causes noisy repeated
logs; move the logging into a useEffect that watches the props (or
parsedProps.success) so the error is logged only when props change, and keep the
items assignment using parsedProps.success ? parsedProps.data.items : [];
specifically, call safeParse in render to derive items but perform any
console.error reporting inside a useEffect tied to props or parsedProps.success
to avoid render-phase side effects.

test/schemas.test.ts (1)

223-244: Rename this test to match what it actually asserts.

The body now rejects runtimes: ["vllm"] and vllm profile variants, so “while accepting legacy runtime values” is misleading when this shows up in Vitest output.

📝 Suggested rename

-it("should reject removed vllm config fields while accepting legacy runtime values", () => {
+it("should reject removed vllm config fields and legacy vllm values in live config", () => {

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/schemas.test.ts` around lines 223 - 244, The test description string
currently says "should reject removed vllm config fields while accepting legacy
runtime values" but the assertions actually verify that vllm-related fields and
legacy runtime values (runtimes: ["vllm"]) and vllm profile variants are
rejected; update the it(...) test title to accurately reflect this behavior (for
example: "should reject removed vllm config fields including legacy runtime
values and profile variants") so it matches the assertions around
BenchConfigSchema.parse and the vllm-related cases.

src/harnesses/tool-prompt.ts (1)

91-94: Document the unsupported pathMode throw in TSDoc.

buildWorkspaceToolPrompt also throws for invalid pathMode values (Line 109-113), but that throw condition is not listed in the function’s @throws contract.

📝 Suggested doc patch

  * `@throws` {Error} If toolNames is empty
+ * `@throws` {Error} If pathMode is not "absolute-anchor" or "relative-only"
  * `@throws` {Error} If pathMode is "absolute-anchor" and workspaceRootPath is
  * missing or blank; message is "absolute-anchor requires workspaceRootPath"

As per coding guidelines, "All exported functions require TSDoc/JSDoc documentation including purpose, params, returns, and throws".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/tool-prompt.ts` around lines 91 - 94, Update the TSDoc for
buildWorkspaceToolPrompt to document the additional throw: add an `@throws` entry
stating that the function throws an Error for unsupported/invalid pathMode
values (the error thrown in buildWorkspaceToolPrompt around the pathMode
validation at lines 109-113), and include the expected message or shape (e.g.,
indicate it reports the invalid pathMode) so callers know this contract in
addition to the existing throws for empty toolNames and missing
workspaceRootPath for "absolute-anchor".

apps/dashboard/src/lib/types.ts (1)

524-527: Add full TSDoc tags for the exported helper.

isToolSmokeItem is exported, but its doc block is missing the required @param, @returns, and @throws tags.

📝 Suggested doc update

-/** Checks if an item is a tool-smoke test */
+/**
+ * Checks whether an item targets the tool-smoke preflight.
+ *
+ * `@param` item - Object carrying the benchmark test slug.
+ * `@returns` `true` when `item.test` matches the tool-smoke slug.
+ * `@throws` {never} This helper performs a pure string comparison.
+ */
 export function isToolSmokeItem(item: { test: string }): boolean {
 	return item.test === TOOL_SMOKE_TEST_SLUG;
 }

As per coding guidelines, "All exported functions must have TSDoc/JSDoc documentation including purpose, params, returns, and throws".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/lib/types.ts` around lines 524 - 527, Add a full TSDoc
block for the exported helper isToolSmokeItem describing its purpose, add an
`@param` tag for the item parameter (type { { test: string } }), an `@returns` tag
describing the boolean result, and an `@throws` tag (use `@throws` {never} or state
that it does not throw) so the doc meets the exported-function guideline;
reference the TOOL_SMOKE_TEST_SLUG constant in the description to explain the
comparison.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cli/migrate-machine-command.ts`:
- Around line 362-368: The current catch block treats every ZodError as an
options-validation failure; move the options validation out of the try so schema
errors from downstream artifacts are not misclassified. Specifically, call
MigrateMachineCommandOptionsSchema.parse(options) before entering the try/catch
that calls migrateResultsDirectory(), leaving the try to handle runtime failures
only; ensure that ZodErrors thrown by RunResultSchema.parse() or
RunPlanSchema.parse() inside migrateResultsDirectory() are allowed to propagate
or be handled separately (not swallowed as "Machine-profile migration options
failed validation"). Update the error handling message that references ZodError
accordingly so it only applies to the options parse step.

In `@src/harnesses/discovery.ts`:
- Around line 45-49: The catch in isOpenCodeAvailable currently always logs a
warn for any probe failure (logger.warn with probe: "opencode"), which spams on
expected "not installed" cases; update the handler to detect the normal "not
installed" condition (e.g., error.code === 'ENOENT' or error.message includes
"not found"/"command not found" or the specific sentinel your spawn/exec throws)
and log that as debug/trace or skip logging, while preserving logger.warn (or
error) for unexpected failures; modify the catch in isOpenCodeAvailable
accordingly so only real errors keep the warn log and expected absence uses a
lower-severity log.

In `@src/harnesses/opencode-adapter.ts`:
- Around line 106-123: selectProcessOutput currently returns stdout as soon as
it has any non-whitespace, which can mask a valid JSONL event stream on stderr;
update the logic to prefer the stream that actually parses into OpenCode events
by attempting to parse both stdout and stderr with parseOpenCodeEvents (or the
same parsing helper used downstream) and returning the stream that yields
valid/non-empty events, falling back to the existing MIN_OUTPUT_LENGTH check
only if neither parses successfully; reference selectProcessOutput,
parseOpenCodeEvents, and MIN_OUTPUT_LENGTH when locating and changing the code
and be sure to trim inputs before parsing to avoid false positives from
whitespace.

In `@src/harnesses/opencode-runner.ts`:
- Around line 220-228: The Execa call that creates proc (execa("opencode",
config.args, { env: config.env, cwd: config.cwd, stdin: "ignore", stdout:
"pipe", stderr: "pipe", cancelSignal: controller.signal, reject: false }))
leaves Execa's internal buffering enabled and duplicates memory with our
stdoutChunks/stderrChunks; disable Execa buffering by adding the buffer: false
option to the options object passed to execa so only our manual chunking is
retained and memory spikes are avoided.

In `@src/runtimes/ollama-residency.ts`:
- Around line 331-338: The loop currently calls sleep(Math.max(0,
pollIntervalMs)) unconditionally which can oversleep past the soft settle
window; change the sleep call in the residency loop (where deadline is computed
via performance.now()) to clamp the delay to the remaining time budget when
deadline is set — compute remaining = deadline - performance.now(), and call
sleep(Math.max(0, Math.min(pollIntervalMs, remaining))) (fall back to
pollIntervalMs if deadline is undefined) so the next sleep never exceeds the
remaining settleTimeoutMs before re-polling listRunningOllamaModels.

In `@test/opencode-runner-race.test.ts`:
- Around line 57-61: The tests hard-code cwd to "/tmp" which is OS-specific;
update the test to create and use a real temporary directory (e.g., via
fs.mkdtempSync(path.join(os.tmpdir(), 'opencode-')) or fs.promises.mkdtemp) and
pass that path into runOpenCodeCommand's cwd (same change for the other instance
at lines 87-91), and ensure you clean up the temp dir after the test; locate the
calls to runOpenCodeCommand in test/opencode-runner-race.test.ts to make the
change.

---

Outside diff comments:
In `@apps/dashboard/src/components/layout/header.tsx`:
- Around line 1-4: Add a file header "Exports:" line to the top comment
documenting what this module exports (e.g., the Header component and any other
named exports) — update the header block above the purpose comment to include an
"Exports:" section that lists the exported symbols (such as Header or default
export) exactly as they appear in the file (adjust to "default" vs named if
applicable).

---

Nitpick comments:
In `@apps/dashboard/src/components/charts/model-comparison-chart.tsx`:
- Around line 97-101: The render currently calls
ModelComparisonChartPropsSchema.safeParse(props) and logs errors with
console.error during render (parsedProps variable), which causes noisy repeated
logs; move the logging into a useEffect that watches the props (or
parsedProps.success) so the error is logged only when props change, and keep the
items assignment using parsedProps.success ? parsedProps.data.items : [];
specifically, call safeParse in render to derive items but perform any
console.error reporting inside a useEffect tied to props or parsedProps.success
to avoid render-phase side effects.

In `@apps/dashboard/src/components/layout/header.tsx`:
- Line 54: The header component currently hardcodes the brand color as "#34c759"
in the inline style (background: "#34c759") — update that to use the design
system CSS variable (e.g., background: "var(--success)" or "var(--brand)") in
the Header component (header.tsx) so the style references the theme variable
instead of a literal color; locate the inline style object where background is
set and replace the literal with the appropriate var(...) token.

In `@apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx`:
- Around line 63-99: Replace hardcoded color strings in the leaderboard summary
card objects with the design system CSS variables: change occurrences of
"#34c759" (used in the card entries for "Profiles" and the preceding card) to
"hsl(var(--success))" or "hsl(var(--brand))" as appropriate, replace "hsl(212,
100%, 67%)" (the "Deduped Items" card) with "hsl(var(--primary))", and ensure
other color fields (e.g., "hsl(156, 67%, 55%)", "hsl(270, 60%, 60%)", "hsl(210,
85%, 60%)") are audited and updated to the corresponding CSS variables if design
tokens exist; update the color property values in the card object literals in
leaderboard-summary-cards.tsx (the objects that reference title/value/sub/color
for Profiles, Deduped Items, Pass Rate, Frontier Coverage, Median Duration) to
use the CSS variable strings.

In `@apps/dashboard/src/lib/types.ts`:
- Around line 524-527: Add a full TSDoc block for the exported helper
isToolSmokeItem describing its purpose, add an `@param` tag for the item parameter
(type { { test: string } }), an `@returns` tag describing the boolean result, and
an `@throws` tag (use `@throws` {never} or state that it does not throw) so the doc
meets the exported-function guideline; reference the TOOL_SMOKE_TEST_SLUG
constant in the description to explain the comparison.

In `@src/harnesses/tool-prompt.ts`:
- Around line 91-94: Update the TSDoc for buildWorkspaceToolPrompt to document
the additional throw: add an `@throws` entry stating that the function throws an
Error for unsupported/invalid pathMode values (the error thrown in
buildWorkspaceToolPrompt around the pathMode validation at lines 109-113), and
include the expected message or shape (e.g., indicate it reports the invalid
pathMode) so callers know this contract in addition to the existing throws for
empty toolNames and missing workspaceRootPath for "absolute-anchor".

In `@src/lib/machine-profile/probe-utils.ts`:
- Around line 47-69: Update the TSDoc for both exported functions
parsePositiveInt and parseMemoryBytes to include an `@throws` tag: explicitly
state that these helpers do not throw exceptions (invalid or unparsable input
results in undefined) so the documentation meets the “params, returns, and
throws” requirement; keep the rest of the existing description and tags
unchanged.

In `@src/runtimes/runtime.ts`:
- Around line 1-12: Add an explicit "Invariants" section to the top-of-file
header in src/runtimes/runtime.ts alongside the existing Purpose and Exports
lines; describe any assumptions that must always hold (e.g., Runtime
implementations must implement ping/listModels/getModelInfo, RUNTIME_NAMES must
be stable identifiers, and ModelInfo must be serializable/immutable) and ensure
the header references the exported symbols Runtime, RuntimeName, RUNTIME_NAMES,
and ModelInfo so the file meets the repo's header contract.

In `@test/build-index.test.ts`:
- Around line 31-33: The test flattens asset paths using path.basename which can
collapse distinct files; update the mapping that assigns REQUIRED_LIB_ASSETS
from CORE_BENCHMARK_LIB_ASSETS to preserve the src/lib-relative path (e.g., keep
the original relative path or use path.relative(path.join(repoRoot,
"src","lib"), assetPath)) instead of calling path.basename, and make the same
change for the second mapping block that currently uses path.basename around
lines where REQUIRED_LIB_ASSETS is created again so fixtures remain uniquely
identified; adjust any downstream expectations that assumed basenames to use the
preserved relative paths.

In `@test/item-retry.test.ts`:
- Around line 87-93: Add explicit assertions that scoreGenerationMock was
invoked exactly twice to validate initial scoring + retry; after each scenario's
expectations (for the block using scoreGenerationMock.mockResolvedValueOnce(...)
and the later scenario around lines 121-188), add
expect(scoreGenerationMock).toHaveBeenCalledTimes(2) (and reset/clear the mock
between scenarios if not already done) so the retry orchestration regressions
are caught.

In `@test/schemas.test.ts`:
- Around line 223-244: The test description string currently says "should reject
removed vllm config fields while accepting legacy runtime values" but the
assertions actually verify that vllm-related fields and legacy runtime values
(runtimes: ["vllm"]) and vllm profile variants are rejected; update the it(...)
test title to accurately reflect this behavior (for example: "should reject
removed vllm config fields including legacy runtime values and profile
variants") so it matches the assertions around BenchConfigSchema.parse and the
vllm-related cases.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b6ccf139-d50e-4313-9190-74ab3206d3ed

📥 Commits

Reviewing files that changed from the base of the PR and between 56b63e7 and 67de245.

📒 Files selected for processing (159)

README.md
apps/dashboard/scripts/build-index.ts
apps/dashboard/src/components/about/about-content.ts
apps/dashboard/src/components/about/about-page.tsx
apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx
apps/dashboard/src/components/charts/chart-primitives.tsx
apps/dashboard/src/components/charts/composite-score-chart.tsx
apps/dashboard/src/components/charts/frontier-eval-scatter.tsx
apps/dashboard/src/components/charts/model-comparison-chart.tsx
apps/dashboard/src/components/charts/model-radar-chart.tsx
apps/dashboard/src/components/charts/model-test-heatmap.tsx
apps/dashboard/src/components/charts/test-difficulty-chart.tsx
apps/dashboard/src/components/charts/test-type-split-chart.tsx
apps/dashboard/src/components/charts/token-efficiency-chart.tsx
apps/dashboard/src/components/layout/header.tsx
apps/dashboard/src/components/leaderboard/leaderboard-filters.ts
apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx
apps/dashboard/src/components/leaderboard/leaderboard-page.tsx
apps/dashboard/src/components/leaderboard/leaderboard-results-table.tsx
apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx
apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx
apps/dashboard/src/components/run-list/run-list-page.tsx
apps/dashboard/src/components/ui/section-heading.tsx
apps/dashboard/src/index.css
apps/dashboard/src/lib/aggregations-charts.ts
apps/dashboard/src/lib/aggregations-compare.ts
apps/dashboard/src/lib/aggregations-diagnostics.ts
apps/dashboard/src/lib/aggregations-test-types.ts
apps/dashboard/src/lib/aggregations-tooling.ts
apps/dashboard/src/lib/aggregations.ts
apps/dashboard/src/lib/api.ts
apps/dashboard/src/lib/chart-colors.ts
apps/dashboard/src/lib/schemas.ts
apps/dashboard/src/lib/types.ts
llm/context/codebase-overview.md
llm/implementation/computer-use-hardening.md
llm/implementation/harnesses-implementation.md
llm/implementation/multi-runtime-mvp-implementation.md
llm/project/project-overview.md
llm/project/project-rules.md
llm/project/user-flow.md
package.json
src/cli/compare-command.ts
src/cli/compare-formatters.ts
src/cli/compare-utils.ts
src/cli/migrate-machine-command.ts
src/cli/run-command.ts
src/harnesses/direct-adapter.ts
src/harnesses/discovery.ts
src/harnesses/goose-adapter.ts
src/harnesses/goose-openai.ts
src/harnesses/goose-output.ts
src/harnesses/harness.ts
src/harnesses/opencode-adapter.ts
src/harnesses/opencode-artifacts.ts
src/harnesses/opencode-cli.ts
src/harnesses/opencode-config.ts
src/harnesses/opencode-events.ts
src/harnesses/opencode-model.ts
src/harnesses/opencode-output.ts
src/harnesses/opencode-permissions.ts
src/harnesses/opencode-process.ts
src/harnesses/opencode-provider.ts
src/harnesses/opencode-runner.ts
src/harnesses/tool-prompt.ts
src/lib/benchmark-checkpoint.ts
src/lib/failure-classifier.ts
src/lib/hardware-profile.ts
src/lib/logger.ts
src/lib/machine-profile/instance-id.ts
src/lib/machine-profile/legacy.ts
src/lib/machine-profile/normalization.ts
src/lib/machine-profile/probe-linux.ts
src/lib/machine-profile/probe-macos.ts
src/lib/machine-profile/probe-utils.ts
src/lib/machine-profile/probe-windows.ts
src/lib/model-aliases.ts
src/lib/model-profile/normalization.ts
src/lib/model-profile/registry-compat.ts
src/lib/model-profile/registry.ts
src/lib/openai-compat-client.ts
src/lib/signal-assessment.ts
src/lib/stats-format.ts
src/lib/test-catalog.ts
src/runner/index.ts
src/runner/item-executor.ts
src/runner/model-availability.ts
src/runner/model-eligibility.ts
src/runner/plan-builder.ts
src/runner/run-progress.ts
src/runtimes/discovery.ts
src/runtimes/index.ts
src/runtimes/ollama-residency.ts
src/runtimes/ollama-runtime.ts
src/runtimes/runtime.ts
src/runtimes/vllm-runtime.ts
src/schemas/common.schema.ts
src/schemas/config.schema.ts
src/schemas/index.ts
src/schemas/model-alias.schema.ts
src/schemas/model-profile.schema.ts
src/schemas/plan.schema.ts
src/schemas/result.schema.ts
src/schemas/test-catalog.schema.ts
src/tests/calculator-basic/test.meta.json
src/tests/calculator-stateful/test.meta.json
src/tests/event-emitter/test.meta.json
src/tests/rate-limiter/test.meta.json
src/tests/smoke/test.meta.json
src/tests/todo-app/test.meta.json
src/tests/ttl-cache/test.meta.json
src/tests/workspace-smoke/fixtures/logs/.gitkeep
src/tests/workspace-smoke/prompt.blind.md
src/tests/workspace-smoke/prompt.informed.md
src/tests/workspace-smoke/test.meta.json
test/aggregate.test.ts
test/benchmark-checkpoint.test.ts
test/build-index.test.ts
test/compare-checkpoint-guard.test.ts
test/compare-formatters.test.ts
test/dashboard-schema-compat.test.ts
test/dashboard-test-type.test.ts
test/failure-classifier.test.ts
test/generation-retry.test.ts
test/goose-adapter.test.ts
test/goose-output.test.ts
test/hardware-profile.test.ts
test/harness-compatibility.test.ts
test/item-executor.test.ts
test/item-retry.test.ts
test/machine-profile-migration.test.ts
test/model-alias-schema.test.ts
test/model-availability.test.ts
test/model-profile-registry.test.ts
test/ollama-adapter.test.ts
test/ollama-residency.test.ts
test/opencode-adapter.test.ts
test/opencode-artifacts.test.ts
test/opencode-cli-cache.test.ts
test/opencode-cli.test.ts
test/opencode-config.test.ts
test/opencode-model.test.ts
test/opencode-output.test.ts
test/opencode-runner-race.test.ts
test/opencode-runner.test.ts
test/plan-builder.test.ts
test/run-command.test.ts
test/runner-residency.test.ts
test/schema-regressions.test.ts
test/schemas.test.ts
test/signal-assessment.test.ts
test/stats-semantics.test.ts
test/stats.test.ts
test/test-catalog.test.ts
test/tool-prompt.test.ts
test/utils/fixtures.ts
test/vllm-runtime.test.ts
test/workspace-capability-parity.test.ts
test/workspace-prompt-parity.test.ts

💤 Files with no reviewable changes (10)

src/harnesses/goose-openai.ts
test/benchmark-checkpoint.test.ts
package.json
test/vllm-runtime.test.ts
src/harnesses/opencode-process.ts
src/lib/openai-compat-client.ts
src/harnesses/opencode-output.ts
src/runtimes/vllm-runtime.ts
src/runtimes/discovery.ts
src/harnesses/opencode-model.ts

AustinKelsay · 2026-04-25T15:23:41Z

@CodeRabbit full review

coderabbitai · 2026-04-25T15:23:46Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/harnesses/goose-adapter.ts (1)
401-434: ⚠️ Potential issue | 🟠 Major

Merge the first-attempt signal assessment into the retry result.

This branch returns retryResult as-is, so any taint already detected on the first attempt is lost if the retry succeeds. That can make a tainted Goose run look clean even though the earlier attempt already emitted transcript/input leakage.
🔧 Suggested fix
 							const retryResult = await createGooseAdapter({
 								maxTurns,
 								retryMaxTurns,
 							}).generate({
 								...opts,
 								prompt: appendRetryMarker(promptWithoutMarker),
 								timeoutMs: remainingMs,
 							});
 							return {
 								...retryResult,
 								durationMs: Math.round(performance.now() - startTime),
+								signalAssessment: mergeSignalAssessments(
+									signalAssessment,
+									retryResult.signalAssessment,
+								),
 							};
Add the helper to the import list as well:
import {
	appendSignalAssessmentReasons,
	getTranscriptOrInputTaintReasons,
	mergeSignalAssessments,
} from "../lib/signal-assessment.js";
Based on learnings, "Never implicitly fix up results after a run; capture enough evidence to explain outcomes (test failures, eval reasoning, durations, best-effort metrics)".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/goose-adapter.ts` around lines 401 - 434, The retry branch
currently returns retryResult as-is, losing any signal/taint evidence from the
first attempt; update the branch inside the decision.shouldRetry handler to
merge the first-attempt assessment into the retry result by (1) importing
appendSignalAssessmentReasons, getTranscriptOrInputTaintReasons, and
mergeSignalAssessments, (2) computing the first attempt taint reasons via
getTranscriptOrInputTaintReasons(output) and/or appendSignalAssessmentReasons
for the initial result, and (3) calling
mergeSignalAssessments(firstAttemptAssessment, retryResult.signalAssessment) to
produce a merged signalAssessment attached to the returned object (keeping
durationMs calculation as you already do). Ensure this change is applied where
createGooseAdapter(...).generate(...) is called and before returning the
combined result so any initial taint is preserved in the final result.
src/lib/model-profile/registry.ts (1)
300-320: ⚠️ Potential issue | 🟡 Minor

Add explicit @throws tags to the remaining exported helpers.

parseInlineModelProfiles() can throw through parseInlineModelProfile(), and mergeModelProfiles() still needs an explicit @throws {never} to satisfy the exported-function contract. As per coding guidelines, "All exported functions require TSDoc/JSDoc documentation (purpose, params, returns, throws)".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/model-profile/registry.ts` around lines 300 - 320, Update the TSDoc
for the two exported helpers: for parseInlineModelProfiles add an explicit
`@throws` tag (e.g. `@throws` {Error} when parseInlineModelProfile fails) to
indicate it can propagate errors from parseInlineModelProfile, and for
mergeModelProfiles add an explicit `@throws` {never} tag to satisfy the
exported-function contract; edit the comment blocks immediately above the
parseInlineModelProfiles and mergeModelProfiles function declarations to include
these `@throws` annotations.

♻️ Duplicate comments (2)

src/schemas/common.schema.ts (1)
13-31: ⚠️ Potential issue | 🟠 Major

Runtime schema narrowing is still a breaking boundary change without explicit migration.

Line 16 now rejects vllm in active config/execution parsing. That breaks existing configs at schema-boundary parse time, but this change is shipped as 0.5.2 without an explicit migration path in the touched schema layer.

As per coding guidelines, "Maintain versioned schemas with schemaVersion field and use additive changes; migrations must be explicit".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/schemas/common.schema.ts` around lines 13 - 31, The change narrows active
runtime parsing by removing "vllm" from
supportedRuntimeNames/SupportedRuntimeNameSchema, which is a breaking schema
change; either revert that narrowing (restore "vllm" in supportedRuntimeNames
and keep SupportedRuntimeNameSchema accepting it) or implement an explicit
migration: bump SCHEMA_VERSION, add a migration function (exported from this
module) that converts old configs using "vllm" into the new allowed form (or
maps them to the artifact-runtime enum), and wire the migration into load/parse
flows; update SupportedRuntimeNameSchema and export the migration helper
alongside SCHEMA_VERSION so consumers can upgrade safely.
src/harnesses/opencode-permissions.ts (1)
31-32: ⚠️ Potential issue | 🟠 Major

Don't classify bare auto-reject* text as a permission denial.

The pattern still matches standalone auto-rejecting / auto-rejected, so unrelated rejection messages can be mislabeled as tool_permission_denied. That branch should stay tied to permission or external_directory context only.
🔧 Tighten the regex
 const PERMISSION_DENIAL_PATTERN =
-	/(?:\bpermission\b[^\r\n.;]{0,40}\b(?:denied|rejected)\b|\b(?:denied|rejected)\b[^\r\n.;]{0,40}\bpermission\b|\baccess\s+denied\b|\bauto-?reject(?:ing|ed)?\b|\bexternal_directory\b[^\r\n.;]{0,80}\b(?:denied|rejected|auto-?reject(?:ing|ed)?)\b)/i;
+	/(?:\bpermission\b[^\r\n.;]{0,40}\b(?:denied|rejected|auto-?reject(?:ing|ed)?)\b|\b(?:denied|rejected|auto-?reject(?:ing|ed)?)\b[^\r\n.;]{0,40}\bpermission\b|\baccess\s+denied\b|\bexternal_directory\b[^\r\n.;]{0,80}\b(?:denied|rejected|auto-?reject(?:ing|ed)?)\b)/i;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/opencode-permissions.ts` around lines 31 - 32, The current
PERMISSION_DENIAL_PATTERN allows standalone "auto-rejecting/auto-rejected" to
match; restrict those tokens so they're only considered when appearing in
permission-related context by removing the standalone auto-?reject alternative
and instead include auto-?reject(?:ing|ed)? as part of the
permission/external_directory alternatives (i.e., require it to be
adjacent/within the same branch that already contains \bpermission\b or
\bexternal_directory\b), keeping the existing word-boundary and length
constraints used in the other branches so unrelated rejection messages won't be
classified as tool_permission_denied.

🧹 Nitpick comments (4)

src/tests/calculator-stateful/test.meta.json (1)
1-8: Formatting changes look consistent.

The changes convert indentation from spaces to tabs and collapse the tags array to a single line. All semantic content remains unchanged, and the JSON is valid.

Since the summary indicates this is part of a broader reformatting pattern across multiple test.meta.json files, this appears intentional.

If not already in place, consider adding a formatter configuration file (e.g., .editorconfig, .prettierrc) to enforce consistent formatting automatically and prevent future manual reformatting.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/tests/calculator-stateful/test.meta.json` around lines 1 - 8, The JSON
formatting changes are intentional but inconsistent across similar
test.meta.json files; add a repository-wide formatter config (for example
.editorconfig or .prettierrc) that enforces indentation (tabs vs spaces) and
array line-wrapping rules, apply the formatter to all test.meta.json files, and
commit the updated config and formatted files so future edits follow the same
style automatically.
apps/dashboard/src/components/charts/model-comparison-chart.tsx (1)
97-105: Avoid re-logging the same validation failure on every render.

props gets a fresh object identity each render, so a persistently invalid payload will keep retriggering this effect and spam the client console. Consider logging only when the issue set changes, or gating the log to development builds.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/charts/model-comparison-chart.tsx` around lines
97 - 105, The current effect logs validation failures every render because props
identity changes; change the effect to only run/log when the validation result
actually changes (e.g., depend on parsedProps.success or parsedProps.error
instead of parsedProps, or compare/store the previous error via a ref) and/or
guard the log behind a development-only check (process.env.NODE_ENV ===
'development'). Update the useEffect that references parsedProps to use
parsedProps.success and parsedProps.error in its dependency array and only call
console.error when parsedProps.success is false and the error differs from the
last logged error (or when in dev), referencing the parsedProps variable and
ModelComparisonChartPropsSchema for locating the validation logic.
src/harnesses/harness.ts (1)
78-84: Tighten this helper instead of casting unknown runtime strings.

runtime as SupportedRuntimeName defeats the narrowing you added to HARNESS_RUNTIME_COMPATIBILITY. A stale "vllm" or typo now looks like an ordinary compatibility miss, which makes unsupported live runtimes harder to catch. Prefer runtime: SupportedRuntimeName here, or parse and throw before the lookup.

As per coding guidelines, "Treat all boundary inputs (CLI args, env, JSON files, HTTP responses) as unknown and parse with Zod immediately" and "Throw on invalid config, invalid schemas, and programmer mistakes; avoid silent fallbacks".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/harness.ts` around lines 78 - 84, The helper is currently
casting an arbitrary runtime string into SupportedRuntimeName which hides
invalid runtimes; update isHarnessCompatibleWithRuntime to validate the runtime
before lookup (either change the parameter to runtime: SupportedRuntimeName or
accept unknown/string and parse/validate it with your runtime Zod schema), and
if validation fails throw an error instead of casting; then use the validated
value to index HARNESS_RUNTIME_COMPATIBILITY so only genuine
SupportedRuntimeName keys are used in isHarnessCompatibleWithRuntime.
src/runner/run-progress.ts (1)
258-258: Consider using path.join for checkpoint path construction.

The checkpoint path is constructed via string interpolation, which may produce inconsistent path separators on Windows.
♻️ Suggested fix
+import * as path from "node:path";
+
 // ... at line 258
-		checkpointPath: `${input.config.outputDir}/${input.plan.runId}/run.partial.json`,
+		checkpointPath: path.join(input.config.outputDir, input.plan.runId, "run.partial.json"),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/runner/run-progress.ts` at line 258, The checkpoint path is built with
string interpolation which breaks cross-platform path separators; replace the
template string at the checkpointPath assignment (currently using
`${input.config.outputDir}/${input.plan.runId}/run.partial.json`) with
path.join(input.config.outputDir, input.plan.runId, 'run.partial.json') and add
the appropriate import for the Node path module (e.g., import path from 'path'
or const path = require('path')) at the top of run-progress.ts so paths are
constructed portably.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx`:
- Around line 28-32: The props schema ModelFilterDropdownPropsSchema currently
uses z.object() which silently strips unknown keys; change it to
z.strictObject() so validation fails on unexpected props. Update the schema
creation (ModelFilterDropdownPropsSchema) to call z.strictObject({...}) while
keeping the same property definitions (models, selectedModels,
onSelectionChange) and return/arg types, and run/update any spots that import or
validate against this schema to handle the stricter validation behavior.

In `@src/harnesses/goose-adapter.ts`:
- Around line 459-499: Compute durationMs = Date.now() - startTime at the top of
the catch and include that value on any Error you rethrow so downstream code
(e.g., extractGenerationFailureDetails) can read the real duration;
specifically, in the execa-style branch where you Object.assign a new Error (the
block using execaError, effectiveOutput, appendSignalAssessmentReasons) add
durationMs to the assigned properties, and in the subsequent timeout branch (the
plain throw new Error(...) when isTimeoutError) attach durationMs to the thrown
Error as well (e.g., via Object.assign) so both failure paths preserve the run
duration.

In `@src/lib/machine-profile/probe-utils.ts`:
- Line 67: The JSDoc in src/lib/machine-profile/probe-utils.ts uses the
non-standard "@throws {never}" syntax (same issue as parsePositiveInt); locate
the JSDoc block containing "@throws {never}" and remove that tag, or replace it
with a valid TSDoc comment such as a `@returns` description or a `@remarks` line
that states the function returns undefined for invalid/unparsable input; ensure
the exported function's TSDoc still includes purpose, `@param`(s), and `@returns` so
it conforms to the project's documentation guidelines (reference the
parsePositiveInt JSDoc pattern for consistency).
- Line 52: Remove the non-standard JSDoc tag "@throws {never}" in the JSDoc
block above the exported function in probe-utils.ts and either delete the
`@throws` line entirely or replace it with a clarifying comment (e.g., "This
function does not throw exceptions; returns undefined on invalid input") so the
TSDoc is valid; ensure the exported function's docblock still contains purpose,
`@param` and `@returns` as required.

In `@test/build-index.test.ts`:
- Around line 31-33: The mapped relative asset paths in REQUIRED_LIB_ASSETS can
be nested, so before writing each mapped core asset ensure its parent
directories exist: when iterating over REQUIRED_LIB_ASSETS to create fixture
files (the write operation that uses each relative path), call mkdir with
recursion on path.dirname(targetPath) (e.g. using fs.mkdirSync or
fs.promises.mkdir with { recursive: true }) for the destination built from the
relative path, then perform the write; update the test/build-index.test.ts write
loop to create parent dirs first for each REQUIRED_LIB_ASSETS entry.

---

Outside diff comments:
In `@src/harnesses/goose-adapter.ts`:
- Around line 401-434: The retry branch currently returns retryResult as-is,
losing any signal/taint evidence from the first attempt; update the branch
inside the decision.shouldRetry handler to merge the first-attempt assessment
into the retry result by (1) importing appendSignalAssessmentReasons,
getTranscriptOrInputTaintReasons, and mergeSignalAssessments, (2) computing the
first attempt taint reasons via getTranscriptOrInputTaintReasons(output) and/or
appendSignalAssessmentReasons for the initial result, and (3) calling
mergeSignalAssessments(firstAttemptAssessment, retryResult.signalAssessment) to
produce a merged signalAssessment attached to the returned object (keeping
durationMs calculation as you already do). Ensure this change is applied where
createGooseAdapter(...).generate(...) is called and before returning the
combined result so any initial taint is preserved in the final result.

In `@src/lib/model-profile/registry.ts`:
- Around line 300-320: Update the TSDoc for the two exported helpers: for
parseInlineModelProfiles add an explicit `@throws` tag (e.g. `@throws` {Error} when
parseInlineModelProfile fails) to indicate it can propagate errors from
parseInlineModelProfile, and for mergeModelProfiles add an explicit `@throws`
{never} tag to satisfy the exported-function contract; edit the comment blocks
immediately above the parseInlineModelProfiles and mergeModelProfiles function
declarations to include these `@throws` annotations.

---

Duplicate comments:
In `@src/harnesses/opencode-permissions.ts`:
- Around line 31-32: The current PERMISSION_DENIAL_PATTERN allows standalone
"auto-rejecting/auto-rejected" to match; restrict those tokens so they're only
considered when appearing in permission-related context by removing the
standalone auto-?reject alternative and instead include auto-?reject(?:ing|ed)?
as part of the permission/external_directory alternatives (i.e., require it to
be adjacent/within the same branch that already contains \bpermission\b or
\bexternal_directory\b), keeping the existing word-boundary and length
constraints used in the other branches so unrelated rejection messages won't be
classified as tool_permission_denied.

In `@src/schemas/common.schema.ts`:
- Around line 13-31: The change narrows active runtime parsing by removing
"vllm" from supportedRuntimeNames/SupportedRuntimeNameSchema, which is a
breaking schema change; either revert that narrowing (restore "vllm" in
supportedRuntimeNames and keep SupportedRuntimeNameSchema accepting it) or
implement an explicit migration: bump SCHEMA_VERSION, add a migration function
(exported from this module) that converts old configs using "vllm" into the new
allowed form (or maps them to the artifact-runtime enum), and wire the migration
into load/parse flows; update SupportedRuntimeNameSchema and export the
migration helper alongside SCHEMA_VERSION so consumers can upgrade safely.

---

Nitpick comments:
In `@apps/dashboard/src/components/charts/model-comparison-chart.tsx`:
- Around line 97-105: The current effect logs validation failures every render
because props identity changes; change the effect to only run/log when the
validation result actually changes (e.g., depend on parsedProps.success or
parsedProps.error instead of parsedProps, or compare/store the previous error
via a ref) and/or guard the log behind a development-only check
(process.env.NODE_ENV === 'development'). Update the useEffect that references
parsedProps to use parsedProps.success and parsedProps.error in its dependency
array and only call console.error when parsedProps.success is false and the
error differs from the last logged error (or when in dev), referencing the
parsedProps variable and ModelComparisonChartPropsSchema for locating the
validation logic.

In `@src/harnesses/harness.ts`:
- Around line 78-84: The helper is currently casting an arbitrary runtime string
into SupportedRuntimeName which hides invalid runtimes; update
isHarnessCompatibleWithRuntime to validate the runtime before lookup (either
change the parameter to runtime: SupportedRuntimeName or accept unknown/string
and parse/validate it with your runtime Zod schema), and if validation fails
throw an error instead of casting; then use the validated value to index
HARNESS_RUNTIME_COMPATIBILITY so only genuine SupportedRuntimeName keys are used
in isHarnessCompatibleWithRuntime.

In `@src/runner/run-progress.ts`:
- Line 258: The checkpoint path is built with string interpolation which breaks
cross-platform path separators; replace the template string at the
checkpointPath assignment (currently using
`${input.config.outputDir}/${input.plan.runId}/run.partial.json`) with
path.join(input.config.outputDir, input.plan.runId, 'run.partial.json') and add
the appropriate import for the Node path module (e.g., import path from 'path'
or const path = require('path')) at the top of run-progress.ts so paths are
constructed portably.

In `@src/tests/calculator-stateful/test.meta.json`:
- Around line 1-8: The JSON formatting changes are intentional but inconsistent
across similar test.meta.json files; add a repository-wide formatter config (for
example .editorconfig or .prettierrc) that enforces indentation (tabs vs spaces)
and array line-wrapping rules, apply the formatter to all test.meta.json files,
and commit the updated config and formatted files so future edits follow the
same style automatically.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8aaf6303-61ff-4918-b15b-29de7e291bfb

📥 Commits

Reviewing files that changed from the base of the PR and between 56b63e7 and 6c76840.

📒 Files selected for processing (159)

README.md
apps/dashboard/scripts/build-index.ts
apps/dashboard/src/components/about/about-content.ts
apps/dashboard/src/components/about/about-page.tsx
apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx
apps/dashboard/src/components/charts/chart-primitives.tsx
apps/dashboard/src/components/charts/composite-score-chart.tsx
apps/dashboard/src/components/charts/frontier-eval-scatter.tsx
apps/dashboard/src/components/charts/model-comparison-chart.tsx
apps/dashboard/src/components/charts/model-radar-chart.tsx
apps/dashboard/src/components/charts/model-test-heatmap.tsx
apps/dashboard/src/components/charts/test-difficulty-chart.tsx
apps/dashboard/src/components/charts/test-type-split-chart.tsx
apps/dashboard/src/components/charts/token-efficiency-chart.tsx
apps/dashboard/src/components/layout/header.tsx
apps/dashboard/src/components/leaderboard/leaderboard-filters.ts
apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx
apps/dashboard/src/components/leaderboard/leaderboard-page.tsx
apps/dashboard/src/components/leaderboard/leaderboard-results-table.tsx
apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx
apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx
apps/dashboard/src/components/run-list/run-list-page.tsx
apps/dashboard/src/components/ui/section-heading.tsx
apps/dashboard/src/index.css
apps/dashboard/src/lib/aggregations-charts.ts
apps/dashboard/src/lib/aggregations-compare.ts
apps/dashboard/src/lib/aggregations-diagnostics.ts
apps/dashboard/src/lib/aggregations-test-types.ts
apps/dashboard/src/lib/aggregations-tooling.ts
apps/dashboard/src/lib/aggregations.ts
apps/dashboard/src/lib/api.ts
apps/dashboard/src/lib/chart-colors.ts
apps/dashboard/src/lib/schemas.ts
apps/dashboard/src/lib/types.ts
llm/context/codebase-overview.md
llm/implementation/computer-use-hardening.md
llm/implementation/harnesses-implementation.md
llm/implementation/multi-runtime-mvp-implementation.md
llm/project/project-overview.md
llm/project/project-rules.md
llm/project/user-flow.md
package.json
src/cli/compare-command.ts
src/cli/compare-formatters.ts
src/cli/compare-utils.ts
src/cli/migrate-machine-command.ts
src/cli/run-command.ts
src/harnesses/direct-adapter.ts
src/harnesses/discovery.ts
src/harnesses/goose-adapter.ts
src/harnesses/goose-openai.ts
src/harnesses/goose-output.ts
src/harnesses/harness.ts
src/harnesses/opencode-adapter.ts
src/harnesses/opencode-artifacts.ts
src/harnesses/opencode-cli.ts
src/harnesses/opencode-config.ts
src/harnesses/opencode-events.ts
src/harnesses/opencode-model.ts
src/harnesses/opencode-output.ts
src/harnesses/opencode-permissions.ts
src/harnesses/opencode-process.ts
src/harnesses/opencode-provider.ts
src/harnesses/opencode-runner.ts
src/harnesses/tool-prompt.ts
src/lib/benchmark-checkpoint.ts
src/lib/failure-classifier.ts
src/lib/hardware-profile.ts
src/lib/logger.ts
src/lib/machine-profile/instance-id.ts
src/lib/machine-profile/legacy.ts
src/lib/machine-profile/normalization.ts
src/lib/machine-profile/probe-linux.ts
src/lib/machine-profile/probe-macos.ts
src/lib/machine-profile/probe-utils.ts
src/lib/machine-profile/probe-windows.ts
src/lib/model-aliases.ts
src/lib/model-profile/normalization.ts
src/lib/model-profile/registry-compat.ts
src/lib/model-profile/registry.ts
src/lib/openai-compat-client.ts
src/lib/signal-assessment.ts
src/lib/stats-format.ts
src/lib/test-catalog.ts
src/runner/index.ts
src/runner/item-executor.ts
src/runner/model-availability.ts
src/runner/model-eligibility.ts
src/runner/plan-builder.ts
src/runner/run-progress.ts
src/runtimes/discovery.ts
src/runtimes/index.ts
src/runtimes/ollama-residency.ts
src/runtimes/ollama-runtime.ts
src/runtimes/runtime.ts
src/runtimes/vllm-runtime.ts
src/schemas/common.schema.ts
src/schemas/config.schema.ts
src/schemas/index.ts
src/schemas/model-alias.schema.ts
src/schemas/model-profile.schema.ts
src/schemas/plan.schema.ts
src/schemas/result.schema.ts
src/schemas/test-catalog.schema.ts
src/tests/calculator-basic/test.meta.json
src/tests/calculator-stateful/test.meta.json
src/tests/event-emitter/test.meta.json
src/tests/rate-limiter/test.meta.json
src/tests/smoke/test.meta.json
src/tests/todo-app/test.meta.json
src/tests/ttl-cache/test.meta.json
src/tests/workspace-smoke/fixtures/logs/.gitkeep
src/tests/workspace-smoke/prompt.blind.md
src/tests/workspace-smoke/prompt.informed.md
src/tests/workspace-smoke/test.meta.json
test/aggregate.test.ts
test/benchmark-checkpoint.test.ts
test/build-index.test.ts
test/compare-checkpoint-guard.test.ts
test/compare-formatters.test.ts
test/dashboard-schema-compat.test.ts
test/dashboard-test-type.test.ts
test/failure-classifier.test.ts
test/generation-retry.test.ts
test/goose-adapter.test.ts
test/goose-output.test.ts
test/hardware-profile.test.ts
test/harness-compatibility.test.ts
test/item-executor.test.ts
test/item-retry.test.ts
test/machine-profile-migration.test.ts
test/model-alias-schema.test.ts
test/model-availability.test.ts
test/model-profile-registry.test.ts
test/ollama-adapter.test.ts
test/ollama-residency.test.ts
test/opencode-adapter.test.ts
test/opencode-artifacts.test.ts
test/opencode-cli-cache.test.ts
test/opencode-cli.test.ts
test/opencode-config.test.ts
test/opencode-model.test.ts
test/opencode-output.test.ts
test/opencode-runner-race.test.ts
test/opencode-runner.test.ts
test/plan-builder.test.ts
test/run-command.test.ts
test/runner-residency.test.ts
test/schema-regressions.test.ts
test/schemas.test.ts
test/signal-assessment.test.ts
test/stats-semantics.test.ts
test/stats.test.ts
test/test-catalog.test.ts
test/tool-prompt.test.ts
test/utils/fixtures.ts
test/vllm-runtime.test.ts
test/workspace-capability-parity.test.ts
test/workspace-prompt-parity.test.ts

💤 Files with no reviewable changes (10)

src/harnesses/goose-openai.ts
package.json
test/benchmark-checkpoint.test.ts
src/harnesses/opencode-model.ts
src/harnesses/opencode-process.ts
src/runtimes/discovery.ts
src/lib/openai-compat-client.ts
src/harnesses/opencode-output.ts
test/vllm-runtime.test.ts
src/runtimes/vllm-runtime.ts

AustinKelsay · 2026-04-25T16:10:56Z

@CodeRabbit full review

coderabbitai · 2026-04-25T16:11:01Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

apps/dashboard/src/components/ui/section-heading.tsx (1)
1-4: ⚠️ Potential issue | 🟡 Minor

Add Invariants to the file header.

This header is missing the Invariants section required by the repo convention.
Suggested update
 /**
  * Purpose: Shared section heading with left-border accent.
  * Exports: SectionHeading
+ * Invariants:
+ * - Renders title and description without side effects.
+ * - Applies consistent heading/border styling for section titles.
  */
As per coding guidelines "Every file begins with a short header documenting purpose, exports, and invariants".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/ui/section-heading.tsx` around lines 1 - 4,
Update the file header comment to include an "Invariants" section beneath
Purpose and Exports: add one or more short, repo-convention-style invariant
statements about the SectionHeading component (e.g., that SectionHeading is a
pure/presentational component with no side effects, its public API/props
expectations, and any stability guarantees like fixed class names or layout
behavior). Ensure the header still documents Purpose and Exports and references
the exported symbol SectionHeading so readers can quickly see the file-level
guarantees.
src/runner/item-executor.ts (1)
298-307: ⚠️ Potential issue | 🟠 Major

Persist the full retry block here, or omit retry metadata entirely.

ScoringMetricsSchema now treats retryGenerationDurationMs as retry-only metadata, but this builder still emits that field by itself. Any compile-retry path with retryGenerationDurationMs > 0 will now produce a scoringMetrics payload that violates the schema contract for run.json. Thread retryAttempted, retryKind, retryReason, and retryPromoted through alongside the duration, or stop emitting the duration until the full block is available.

Based on learnings, "Use Zod for schema validation at all boundaries (CLI args, env, JSON files, HTTP responses)" and "Never implicitly fix up results after a run; capture enough evidence to explain outcomes (test failures, eval reasoning, durations, best-effort metrics)".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/runner/item-executor.ts` around lines 298 - 307, The builder emits
retryGenerationDurationMs into scoringMetrics without the full retry metadata,
violating ScoringMetricsSchema; update the scoringMetrics construction (where
scoringMetrics is built from scoringOutcome) so that if
scoringOutcome.retryGenerationDurationMs > 0 you also thread and include
scoringOutcome.retryAttempted, scoringOutcome.retryKind,
scoringOutcome.retryReason, and scoringOutcome.retryPromoted into the same retry
block, otherwise omit retryGenerationDurationMs entirely; ensure the produced
object matches ScoringMetricsSchema exactly (either full retry block present or
no retry fields).

♻️ Duplicate comments (2)

src/lib/machine-profile/probe-utils.ts (1)
47-55: ⚠️ Potential issue | 🟡 Minor

Re-add @throws tags for the exported parser helpers.

The updated JSDoc blocks are clear, but exported functions in this repo still require an explicit @throws tag.
Suggested update
 /**
  * Parses a numeric probe result into a positive integer.
  *
  * `@param` value - Raw probe stdout
  * `@returns` Positive integer when valid; undefined for invalid or unparsable input
+ * `@throws` Never throws; returns undefined for invalid or unparsable input.
  */
 export function parsePositiveInt(
 	value: string | undefined,
 ): number | undefined {

 /**
  * Parses a memory string such as `8 GB` or `8192 MB`.
  *
  * `@param` value - Raw memory string
  * `@returns` Memory in bytes when parsable; undefined for invalid or unparsable input
+ * `@throws` Never throws; returns undefined for invalid or unparsable input.
  */
 export function parseMemoryBytes(
 	value: string | undefined,
 ): number | undefined {
As per coding guidelines "All exported functions must have TSDoc/JSDoc documentation including purpose, params, returns, and throws".

Also applies to: 61-69
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/machine-profile/probe-utils.ts` around lines 47 - 55, Add the missing
`@throws` tags to the exported parser helpers' JSDoc (e.g., parsePositiveInt and
the other exported parsers in this file) describing what exceptions can be
raised; update the JSDoc block above function parsePositiveInt to include an
`@throws` line stating the conditions under which an error is thrown (e.g.,
invalid format or out-of-range) and do the same for the other exported parser
functions so all exported functions include purpose, `@param`, `@returns` and
`@throws` per repo guidelines.
test/schemas.test.ts (1)
231-239: ⚠️ Potential issue | 🟠 Major

Don’t silently coerce runtimes: ["vllm"] inside BenchConfigSchema.

This expectation makes the live config boundary accept an explicit deprecated runtime again. Keep the legacy mapping in migrateLegacySupportedRuntimeNames() and make BenchConfigSchema.parse({ runtimes: ["vllm"] }) throw instead.

As per coding guidelines, "Throw on invalid config, invalid schemas, and programmer mistakes; avoid silent fallbacks".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/schemas.test.ts` around lines 231 - 239, The test is asserting that
BenchConfigSchema currently silently coerces deprecated runtime "vllm" to
"ollama"; instead update the schema so BenchConfigSchema.parse({ runtimes:
["vllm"] }) throws; do not remove the existing
migrateLegacySupportedRuntimeNames() helper (keep legacy mapping there for
explicit migration flows) but stop invoking or applying that migration inside
the BenchConfigSchema parsing/validation path (or change the schema's runtime
validator to treat "vllm" as invalid), and update any tests to expect an
exception from BenchConfigSchema.parse rather than accepting a coerced value.

🧹 Nitpick comments (7)

test/build-index.test.ts (1)

31-33: Harden REQUIRED_LIB_ASSETS mapping with an explicit src/lib/ invariant.

path.relative("src/lib", assetPath) will also return ../... for unexpected inputs, which can make fixtures drift silently if CORE_BENCHMARK_LIB_ASSETS is edited later. Prefer a prefix check + slice.

Suggested refactor

-const REQUIRED_LIB_ASSETS = CORE_BENCHMARK_LIB_ASSETS.map((assetPath) =>
-	path.relative("src/lib", assetPath),
-);
+const REQUIRED_LIB_ASSETS = CORE_BENCHMARK_LIB_ASSETS.map((assetPath) => {
+	if (!assetPath.startsWith("src/lib/")) {
+		throw new Error(
+			`CORE_BENCHMARK_LIB_ASSETS must contain only src/lib assets. Received: ${assetPath}`,
+		);
+	}
+	return assetPath.slice("src/lib/".length);
+});

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/build-index.test.ts` around lines 31 - 33, The mapping for
REQUIRED_LIB_ASSETS currently uses path.relative("src/lib", assetPath) which can
yield ../... for unexpected inputs; update the mapping over
CORE_BENCHMARK_LIB_ASSETS so you first assert or normalize that each assetPath
starts with the invariant "src/lib/" (e.g. assetPath.startsWith("src/lib/")),
and then compute the relative path by slicing off that prefix
(assetPath.slice("src/lib/".length)); for assets that don't match, either throw
an explicit error or filter/log them so fixtures cannot silently drift. Ensure
you update the mapping where REQUIRED_LIB_ASSETS is declared and remove the use
of path.relative for this transformation.

src/lib/machine-profile/instance-id.ts (1)

194-228: Consider adding Zod validation for boundary inputs.

While the current readNonEmpty validation provides basic checks, the coding guidelines recommend validating all boundary inputs with Zod immediately. This function accepts options from callers and reads environment variables—both are boundaries that could benefit from schema validation.

Example: Add Zod schema for options validation

import { z } from "zod";

const ResolveMachineInstanceIdOptionsSchema = z.object({
  configuredInstanceId: z.string().optional(),
  legacyConfiguredInstanceId: z.string().optional(),
  env: z.record(z.string()).optional(),
  instanceIdFilePath: z.string().optional(),
});

export function resolveMachineInstanceId(
  options: ResolveMachineInstanceIdOptions = {},
): ResolvedMachineInstanceId {
  const validated = ResolveMachineInstanceIdOptionsSchema.parse(options);
  const env = validated.env ?? process.env;
  // ... rest of function
}

As per coding guidelines: "Validate at boundaries with Zod and pass typed data inward" and "Treat all boundary inputs (CLI args, env, JSON files, HTTP responses) as unknown and parse with Zod immediately."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/machine-profile/instance-id.ts` around lines 194 - 228, The function
resolveMachineInstanceId accepts external options and reads env variables but
doesn't validate those boundary inputs with Zod; add a Zod schema (e.g.
ResolveMachineInstanceIdOptionsSchema) that describes configuredInstanceId,
legacyConfiguredInstanceId, env (record of strings) and instanceIdFilePath,
parse/validate the incoming options at the top of resolveMachineInstanceId, then
use the validated values (validated.env ?? process.env) throughout (including
where readNonEmpty is called and when resolving instanceIdFilePath via
resolveDefaultInstanceIdPath) so only typed/validated data is used inside the
function.

src/cli/compare-formatters.ts (1)

223-239: Extract the repeated fixed-width table renderer.

The regressions, improvements, and exclusive-items paths all duplicate the same header and row-formatting logic. That makes future width/truncation tweaks easy to miss in one branch even though they need to stay in lockstep.

As per coding guidelines, "Favor iteration and modularization over code duplication".

Also applies to: 258-275, 339-377
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cli/compare-formatters.ts` around lines 223 - 239, Duplicate fixed-width
table printing logic (header, pad/truncate, row loop) used for regressions,
improvements, and exclusive-items should be consolidated into a single helper;
create a function (e.g., renderFixedWidthTable or formatStatusChangeTable) that
accepts the column width spec (STATUS_CHANGE_COLUMN_WIDTHS), a header array
["MODEL","HARNESS","TEST","PASS"], and an items array (like
regressions/improvements/exclusiveItems) and uses existing utilities pad and
truncate to produce the header and rows; replace the three duplicated blocks
(current regressions block and the blocks at the ranges you noted: 258-275 and
339-377) with calls to this helper, ensuring you pass item field accessors for
model, harness, test, and passType so truncation/widths remain in sync.

test/utils/fixtures.ts (1)

117-199: Prefer a shared schema-version constant over repeated schemaVersion: 1 literals.

Hardcoded repeated literals here can drift silently when schema version increments.

♻️ Suggested refactor

 import {
 	buildMachineProfileKey,
 	buildMachineProfileLabel,
 	normalizeMachineProfile,
 } from "../../src/lib/machine-profile/normalization.js";
+import { SCHEMA_VERSION } from "../../src/schemas/index.js";
@@
-			schemaVersion: 1,
+			schemaVersion: SCHEMA_VERSION,
@@
-			schemaVersion: 1,
+			schemaVersion: SCHEMA_VERSION,
@@
-			schemaVersion: 1,
+			schemaVersion: SCHEMA_VERSION,
@@
-			schemaVersion: 1,
+			schemaVersion: SCHEMA_VERSION,
@@
-			schemaVersion: 1,
+			schemaVersion: SCHEMA_VERSION,
@@
-			schemaVersion: 1,
+			schemaVersion: SCHEMA_VERSION,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/utils/fixtures.ts` around lines 117 - 199, Replace the repeated numeric
literal schemaVersion: 1 with a single shared constant (e.g., SCHEMA_VERSION or
PREFLIGHT_SCHEMA_VERSION) declared near the top of the fixtures module and used
in each fixture object; update the fixtures array entries (objects with slug
values like "tool-smoke", "workspace-tool-smoke", "file-search-smoke", etc.) to
reference that constant instead of the literal so future schema bumps require
changing only the constant.

src/harnesses/goose-adapter.ts (1)

179-517: Split generate() before this grows any further.

This path now handles prompt construction, env setup, process launch, output normalization, taint extraction, retry orchestration, and error translation in one function. Breaking those into a few focused helpers would get the file back under the repo limit and make the retry/failure paths much easier to change safely.

As per coding guidelines, "Keep files under 500 lines; split aggressively by responsibility".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/goose-adapter.ts` around lines 179 - 517, The generate() method
has grown too large; split it into focused helper functions and call them from
generate to reduce complexity: extract prompt/env construction into a
buildGooseInvocation(opts) that returns {fullPrompt, env, args, executionCwd,
maxTurnsForAttempt}, move process execution and raw parsing into
runGooseProcess(fullPrompt, args, env, executionCwd, timeoutMs) which returns
{result, rawOutputParts, stderr, durationMs}, isolate normalization/taint logic
into normalizeAndAssess(rawOutputParts, rawOutput, output) that returns {output,
normalized, toolCallDetected, signalAssessment, transcriptOrInputReasons}, and
pull code-extraction/retry decision logic into decideAndPersistCode(output,
startTime, timeoutMs, promptWithoutMarker, isRetryAttempt) which uses
evaluateCodeOnlyOutput, write solution via solutionPath, and invokes
createGooseAdapter.generate() for retries; keep generate() as an orchestration
function that calls these helpers, handles workspace-mode early return, and
centralizes the try/catch translation logic (preserve existing symbols:
generate, normalizeGooseOutput, evaluateCodeOnlyOutput, createGooseAdapter).

src/harnesses/opencode-adapter.ts (1)

18-52: Split this module again before it grows further.

This file is back over the repo’s 500-line limit and now owns prompt construction, stream scoring/parsing selection, taint assembly, failure shaping, artifact IO, and retry orchestration. A couple of focused modules here would make the failure paths much easier to change safely.

As per coding guidelines, "Keep files under 500 lines; split aggressively by responsibility".

Also applies to: 56-251, 306-556

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/opencode-adapter.ts` around lines 18 - 52, This file has grown
past the 500-line limit and mixes responsibilities (prompt construction, stream
scoring/parsing selection, taint assembly, failure shaping, artifact IO, retry
orchestration); split it into focused modules by responsibility: extract prompt
builders (functions like buildToolPrompt, buildWorkspaceToolPrompt,
buildOpenCodeConfig/buildOpenCodeEnv) into a prompt module, move event parsing
and scoring (parseOpenCodeEvents, getOpenCodeRunFeatures, evaluateCodeOnlyOutput
decision points) into a streaming/parse module, move artifact I/O
(prepareOpenCodeArtifacts, cleanupOpenCodeArtifacts, readUsableOpenCodeSolution,
OpenCodeArtifacts types) into an artifacts module, and move retry/marker logic
(appendRetryMarker, hasRetryMarker, stripRetryMarker,
appendSignalAssessmentReasons, getOpenCodePermissionTaintReasons) and
orchestration (runOpenCodeCommand, buildOpenCodeRunArgs,
isOpenCodeRunCompatible) into a runner/orchestration module; update
imports/exports so harnesses/opencode-adapter.ts composes these smaller modules
(re-export or import by name) to keep it under 500 lines and make failure paths
easier to change.

src/runner/run-progress.ts (1)

61-71: Preserve plan.provenance via spread before overriding fields.

Rebuilding provenance key-by-key makes this helper brittle against additive RunProvenance changes and can let related fields drift if only one is intentionally reset. Prefer spreading the existing object first, then override the exact keys you want to normalize.

♻️ Suggested refactor

 		provenance: {
-			verificationStatus: "self_reported",
-			source: plan.provenance?.source ?? "local_cli",
-			...(plan.provenance?.submittedBy
-				? { submittedBy: plan.provenance.submittedBy }
-				: {}),
-			...(plan.provenance?.submittedAt
-				? { submittedAt: plan.provenance.submittedAt }
-				: {}),
-			...(plan.provenance?.notes ? { notes: plan.provenance.notes } : {}),
+			...(plan.provenance ?? {}),
+			verificationStatus: "self_reported",
+			source: plan.provenance?.source ?? "local_cli",
 		},

As per coding guidelines, "Maintain versioned schemas with schemaVersion field and use additive changes; migrations must be explicit".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/runner/run-progress.ts` around lines 61 - 71, The provenance object is
being rebuilt field-by-field which can drop future/additive fields; instead
spread the existing plan.provenance first and then override/normalize only the
fields you need (e.g., set verificationStatus to "self_reported" and source to
plan.provenance?.source ?? "local_cli"), so modify the code that constructs
provenance (the provenance object in run-progress.ts that references
plan.provenance) to start with ...plan.provenance and then set the explicit
overrides, removing the manual conditional re-adds for
submittedBy/submittedAt/notes.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/dashboard/src/components/layout/header.tsx`:
- Around line 1-6: The file header block at the top currently lists Purpose and
Exports but is missing the required Invariants section; update that top comment
block in apps/dashboard/src/components/layout/header.tsx to add an "Invariants:"
line or short bullet list describing the file-level guarantees (e.g., no
side-effects on import, Header component is pure/controlled by props, CSS
classes stable) so it follows the repo standard of documenting purpose, exports,
and invariants; ensure the new Invariants header is concise and placed with
Purpose and Exports in the existing comment block.

In `@llm/implementation/harnesses-implementation.md`:
- Around line 139-143: The Goose "Tool-Calling Mode (Goose)" guidance is
inconsistent: it claims the adapter uses generated per-item provider config (via
"goose run") but earlier CLI examples still show forcing provider with flags
like "--provider ollama --model <model>"; reconcile by choosing one
authoritative control path and updating both places accordingly — either remove
or mark as deprecated the CLI flags in the earlier command block/flag list and
state that "goose run" uses per-item provider config (and describe its
precedence over CLI flags), or explicitly document that CLI flags override
per-item config and show the correct precedence; update the "Tool-Calling Mode
(Goose)" section, the earlier command block, and any flag list to consistently
reference the chosen behavior and mention "goose run", "per-item provider
config", and the flags "--provider" / "--model" so readers can find and align
the text.

In `@src/cli/compare-command.ts`:
- Around line 34-40: RunCompareSchema currently uses z.string().min(1) which
accepts whitespace-only values; update the schema to trim and reject blank-only
strings by replacing z.string().min(1) with z.string().trim().min(1) for the
runA and runB fields and for options.output (and any other CLI string fields in
the same schema), so validation fails at the CLI boundary rather than later
during filesystem lookups.

In `@src/harnesses/harness.ts`:
- Around line 80-91: Update the TSDoc for the exported wrapper function
getCompatibleHarnesses to document the new exception contract introduced by
isHarnessCompatibleWithRuntime: add an `@throws` entry stating that it will throw
an Error for unsupported runtime strings (propagated from
isHarnessCompatibleWithRuntime) and ensure the docblock still documents purpose,
params, and returns per coding guidelines.

In `@src/harnesses/opencode-runner.ts`:
- Around line 156-195: forceKillProcess currently only signals the direct child
(via killOpenCodeProcess) so descendant processes spawned by OpenCode can
survive; update forceKillProcess (and/or killOpenCodeProcess) to enumerate and
terminate the entire process tree for the target pid before and after sending
SIGTERM/SIGKILL. Implement a cross-platform descendant-kill step: on Unix use
ps/pgrep or killpg (or a library like tree-kill / ps-tree) to gather child PIDs
and send signals to each (or the process group), and on Windows use taskkill
/PID /T /F or a dedicated library; ensure the descendant-kill runs both after
the initial SIGTERM wait and again after SIGKILL escalation, and log pid lists
and failures via the existing log (pino) so forceKillProcess reliably cleans up
all descendants.

In `@src/lib/model-profile/registry-compat.ts`:
- Around line 39-45: Replace the top-level wrapper from z.object(...) to
z.strictObject(...) so unknown top-level keys are rejected; update the
definition of LegacyCompatibleModelProfileFileSchema (which currently references
ModelProfileFileSchema.shape.schemaVersion and
LegacyCompatibleModelProfileRegistrySchema) to use z.strictObject({
schemaVersion: z.union([ModelProfileFileSchema.shape.schemaVersion,
z.literal("0.5.1")]), models: LegacyCompatibleModelProfileRegistrySchema }) so
the parser throws on misspelled/extra wrapper fields instead of silently
stripping them.

In `@src/runtimes/ollama-residency.ts`:
- Around line 2-5: The file header is missing the exported config interface
names; update the header exports comment to include OllamaResidencyBaseConfig,
UnloadOllamaModelConfig, and EnsureOnlyOllamaModelLoadedConfig and add a
one-line description for each (e.g., residency base options, unload options, and
ensure-only-loaded options) so the header documents these public interfaces
alongside LoadedOllamaModel, OllamaResidencyReport, listRunningOllamaModels,
unloadOllamaModel, and ensureOnlyOllamaModelLoaded; reference the exact symbols
OllamaResidencyBaseConfig, UnloadOllamaModelConfig, and
EnsureOnlyOllamaModelLoadedConfig when editing the header.

In `@src/schemas/index.ts`:
- Around line 84-103: The deprecated RuntimeName* aliases should continue to
represent supported execution runtimes, not artifact-only values; update the
three exports so they re-export the supported runtime symbols instead of the
artifact ones: export supportedRuntimeNames as runtimeNames, export
SupportedRuntimeNameSchema as RuntimeNameSchema, and export the
SupportedRuntimeName type as RuntimeName (or a clear union that still includes
supported names) so downstream config validation continues to fail-fast on
legacy artifact-only names like "vllm".

---

Outside diff comments:
In `@apps/dashboard/src/components/ui/section-heading.tsx`:
- Around line 1-4: Update the file header comment to include an "Invariants"
section beneath Purpose and Exports: add one or more short,
repo-convention-style invariant statements about the SectionHeading component
(e.g., that SectionHeading is a pure/presentational component with no side
effects, its public API/props expectations, and any stability guarantees like
fixed class names or layout behavior). Ensure the header still documents Purpose
and Exports and references the exported symbol SectionHeading so readers can
quickly see the file-level guarantees.

In `@src/runner/item-executor.ts`:
- Around line 298-307: The builder emits retryGenerationDurationMs into
scoringMetrics without the full retry metadata, violating ScoringMetricsSchema;
update the scoringMetrics construction (where scoringMetrics is built from
scoringOutcome) so that if scoringOutcome.retryGenerationDurationMs > 0 you also
thread and include scoringOutcome.retryAttempted, scoringOutcome.retryKind,
scoringOutcome.retryReason, and scoringOutcome.retryPromoted into the same retry
block, otherwise omit retryGenerationDurationMs entirely; ensure the produced
object matches ScoringMetricsSchema exactly (either full retry block present or
no retry fields).

---

Duplicate comments:
In `@src/lib/machine-profile/probe-utils.ts`:
- Around line 47-55: Add the missing `@throws` tags to the exported parser
helpers' JSDoc (e.g., parsePositiveInt and the other exported parsers in this
file) describing what exceptions can be raised; update the JSDoc block above
function parsePositiveInt to include an `@throws` line stating the conditions
under which an error is thrown (e.g., invalid format or out-of-range) and do the
same for the other exported parser functions so all exported functions include
purpose, `@param`, `@returns` and `@throws` per repo guidelines.

In `@test/schemas.test.ts`:
- Around line 231-239: The test is asserting that BenchConfigSchema currently
silently coerces deprecated runtime "vllm" to "ollama"; instead update the
schema so BenchConfigSchema.parse({ runtimes: ["vllm"] }) throws; do not remove
the existing migrateLegacySupportedRuntimeNames() helper (keep legacy mapping
there for explicit migration flows) but stop invoking or applying that migration
inside the BenchConfigSchema parsing/validation path (or change the schema's
runtime validator to treat "vllm" as invalid), and update any tests to expect an
exception from BenchConfigSchema.parse rather than accepting a coerced value.

---

Nitpick comments:
In `@src/cli/compare-formatters.ts`:
- Around line 223-239: Duplicate fixed-width table printing logic (header,
pad/truncate, row loop) used for regressions, improvements, and exclusive-items
should be consolidated into a single helper; create a function (e.g.,
renderFixedWidthTable or formatStatusChangeTable) that accepts the column width
spec (STATUS_CHANGE_COLUMN_WIDTHS), a header array
["MODEL","HARNESS","TEST","PASS"], and an items array (like
regressions/improvements/exclusiveItems) and uses existing utilities pad and
truncate to produce the header and rows; replace the three duplicated blocks
(current regressions block and the blocks at the ranges you noted: 258-275 and
339-377) with calls to this helper, ensuring you pass item field accessors for
model, harness, test, and passType so truncation/widths remain in sync.

In `@src/harnesses/goose-adapter.ts`:
- Around line 179-517: The generate() method has grown too large; split it into
focused helper functions and call them from generate to reduce complexity:
extract prompt/env construction into a buildGooseInvocation(opts) that returns
{fullPrompt, env, args, executionCwd, maxTurnsForAttempt}, move process
execution and raw parsing into runGooseProcess(fullPrompt, args, env,
executionCwd, timeoutMs) which returns {result, rawOutputParts, stderr,
durationMs}, isolate normalization/taint logic into
normalizeAndAssess(rawOutputParts, rawOutput, output) that returns {output,
normalized, toolCallDetected, signalAssessment, transcriptOrInputReasons}, and
pull code-extraction/retry decision logic into decideAndPersistCode(output,
startTime, timeoutMs, promptWithoutMarker, isRetryAttempt) which uses
evaluateCodeOnlyOutput, write solution via solutionPath, and invokes
createGooseAdapter.generate() for retries; keep generate() as an orchestration
function that calls these helpers, handles workspace-mode early return, and
centralizes the try/catch translation logic (preserve existing symbols:
generate, normalizeGooseOutput, evaluateCodeOnlyOutput, createGooseAdapter).

In `@src/harnesses/opencode-adapter.ts`:
- Around line 18-52: This file has grown past the 500-line limit and mixes
responsibilities (prompt construction, stream scoring/parsing selection, taint
assembly, failure shaping, artifact IO, retry orchestration); split it into
focused modules by responsibility: extract prompt builders (functions like
buildToolPrompt, buildWorkspaceToolPrompt, buildOpenCodeConfig/buildOpenCodeEnv)
into a prompt module, move event parsing and scoring (parseOpenCodeEvents,
getOpenCodeRunFeatures, evaluateCodeOnlyOutput decision points) into a
streaming/parse module, move artifact I/O (prepareOpenCodeArtifacts,
cleanupOpenCodeArtifacts, readUsableOpenCodeSolution, OpenCodeArtifacts types)
into an artifacts module, and move retry/marker logic (appendRetryMarker,
hasRetryMarker, stripRetryMarker, appendSignalAssessmentReasons,
getOpenCodePermissionTaintReasons) and orchestration (runOpenCodeCommand,
buildOpenCodeRunArgs, isOpenCodeRunCompatible) into a runner/orchestration
module; update imports/exports so harnesses/opencode-adapter.ts composes these
smaller modules (re-export or import by name) to keep it under 500 lines and
make failure paths easier to change.

In `@src/lib/machine-profile/instance-id.ts`:
- Around line 194-228: The function resolveMachineInstanceId accepts external
options and reads env variables but doesn't validate those boundary inputs with
Zod; add a Zod schema (e.g. ResolveMachineInstanceIdOptionsSchema) that
describes configuredInstanceId, legacyConfiguredInstanceId, env (record of
strings) and instanceIdFilePath, parse/validate the incoming options at the top
of resolveMachineInstanceId, then use the validated values (validated.env ??
process.env) throughout (including where readNonEmpty is called and when
resolving instanceIdFilePath via resolveDefaultInstanceIdPath) so only
typed/validated data is used inside the function.

In `@src/runner/run-progress.ts`:
- Around line 61-71: The provenance object is being rebuilt field-by-field which
can drop future/additive fields; instead spread the existing plan.provenance
first and then override/normalize only the fields you need (e.g., set
verificationStatus to "self_reported" and source to plan.provenance?.source ??
"local_cli"), so modify the code that constructs provenance (the provenance
object in run-progress.ts that references plan.provenance) to start with
...plan.provenance and then set the explicit overrides, removing the manual
conditional re-adds for submittedBy/submittedAt/notes.

In `@test/build-index.test.ts`:
- Around line 31-33: The mapping for REQUIRED_LIB_ASSETS currently uses
path.relative("src/lib", assetPath) which can yield ../... for unexpected
inputs; update the mapping over CORE_BENCHMARK_LIB_ASSETS so you first assert or
normalize that each assetPath starts with the invariant "src/lib/" (e.g.
assetPath.startsWith("src/lib/")), and then compute the relative path by slicing
off that prefix (assetPath.slice("src/lib/".length)); for assets that don't
match, either throw an explicit error or filter/log them so fixtures cannot
silently drift. Ensure you update the mapping where REQUIRED_LIB_ASSETS is
declared and remove the use of path.relative for this transformation.

In `@test/utils/fixtures.ts`:
- Around line 117-199: Replace the repeated numeric literal schemaVersion: 1
with a single shared constant (e.g., SCHEMA_VERSION or PREFLIGHT_SCHEMA_VERSION)
declared near the top of the fixtures module and used in each fixture object;
update the fixtures array entries (objects with slug values like "tool-smoke",
"workspace-tool-smoke", "file-search-smoke", etc.) to reference that constant
instead of the literal so future schema bumps require changing only the
constant.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: facc69af-a94e-44ff-a77c-abe3395f600c

📥 Commits

Reviewing files that changed from the base of the PR and between 56b63e7 and 7d83da4.

📒 Files selected for processing (159)

README.md
apps/dashboard/scripts/build-index.ts
apps/dashboard/src/components/about/about-content.ts
apps/dashboard/src/components/about/about-page.tsx
apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx
apps/dashboard/src/components/charts/chart-primitives.tsx
apps/dashboard/src/components/charts/composite-score-chart.tsx
apps/dashboard/src/components/charts/frontier-eval-scatter.tsx
apps/dashboard/src/components/charts/model-comparison-chart.tsx
apps/dashboard/src/components/charts/model-radar-chart.tsx
apps/dashboard/src/components/charts/model-test-heatmap.tsx
apps/dashboard/src/components/charts/test-difficulty-chart.tsx
apps/dashboard/src/components/charts/test-type-split-chart.tsx
apps/dashboard/src/components/charts/token-efficiency-chart.tsx
apps/dashboard/src/components/layout/header.tsx
apps/dashboard/src/components/leaderboard/leaderboard-filters.ts
apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx
apps/dashboard/src/components/leaderboard/leaderboard-page.tsx
apps/dashboard/src/components/leaderboard/leaderboard-results-table.tsx
apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx
apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx
apps/dashboard/src/components/run-list/run-list-page.tsx
apps/dashboard/src/components/ui/section-heading.tsx
apps/dashboard/src/index.css
apps/dashboard/src/lib/aggregations-charts.ts
apps/dashboard/src/lib/aggregations-compare.ts
apps/dashboard/src/lib/aggregations-diagnostics.ts
apps/dashboard/src/lib/aggregations-test-types.ts
apps/dashboard/src/lib/aggregations-tooling.ts
apps/dashboard/src/lib/aggregations.ts
apps/dashboard/src/lib/api.ts
apps/dashboard/src/lib/chart-colors.ts
apps/dashboard/src/lib/schemas.ts
apps/dashboard/src/lib/types.ts
llm/context/codebase-overview.md
llm/implementation/computer-use-hardening.md
llm/implementation/harnesses-implementation.md
llm/implementation/multi-runtime-mvp-implementation.md
llm/project/project-overview.md
llm/project/project-rules.md
llm/project/user-flow.md
package.json
src/cli/compare-command.ts
src/cli/compare-formatters.ts
src/cli/compare-utils.ts
src/cli/migrate-machine-command.ts
src/cli/run-command.ts
src/harnesses/direct-adapter.ts
src/harnesses/discovery.ts
src/harnesses/goose-adapter.ts
src/harnesses/goose-openai.ts
src/harnesses/goose-output.ts
src/harnesses/harness.ts
src/harnesses/opencode-adapter.ts
src/harnesses/opencode-artifacts.ts
src/harnesses/opencode-cli.ts
src/harnesses/opencode-config.ts
src/harnesses/opencode-events.ts
src/harnesses/opencode-model.ts
src/harnesses/opencode-output.ts
src/harnesses/opencode-permissions.ts
src/harnesses/opencode-process.ts
src/harnesses/opencode-provider.ts
src/harnesses/opencode-runner.ts
src/harnesses/tool-prompt.ts
src/lib/benchmark-checkpoint.ts
src/lib/failure-classifier.ts
src/lib/hardware-profile.ts
src/lib/logger.ts
src/lib/machine-profile/instance-id.ts
src/lib/machine-profile/legacy.ts
src/lib/machine-profile/normalization.ts
src/lib/machine-profile/probe-linux.ts
src/lib/machine-profile/probe-macos.ts
src/lib/machine-profile/probe-utils.ts
src/lib/machine-profile/probe-windows.ts
src/lib/model-aliases.ts
src/lib/model-profile/normalization.ts
src/lib/model-profile/registry-compat.ts
src/lib/model-profile/registry.ts
src/lib/openai-compat-client.ts
src/lib/signal-assessment.ts
src/lib/stats-format.ts
src/lib/test-catalog.ts
src/runner/index.ts
src/runner/item-executor.ts
src/runner/model-availability.ts
src/runner/model-eligibility.ts
src/runner/plan-builder.ts
src/runner/run-progress.ts
src/runtimes/discovery.ts
src/runtimes/index.ts
src/runtimes/ollama-residency.ts
src/runtimes/ollama-runtime.ts
src/runtimes/runtime.ts
src/runtimes/vllm-runtime.ts
src/schemas/common.schema.ts
src/schemas/config.schema.ts
src/schemas/index.ts
src/schemas/model-alias.schema.ts
src/schemas/model-profile.schema.ts
src/schemas/plan.schema.ts
src/schemas/result.schema.ts
src/schemas/test-catalog.schema.ts
src/tests/calculator-basic/test.meta.json
src/tests/calculator-stateful/test.meta.json
src/tests/event-emitter/test.meta.json
src/tests/rate-limiter/test.meta.json
src/tests/smoke/test.meta.json
src/tests/todo-app/test.meta.json
src/tests/ttl-cache/test.meta.json
src/tests/workspace-smoke/fixtures/logs/.gitkeep
src/tests/workspace-smoke/prompt.blind.md
src/tests/workspace-smoke/prompt.informed.md
src/tests/workspace-smoke/test.meta.json
test/aggregate.test.ts
test/benchmark-checkpoint.test.ts
test/build-index.test.ts
test/compare-checkpoint-guard.test.ts
test/compare-formatters.test.ts
test/dashboard-schema-compat.test.ts
test/dashboard-test-type.test.ts
test/failure-classifier.test.ts
test/generation-retry.test.ts
test/goose-adapter.test.ts
test/goose-output.test.ts
test/hardware-profile.test.ts
test/harness-compatibility.test.ts
test/item-executor.test.ts
test/item-retry.test.ts
test/machine-profile-migration.test.ts
test/model-alias-schema.test.ts
test/model-availability.test.ts
test/model-profile-registry.test.ts
test/ollama-adapter.test.ts
test/ollama-residency.test.ts
test/opencode-adapter.test.ts
test/opencode-artifacts.test.ts
test/opencode-cli-cache.test.ts
test/opencode-cli.test.ts
test/opencode-config.test.ts
test/opencode-model.test.ts
test/opencode-output.test.ts
test/opencode-runner-race.test.ts
test/opencode-runner.test.ts
test/plan-builder.test.ts
test/run-command.test.ts
test/runner-residency.test.ts
test/schema-regressions.test.ts
test/schemas.test.ts
test/signal-assessment.test.ts
test/stats-semantics.test.ts
test/stats.test.ts
test/test-catalog.test.ts
test/tool-prompt.test.ts
test/utils/fixtures.ts
test/vllm-runtime.test.ts
test/workspace-capability-parity.test.ts
test/workspace-prompt-parity.test.ts

💤 Files with no reviewable changes (10)

src/harnesses/opencode-model.ts
package.json
src/harnesses/goose-openai.ts
src/lib/openai-compat-client.ts
src/runtimes/discovery.ts
test/benchmark-checkpoint.test.ts
test/vllm-runtime.test.ts
src/harnesses/opencode-process.ts
src/harnesses/opencode-output.ts
src/runtimes/vllm-runtime.ts

AustinKelsay · 2026-04-25T18:32:14Z

@CodeRabbit full review

coderabbitai · 2026-04-25T18:32:18Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

apps/dashboard/src/components/layout/header.tsx (1)

12-16: ⚠️ Potential issue | 🟡 Minor

Add an explicit @throws contract for useLocation().

At Line 12-16, the exported Header docs still omit throws behavior. Since useLocation() requires router context, document that failure mode in the JSDoc/TSDoc block.
✏️ Proposed doc patch
 /**
  * Renders the persistent dashboard header navigation.
  *
  * `@returns` React element containing brand and primary navigation
+ * `@throws` {Error} If rendered outside a React Router provider context.
  */
 export function Header() {
As per coding guidelines: All exported functions must have TSDoc/JSDoc documentation including purpose, params, returns, and throws.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/layout/header.tsx` around lines 12 - 16, The
JSDoc for the exported Header component is missing an explicit `@throws` clause
for the useLocation() router dependency; update the documentation block above
the exported Header function to include a `@throws` (or `@throws` {Error}) entry
stating that useLocation() will throw when Header is rendered outside a Router
context (i.e., when no routing context is available), and give a brief
description of the failure mode so callers know to wrap Header in a Router
provider.

src/lib/machine-profile/probe-utils.ts (2)

47-59: ⚠️ Potential issue | 🟠 Major

parsePositiveInt still accepts malformed probe output.

The new contract says invalid or unparsable input returns undefined, but Number.parseInt() will happily accept prefixes like "16 cores" and return 16. That can silently turn noisy probe output into valid-looking hardware data.

Suggested fix

 export function parsePositiveInt(
 	value: string | undefined,
 ): number | undefined {
 	if (!value) return undefined;
-	const parsed = Number.parseInt(value.trim(), 10);
+	const trimmed = value.trim();
+	if (!/^\d+$/.test(trimmed)) return undefined;
+	const parsed = Number.parseInt(trimmed, 10);
 	return Number.isInteger(parsed) && parsed > 0 ? parsed : undefined;
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/machine-profile/probe-utils.ts` around lines 47 - 59, The
parsePositiveInt function currently uses Number.parseInt which accepts strings
like "16 cores"; update parsePositiveInt to validate the trimmed input matches a
strict positive-integer pattern (e.g. /^\d+$/) before parsing so only purely
numeric probe output is accepted; if the validation fails return undefined,
otherwise convert to Number (or parseInt) and return the positive integer as
before, keeping the function non-throwing.

62-80: ⚠️ Potential issue | 🟠 Major

Anchor the memory parser to the whole string.

normalized.match(...) currently succeeds on substrings, so inputs like "speed 6400 MT/s, size 16 GB" can parse as a partial match instead of failing. That silently corrupts machine-profile data and contradicts the updated docstring.

Suggested fix

 	const normalized = value.trim().replace(/,/g, "");
 	const match = normalized.match(
-		/(\d+(?:\.\d+)?)\s*(tib|gib|mib|kib|tb|gb|mb|kb|b)?/i,
+		/^(\d+(?:\.\d+)?)\s*(tib|gib|mib|kib|tb|gb|mb|kb|b)?$/i,
 	);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/machine-profile/probe-utils.ts` around lines 62 - 80, The regex in
parseMemoryBytes currently allows partial matches (so strings like "speed 6400
MT/s, size 16 GB" parse) — update the match to require the entire normalized
string to match by anchoring the pattern (use ^ and $ around the existing token
pattern) so only full-string memory formats are accepted; keep the existing
normalization (trim/commas), capture amount and unit the same way, and preserve
the subsequent logic that converts units and validates amount.

src/runner/item-retry.ts (1)

307-313: ⚠️ Potential issue | 🟠 Major

Promote missing_export fixes the same way as import fixes.

The tie-breaker only treats import as a promotable compile failure. If the original result is missing_export and the retry clears that compile issue without increasing the pass count, this keeps the worse original attempt and drops the improved retry output.

Proposed fix

 			const shouldPromoteRetry =
 				retryAttempt.scoringResult.passed > previousPassed ||
 				(retryAttempt.scoringResult.passed === previousPassed &&
-					scoringResult.failureType === "import" &&
-					retryAttempt.scoringResult.failureType !== "import");
+					(scoringResult.failureType === "import" ||
+						scoringResult.failureType === "missing_export") &&
+					retryAttempt.scoringResult.failureType !== "import" &&
+					retryAttempt.scoringResult.failureType !== "missing_export");

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/runner/item-retry.ts` around lines 307 - 313, The tie-breaker only treats
failureType "import" as promotable; update the shouldPromoteRetry logic in
item-retry.ts (where scoringResult, retryAttempt, and shouldPromoteRetry are
used) so that a retry which changes failureType from "missing_export" to any
non-"missing_export" (or clears it) is promoted the same as "import" fixes—i.e.,
include "missing_export" alongside "import" in the equality branch check that
currently tests scoringResult.failureType === "import" and
retryAttempt.scoringResult.failureType !== "import".

♻️ Duplicate comments (2)

apps/dashboard/src/components/about/about-page.tsx (1)
91-94: ⚠️ Potential issue | 🟡 Minor

Unify axis terminology in this sentence with the rest of the About content.

Line 91-Line 94 uses passType; if other About sections still present this axis as “prompt mode,” readers get two names for one dimension. Consider passType (prompt mode) to keep terms aligned.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/about/about-page.tsx` around lines 91 - 94, The
sentence currently uses the term "passType" which conflicts with other About
copy that calls this axis "prompt mode"; update the JSX string in about-page.tsx
(the paragraph starting "Each run expands a matrix of runtime × harness × model
× test × passType...") to use "passType (prompt mode)" so readers see both names
consistently and avoid ambiguity.
llm/context/codebase-overview.md (1)
170-175: ⚠️ Potential issue | 🟡 Minor

The overview’s schema version may already be stale again.

This section says 0.5.2, but the rest of the PR context points to a newer shared SCHEMA_VERSION. Please verify against the source-of-truth constant and update this block in the same pass so readers do not validate artifacts against the wrong version.
#!/bin/bash
# Verify the documented schema version against the source-of-truth constant.
fd -a 'common\.schema\.ts|index\.ts' src/schemas | while read -r file; do
  printf '\n== %s ==\n' "$file"
  rg -n 'SCHEMA_VERSION' "$file"
done

printf '\n== llm/context/codebase-overview.md ==\n'
rg -n 'Schema version:|SupportedRuntimeNameSchema|ArtifactRuntimeNameSchema' llm/context/codebase-overview.md
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@llm/context/codebase-overview.md` around lines 170 - 175, The documented
"Schema version: 0.5.2" in llm/context/codebase-overview.md is stale—open the
source-of-truth constant SCHEMA_VERSION in common.schema.ts (or index.ts if
defined there), copy its current value, and replace the hardcoded version in the
overview; also scan for and update any other occurrences referencing
SupportedRuntimeNameSchema or ArtifactRuntimeNameSchema to ensure they align
with the same SCHEMA_VERSION, then run the provided verification commands (or a
quick ripgrep) to confirm all references match before committing.

🧹 Nitpick comments (6)

apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx (1)

63-99: Consider varying the accent colors for better visual distinction.

Three cards (Matched Runs, Profiles, Pass Rate) currently use the same --success color for their left border accent. While this may be intentional for semantic grouping, it reduces visual distinction across the six-card grid. Consider using different theme colors to help users quickly scan and differentiate the KPI cards.

🎨 Example: Vary accent colors

 {
   title: "Matched Runs",
   value: String(aggregate?.summary.runsMatched ?? 0),
   sub: null,
-  color: "hsl(var(--success))",
+  color: "hsl(var(--chart-1))",
 },
 {
   title: "Profiles",
   value: String(aggregate?.summary.machines ?? 0),
   sub:
     aggregate?.summary.instances !== undefined
       ? `${aggregate.summary.instances} ${
           aggregate.summary.instances === 1 ? "instance" : "instances"
         }`
       : null,
-  color: "hsl(var(--success))",
+  color: "hsl(var(--chart-2))",
 },

Or keep semantic colors but introduce more variety from your theme palette.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx`
around lines 63 - 99, The three KPI card objects with title "Matched Runs",
"Profiles", and "Pass Rate" all set color: "hsl(var(--success))" which reduces
visual distinction; update the color property on those specific card objects in
the cards array so each uses a different theme token (e.g.,
"hsl(var(--accent))", "hsl(var(--primary))", "hsl(var(--warning))" or similar
design-system variables) while keeping other cards unchanged, or alternatively
retain --success for one and pick complementary tokens for the other two to
preserve semantic grouping but increase scanability.

apps/dashboard/src/components/run-list/run-list-page.tsx (1)

138-165: Decouple stat accent color from index to prevent silent mismatches.

This currently relies on positional coupling between the stat array and STAT_ACCENTS. Attaching accent directly to each stat object is safer and easier to maintain.

♻️ Optional refactor

-								{[
+								{[
 									{
 										label: "Live Runs",
 										value: formatNumber(liveGroup?.runCount ?? 0),
 										large: true,
+										accent: STAT_ACCENTS[0],
 									},
 									{
 										label: "Profiles",
 										value: formatNumber(liveGroup?.machineCount ?? 0),
 										large: true,
+										accent: STAT_ACCENTS[1],
 									},
 									{
 										label: "Instances",
 										value: formatNumber(liveGroup?.instanceCount ?? 0),
 										large: true,
+										accent: STAT_ACCENTS[2],
 									},
 									{
 										label: "Last Published",
 										value: liveGroup
 											? formatDate(liveGroup.latestRunAt)
 											: "n/a",
 										large: false,
+										accent: STAT_ACCENTS[3],
 									},
 								].map((stat, i) => (
 									<div
 										key={stat.label}
 										className={`rounded border border-border border-l-2 bg-background p-4 animate-fade-slide-up animate-stagger-${i + 1}`}
-										style={{ borderLeftColor: STAT_ACCENTS[i] }}
+										style={{ borderLeftColor: stat.accent }}
 									>

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/run-list/run-list-page.tsx` around lines 138 -
165, The stat cards are brittle because they assume positional coupling with
STAT_ACCENTS; update the stat objects created in the array (the ones containing
label/value/large) to include an explicit accent property (e.g., accent:
someColor or accent: STAT_ACCENTS[0] as an initial value) and then change the
render to use stat.accent for the inline style (style={{ borderLeftColor:
stat.accent ?? STAT_ACCENTS[i] }}). Modify the objects built from
liveGroup/formatNumber/formatDate to include accent for each item so future
reordering or additions won’t silently mismatch with STAT_ACCENTS; keep the
existing fallback to STAT_ACCENTS[i] to preserve current visuals during the
transition.

src/cli/compare-formatters.ts (1)

68-72: Optionally truncate passType for strict fixed-width guarantees.

For consistency with other columns, truncating passType before padding would prevent future overflow if longer pass labels are introduced.

♻️ Optional consistency patch

-				`${pad(item.passType, columnWidths.pass)}`,
+				`${pad(truncate(item.passType, columnWidths.pass), columnWidths.pass)}`,

-				`${pad(item.passType, passW)} ` +
+				`${pad(truncate(item.passType, passW), passW)} ` +

Also applies to: 307-311

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/cli/compare-formatters.ts` around lines 68 - 72, The passType field is
padded without being truncated which can break fixed-width layout if pass labels
grow; update the formatter to truncate item.passType using
truncate(item.passType, columnWidths.pass) before calling pad (same pattern used
for model/harness/test), and apply the same change in the other occurrence
around lines 307-311 to ensure consistent fixed-width behavior; look for usages
of pad(..., columnWidths.pass) and replace the value argument with
truncate(item.passType, columnWidths.pass).

apps/dashboard/src/lib/aggregations-charts.ts (1)

164-170: Avoid positional coupling between axes and values.

These arrays have to stay in lockstep, so inserting or reordering an axis can silently relabel the radar metrics. A keyed mapping would make the relationship explicit and safer to extend.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/lib/aggregations-charts.ts` around lines 164 - 170, The
current positional coupling between the axes array and the values array
(variables named axes and values in aggregations-charts.ts) is fragile—replace
the parallel arrays with a single keyed mapping (e.g., an array of objects or a
Record) that pairs each axis name/key with its numeric value (for example
entries like { key: 'passRate', label: 'Pass Rate', value: ... }), then update
any downstream code that iterates axes or values to read value by that
key/property (instead of relying on matching indices) so adding/reordering axes
won't mislabel metrics.

test/workspace-prompt-parity.test.ts (1)

50-60: Split the step-2 and step-3 contract assertions.

Bundling both blocks into one toContain makes failures less actionable and also requires the two clauses to stay adjacent. Separate expectations would keep the same coverage with better failure output.

♻️ Suggested cleanup

-		expect(prompt).toContain(
-			[
-				"2. Overwrite `checklist/steps.txt` so its exact contents are the three lines below, ending with a trailing newline:",
-				"   ```text",
-				"   bootstrap",
-				"   verify-inputs",
-				"   archive-results",
-				"   ```",
-				"3. Create `artifacts/summary.json` so its exact contents are the JSON below, with no additional whitespace or fields:",
-				'   `{"status":"ready","createdBy":"workspace-smoke","steps":3}`',
-			].join("\n"),
-		);
+		expect(prompt).toContain(
+			[
+				"2. Overwrite `checklist/steps.txt` so its exact contents are the three lines below, ending with a trailing newline:",
+				"   ```text",
+				"   bootstrap",
+				"   verify-inputs",
+				"   archive-results",
+				"   ```",
+			].join("\n"),
+		);
+		expect(prompt).toContain(
+			[
+				"3. Create `artifacts/summary.json` so its exact contents are the JSON below, with no additional whitespace or fields:",
+				'   `{"status":"ready","createdBy":"workspace-smoke","steps":3}`',
+			].join("\n"),
+		);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/workspace-prompt-parity.test.ts` around lines 50 - 60, The assertion
currently bundles step 2 and step 3 into one expect(prompt).toContain(...) which
makes failures hard to diagnose and requires the blocks to remain adjacent;
split this into two expectations: keep the first expect(prompt).toContain(...)
checking the multiline block for step 2 (the checklist/steps.txt lines and the
```text fence) and add a separate expect(prompt).toContain(...) for step 3
checking the artifacts/summary.json JSON line; locate the test that references
the prompt variable in workspace-prompt-parity.test.ts and replace the single
combined toContain call with these two separate toContain calls referencing the
same string fragments as in the diff.

test/model-availability.test.ts (1)

29-34: Prefer a typed fixture over the as Parameters<...>[1] cast.

That cast suppresses config-shape drift, so this test can keep compiling after listAvailableModelsByRuntime() starts requiring new fields. A shared fixture typed with satisfies keeps the mock-based test honest.

Also applies to: 42-47
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/model-availability.test.ts` around lines 29 - 34, Replace the ad-hoc
cast with a properly typed fixture: create a const (e.g., named configFixture)
for the options object and declare it using TypeScript's satisfies operator with
Parameters<typeof listAvailableModelsByRuntime>[1], then pass that fixture into
listAvailableModelsByRuntime in the test (and the other occurrence at lines
42-47); this ensures the test will fail to compile if the function's config
shape changes while avoiding the unsafe as ... cast.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/dashboard/src/components/charts/model-comparison-chart.tsx`:
- Around line 102-110: The effect currently references parsedProps.error in the
dependency array without narrowing the Zod safeParse discriminated union;
extract the error message with proper narrowing before the useEffect (e.g.
compute const validationErrorMessage = parsedProps.success ? undefined :
parsedProps.error.message) and then update the useEffect to depend on
parsedProps.success and validationErrorMessage, using validationErrorMessage
inside the effect to compare/mutate lastLoggedValidationError.current and log.
This keeps the discriminated union safe and satisfies strict type checking for
the useEffect dependencies referencing parsedProps.error.

In `@apps/dashboard/src/components/run-list/run-list-page.tsx`:
- Around line 302-305: The dynamic template literal animate-stagger-${(i % 8) +
1} won’t be picked up by Tailwind’s scanner, so either add a safelist for
animate-stagger-1..animate-stagger-8 and the corresponding animation keyframes
in the Tailwind config, or replace the dynamic expression in the RunListPage
component (the div using className={`animate-fade-slide-up animate-stagger-${(i
% 8) + 1}`}) with a deterministic mapping to static class names (e.g., const
staggerClass = ['animate-stagger-1',…,'animate-stagger-8'][i % 8];
className={`animate-fade-slide-up ${staggerClass}`}) so Tailwind can generate
the CSS; update the same pattern in other components (about-page,
leaderboard-summary-cards) to use the safelist or static class mapping.

In `@src/harnesses/goose-utils.ts`:
- Around line 21-35: normalizeTurnLimit currently returns the fallback without
validation when value is undefined, which can allow invalid defaults (0, NaN,
non-integers) through; update normalizeTurnLimit to validate the fallback the
same way it validates value: when value === undefined, verify fallback is a
number, an integer, and >= 1 (and not NaN) and throw a TypeError including
paramName if the fallback is invalid, otherwise return the validated fallback;
keep the existing error message format and function signature.

In `@src/lib/machine-profile/instance-id.ts`:
- Around line 183-188: The schema ResolveMachineInstanceIdOptionsSchema
currently uses z.record(z.string().optional()) — update this to the explicit
two-argument form z.record(z.string(), z.string().optional()) for consistency
with the codebase (see pattern used in workspace-manifest.ts) and, while
editing, evaluate whether env values should be required (z.string()) instead of
optional to reflect typical Record<string, string> environment shapes; adjust
the second argument accordingly if you decide env values must be present.

In `@src/lib/stats-format.ts`:
- Around line 88-96: The code currently computes scored_row_failure as failed -
generationFailureCount and labels the residual bucket `scored_row_failure`,
which is incorrect; change the logic in the block referencing
generationFailureCount, failed and stats.generationFailures so that you either
(A) read and display an explicit scored-row failure counter (e.g., use
stats.scoredRowFailures or the canonical field that tracks semantic/scored
failures) and print that value, or (B) if no explicit counter exists, rename the
residual label to something like `other_failures`/`residual_failures` and
document it accordingly; update the variable scoredRowFailures and the pushed
line to use the explicit field or the new residual label so the output is not
misrepresenting non-scored failure types.

In `@src/schemas/common.schema.ts`:
- Around line 15-51: The config migration function
migrateLegacySupportedRuntimeNames is never applied, so legacy runtime names
like "vllm" aren't converted before validation; update migrateBenchConfigAliases
to detect when config.runtimes is an array and assign config.runtimes =
migrateLegacySupportedRuntimeNames(config.runtimes) prior to schema validation
(this ensures values validated against SupportedRuntimeNameSchema are migrated);
modify the function migrateBenchConfigAliases to shallow-clone the raw object,
perform existing migrations, then add this runtimes migration step and return
the updated config.

In `@src/schemas/config.schema.ts`:
- Around line 125-129: The current zod schema for vllmBaseUrl uses
z.preprocess(..., z.string().optional()) so non-string inputs (null, number,
object) fail parsing and never reach the config-level superRefine check that
should emit the deprecation; change the schema to accept any type (e.g.,
z.preprocess(..., z.unknown().optional())) so all input shapes pass through
parsing and the superRefine() deprecation logic that checks config.vllmBaseUrl
will reliably run for non-undefined values; update the vllmBaseUrl preprocess
accordingly and verify superRefine still uses the same config.vllmBaseUrl check.

In `@test/run-command.test.ts`:
- Around line 45-63: The test currently only asserts process.exit and generic
error logging; update the assertions in the test for runCommand to verify the
conflict-specific diagnostic is reported: after invoking runCommand.parseAsync
(the failing path in runCommand) assert that mocks.loggerError (or exitSpy if
used) was called with an Error whose message or logged context explicitly
mentions the '--ollama-url' / '--vllm-url' conflict (e.g., expect.any(Error)
whose message includes the conflict text) and/or that the logged message string
includes the conflict phrase, while still asserting exitSpy called with 1 and
runBenchmark not called; locate references to runCommand, runBenchmark, exitSpy,
and mocks.loggerError in the test to implement this more specific check.

---

Outside diff comments:
In `@apps/dashboard/src/components/layout/header.tsx`:
- Around line 12-16: The JSDoc for the exported Header component is missing an
explicit `@throws` clause for the useLocation() router dependency; update the
documentation block above the exported Header function to include a `@throws` (or
`@throws` {Error}) entry stating that useLocation() will throw when Header is
rendered outside a Router context (i.e., when no routing context is available),
and give a brief description of the failure mode so callers know to wrap Header
in a Router provider.

In `@src/lib/machine-profile/probe-utils.ts`:
- Around line 47-59: The parsePositiveInt function currently uses
Number.parseInt which accepts strings like "16 cores"; update parsePositiveInt
to validate the trimmed input matches a strict positive-integer pattern (e.g.
/^\d+$/) before parsing so only purely numeric probe output is accepted; if the
validation fails return undefined, otherwise convert to Number (or parseInt) and
return the positive integer as before, keeping the function non-throwing.
- Around line 62-80: The regex in parseMemoryBytes currently allows partial
matches (so strings like "speed 6400 MT/s, size 16 GB" parse) — update the match
to require the entire normalized string to match by anchoring the pattern (use ^
and $ around the existing token pattern) so only full-string memory formats are
accepted; keep the existing normalization (trim/commas), capture amount and unit
the same way, and preserve the subsequent logic that converts units and
validates amount.

In `@src/runner/item-retry.ts`:
- Around line 307-313: The tie-breaker only treats failureType "import" as
promotable; update the shouldPromoteRetry logic in item-retry.ts (where
scoringResult, retryAttempt, and shouldPromoteRetry are used) so that a retry
which changes failureType from "missing_export" to any non-"missing_export" (or
clears it) is promoted the same as "import" fixes—i.e., include "missing_export"
alongside "import" in the equality branch check that currently tests
scoringResult.failureType === "import" and
retryAttempt.scoringResult.failureType !== "import".

---

Duplicate comments:
In `@apps/dashboard/src/components/about/about-page.tsx`:
- Around line 91-94: The sentence currently uses the term "passType" which
conflicts with other About copy that calls this axis "prompt mode"; update the
JSX string in about-page.tsx (the paragraph starting "Each run expands a matrix
of runtime × harness × model × test × passType...") to use "passType (prompt
mode)" so readers see both names consistently and avoid ambiguity.

In `@llm/context/codebase-overview.md`:
- Around line 170-175: The documented "Schema version: 0.5.2" in
llm/context/codebase-overview.md is stale—open the source-of-truth constant
SCHEMA_VERSION in common.schema.ts (or index.ts if defined there), copy its
current value, and replace the hardcoded version in the overview; also scan for
and update any other occurrences referencing SupportedRuntimeNameSchema or
ArtifactRuntimeNameSchema to ensure they align with the same SCHEMA_VERSION,
then run the provided verification commands (or a quick ripgrep) to confirm all
references match before committing.

---

Nitpick comments:
In `@apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx`:
- Around line 63-99: The three KPI card objects with title "Matched Runs",
"Profiles", and "Pass Rate" all set color: "hsl(var(--success))" which reduces
visual distinction; update the color property on those specific card objects in
the cards array so each uses a different theme token (e.g.,
"hsl(var(--accent))", "hsl(var(--primary))", "hsl(var(--warning))" or similar
design-system variables) while keeping other cards unchanged, or alternatively
retain --success for one and pick complementary tokens for the other two to
preserve semantic grouping but increase scanability.

In `@apps/dashboard/src/components/run-list/run-list-page.tsx`:
- Around line 138-165: The stat cards are brittle because they assume positional
coupling with STAT_ACCENTS; update the stat objects created in the array (the
ones containing label/value/large) to include an explicit accent property (e.g.,
accent: someColor or accent: STAT_ACCENTS[0] as an initial value) and then
change the render to use stat.accent for the inline style (style={{
borderLeftColor: stat.accent ?? STAT_ACCENTS[i] }}). Modify the objects built
from liveGroup/formatNumber/formatDate to include accent for each item so future
reordering or additions won’t silently mismatch with STAT_ACCENTS; keep the
existing fallback to STAT_ACCENTS[i] to preserve current visuals during the
transition.

In `@apps/dashboard/src/lib/aggregations-charts.ts`:
- Around line 164-170: The current positional coupling between the axes array
and the values array (variables named axes and values in aggregations-charts.ts)
is fragile—replace the parallel arrays with a single keyed mapping (e.g., an
array of objects or a Record) that pairs each axis name/key with its numeric
value (for example entries like { key: 'passRate', label: 'Pass Rate', value:
... }), then update any downstream code that iterates axes or values to read
value by that key/property (instead of relying on matching indices) so
adding/reordering axes won't mislabel metrics.

In `@src/cli/compare-formatters.ts`:
- Around line 68-72: The passType field is padded without being truncated which
can break fixed-width layout if pass labels grow; update the formatter to
truncate item.passType using truncate(item.passType, columnWidths.pass) before
calling pad (same pattern used for model/harness/test), and apply the same
change in the other occurrence around lines 307-311 to ensure consistent
fixed-width behavior; look for usages of pad(..., columnWidths.pass) and replace
the value argument with truncate(item.passType, columnWidths.pass).

In `@test/model-availability.test.ts`:
- Around line 29-34: Replace the ad-hoc cast with a properly typed fixture:
create a const (e.g., named configFixture) for the options object and declare it
using TypeScript's satisfies operator with Parameters<typeof
listAvailableModelsByRuntime>[1], then pass that fixture into
listAvailableModelsByRuntime in the test (and the other occurrence at lines
42-47); this ensures the test will fail to compile if the function's config
shape changes while avoiding the unsafe as ... cast.

In `@test/workspace-prompt-parity.test.ts`:
- Around line 50-60: The assertion currently bundles step 2 and step 3 into one
expect(prompt).toContain(...) which makes failures hard to diagnose and requires
the blocks to remain adjacent; split this into two expectations: keep the first
expect(prompt).toContain(...) checking the multiline block for step 2 (the
checklist/steps.txt lines and the ```text fence) and add a separate
expect(prompt).toContain(...) for step 3 checking the artifacts/summary.json
JSON line; locate the test that references the prompt variable in
workspace-prompt-parity.test.ts and replace the single combined toContain call
with these two separate toContain calls referencing the same string fragments as
in the diff.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 72c2cf1a-d56b-4a5c-9609-66f1b850fec7

📥 Commits

Reviewing files that changed from the base of the PR and between 56b63e7 and 0cb6fdb.

📒 Files selected for processing (162)

README.md
apps/dashboard/scripts/build-index.ts
apps/dashboard/src/components/about/about-content.ts
apps/dashboard/src/components/about/about-page.tsx
apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx
apps/dashboard/src/components/charts/chart-primitives.tsx
apps/dashboard/src/components/charts/composite-score-chart.tsx
apps/dashboard/src/components/charts/frontier-eval-scatter.tsx
apps/dashboard/src/components/charts/model-comparison-chart.tsx
apps/dashboard/src/components/charts/model-radar-chart.tsx
apps/dashboard/src/components/charts/model-test-heatmap.tsx
apps/dashboard/src/components/charts/test-difficulty-chart.tsx
apps/dashboard/src/components/charts/test-type-split-chart.tsx
apps/dashboard/src/components/charts/token-efficiency-chart.tsx
apps/dashboard/src/components/layout/header.tsx
apps/dashboard/src/components/leaderboard/leaderboard-filters.ts
apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx
apps/dashboard/src/components/leaderboard/leaderboard-page.tsx
apps/dashboard/src/components/leaderboard/leaderboard-results-table.tsx
apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx
apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx
apps/dashboard/src/components/run-list/run-list-page.tsx
apps/dashboard/src/components/ui/section-heading.tsx
apps/dashboard/src/index.css
apps/dashboard/src/lib/aggregations-charts.ts
apps/dashboard/src/lib/aggregations-compare.ts
apps/dashboard/src/lib/aggregations-diagnostics.ts
apps/dashboard/src/lib/aggregations-test-types.ts
apps/dashboard/src/lib/aggregations-tooling.ts
apps/dashboard/src/lib/aggregations.ts
apps/dashboard/src/lib/api.ts
apps/dashboard/src/lib/chart-colors.ts
apps/dashboard/src/lib/schemas.ts
apps/dashboard/src/lib/types.ts
llm/context/codebase-overview.md
llm/implementation/computer-use-hardening.md
llm/implementation/harnesses-implementation.md
llm/implementation/multi-runtime-mvp-implementation.md
llm/project/project-overview.md
llm/project/project-rules.md
llm/project/user-flow.md
package.json
src/cli/compare-command.ts
src/cli/compare-formatters.ts
src/cli/compare-utils.ts
src/cli/migrate-machine-command.ts
src/cli/run-command.ts
src/harnesses/direct-adapter.ts
src/harnesses/discovery.ts
src/harnesses/goose-adapter.ts
src/harnesses/goose-openai.ts
src/harnesses/goose-output.ts
src/harnesses/goose-utils.ts
src/harnesses/harness.ts
src/harnesses/opencode-adapter.ts
src/harnesses/opencode-artifacts.ts
src/harnesses/opencode-cli.ts
src/harnesses/opencode-config.ts
src/harnesses/opencode-events.ts
src/harnesses/opencode-model.ts
src/harnesses/opencode-output.ts
src/harnesses/opencode-permissions.ts
src/harnesses/opencode-process.ts
src/harnesses/opencode-provider.ts
src/harnesses/opencode-runner.ts
src/harnesses/opencode-signal.ts
src/harnesses/tool-prompt.ts
src/lib/benchmark-checkpoint.ts
src/lib/failure-classifier.ts
src/lib/hardware-profile.ts
src/lib/logger.ts
src/lib/machine-profile/instance-id.ts
src/lib/machine-profile/legacy.ts
src/lib/machine-profile/normalization.ts
src/lib/machine-profile/probe-linux.ts
src/lib/machine-profile/probe-macos.ts
src/lib/machine-profile/probe-utils.ts
src/lib/machine-profile/probe-windows.ts
src/lib/model-aliases.ts
src/lib/model-profile/normalization.ts
src/lib/model-profile/registry-compat.ts
src/lib/model-profile/registry.ts
src/lib/openai-compat-client.ts
src/lib/signal-assessment.ts
src/lib/stats-format.ts
src/lib/test-catalog.ts
src/runner/index.ts
src/runner/item-executor.ts
src/runner/item-retry.ts
src/runner/model-availability.ts
src/runner/model-eligibility.ts
src/runner/plan-builder.ts
src/runner/run-progress.ts
src/runtimes/discovery.ts
src/runtimes/index.ts
src/runtimes/ollama-residency.ts
src/runtimes/ollama-runtime.ts
src/runtimes/runtime.ts
src/runtimes/vllm-runtime.ts
src/schemas/common.schema.ts
src/schemas/config.schema.ts
src/schemas/index.ts
src/schemas/model-alias.schema.ts
src/schemas/model-profile.schema.ts
src/schemas/plan.schema.ts
src/schemas/result.schema.ts
src/schemas/test-catalog.schema.ts
src/tests/calculator-basic/test.meta.json
src/tests/calculator-stateful/test.meta.json
src/tests/event-emitter/test.meta.json
src/tests/rate-limiter/test.meta.json
src/tests/smoke/test.meta.json
src/tests/todo-app/test.meta.json
src/tests/ttl-cache/test.meta.json
src/tests/workspace-smoke/fixtures/logs/.gitkeep
src/tests/workspace-smoke/prompt.blind.md
src/tests/workspace-smoke/prompt.informed.md
src/tests/workspace-smoke/test.meta.json
test/aggregate.test.ts
test/benchmark-checkpoint.test.ts
test/build-index.test.ts
test/compare-checkpoint-guard.test.ts
test/compare-formatters.test.ts
test/dashboard-schema-compat.test.ts
test/dashboard-test-type.test.ts
test/failure-classifier.test.ts
test/generation-retry.test.ts
test/goose-adapter.test.ts
test/goose-output.test.ts
test/hardware-profile.test.ts
test/harness-compatibility.test.ts
test/item-executor.test.ts
test/item-retry.test.ts
test/machine-profile-migration.test.ts
test/model-alias-schema.test.ts
test/model-availability.test.ts
test/model-profile-registry.test.ts
test/ollama-adapter.test.ts
test/ollama-residency.test.ts
test/opencode-adapter.test.ts
test/opencode-artifacts.test.ts
test/opencode-cli-cache.test.ts
test/opencode-cli.test.ts
test/opencode-config.test.ts
test/opencode-model.test.ts
test/opencode-output.test.ts
test/opencode-runner-race.test.ts
test/opencode-runner.test.ts
test/plan-builder.test.ts
test/run-command.test.ts
test/runner-residency.test.ts
test/schema-regressions.test.ts
test/schemas.test.ts
test/signal-assessment.test.ts
test/stats-semantics.test.ts
test/stats.test.ts
test/test-catalog.test.ts
test/tool-prompt.test.ts
test/utils/fixtures.ts
test/vllm-runtime.test.ts
test/workspace-capability-parity.test.ts
test/workspace-prompt-parity.test.ts

💤 Files with no reviewable changes (10)

src/harnesses/goose-openai.ts
src/harnesses/opencode-model.ts
test/benchmark-checkpoint.test.ts
src/lib/openai-compat-client.ts
src/harnesses/opencode-output.ts
src/runtimes/discovery.ts
src/harnesses/opencode-process.ts
src/runtimes/vllm-runtime.ts
test/vllm-runtime.test.ts
package.json

AustinKelsay · 2026-04-25T19:06:53Z

@CodeRabbit full review

coderabbitai · 2026-04-25T19:06:58Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (3)

src/schemas/plan.schema.ts (1)

40-64: ⚠️ Potential issue | 🟠 Major

Use active runtime schema for executable run plans.

RunPlanSchema is an execution boundary, but these fields currently accept ArtifactRuntimeNameSchema, which can admit legacy artifact-only runtime names and defer failures downstream instead of failing at parse time.
♻️ Proposed fix
 import {
-	ArtifactRuntimeNameSchema,
 	BenchmarkCheckpointSchema,
 	HarnessCapabilitySchema,
 	MachineProfileSchema,
 	PassTypeSchema,
 	RunProvenanceSchema,
+	RuntimeNameSchema,
 	RuntimeEnvironmentSchema,
 	SCHEMA_VERSION,
 	TestCategorySchema,
 	TestScoringModeSchema,
 } from "./common.schema.js";
@@
 export const ModelExclusionSchema = z.object({
@@
-	runtime: ArtifactRuntimeNameSchema,
+	runtime: RuntimeNameSchema,
@@
 export const MatrixItemSchema = z.object({
@@
-	runtime: ArtifactRuntimeNameSchema,
+	runtime: RuntimeNameSchema,
As per coding guidelines, "Use Zod for schema validation at all boundaries (CLI args, env, JSON files, HTTP responses)", "Throw on invalid config, invalid schemas, and programmer mistakes; avoid silent fallbacks", and "Maintain versioned schemas with schemaVersion field and use additive changes; migrations must be explicit".

Also applies to: 143-144
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/schemas/plan.schema.ts` around lines 40 - 64, The runtime fields
currently typed with ArtifactRuntimeNameSchema should use the stricter
ExecutableArtifactRuntimeNameSchema so invalid legacy artifact-only names fail
at parse time; update the runtime property in ModelExclusionSchema, the runtime
property in MatrixItemSchema, and any runtime fields on RunPlanSchema (and the
other mentioned occurrence) to use ExecutableArtifactRuntimeNameSchema, and
add/import that schema if missing so the module fails fast on non-executable
runtime names.

src/harnesses/opencode-events.ts (1)

356-367: ⚠️ Potential issue | 🟡 Minor

Set hasToolUse when mixed-output salvage returns tool_call.

This branch can return method: "tool_call" while preserving hasToolUse: false from earlier event inspection. Downstream diagnostics then disagree about whether tool use happened.

♻️ Suggested fix

 	return buildParsedEvents(
 		toolCallCode ?? mixedOutput,
 		toolCallCode ? "tool_call" : hasRawParts ? "raw" : "json",
 		{
 			...diagnostics,
+			hasToolUse: diagnostics.hasToolUse || Boolean(toolCallCode),
 			permissionDenied: diagnostics.permissionDenied || mixedPermissionDenied,
 		},
 	);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/opencode-events.ts` around lines 356 - 367, The branch that
handles mixed text output (in the if using parsedLines, hasTextParts) can return
a tool_call method without updating hasToolUse; when joinOrderedOutputParts ->
extractFromToolCallText yields a toolCallCode, set hasToolUse to true before
calling buildParsedEvents so downstream diagnostics reflect actual tool usage;
update the local diagnostics/flags (the diagnostics object or the hasToolUse
variable referenced earlier) so that buildParsedEvents receives diagnostics with
permissionDenied as before and hasToolUse: true when toolCallCode is non-null.

src/harnesses/opencode-signal.ts (1)

84-90: ⚠️ Potential issue | 🟠 Major

Pass harness source when extracting raw-stream taint reasons.

getTranscriptOrInputTaintReasons() only classifies agent_requested_input when source === "harness". These calls omit that context, so confirmation/input prompts in raw stdout/stderr or parsedStream never produce the new taint reason.

♻️ Suggested fix

-	const stderrReasons = getTranscriptOrInputTaintReasons(stderr).filter(
+	const stderrReasons = getTranscriptOrInputTaintReasons(stderr, {
+		source: "harness",
+	}).filter(
 		(reason) => reason !== "internal_tool_transcript",
 	);
 	const protocolOnlyReasons =
 		parsed.method === "json" && parsed.output.trim().length === 0
-			? getTranscriptOrInputTaintReasons(parsedStream)
+			? getTranscriptOrInputTaintReasons(parsedStream, {
+					source: "harness",
+				})
 			: [];
@@
 	const reasons = Array.from(
 		new Set([
-			...getTranscriptOrInputTaintReasons(stdout),
-			...getTranscriptOrInputTaintReasons(stderr),
+			...getTranscriptOrInputTaintReasons(stdout, { source: "harness" }),
+			...getTranscriptOrInputTaintReasons(stderr, { source: "harness" }),
 			...getOpenCodePermissionTaintReasons(stdout, stderr, ""),
 		]),
 	);

Also applies to: 127-132

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/opencode-signal.ts` around lines 84 - 90, Calls to
getTranscriptOrInputTaintReasons for stderr, parsedStream (used in variables
stderrReasons and protocolOnlyReasons) are missing the source context so they
never emit agent_requested_input; update those calls to pass the harness source
(e.g., source: "harness") when extracting raw-stream taint reasons. Locate
usages of getTranscriptOrInputTaintReasons in this file (including the similar
block around the later occurrence at the 127-132 region) and add the source
argument so confirmation/input prompts in raw stdout/stderr and parsedStream are
classified correctly.

🧹 Nitpick comments (5)

apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx (1)
22-32: Consider deriving the TypeScript interface from the Zod schema.

Both ModelFilterDropdownProps interface and ModelFilterDropdownPropsSchema define the same shape. Using z.infer would eliminate duplication and ensure they stay in sync.
♻️ Proposed refactor
-interface ModelFilterDropdownProps {
-	models: string[];
-	selectedModels: string[];
-	onSelectionChange: (selectedModels: string[]) => void;
-}
-
 const ModelFilterDropdownPropsSchema = z.strictObject({
 	models: z.array(z.string()),
 	selectedModels: z.array(z.string()),
 	onSelectionChange: z.function().args(z.array(z.string())).returns(z.void()),
 });
+
+type ModelFilterDropdownProps = z.infer<typeof ModelFilterDropdownPropsSchema>;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx` around
lines 22 - 32, Duplicate prop typings exist: keep only the Zod schema and derive
the TypeScript type from it to avoid drift. Replace the manual
ModelFilterDropdownProps interface with a type alias like `type
ModelFilterDropdownProps = z.infer<typeof ModelFilterDropdownPropsSchema>`
(keeping the existing ModelFilterDropdownPropsSchema definition), then update
any imports/usages to reference this derived type; ensure you import `z` from
Zod where needed and remove the redundant interface declaration.
src/lib/machine-profile/probe-macos.ts (1)
35-41: Consider using Zod for boundary validation.

Since this code segment is being reformatted, consider refactoring to use Zod for schema validation instead of manual typeof and Array.isArray() checks. The parseMacosAccelerators function receives JSON from system_profiler (an external boundary), and the guidelines specify using Zod for all boundary validation.

As per coding guidelines: "Use Zod for schema validation at all boundaries (CLI args, env, JSON files, HTTP responses)" and "Treat all boundary inputs (CLI args, env, JSON files, HTTP responses) as unknown and parse with Zod immediately."
Example Zod-based validation approach
import { z } from "zod";

const MacosDisplaySchema = z.object({
  SPDisplaysDataType: z.array(z.unknown()),
});

export function parseMacosAccelerators(rawJson: string): ObservedAccelerator[] {
  let parsed: unknown;
  try {
    parsed = JSON.parse(rawJson);
  } catch (error) {
    throw new Error(
      `malformed macOS accelerator probe JSON: ${(error as Error).message}`,
    );
  }
  
  const result = MacosDisplaySchema.safeParse(parsed);
  if (!result.success) {
    throw new Error(
      "malformed macOS accelerator probe JSON: missing SPDisplaysDataType array",
    );
  }
  
  const displays = result.data.SPDisplaysDataType;
  // ... rest of the function
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/machine-profile/probe-macos.ts` around lines 35 - 41, Replace the
manual typeof/Array.isArray checks in parseMacosAccelerators with a Zod schema:
define a MacosDisplaySchema that requires SPDisplaysDataType as an array (e.g.,
z.object({ SPDisplaysDataType: z.array(z.unknown()) })), parse the raw JSON into
unknown, safeParse it with Zod, and on failure throw the existing "malformed
macOS accelerator probe JSON: missing SPDisplaysDataType array" error (or
include JSON parse errors when JSON.parse fails); then use
result.data.SPDisplaysDataType instead of the parsed cast.
src/cli/compare-command.ts (1)
88-91: Consider returning a non-zero exit code on checkpoint guard failure.

When checkpoint metadata is missing or mismatched, the command prints ✗ FAIL: but returns normally (exit 0). This may cause CI pipelines to treat the comparison as successful despite the guard failure. If this is intentional per the "exit non-zero only on crashes" guideline, consider documenting this behavior.
♻️ Optional fix for explicit failure exit
 			if (checkpointGuardMessage) {
 				console.error(`✗ FAIL: ${checkpointGuardMessage}`);
+				process.exit(1);
-				return;
 			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cli/compare-command.ts` around lines 88 - 91, The checkpoint guard
currently logs via console.error when checkpointGuardMessage is set but then
returns normally; change this to fail the process explicitly by calling
process.exit(1) (or otherwise throwing an error) immediately after logging the
`✗ FAIL: ${checkpointGuardMessage}` so CI sees a non-zero exit; update the spot
referencing checkpointGuardMessage in compare-command.ts (the block that
currently does console.error(...) then return) to perform the non-zero exit, or
alternatively add documentation if keeping the current zero-exit behavior is
intentional.
apps/dashboard/src/lib/schemas.ts (1)
189-198: Consider adding retry field coherence validation.

The dashboard ScoringMetricsSchema adds the retry fields but omits the .refine() validation present in the core schema (src/schemas/result.schema.ts). This allows partial retry payloads like { durationMs: 10, retryPromoted: true } to pass.

If dashboard schemas need to be permissive for legacy data, this is acceptable. However, if strict validation is preferred, consider adding the same refinement logic.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/lib/schemas.ts` around lines 189 - 198,
ScoringMetricsSchema allows inconsistent retry field combinations (e.g.,
retryPromoted true without retryAttempted) because it lacks the `.refine()` used
in the core result.schema.ts; update ScoringMetricsSchema to add a refinement
that validates coherence between retry fields (testing that if any retry* fields
are present they follow the same rules as in core: e.g., retryAttempted must be
true if retryPromoted or
retryReason/retryKind/retryGenerationDurationMs/retryDuration fields exist, and
required pairings are enforced), using ScoringMetricsSchema.refine(...) to run
the same predicate and error message as the core schema to keep behavior
consistent.
src/harnesses/opencode-runner.ts (1)
12-21: Route the process-tree helpers through the shared exec abstraction.

This file reintroduces direct execFile calls inside src/harnesses, which breaks the repo’s process-execution conventions right in the new kill-path helpers. Please move the pgrep / taskkill calls behind src/lib/exec.ts and keep harness-level process spawning on Execa so timeout/cleanup behavior stays consistent across harnesses.

Based on learnings, "Applies to src/lib/exec.ts : Put process execution code in src/lib/exec.ts" and "Applies to src/harnesses/**/*.{ts,tsx} : Use Execa for process execution and shell command invocation".

Also applies to: 42-123
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/opencode-runner.ts` around lines 12 - 21, This file is
reintroducing direct execFile calls (execFileAsync) for the pgrep/taskkill
helpers which violates the repo convention; replace any use of execFileAsync
(and direct node:child_process usage) with the shared execution abstraction in
src/lib/exec.ts and ensure harness-level process spawning continues to use execa
(the existing execa import). Concretely: remove/promisify execFile usage, move
the pgrep and taskkill invocations to new or existing helper functions in
src/lib/exec.ts (call those helpers from this file), and keep all process-spawn
logic in this module using execa so STALE_CHECK_INTERVAL_MS /
MIN_STALE_OUTPUT_TIMEOUT_MS / MAX_STALE_OUTPUT_TIMEOUT_MS / FORCE_KILL_DELAY_MS
timeout/cleanup behavior remains consistent; update any references to
execFileAsync to the shared exec API.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/machine-profile/instance-id.ts`:
- Around line 183-188: The ResolveMachineInstanceIdOptionsSchema currently uses
z.object({...}) which silently strips unknown keys; change it to a strict schema
by applying z.object({...}).strict() (or the equivalent strict mode in your zod
version) so that parsing of ResolveMachineInstanceIdOptionsSchema throws on any
extra/typoed option keys (e.g., configuredInstanceId,
legacyConfiguredInstanceId, env, instanceIdFilePath), enforcing fail-fast
behavior for invalid config inputs.

In `@src/schemas/config.schema.ts`:
- Around line 70-76: The migration currently only rewrites legacy runtime names
and misses the case where an older config has runtimes: [], which should be
normalized to the default ["ollama"]. Update the migration branch that checks
config.schemaVersion !== undefined && config.schemaVersion !== SCHEMA_VERSION &&
Array.isArray(config.runtimes) so that if config.runtimes.length === 0 you set
config.runtimes = ["ollama"], otherwise call
migrateLegacySupportedRuntimeNames(config.runtimes); ensure you reference the
existing symbols config.schemaVersion, SCHEMA_VERSION, config.runtimes, and
migrateLegacySupportedRuntimeNames when making the change so the post-migration
.nonempty() validation will pass.

In `@src/schemas/model-profile.schema.ts`:
- Around line 100-125: The persisted model-profile schemas must reject unknown
keys instead of silently stripping them; update the schema definitions so they
are strict by applying Zod's strict object behavior: either change
ConfiguredModelProfileBaseSchema to be created with z.strictObject({...}) or
call .strict() on the extended schemas (ConfiguredModelProfileSchema and
ArtifactConfiguredModelProfileSchema) so that
ArtifactConfiguredModelProfileSchema (and the persisted variant) throws on extra
fields; use the existing LegacyCompatibleModelProfileFileSchema pattern as
reference.

---

Duplicate comments:
In `@src/harnesses/opencode-events.ts`:
- Around line 356-367: The branch that handles mixed text output (in the if
using parsedLines, hasTextParts) can return a tool_call method without updating
hasToolUse; when joinOrderedOutputParts -> extractFromToolCallText yields a
toolCallCode, set hasToolUse to true before calling buildParsedEvents so
downstream diagnostics reflect actual tool usage; update the local
diagnostics/flags (the diagnostics object or the hasToolUse variable referenced
earlier) so that buildParsedEvents receives diagnostics with permissionDenied as
before and hasToolUse: true when toolCallCode is non-null.

In `@src/harnesses/opencode-signal.ts`:
- Around line 84-90: Calls to getTranscriptOrInputTaintReasons for stderr,
parsedStream (used in variables stderrReasons and protocolOnlyReasons) are
missing the source context so they never emit agent_requested_input; update
those calls to pass the harness source (e.g., source: "harness") when extracting
raw-stream taint reasons. Locate usages of getTranscriptOrInputTaintReasons in
this file (including the similar block around the later occurrence at the
127-132 region) and add the source argument so confirmation/input prompts in raw
stdout/stderr and parsedStream are classified correctly.

In `@src/schemas/plan.schema.ts`:
- Around line 40-64: The runtime fields currently typed with
ArtifactRuntimeNameSchema should use the stricter
ExecutableArtifactRuntimeNameSchema so invalid legacy artifact-only names fail
at parse time; update the runtime property in ModelExclusionSchema, the runtime
property in MatrixItemSchema, and any runtime fields on RunPlanSchema (and the
other mentioned occurrence) to use ExecutableArtifactRuntimeNameSchema, and
add/import that schema if missing so the module fails fast on non-executable
runtime names.

---

Nitpick comments:
In `@apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx`:
- Around line 22-32: Duplicate prop typings exist: keep only the Zod schema and
derive the TypeScript type from it to avoid drift. Replace the manual
ModelFilterDropdownProps interface with a type alias like `type
ModelFilterDropdownProps = z.infer<typeof ModelFilterDropdownPropsSchema>`
(keeping the existing ModelFilterDropdownPropsSchema definition), then update
any imports/usages to reference this derived type; ensure you import `z` from
Zod where needed and remove the redundant interface declaration.

In `@apps/dashboard/src/lib/schemas.ts`:
- Around line 189-198: ScoringMetricsSchema allows inconsistent retry field
combinations (e.g., retryPromoted true without retryAttempted) because it lacks
the `.refine()` used in the core result.schema.ts; update ScoringMetricsSchema
to add a refinement that validates coherence between retry fields (testing that
if any retry* fields are present they follow the same rules as in core: e.g.,
retryAttempted must be true if retryPromoted or
retryReason/retryKind/retryGenerationDurationMs/retryDuration fields exist, and
required pairings are enforced), using ScoringMetricsSchema.refine(...) to run
the same predicate and error message as the core schema to keep behavior
consistent.

In `@src/cli/compare-command.ts`:
- Around line 88-91: The checkpoint guard currently logs via console.error when
checkpointGuardMessage is set but then returns normally; change this to fail the
process explicitly by calling process.exit(1) (or otherwise throwing an error)
immediately after logging the `✗ FAIL: ${checkpointGuardMessage}` so CI sees a
non-zero exit; update the spot referencing checkpointGuardMessage in
compare-command.ts (the block that currently does console.error(...) then
return) to perform the non-zero exit, or alternatively add documentation if
keeping the current zero-exit behavior is intentional.

In `@src/harnesses/opencode-runner.ts`:
- Around line 12-21: This file is reintroducing direct execFile calls
(execFileAsync) for the pgrep/taskkill helpers which violates the repo
convention; replace any use of execFileAsync (and direct node:child_process
usage) with the shared execution abstraction in src/lib/exec.ts and ensure
harness-level process spawning continues to use execa (the existing execa
import). Concretely: remove/promisify execFile usage, move the pgrep and
taskkill invocations to new or existing helper functions in src/lib/exec.ts
(call those helpers from this file), and keep all process-spawn logic in this
module using execa so STALE_CHECK_INTERVAL_MS / MIN_STALE_OUTPUT_TIMEOUT_MS /
MAX_STALE_OUTPUT_TIMEOUT_MS / FORCE_KILL_DELAY_MS timeout/cleanup behavior
remains consistent; update any references to execFileAsync to the shared exec
API.

In `@src/lib/machine-profile/probe-macos.ts`:
- Around line 35-41: Replace the manual typeof/Array.isArray checks in
parseMacosAccelerators with a Zod schema: define a MacosDisplaySchema that
requires SPDisplaysDataType as an array (e.g., z.object({ SPDisplaysDataType:
z.array(z.unknown()) })), parse the raw JSON into unknown, safeParse it with
Zod, and on failure throw the existing "malformed macOS accelerator probe JSON:
missing SPDisplaysDataType array" error (or include JSON parse errors when
JSON.parse fails); then use result.data.SPDisplaysDataType instead of the parsed
cast.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: daf5bf35-1f5e-4938-9a10-916719b9d0d7

📥 Commits

Reviewing files that changed from the base of the PR and between 56b63e7 and f7a7f85.

📒 Files selected for processing (162)

README.md
apps/dashboard/scripts/build-index.ts
apps/dashboard/src/components/about/about-content.ts
apps/dashboard/src/components/about/about-page.tsx
apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx
apps/dashboard/src/components/charts/chart-primitives.tsx
apps/dashboard/src/components/charts/composite-score-chart.tsx
apps/dashboard/src/components/charts/frontier-eval-scatter.tsx
apps/dashboard/src/components/charts/model-comparison-chart.tsx
apps/dashboard/src/components/charts/model-radar-chart.tsx
apps/dashboard/src/components/charts/model-test-heatmap.tsx
apps/dashboard/src/components/charts/test-difficulty-chart.tsx
apps/dashboard/src/components/charts/test-type-split-chart.tsx
apps/dashboard/src/components/charts/token-efficiency-chart.tsx
apps/dashboard/src/components/layout/header.tsx
apps/dashboard/src/components/leaderboard/leaderboard-filters.ts
apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx
apps/dashboard/src/components/leaderboard/leaderboard-page.tsx
apps/dashboard/src/components/leaderboard/leaderboard-results-table.tsx
apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx
apps/dashboard/src/components/leaderboard/model-filter-dropdown.tsx
apps/dashboard/src/components/run-list/run-list-page.tsx
apps/dashboard/src/components/ui/section-heading.tsx
apps/dashboard/src/index.css
apps/dashboard/src/lib/aggregations-charts.ts
apps/dashboard/src/lib/aggregations-compare.ts
apps/dashboard/src/lib/aggregations-diagnostics.ts
apps/dashboard/src/lib/aggregations-test-types.ts
apps/dashboard/src/lib/aggregations-tooling.ts
apps/dashboard/src/lib/aggregations.ts
apps/dashboard/src/lib/api.ts
apps/dashboard/src/lib/chart-colors.ts
apps/dashboard/src/lib/schemas.ts
apps/dashboard/src/lib/types.ts
llm/context/codebase-overview.md
llm/implementation/computer-use-hardening.md
llm/implementation/harnesses-implementation.md
llm/implementation/multi-runtime-mvp-implementation.md
llm/project/project-overview.md
llm/project/project-rules.md
llm/project/user-flow.md
package.json
src/cli/compare-command.ts
src/cli/compare-formatters.ts
src/cli/compare-utils.ts
src/cli/migrate-machine-command.ts
src/cli/run-command.ts
src/harnesses/direct-adapter.ts
src/harnesses/discovery.ts
src/harnesses/goose-adapter.ts
src/harnesses/goose-openai.ts
src/harnesses/goose-output.ts
src/harnesses/goose-utils.ts
src/harnesses/harness.ts
src/harnesses/opencode-adapter.ts
src/harnesses/opencode-artifacts.ts
src/harnesses/opencode-cli.ts
src/harnesses/opencode-config.ts
src/harnesses/opencode-events.ts
src/harnesses/opencode-model.ts
src/harnesses/opencode-output.ts
src/harnesses/opencode-permissions.ts
src/harnesses/opencode-process.ts
src/harnesses/opencode-provider.ts
src/harnesses/opencode-runner.ts
src/harnesses/opencode-signal.ts
src/harnesses/tool-prompt.ts
src/lib/benchmark-checkpoint.ts
src/lib/failure-classifier.ts
src/lib/hardware-profile.ts
src/lib/logger.ts
src/lib/machine-profile/instance-id.ts
src/lib/machine-profile/legacy.ts
src/lib/machine-profile/normalization.ts
src/lib/machine-profile/probe-linux.ts
src/lib/machine-profile/probe-macos.ts
src/lib/machine-profile/probe-utils.ts
src/lib/machine-profile/probe-windows.ts
src/lib/model-aliases.ts
src/lib/model-profile/normalization.ts
src/lib/model-profile/registry-compat.ts
src/lib/model-profile/registry.ts
src/lib/openai-compat-client.ts
src/lib/signal-assessment.ts
src/lib/stats-format.ts
src/lib/test-catalog.ts
src/runner/index.ts
src/runner/item-executor.ts
src/runner/item-retry.ts
src/runner/model-availability.ts
src/runner/model-eligibility.ts
src/runner/plan-builder.ts
src/runner/run-progress.ts
src/runtimes/discovery.ts
src/runtimes/index.ts
src/runtimes/ollama-residency.ts
src/runtimes/ollama-runtime.ts
src/runtimes/runtime.ts
src/runtimes/vllm-runtime.ts
src/schemas/common.schema.ts
src/schemas/config.schema.ts
src/schemas/index.ts
src/schemas/model-alias.schema.ts
src/schemas/model-profile.schema.ts
src/schemas/plan.schema.ts
src/schemas/result.schema.ts
src/schemas/test-catalog.schema.ts
src/tests/calculator-basic/test.meta.json
src/tests/calculator-stateful/test.meta.json
src/tests/event-emitter/test.meta.json
src/tests/rate-limiter/test.meta.json
src/tests/smoke/test.meta.json
src/tests/todo-app/test.meta.json
src/tests/ttl-cache/test.meta.json
src/tests/workspace-smoke/fixtures/logs/.gitkeep
src/tests/workspace-smoke/prompt.blind.md
src/tests/workspace-smoke/prompt.informed.md
src/tests/workspace-smoke/test.meta.json
test/aggregate.test.ts
test/benchmark-checkpoint.test.ts
test/build-index.test.ts
test/compare-checkpoint-guard.test.ts
test/compare-formatters.test.ts
test/dashboard-schema-compat.test.ts
test/dashboard-test-type.test.ts
test/failure-classifier.test.ts
test/generation-retry.test.ts
test/goose-adapter.test.ts
test/goose-output.test.ts
test/hardware-profile.test.ts
test/harness-compatibility.test.ts
test/item-executor.test.ts
test/item-retry.test.ts
test/machine-profile-migration.test.ts
test/model-alias-schema.test.ts
test/model-availability.test.ts
test/model-profile-registry.test.ts
test/ollama-adapter.test.ts
test/ollama-residency.test.ts
test/opencode-adapter.test.ts
test/opencode-artifacts.test.ts
test/opencode-cli-cache.test.ts
test/opencode-cli.test.ts
test/opencode-config.test.ts
test/opencode-model.test.ts
test/opencode-output.test.ts
test/opencode-runner-race.test.ts
test/opencode-runner.test.ts
test/plan-builder.test.ts
test/run-command.test.ts
test/runner-residency.test.ts
test/schema-regressions.test.ts
test/schemas.test.ts
test/signal-assessment.test.ts
test/stats-semantics.test.ts
test/stats.test.ts
test/test-catalog.test.ts
test/tool-prompt.test.ts
test/utils/fixtures.ts
test/vllm-runtime.test.ts
test/workspace-capability-parity.test.ts
test/workspace-prompt-parity.test.ts

💤 Files with no reviewable changes (10)

src/harnesses/goose-openai.ts
src/harnesses/opencode-model.ts
src/runtimes/discovery.ts
package.json
src/runtimes/vllm-runtime.ts
src/harnesses/opencode-process.ts
test/vllm-runtime.test.ts
src/lib/openai-compat-client.ts
src/harnesses/opencode-output.ts
test/benchmark-checkpoint.test.ts

AustinKelsay and others added 4 commits April 5, 2026 10:51

bench: harden signal assessment and retry fairness

589033c

bench: tighten taint detection regressions

42bd9ea

fix: tighten signal assessment taint guards

aef3606

Merge pull request #23 from AustinKelsay/fix/benchmark-signal-hardening

9c372ee

[codex] bench: harden signal assessment and retry fairness

vercel Bot deployed to Preview April 10, 2026 21:29 View deployment

coderabbitai Bot reviewed Apr 10, 2026

View reviewed changes

Comment thread src/harnesses/goose-adapter.ts Outdated

Comment thread src/lib/signal-assessment.ts Outdated

Preserve failure metadata in item execution

b6f41a8

- keep stdout/stderr and signal assessments from Goose failures - carry structured generation failure details through executeItem - bump schema version and accept prior current-version artifacts

vercel Bot deployed to Preview April 10, 2026 22:11 View deployment

coderabbitai Bot reviewed Apr 10, 2026

View reviewed changes

Comment thread src/harnesses/goose-adapter.ts Outdated

Comment thread src/runner/item-executor.ts Outdated

Improve Goose taint detection and error fallback

b2dd0fa

- Scan stderr alongside stdout for taint reasons - Use execa message when Goose emits no stream output - Cover legacy run migration for machine profiles

vercel Bot deployed to Preview April 10, 2026 22:35 View deployment

Restrict benchmark execution to Ollama

8ea88b5

- Remove vLLM and OpenAI-compatible runtime paths - Simplify harness, config, and docs around Ollama-only runs - Keep legacy vLLM artifacts readable for compare and history

vercel Bot had a problem deploying to Preview April 13, 2026 18:39 Failure

coderabbitai Bot reviewed Apr 13, 2026

View reviewed changes

Comment thread src/schemas/model-profile.schema.ts Outdated

Align dashboard plan config with new Goose fields

5c5bc91

- Remove deprecated `vllmBaseUrl` from plan config - Add optional Goose turn limits and test categories

vercel Bot deployed to Preview April 13, 2026 18:51 View deployment

coderabbitai Bot reviewed Apr 13, 2026

View reviewed changes

Comment thread apps/dashboard/src/lib/schemas.ts Outdated

Harden legacy profile and signal-assessment loading

427c6b8

- Accept legacy model profile files by dropping unsupported runtime variants - Validate harness signal assessments with Zod before persisting failures - Tighten dashboard schema bounds for Goose turn counts

vercel Bot deployed to Preview April 13, 2026 19:09 View deployment

AustinKelsay added 2 commits April 17, 2026 11:07

Merge remote-tracking branch 'origin/main' into staging

3be41ed

chore: restore staging verification

fc34144

vercel Bot deployed to Preview April 17, 2026 18:10 View deployment

coderabbitai Bot reviewed Apr 20, 2026

View reviewed changes

AustinKelsay and others added 3 commits April 21, 2026 16:27

rebuild opencode harness

7977818

Harden OpenCode harness review fixes

ae1ab2e

Merge pull request #25 from AustinKelsay/codex/opencode-harness-rebuild

6f0daf9

[codex] rebuild opencode harness

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Harden Ollama-only benchmark execution and validation

67de245

vercel Bot deployed to Preview April 25, 2026 01:40 View deployment

coderabbitai Bot reviewed Apr 25, 2026

View reviewed changes

Refine dashboard and harness validation handling

6c76840

vercel Bot deployed to Preview April 25, 2026 15:18 View deployment

coderabbitai Bot reviewed Apr 25, 2026

View reviewed changes

Tighten validation and preserve adapter taint metadata

7d83da4

vercel Bot deployed to Preview April 25, 2026 16:10 View deployment

coderabbitai Bot reviewed Apr 25, 2026

View reviewed changes

Tighten CLI validation and adapter cleanup

0cb6fdb

vercel Bot deployed to Preview April 25, 2026 18:32 View deployment

coderabbitai Bot reviewed Apr 25, 2026

View reviewed changes

Refine dashboard and runtime validation cleanup

f7a7f85

vercel Bot deployed to Preview April 25, 2026 19:07 View deployment

coderabbitai Bot reviewed Apr 25, 2026

View reviewed changes

Comment thread src/lib/machine-profile/instance-id.ts Outdated

Comment thread src/schemas/config.schema.ts

Comment thread src/schemas/model-profile.schema.ts Outdated

Tighten schema validation and runtime handling

16cb5cf

vercel Bot deployed to Preview April 25, 2026 19:31 View deployment

Conversation

AustinKelsay commented Apr 10, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

vercel Bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AustinKelsay commented Apr 20, 2026

Uh oh!

coderabbitai Bot commented Apr 20, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AustinKelsay commented Apr 24, 2026

Uh oh!

coderabbitai Bot commented Apr 24, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AustinKelsay commented Apr 25, 2026

Uh oh!

coderabbitai Bot commented Apr 25, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AustinKelsay commented Apr 25, 2026

AustinKelsay commented Apr 10, 2026 •

edited by coderabbitai Bot

Loading

vercel Bot commented Apr 10, 2026 •

edited

Loading

coderabbitai Bot commented Apr 10, 2026 •

edited

Loading