Feature/result checkpointing and aggregation by AustinKelsay · Pull Request #13 · AustinKelsay/plebdev-bench

AustinKelsay · 2026-03-05T17:53:59Z

Summary by CodeRabbit

New Features
- Default Leaderboard route showing latest-checkpoint aggregated leaderboard and downloadable checkpoint aggregates (latest.json); per-checkpoint aggregate files added.
- Periodic crash‑safe run snapshots (run.partial.json) during long runs.
Enhancements
- Richer run/artifact metadata: checkpoints, machine profiles, provenance; schema bumped.
- Explicit CLI machine identity and compare guardrail (--allow-cross-checkpoint).
- Goose max/retry turn controls with validation; scorer runs in isolated worker by default (in‑process debug mode).
- Improved timing metrics and interactive leaderboard filters.
Tests
- New tests for checkpointing, aggregation, hardware identity, scorer isolation, configs, and schemas.

vercel · 2026-03-05T17:54:04Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
plebdev-bench-dashboard	Ready	Preview, Comment	Mar 6, 2026 8:45pm

coderabbitai · 2026-03-05T17:54:24Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
✅ Review completed - (🔄 Check again to review again)

📝 Walkthrough

Walkthrough

Adds deterministic benchmark checkpoints, sanitized machine profiles, crash‑safe per‑run partial snapshots, checkpointed cross‑run aggregation with latest‑leaderboard artifacts and UI, scorer isolation via a worker (with in‑process fallback), goose turn‑limit controls and CLI machine identity options, and a dashboard/index v2 with schema bump to 0.3.0.

Changes

Cohort / File(s)	Summary
Checkpointing & Hardware `src/lib/benchmark-checkpoint.ts`, `src/lib/hardware-profile.ts`, `src/runner/plan-builder.ts`	Deterministic benchmark manifest/checkpoint computation; sanitized hardware profile collection; deterministic or explicit machine identity resolution; embed `benchmarkCheckpoint`, `machine`, and `provenance` into generated run plans.
Per-run Partial Snapshots / Writer `src/results/writer.ts`, `src/runner/index.ts`, `src/results/reader.ts`	Periodic crash‑safe `run.partial.json` checkpoint writes (atomic temp→rename), deletion on success; exports `writePartialResult`/`deletePartialResult`; reader migrates legacy `environment`→`runtimeEnvironment`.
Checkpointed Aggregation & Indexing `src/results/aggregate.ts`, `apps/dashboard/scripts/build-index.ts`, `apps/dashboard/public/results/aggregates/*`, `apps/dashboard/public/results/index.json`	Deterministic latest‑wins per‑checkpoint aggregation, checkpoint summaries, per‑checkpoint aggregate artifacts, and a v2 dashboard index plus `aggregates/latest.json`.
Dashboard UI & API `apps/dashboard/src/components/leaderboard/`, `apps/dashboard/src/pages/leaderboard.tsx`, `apps/dashboard/src/App.tsx`, `apps/dashboard/src/lib/api.ts`, `apps/dashboard/src/lib/schemas.ts`, `apps/dashboard/src/lib/types.ts`, `apps/dashboard/src/components/`	New Leaderboard route (default), filtering controls, aggregate‑first rendering; client APIs `fetchDashboardIndex`/`fetchLatestAggregate` with legacy‑to‑v2 normalization; new aggregate schemas/types.
Scorer Isolation & Worker `src/lib/scorer.ts`, `src/lib/scorer-worker.ts`, `src/lib/scorer-core.ts`	Introduce scorer worker subprocess protocol and CLI/env selectable mode (worker vs in‑process); retain `scoreGenerationInProcess` and add structured worker I/O handling.
Goose / Harness Hardening `src/harnesses/goose-adapter.ts`, `src/harnesses/code-output-policy.ts`, `src/harnesses/index.ts`, `src/runner/item-executor.ts`	Add goose turn‑limit detection and retry semantics, config‑driven `maxTurns`/`retryMaxTurns`, validate `retryMaxTurns >= maxTurns`, propagate harness options and timing through executor.
CLI & Compare Guardrail `src/cli/run-command.ts`, `src/cli/compare-command.ts`	Run CLI gains `--goose-max-turns`, `--goose-retry-max-turns`, `--machine-id`, `--machine-label`; compare enforces checkpoint equality guard (opt‑out via `--allow-cross-checkpoint`) and reads plan best‑effort.
Schemas, Types & Config `src/schemas/...`, `src/schemas/index.ts`, `src/schemas/common.schema.ts`, `src/schemas/config.schema.ts`, `apps/dashboard/src/lib/schemas.ts`, `apps/dashboard/src/lib/types.ts`	Schema version bumped to 0.3.0; add VerificationStatus, BenchmarkCheckpoint, RuntimeEnvironment, HardwareProfile, MachineProfile, RunProvenance; RunPlan/RunResult include `machine`/`benchmarkCheckpoint`/`provenance`; BenchConfig adds machine id/label and goose turn limits with cross‑field validation.
Timing / Stats `src/lib/stats.ts`, `src/runner/item-executor.ts`	TimingStats extended for pure scoring vs retry generation (`avgScoringOnlyMs`, `avgRetryGenerationMs`, `scoringItemsWithRetry`); scoring timing propagated from retry paths.
Index/Build Tests & Docs `test/`, `llm/implementation/.md`, `README.md`, `apps/dashboard/scripts/build-index.ts`	New/updated tests for checkpointing, hardware profile, aggregation, build‑index, compare guard, code‑output turn‑limit, scorer isolation, and stats; docs for checkpointed aggregation, scorer isolation, and goose hardening; build‑index implementation added.
Results Public Artifacts `apps/dashboard/public/results/aggregates/*`, `apps/dashboard/public/results/index.json`	Add per‑checkpoint aggregate JSON files and `latest.json`; root `index.json` migrated to v2 shape with `runs` and `checkpoints` arrays and `latestCheckpointId`.

Sequence Diagram(s)

sequenceDiagram
    participant UI as Dashboard UI
    participant API as Client API
    participant Agg as aggregates/latest.json
    participant Index as index.json

    UI->>API: fetchLatestAggregate()
    API->>Agg: GET /results/aggregates/latest.json
    alt latest exists
        Agg-->>API: LeaderboardAggregate
        API->>API: validate aggregate schema
    else 404 or missing
        API->>Index: GET /results/index.json
        Index-->>API: DashboardIndex (v2 or legacy)
        API->>API: normalize to v2
        API->>API: synthesize empty aggregate if needed
    end
    API-->>UI: LeaderboardAggregate (resolved)

sequenceDiagram
    participant Runner as Runner
    participant Harness as Goose Harness
    participant Mode as Mode Resolver (PLEBDEV_BENCH_SCORER_MODE)
    participant Worker as Scorer Worker (subprocess)

    Runner->>Harness: createHarness({maxTurns,retryMaxTurns})
    Runner->>Harness: generate(item)
    Harness-->>Runner: output
    alt detected turn-limit or off-task
        Runner->>Mode: resolveScoringMode()
        Mode-->>Runner: "process" or "in-process"
        alt "process"
            Runner->>Worker: spawn scorer-worker (stdin JSON)
            Worker->>Worker: scoreGenerationInProcess(...)
            Worker-->>Runner: {ok,result}
        else "in-process"
            Runner->>Runner: scoreGenerationInProcess(...)
            Runner-->>Runner: ScoringResult
        end
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

review and harden phase executed by codex #4: Overlapping hardening and dashboard features (scorer isolation, schema bump, checkpointing, machine profiles).
Feature/multi runtime #6: Related dashboard indexing/aggregation and build‑index logic touching artifact generation.
tighten todo prompt and export exceptions #8: Related changes to scoring/harness stack and output policies (code‑output‑policy and scorer behavior).

Poem

🐰 I hop through hashes and tidy each test,

Partial snaps save the runs from the rest.
Machines get names, manifests fall in line,
Leaderboards bloom from the latest design.
Workers hum softly — scoring, neat and blessed.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Feature/result checkpointing and aggregation' clearly and accurately summarizes the main changes in this large pull request, which introduces per-run partial result checkpointing, cross-run checkpoint-based aggregation, and leaderboard/dashboard features.
Docstring Coverage	✅ Passed	Docstring coverage is 94.25% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/result-checkpointing-and-aggregation

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

🧹 Nitpick comments (15)

src/harnesses/code-output-policy.ts (1)

27-32: Pattern overlap between TURN_LIMIT_PATTERNS and OFF_TASK_PATTERNS.

The pattern /\bwould you like me to continue\b/i appears in both TURN_LIMIT_PATTERNS (line 31) and OFF_TASK_PATTERNS (line 19). Since the turn-limit check executes first (lines 159-169), this phrase will always classify as turn_limit rather than off_task.

If intentional (turn-limit should take precedence), consider removing the duplicate from OFF_TASK_PATTERNS to avoid confusion. If not intentional, remove it from TURN_LIMIT_PATTERNS or adjust the pattern to be more specific (e.g., requiring the "maximum number of actions" phrase to co-occur).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/code-output-policy.ts` around lines 27 - 32,
TURN_LIMIT_PATTERNS and OFF_TASK_PATTERNS both include the phrase-matching
pattern /\bwould you like me to continue\b/i, causing the turn-limit check (in
the turn-limit classification flow that runs before off-task) to always win;
remove the duplicate from one of the arrays or make the TURN_LIMIT_PATTERNS
entry more specific (e.g., require co-occurrence with "maximum number of
actions" or tighten the regex) so the intended category wins—update the
TURN_LIMIT_PATTERNS constant (or OFF_TASK_PATTERNS) accordingly and run tests to
ensure classification precedence matches expectations.

src/harnesses/goose-adapter.ts (1)

81-84: Silent override when retryMaxTurns < maxTurns.

The Math.max(maxTurns, ...) ensures retryMaxTurns >= maxTurns, but if a caller explicitly sets retryMaxTurns to a value lower than maxTurns, it gets silently overridden. Consider logging a debug message when this adjustment occurs to aid troubleshooting.

🔧 Optional: Add debug logging for adjustment

 	const retryMaxTurns = Math.max(
 		maxTurns,
 		normalizeTurnLimit(options?.retryMaxTurns, DEFAULT_GOOSE_RETRY_MAX_TURNS),
 	);
+	if (options?.retryMaxTurns !== undefined && retryMaxTurns !== options.retryMaxTurns) {
+		logger.debug(
+			{ requested: options.retryMaxTurns, adjusted: retryMaxTurns },
+			"retryMaxTurns adjusted to be >= maxTurns",
+		);
+	}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/goose-adapter.ts` around lines 81 - 84, The current calculation
for retryMaxTurns using Math.max(maxTurns,
normalizeTurnLimit(options?.retryMaxTurns, DEFAULT_GOOSE_RETRY_MAX_TURNS))
silently forces retryMaxTurns to be at least maxTurns; detect when the
normalized options?.retryMaxTurns is less than maxTurns and emit a debug log
noting the requested value and the enforced adjusted value to aid
troubleshooting. Locate the logic around retryMaxTurns, maxTurns,
normalizeTurnLimit and DEFAULT_GOOSE_RETRY_MAX_TURNS in goose-adapter.ts and add
a conditional that compares the normalized requestedRetry =
normalizeTurnLimit(options?.retryMaxTurns, DEFAULT_GOOSE_RETRY_MAX_TURNS) to
maxTurns, log a debug message (including requestedRetry and final retryMaxTurns)
before assigning retryMaxTurns, and then keep using Math.max to set the final
value.

src/lib/scorer-core.ts (2)

46-74: JSON.stringify deep equality has known edge cases.

JSON.stringify comparison can produce false negatives for objects with:

Keys in different insertion order

undefined values vs missing keys (both stringify to the same result, but semantically different)

Special values like Date objects, Map, Set

For the scoring use case (comparing test outputs), this is likely acceptable since test cases typically use primitive values or simple objects. Consider documenting this limitation in the function's JSDoc if edge cases become a concern.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/scorer-core.ts` around lines 46 - 74, valuesMatch uses JSON.stringify
for deep equality which has edge cases (key order, undefined vs missing keys,
Date/Map/Set); update the valuesMatch function's JSDoc to document these
limitations and when tolerance applies, referencing the valuesMatch function so
future maintainers know that this comparison is intended for primitives/simple
objects and may not handle complex types or ordering differences.
168-176: Class detection heuristic may misidentify some factory functions.

The check factory.prototype && factory.prototype.constructor === factory identifies classes, but some factory functions also have prototype chains (e.g., functions created with Object.create). This could lead to incorrect new calls on non-constructor functions.

However, since the scoring specs are controlled by the project (not arbitrary user input), this is a low-risk edge case.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/scorer-core.ts` around lines 168 - 176, The current heuristic using
factory.prototype.constructor can misclassify functions; update the detection
around the factory variable so ES6 classes are detected reliably by checking
Function.prototype.toString.call(factory).startsWith("class") and fall back to a
safe attempt: try instantiating with new (wrap new (factory as any)() in a
try/catch) and if it throws a TypeError, call the factory as a function instead.
Apply this change where the factory variable is handled so class constructors
are constructed and non-constructors are invoked safely.

src/lib/scorer.ts (1)

62-85: Consider aligning path resolution with existing codebase patterns.

The process.cwd() pattern is already used consistently across multiple files in the codebase (benchmark-checkpoint.ts, test-catalog.ts, scoring-spec.ts, etc.), indicating the tool is designed to be run from the repository root. While import.meta.url would be viable for this ESM codebase, adopting it here would create inconsistency.

If improving path resolution is a priority, consider standardizing the approach across the codebase rather than in this isolated function.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/scorer.ts` around lines 62 - 85, The path resolution in scoreInWorker
should follow the repo-wide convention using process.cwd() rather than
import.meta.url; ensure the workerPath in the scoreInWorker function is built
with path.join(process.cwd(), "src", "lib", "scorer-worker.ts") (the same
pattern used elsewhere like benchmark-checkpoint.ts), and avoid switching to
import.meta.url here unless you standardize the approach across the codebase;
update any divergent resolution logic in scoreInWorker or related helpers to
match this pattern.

apps/dashboard/src/components/run-detail/run-detail-page.tsx (1)

194-196: Expose full checkpoint ID on hover for truncated text.

Because Line 194 truncates the checkpoint string, adding title={checkpointId} makes the full value easy to inspect/copy.

Suggested patch

-						{checkpointId && (
-							<p className="text-xs text-foreground-faint mt-1 truncate">
-								{checkpointId}
-							</p>
-						)}
+						{checkpointId && (
+							<p
+								className="text-xs text-foreground-faint mt-1 truncate"
+								title={checkpointId}
+							>
+								{checkpointId}
+							</p>
+						)}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/run-detail/run-detail-page.tsx` around lines
194 - 196, The truncated checkpoint string displayed in the RunDetailPage
component (the <p className="text-xs text-foreground-faint mt-1 truncate">
element rendering checkpointId) should expose the full value on hover; update
that element (where checkpointId is rendered) to add a title attribute set to
checkpointId so users can view/copy the entire ID on hover.

apps/dashboard/src/components/leaderboard/leaderboard-page.tsx (1)

44-47: Add full TSDoc contract for exported LeaderboardPage.

Line 47 exports a public function, but the block at Line 44-Line 46 doesn’t provide the full required function contract fields.

📝 Suggested doc update

 /**
  * Renders the latest-checkpoint leaderboard page.
+ *
+ * `@returns` Leaderboard route content with summary cards, filters, and aggregated results.
+ * `@throws` {Error} No direct throws; load failures are rendered via component error state.
  */
 export function LeaderboardPage() {

As per coding guidelines, "All exported functions must have TSDoc/JSDoc documentation including purpose, params, returns, and throws".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/leaderboard/leaderboard-page.tsx` around lines
44 - 47, The exported React component LeaderboardPage is missing a full TSDoc
contract; add a complete TSDoc block above the export that documents the
component's purpose, any props (or note none if it accepts none), the return
type (JSX.Element or ReactElement), and any errors/throws behavior (if it can
throw or rely on thrown services), e.g., include `@remarks/`@example as needed to
clarify usage; ensure the TSDoc sits immediately above the function declaration
for LeaderboardPage.

apps/dashboard/src/lib/api.ts (1)

83-171: Complete TSDoc contracts for the exported API functions.

The new/updated exports in this range should include full contract fields (@param, @returns, @throws) to match repository API-doc standards.

As per coding guidelines, "All exported functions must have TSDoc/JSDoc documentation including purpose, params, returns, and throws".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/lib/api.ts` around lines 83 - 171, Add complete TSDoc
blocks for all exported functions in this file: fetchDashboardIndex, fetchRuns,
fetchRun, fetchPlan, fetchRunWithPlan, and fetchLatestAggregate. For each
function include a one-line description, `@param` entries for parameters (e.g.,
runId: string) where applicable, an `@returns` describing the resolved Promise
type (e.g., Promise<DashboardIndex>, Promise<RunResult>, Promise<{run:
RunResult; plan: RunPlan}>, Promise<LeaderboardAggregate>), and `@throws`
describing possible errors (network failures, 404 fallback behavior for
fetchDashboardIndex/fetchLatestAggregate, and validation failures from
schema.parse). Ensure wording matches existing repo style (purpose, params,
returns, throws) and place the comment immediately above each function
declaration.

test/aggregate.test.ts (1)

96-101: Align duplicate-key test data with intent by keeping id constant.

Line 96 and Line 99 use different IDs; using the same ID makes the “duplicate machine+matrix key” intent explicit and less coupled to key-implementation details.

🧪 Suggested tweak

-const olderItem = createItem("01", "2026-03-04T12:00:00.000Z", {
+const olderItem = createItem("01", "2026-03-04T12:00:00.000Z", {
 	automatedScore: { passed: 3, failed: 3, total: 6 },
 });
-const newerItem = createItem("02", "2026-03-04T12:10:00.000Z", {
+const newerItem = createItem("01", "2026-03-04T12:10:00.000Z", {
 	automatedScore: { passed: 6, failed: 0, total: 6 },
 });

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/aggregate.test.ts` around lines 96 - 101, The duplicate-key test uses
createItem to build olderItem and newerItem but gives them different ids, which
weakens the intent of testing duplicate (machine+matrix) keys; update the
createItem calls for olderItem and newerItem so they share the same id value
(e.g., use "01" for both) in the aggregate.test.ts test, leaving timestamps and
payloads unchanged so the test explicitly models a newer record replacing an
older record; adjust any expectations/assertions that depend on the id if
needed.

src/schemas/common.schema.ts (1)

196-212: Default values in RunProvenanceSchema may have unintended effects.

Using .default() on verificationStatus and source means incomplete provenance objects will be auto-populated. Verify this is intentional for partial input scenarios; otherwise, consider making them required or using .optional() without defaults to let consumers handle missing values explicitly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/schemas/common.schema.ts` around lines 196 - 212, RunProvenanceSchema
currently uses defaults for verificationStatus and source which will silently
populate missing fields; change the schema to either make these fields required
(remove .default(...) so verificationStatus and source are validated as present)
or change them to optional (use .optional() without defaults) depending on
desired behavior. Locate RunProvenanceSchema and update
VerificationStatusSchema.default("self_reported") and source:
z.string().min(1).default("local_cli") to the chosen alternative (required
without .default or optional via .optional()) and adjust any upstream callers
that relied on the implicit defaults (e.g., validation/creation flows that use
RunProvenanceSchema).

src/cli/compare-command.ts (1)

54-63: Consider using logger instead of console.warn.

Per coding guidelines, the project uses Pino for structured logging. The console.warn in readPlanBestEffort should use the logger for consistency with the rest of the codebase.

♻️ Proposed change to use logger

+import { logger } from "../lib/logger.js";
+
 export function readPlanBestEffort(runDir: string): RunPlan | undefined {
 	try {
 		return readPlan(runDir);
 	} catch (error) {
-		console.warn(
-			`Warning: Unable to read plan metadata from ${runDir}: ${error instanceof Error ? error.message : String(error)}. Continuing with run.json metadata.`,
+		logger.warn(
+			{ runDir, error: error instanceof Error ? error.message : String(error) },
+			"Unable to read plan metadata; continuing with run.json metadata",
 		);
 		return undefined;
 	}
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/cli/compare-command.ts` around lines 54 - 63, The function
readPlanBestEffort currently logs errors with console.warn; replace that with
the project's Pino logger to keep structured logging consistent. Update
readPlanBestEffort to call the shared logger (the same logger used elsewhere in
this module) instead of console.warn, and include the same warning message and
error details (use error instanceof Error ? error.message : String(error)) when
logging; keep behavior of returning undefined after logging and leave readPlan
invocation and try/catch logic intact.

src/results/writer.ts (2)

78-90: Consider atomic write for crash-safety.

Since writePartialResult is designed for crash-safe checkpoints, a non-atomic writeFileSync could still leave a corrupted partial file if the process crashes mid-write. For stronger guarantees, consider writing to a temp file and renaming atomically.

♻️ Proposed atomic write pattern

 export async function writePartialResult(
 	outputDir: string,
 	result: RunResult,
 ): Promise<void> {
 	RunResultSchema.parse(result);

 	const runDir = path.join(outputDir, result.runId);
 	ensureDir(runDir);

 	const partialPath = path.join(runDir, "run.partial.json");
+	const tempPath = path.join(runDir, ".run.partial.json.tmp");
 	const content = JSON.stringify(result, null, 2);
-	fs.writeFileSync(partialPath, content, "utf-8");
+	fs.writeFileSync(tempPath, content, "utf-8");
+	fs.renameSync(tempPath, partialPath);
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/results/writer.ts` around lines 78 - 90, writePartialResult currently
writes run.partial.json with fs.writeFileSync which can leave a corrupted file
if the process crashes mid-write; change it to perform an atomic write by
writing to a temp file in the same directory (e.g., partialPath + ".tmp") and
then atomically renaming it to partialPath using fs.renameSync (or
fs.promises.rename). Ensure you still call ensureDir(runDir) first, write the
JSON content to the temp file and flush/close it before renaming, and keep
RunResultSchema.parse(result) validation unchanged.

39-48: Async function signatures with synchronous fs operations.

All three write functions are declared async but use synchronous writeFileSync. This blocks the event loop during writes. Consider using fs.promises.writeFile for consistency with the async signature, or remove async if sync behavior is intended.

Also applies to: 57-69, 78-90

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/results/writer.ts` around lines 39 - 48, The three async functions
(writePlan, plus the other two write functions) currently use synchronous
fs.writeFileSync which blocks the event loop; update each to use the
asynchronous promise API instead (e.g., replace writeFileSync with await
fs.promises.writeFile or import from "fs/promises" and await writeFile) so the
functions truly operate asynchronously, keep the JSON.stringify and path logic
unchanged, and ensure the functions still return Promise<void> (add awaits where
needed and remove any unused async if you decide to keep sync behavior instead).

src/runner/index.ts (1)

300-323: Consider extracting checkpoint write logic.

The checkpoint write logic (check interval, build snapshot, write, log, update counter) is duplicated in two places. Extract to a helper function to reduce duplication and ensure consistent behavior.

♻️ Proposed helper extraction

+async function maybeWriteCheckpoint(
+	plan: Awaited<ReturnType<typeof buildRunPlan>>,
+	startedAt: string,
+	startTime: number,
+	total: number,
+	results: MatrixItemResult[],
+	lastCheckpointItemCount: number,
+	config: BenchConfig,
+	log: ReturnType<typeof logger.child>,
+): Promise<number> {
+	const itemCount = results.length;
+	const shouldCheckpoint =
+		itemCount === total ||
+		itemCount - lastCheckpointItemCount >= PARTIAL_RESULT_CHECKPOINT_INTERVAL;
+	if (shouldCheckpoint) {
+		const partialSnapshot = buildRunResultSnapshot(plan, startedAt, startTime, total, results);
+		await writePartialResult(config.outputDir, partialSnapshot);
+		log.info(
+			{
+				completedItems: itemCount,
+				totalItems: total,
+				checkpointPath: `${config.outputDir}/${plan.runId}/run.partial.json`,
+			},
+			"Wrote run checkpoint",
+		);
+		return itemCount;
+	}
+	return lastCheckpointItemCount;
+}

Also applies to: 347-370

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/runner/index.ts` around lines 300 - 323, Extract the duplicated
checkpoint logic into a single helper (e.g., checkpointIfNeeded or
writeRunCheckpoint) that accepts the inputs used here: results (or itemCount),
total, lastCheckpointItemCount (by ref/return updated value),
PARTIAL_RESULT_CHECKPOINT_INTERVAL, plan, startedAt, startTime,
config.outputDir, writePartialResult, buildRunResultSnapshot and log; inside the
helper compute shouldCheckpoint (itemCount === total || itemCount -
lastCheckpointItemCount >= PARTIAL_RESULT_CHECKPOINT_INTERVAL), and when true
build the partial snapshot via buildRunResultSnapshot, call writePartialResult,
log the same metadata (completedItems, totalItems, checkpointPath using
config.outputDir and plan.runId), and return the new lastCheckpointItemCount so
callers (the blocks around lines using results/total) update their counter;
replace both duplicated blocks (the current block and the one around lines
347–370) with calls to this helper.

apps/dashboard/src/lib/schemas.ts (1)

1-7: Consider updating the file header exports list.

The header lists RunResultSchema, RunPlanSchema, RunListItemSchema but the file now also exports DashboardCheckpointSummarySchema, DashboardIndexSchema, DashboardIndexLegacyOrV2Schema, and LeaderboardAggregateSchema.

📝 Suggested header update

 /**
  * Purpose: Zod schemas for dashboard API boundary validation.
- * Exports: RunResultSchema, RunPlanSchema, RunListItemSchema
+ * Exports: RunResultSchema, RunPlanSchema, RunListItemSchema, RunListSchema,
+ *          DashboardCheckpointSummarySchema, DashboardIndexSchema,
+ *          DashboardIndexLegacyOrV2Schema, LeaderboardAggregateSchema
  *
  * These schemas validate JSON fetched from the results directory.
  * They mirror the CLI schemas but are kept local to avoid cross-package imports.
  */

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/lib/schemas.ts` around lines 1 - 7, Update the file header
comment to reflect the actual exported schemas: add
DashboardCheckpointSummarySchema, DashboardIndexSchema,
DashboardIndexLegacyOrV2Schema, and LeaderboardAggregateSchema to the exports
list alongside RunResultSchema, RunPlanSchema, and RunListItemSchema; ensure the
header’s "Exports:" line lists these symbol names exactly and adjust any brief
description if needed so the header accurately documents the module’s public
API.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/dashboard/scripts/build-index.ts`:
- Around line 245-247: The log over-counts aggregate files because it
unconditionally adds +1 for latest.json even though latestCheckpointId may
already have been counted into result.aggregatesWritten (see variables
result.aggregatesWritten, latestCheckpointId and checkpoints and the branch that
increments aggregatesWritten when latestCheckpointId is missing); update the
final console.log to report the correct count by using result.aggregatesWritten
as-is (or compute a conditional +1 only when latestCheckpointId was not present)
so the message reflects the true number of written aggregate files.

In `@apps/dashboard/src/lib/api.ts`:
- Around line 162-165: When handling a 404 for aggregates/latest.json, don't
always synthesize an empty aggregate; instead call fetchDashboardIndex() and
inspect the index (e.g., index.runs or index.latestCheckpointId) — if the index
shows existing runs (index.runs.length > 0 or similar), throw an error (or
return a failure) explaining that aggregates are missing for an existing run so
the pipeline/artefact is broken; only call
createEmptyAggregate(index.latestCheckpointId) when the index has no runs.
Update the 404 branch around response.status to implement this guard and include
the index reference (from fetchDashboardIndex()) in the error message for
debugging.

In `@apps/dashboard/src/pages/leaderboard.tsx`:
- Around line 1-4: Add a file-level header comment at the top of
apps/dashboard/src/pages/leaderboard.tsx that includes Purpose (already
present), Exports (list exported symbols such as the default exported
Leaderboard component and any named exports), and Invariants (any assumptions
the module relies on), and then add a TSDoc block above the exported function
(the default exported Leaderboard React component or any other exported function
on lines ~7–9) describing its purpose, param types (props), return value
(JSX.Element), and any thrown errors; update the header and the TSDoc to follow
the project's TSDoc/JSDoc conventions so every exported function and the file
header include purpose, params, returns, and throws where applicable.

In `@llm/implementation/goose-headless-hardening.md`:
- Line 32: Replace the unhyphenated phrase in the sentence "If a model still
frequently emits turn-limit chatter on Goose informed passes, increase retry
turns first." by changing "Goose informed passes" to the hyphenated compound
adjective "Goose-informed passes" so the sentence reads with improved
readability.

In `@src/cli/compare-command.ts`:
- Around line 361-367: The two readResult calls are synchronous but are awaited;
remove the unnecessary await operators on the calls that assign resultA and
resultB so they become: resultA = readResult(dirA) and resultB =
readResult(dirB); keep readPlanBestEffort(dirA) and readPlanBestEffort(dirB)
as-is to maintain consistency with the synchronous API of readResult and avoid
awaiting non-Promise values.

In `@src/cli/run-command.ts`:
- Around line 138-139: Replace the lenient Number.parseInt usage when mapping
options.gooseMaxTurns and options.gooseRetryMaxTurns to
gooseMaxTurns/gooseRetryMaxTurns with strict integer validation: first verify
the raw option strings match an integer pattern (e.g., /^\d+$/ or optional sign
if allowed) and only then convert with parseInt/Number; if the input fails the
check, throw a clear error so malformed inputs like "3x" are rejected rather
than silently coerced. Ensure you update the assignment sites for gooseMaxTurns
and gooseRetryMaxTurns (where options.gooseMaxTurns/options.gooseRetryMaxTurns
are read) to perform this validation and throw on invalid input.

In `@src/lib/scorer-worker.ts`:
- Around line 63-79: The catch block in main can rethrow if writeResponse itself
throws (e.g., validation or stdout errors); update main to guard the error-path
write by wrapping writeResponse({ ok: false, error }) in its own try/catch so
the process cannot crash while reporting an error. Specifically, inside async
function main() (which parses with ScorerWorkerRequestSchema and calls
scoreGenerationInProcess and returns a ScoringResultSchema), after computing
errorMessage do a nested try around writeResponse({ ok: false, error:
errorMessage }) and on failure fallback to a synchronous fallback (e.g.,
console.error or process.stderr.write) and then set process.exitCode = 1; ensure
the success-path still validates result against ScoringResultSchema before
calling writeResponse to avoid triggering the catch unnecessarily.

In `@src/schemas/plan.schema.ts`:
- Around line 62-72: The schema rename from environment → runtimeEnvironment
breaks older plan.json files; update the runner to handle legacy plans by adding
explicit migration or fallback: modify readPlan() to detect an incoming plan
object that contains environment and if runtimeEnvironment is missing, copy or
transform plan.environment into plan.runtimeEnvironment before calling
RunPlanSchema.parse(), or alternatively make RunPlanSchema accept an optional
environment field alongside runtimeEnvironment (mirroring the dashboard) so
legacy files validate; reference the RunPlanSchema and readPlan() symbols and
ensure the chosen migration runs prior to validation and persists the new
runtimeEnvironment field for downstream code.

In `@test/scorer-isolation.test.ts`:
- Around line 15-25: The test teardown/setup incorrectly assigns
process.env.PLEBDEV_BENCH_SCORER_MODE = undefined in beforeEach and afterEach
which yields the string "undefined" and breaks the ORIGINAL_ENV === undefined
check; update the beforeEach and afterEach to remove the variable using delete
process.env.PLEBDEV_BENCH_SCORER_MODE instead, keeping the ORIGINAL_ENV
restoration logic unchanged and referencing the same symbols (beforeEach,
afterEach, ORIGINAL_ENV, process.env.PLEBDEV_BENCH_SCORER_MODE).

---

Nitpick comments:
In `@apps/dashboard/src/components/leaderboard/leaderboard-page.tsx`:
- Around line 44-47: The exported React component LeaderboardPage is missing a
full TSDoc contract; add a complete TSDoc block above the export that documents
the component's purpose, any props (or note none if it accepts none), the return
type (JSX.Element or ReactElement), and any errors/throws behavior (if it can
throw or rely on thrown services), e.g., include `@remarks/`@example as needed to
clarify usage; ensure the TSDoc sits immediately above the function declaration
for LeaderboardPage.

In `@apps/dashboard/src/components/run-detail/run-detail-page.tsx`:
- Around line 194-196: The truncated checkpoint string displayed in the
RunDetailPage component (the <p className="text-xs text-foreground-faint mt-1
truncate"> element rendering checkpointId) should expose the full value on
hover; update that element (where checkpointId is rendered) to add a title
attribute set to checkpointId so users can view/copy the entire ID on hover.

In `@apps/dashboard/src/lib/api.ts`:
- Around line 83-171: Add complete TSDoc blocks for all exported functions in
this file: fetchDashboardIndex, fetchRuns, fetchRun, fetchPlan,
fetchRunWithPlan, and fetchLatestAggregate. For each function include a one-line
description, `@param` entries for parameters (e.g., runId: string) where
applicable, an `@returns` describing the resolved Promise type (e.g.,
Promise<DashboardIndex>, Promise<RunResult>, Promise<{run: RunResult; plan:
RunPlan}>, Promise<LeaderboardAggregate>), and `@throws` describing possible
errors (network failures, 404 fallback behavior for
fetchDashboardIndex/fetchLatestAggregate, and validation failures from
schema.parse). Ensure wording matches existing repo style (purpose, params,
returns, throws) and place the comment immediately above each function
declaration.

In `@apps/dashboard/src/lib/schemas.ts`:
- Around line 1-7: Update the file header comment to reflect the actual exported
schemas: add DashboardCheckpointSummarySchema, DashboardIndexSchema,
DashboardIndexLegacyOrV2Schema, and LeaderboardAggregateSchema to the exports
list alongside RunResultSchema, RunPlanSchema, and RunListItemSchema; ensure the
header’s "Exports:" line lists these symbol names exactly and adjust any brief
description if needed so the header accurately documents the module’s public
API.

In `@src/cli/compare-command.ts`:
- Around line 54-63: The function readPlanBestEffort currently logs errors with
console.warn; replace that with the project's Pino logger to keep structured
logging consistent. Update readPlanBestEffort to call the shared logger (the
same logger used elsewhere in this module) instead of console.warn, and include
the same warning message and error details (use error instanceof Error ?
error.message : String(error)) when logging; keep behavior of returning
undefined after logging and leave readPlan invocation and try/catch logic
intact.

In `@src/harnesses/code-output-policy.ts`:
- Around line 27-32: TURN_LIMIT_PATTERNS and OFF_TASK_PATTERNS both include the
phrase-matching pattern /\bwould you like me to continue\b/i, causing the
turn-limit check (in the turn-limit classification flow that runs before
off-task) to always win; remove the duplicate from one of the arrays or make the
TURN_LIMIT_PATTERNS entry more specific (e.g., require co-occurrence with
"maximum number of actions" or tighten the regex) so the intended category
wins—update the TURN_LIMIT_PATTERNS constant (or OFF_TASK_PATTERNS) accordingly
and run tests to ensure classification precedence matches expectations.

In `@src/harnesses/goose-adapter.ts`:
- Around line 81-84: The current calculation for retryMaxTurns using
Math.max(maxTurns, normalizeTurnLimit(options?.retryMaxTurns,
DEFAULT_GOOSE_RETRY_MAX_TURNS)) silently forces retryMaxTurns to be at least
maxTurns; detect when the normalized options?.retryMaxTurns is less than
maxTurns and emit a debug log noting the requested value and the enforced
adjusted value to aid troubleshooting. Locate the logic around retryMaxTurns,
maxTurns, normalizeTurnLimit and DEFAULT_GOOSE_RETRY_MAX_TURNS in
goose-adapter.ts and add a conditional that compares the normalized
requestedRetry = normalizeTurnLimit(options?.retryMaxTurns,
DEFAULT_GOOSE_RETRY_MAX_TURNS) to maxTurns, log a debug message (including
requestedRetry and final retryMaxTurns) before assigning retryMaxTurns, and then
keep using Math.max to set the final value.

In `@src/lib/scorer-core.ts`:
- Around line 46-74: valuesMatch uses JSON.stringify for deep equality which has
edge cases (key order, undefined vs missing keys, Date/Map/Set); update the
valuesMatch function's JSDoc to document these limitations and when tolerance
applies, referencing the valuesMatch function so future maintainers know that
this comparison is intended for primitives/simple objects and may not handle
complex types or ordering differences.
- Around line 168-176: The current heuristic using factory.prototype.constructor
can misclassify functions; update the detection around the factory variable so
ES6 classes are detected reliably by checking
Function.prototype.toString.call(factory).startsWith("class") and fall back to a
safe attempt: try instantiating with new (wrap new (factory as any)() in a
try/catch) and if it throws a TypeError, call the factory as a function instead.
Apply this change where the factory variable is handled so class constructors
are constructed and non-constructors are invoked safely.

In `@src/lib/scorer.ts`:
- Around line 62-85: The path resolution in scoreInWorker should follow the
repo-wide convention using process.cwd() rather than import.meta.url; ensure the
workerPath in the scoreInWorker function is built with path.join(process.cwd(),
"src", "lib", "scorer-worker.ts") (the same pattern used elsewhere like
benchmark-checkpoint.ts), and avoid switching to import.meta.url here unless you
standardize the approach across the codebase; update any divergent resolution
logic in scoreInWorker or related helpers to match this pattern.

In `@src/results/writer.ts`:
- Around line 78-90: writePartialResult currently writes run.partial.json with
fs.writeFileSync which can leave a corrupted file if the process crashes
mid-write; change it to perform an atomic write by writing to a temp file in the
same directory (e.g., partialPath + ".tmp") and then atomically renaming it to
partialPath using fs.renameSync (or fs.promises.rename). Ensure you still call
ensureDir(runDir) first, write the JSON content to the temp file and flush/close
it before renaming, and keep RunResultSchema.parse(result) validation unchanged.
- Around line 39-48: The three async functions (writePlan, plus the other two
write functions) currently use synchronous fs.writeFileSync which blocks the
event loop; update each to use the asynchronous promise API instead (e.g.,
replace writeFileSync with await fs.promises.writeFile or import from
"fs/promises" and await writeFile) so the functions truly operate
asynchronously, keep the JSON.stringify and path logic unchanged, and ensure the
functions still return Promise<void> (add awaits where needed and remove any
unused async if you decide to keep sync behavior instead).

In `@src/runner/index.ts`:
- Around line 300-323: Extract the duplicated checkpoint logic into a single
helper (e.g., checkpointIfNeeded or writeRunCheckpoint) that accepts the inputs
used here: results (or itemCount), total, lastCheckpointItemCount (by ref/return
updated value), PARTIAL_RESULT_CHECKPOINT_INTERVAL, plan, startedAt, startTime,
config.outputDir, writePartialResult, buildRunResultSnapshot and log; inside the
helper compute shouldCheckpoint (itemCount === total || itemCount -
lastCheckpointItemCount >= PARTIAL_RESULT_CHECKPOINT_INTERVAL), and when true
build the partial snapshot via buildRunResultSnapshot, call writePartialResult,
log the same metadata (completedItems, totalItems, checkpointPath using
config.outputDir and plan.runId), and return the new lastCheckpointItemCount so
callers (the blocks around lines using results/total) update their counter;
replace both duplicated blocks (the current block and the one around lines
347–370) with calls to this helper.

In `@src/schemas/common.schema.ts`:
- Around line 196-212: RunProvenanceSchema currently uses defaults for
verificationStatus and source which will silently populate missing fields;
change the schema to either make these fields required (remove .default(...) so
verificationStatus and source are validated as present) or change them to
optional (use .optional() without defaults) depending on desired behavior.
Locate RunProvenanceSchema and update
VerificationStatusSchema.default("self_reported") and source:
z.string().min(1).default("local_cli") to the chosen alternative (required
without .default or optional via .optional()) and adjust any upstream callers
that relied on the implicit defaults (e.g., validation/creation flows that use
RunProvenanceSchema).

In `@test/aggregate.test.ts`:
- Around line 96-101: The duplicate-key test uses createItem to build olderItem
and newerItem but gives them different ids, which weakens the intent of testing
duplicate (machine+matrix) keys; update the createItem calls for olderItem and
newerItem so they share the same id value (e.g., use "01" for both) in the
aggregate.test.ts test, leaving timestamps and payloads unchanged so the test
explicitly models a newer record replacing an older record; adjust any
expectations/assertions that depend on the id if needed.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b0765413-7512-497b-8967-1158ee27c31e

📥 Commits

Reviewing files that changed from the base of the PR and between e40414c and 5ef106e.

📒 Files selected for processing (49)

README.md
apps/dashboard/public/results/aggregates/chk_sha256v1_517a25afd25b.json
apps/dashboard/public/results/aggregates/latest.json
apps/dashboard/public/results/index.json
apps/dashboard/scripts/build-index.ts
apps/dashboard/src/App.tsx
apps/dashboard/src/components/layout/header.tsx
apps/dashboard/src/components/leaderboard/leaderboard-filters.ts
apps/dashboard/src/components/leaderboard/leaderboard-page.tsx
apps/dashboard/src/components/run-detail/run-detail-page.tsx
apps/dashboard/src/components/run-list/run-card.tsx
apps/dashboard/src/lib/api.ts
apps/dashboard/src/lib/schemas.ts
apps/dashboard/src/lib/tooltip-content.ts
apps/dashboard/src/lib/types.ts
apps/dashboard/src/pages/leaderboard.tsx
llm/implementation/checkpointed-aggregation-phase1-implementation.md
llm/implementation/goose-headless-hardening.md
llm/implementation/review-and-hardening-implementation.md
src/cli/compare-command.ts
src/cli/run-command.ts
src/harnesses/code-output-policy.ts
src/harnesses/goose-adapter.ts
src/harnesses/index.ts
src/lib/benchmark-checkpoint.ts
src/lib/hardware-profile.ts
src/lib/scorer-core.ts
src/lib/scorer-worker.ts
src/lib/scorer.ts
src/lib/stats.ts
src/results/aggregate.ts
src/results/writer.ts
src/runner/index.ts
src/runner/item-executor.ts
src/runner/plan-builder.ts
src/schemas/common.schema.ts
src/schemas/config.schema.ts
src/schemas/index.ts
src/schemas/plan.schema.ts
src/schemas/result.schema.ts
test/aggregate.test.ts
test/benchmark-checkpoint.test.ts
test/build-index.test.ts
test/code-output-policy.test.ts
test/compare-checkpoint-guard.test.ts
test/hardware-profile.test.ts
test/schemas.test.ts
test/scorer-isolation.test.ts
test/stats.test.ts

coderabbitai

Actionable comments posted: 9

🧹 Nitpick comments (3)

apps/dashboard/src/components/run-detail/run-detail-page.tsx (1)

182-186: Avoid rendering partial runtimeEnvironment as undefined text.

This line currently assumes both fields exist; if one is missing, users may see awkward output. Build the display string from available parts only.

Suggested refactor

-						{runtimeEnvironment && (
-							<p className="text-sm text-foreground-faint">
-								{runtimeEnvironment.platform} · Bun{" "}
-								{runtimeEnvironment.bunVersion}
-							</p>
-						)}
+						{runtimeEnvironment &&
+							(runtimeEnvironment.platform || runtimeEnvironment.bunVersion) && (
+								<p className="text-sm text-foreground-faint">
+									{[
+										runtimeEnvironment.platform,
+										runtimeEnvironment.bunVersion
+											? `Bun ${runtimeEnvironment.bunVersion}`
+											: null,
+									]
+										.filter(Boolean)
+										.join(" · ")}
+								</p>
+							)}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/run-detail/run-detail-page.tsx` around lines
182 - 186, The current JSX renders runtimeEnvironment.platform and
runtimeEnvironment.bunVersion directly, producing "undefined" if one is missing;
update the render logic in run-detail-page.tsx (the runtimeEnvironment usage in
the paragraph) to build a display string from available parts: create an array
like [runtimeEnvironment?.platform, runtimeEnvironment?.bunVersion], filter out
falsy values, join with " · ", and render the paragraph only when the resulting
string is non-empty so partial data doesn't show "undefined".

apps/dashboard/src/components/leaderboard/leaderboard-page.tsx (1)

59-84: Consider adding fetch cancellation on unmount.

If the component unmounts before the fetch completes, the state setters will be called on an unmounted component. While React 18 handles this gracefully, an AbortController would be cleaner for long-running fetches.

♻️ Optional improvement with AbortController

 useEffect(() => {
+  const controller = new AbortController();
   const fetchData = async () => {
     setLoading(true);
     setError(null);
     try {
       const [dashboardIndex, latestAggregate] = await Promise.all([
         fetchDashboardIndex(),
         fetchLatestAggregate(),
       ]);
+      if (controller.signal.aborted) return;
       setIndex(dashboardIndex);
       setAggregate(latestAggregate);
     } catch (fetchError) {
+      if (controller.signal.aborted) return;
       setError(
         fetchError instanceof Error
           ? fetchError.message
           : "Failed to load leaderboard",
       );
       setIndex(null);
       setAggregate(null);
     } finally {
+      if (!controller.signal.aborted) {
         setLoading(false);
+      }
     }
   };

   void fetchData();
+  return () => controller.abort();
 }, []);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/leaderboard/leaderboard-page.tsx` around lines
59 - 84, The useEffect's fetchData should support cancellation: create an
AbortController, pass its signal into fetchDashboardIndex and
fetchLatestAggregate (or wrap their fetch calls to accept a signal), and in the
cleanup call controller.abort(); inside fetchData avoid calling
setLoading/setError/setIndex/setAggregate if signal.aborted (or catch an
AbortError and skip state updates) so state setters aren't invoked after
unmount; ensure references to
useEffect/fetchData/fetchDashboardIndex/fetchLatestAggregate/setLoading/setError/setIndex/setAggregate
are updated accordingly.

src/lib/scorer-core.ts (1)

234-243: Add @throws to exported function TSDoc.

The exported scoreGenerationInProcess docs are missing an explicit throws contract.

📝 Proposed doc patch

 /**
  * Scores generated code against a test's scoring spec in the current process.
  *
  * `@param` testSlug - Test directory name
  * `@param` rawOutput - Raw output from LLM generation
  * `@param` timeoutMs - Timeout for scoring (default: 5s)
  * `@param` codeFilePath - Optional path to code file written by tool-calling harness
  * `@returns` Scoring result with pass/fail counts
+ * `@throws` {Error} Programmer/setup errors outside normal scoring failure paths
  */

As per coding guidelines "All exported functions require TSDoc/JSDoc documentation (purpose, params, returns, throws)".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/scorer-core.ts` around lines 234 - 243, The exported function
scoreGenerationInProcess is missing a `@throws` tag in its TSDoc; update its
comment block to include an explicit `@throws` describing the error conditions
(e.g., throws on timeout when scoring exceeds timeoutMs, throws on
invalid/missing scoring spec, and throws on file IO or parsing errors such as
when reading codeFilePath), and ensure the description mentions the error
type/conditions so callers know when to catch or propagate exceptions; keep the
tag adjacent to `@param` and `@returns` in the existing doc block for
scoreGenerationInProcess.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cli/compare-command.ts`:
- Around line 55-67: readPlanBestEffort is currently catching all errors from
readPlan and downgrading parse/validation failures to warnings; change it to
only swallow benign IO/not-found errors and rethrow errors that indicate invalid
JSON/schema/programmer mistakes. Specifically, call readPlan(runDir) inside the
try, but if the caught error is a file-not-found or other IO error (e.g., errno
ENOENT) log the warn and return undefined, whereas if the error is a
SyntaxError, a schema/validation error (the ValidationError type your validator
uses), or any Error whose name/message indicates invalid JSON/schema, rethrow it
instead of returning undefined; reference the readPlanBestEffort and readPlan
functions when making this change so you only suppress IO/missing-file cases and
preserve strict failure for invalid configs.
- Around line 37-45: Replace thrown Error usage in the checkpoint guard inside
compare-command.ts (the block checking checkpointA and checkpointB) with normal
CLI flow: detect missing or mismatched checkpoints in the compare command
handler (e.g., the function that contains the checkpointA/checkpointB checks),
print a clear user-facing message (console.error or logger.error) explaining the
condition and the --allow-cross-checkpoint option, and return/exit the command
handler normally (non-throw) so the process does not treat it as a crash; apply
the same change to the similar guard at the other occurrence (the block around
lines 377-384) so both cases use non-exception control flow for expected user
validation failures.

In `@src/harnesses/goose-adapter.ts`:
- Around line 318-323: The retry warning currently includes raw output via
outputPreview (constructed in retryContext) which can leak model/user content;
instead remove outputPreview and add redaction-safe metadata such as
outputLength (output.length) and an outputFingerprint (e.g., a short SHA256 or
other hash of output) and keep decision.reason and remainingMs/retryMaxTurns;
compute the fingerprint where output is available before creating retryContext
and store those fields in retryContext rather than any raw substring, and do the
same replacement for the other retry warning construction that references
outputPreview (the block that builds retryContext around decision.reason).
- Around line 59-67: normalizeTurnLimit currently silently returns the fallback
for any non-valid numeric input; change it to return the fallback only when
value === undefined and throw a TypeError for invalid values (non-number,
non-integer, or < 1) so misconfiguration fails fast—update normalizeTurnLimit to
perform: if value === undefined => return fallback; else if typeof value !==
"number" || !Number.isInteger(value) || value < 1 => throw new TypeError with a
message including the parameter name and invalid value; otherwise return value.
Apply the same fix to the analogous validator function in this module (the other
normalize/limit function with the same logic) so both validators enforce
fail-fast behavior and include clear error messages.
- Around line 203-209: The code currently includes runtime.baseUrl directly in
the object (runtimeBaseUrl: runtime.baseUrl), which may leak credentials if
embedded in the URL; change the value to a sanitized host-only/origin form
before including it (e.g., derive a safeBase = try { new
URL(runtime.baseUrl).origin } catch { 'REDACTED' } or strip
username/password/search/hash) and assign runtimeBaseUrl to that sanitized value
(keep the property name runtimeBaseUrl to minimize downstream changes, but
ensure the value is origin/host-only and falls back to a redacted string on
parse errors).

In `@src/lib/scorer-core.ts`:
- Around line 266-279: Caller-supplied codeFilePath should not silently fall
back to extractCode; change the logic in this block so that if codeFilePath is
provided but the file does not exist or cannot be read, you throw a clear error
(include the codeFilePath in the message) instead of calling extractCode;
implement this by checking fs.existsSync(codeFilePath) and throwing when false,
and wrapping the fs.promises.readFile in a try/catch to rethrow a descriptive
error on read failures, otherwise set extracted = { code, method: "file" } as
before (ensure extractCode is only used when codeFilePath is not provided).

In `@src/lib/scorer-worker.ts`:
- Around line 75-78: The failure path may write an empty error string which
violates ScorerWorkerResponseSchema; before calling writeResponse in the catch
block ensure the error text is non-empty (use the computed errorMessage from the
catch, fall back to error.constructor?.name, String(error), or a fixed
placeholder like "Unknown error" if message is empty) and then call
writeResponse({ ok: false, error: finalErrorMessage }); update the code around
the errorMessage local and the writeResponse call so finalErrorMessage is
guaranteed to be a non-empty string.

In `@src/results/writer.ts`:
- Around line 101-107: The deletePartialResult function currently uses
fs.existsSync followed by fs.unlinkSync which can race and throw ENOENT; remove
the existsSync check and call fs.unlinkSync(partialPath) inside a try/catch, and
if an error is caught ignore it only when err.code === 'ENOENT' otherwise
rethrow or log — reference the partialPath variable and the deletePartialResult
function to locate where to change the logic.
- Around line 87-92: The current implementation uses a fixed tempPath (`tempPath
= \`\${partialPath}.tmp\``) which can race when concurrent writes occur; change
the temp filename to be unique per invocation (e.g., include process.pid,
Date.now(), and a short random suffix or use a secure random/token) before
writing and renaming so concurrent calls to the writer won't clobber each
other's temp file; update the code paths that reference `tempPath` (the write
and subsequent `fs.promises.rename`) to use the new uniqueTempPath and ensure
any transient error handling/cleanup still applies to the unique temp file.

---

Nitpick comments:
In `@apps/dashboard/src/components/leaderboard/leaderboard-page.tsx`:
- Around line 59-84: The useEffect's fetchData should support cancellation:
create an AbortController, pass its signal into fetchDashboardIndex and
fetchLatestAggregate (or wrap their fetch calls to accept a signal), and in the
cleanup call controller.abort(); inside fetchData avoid calling
setLoading/setError/setIndex/setAggregate if signal.aborted (or catch an
AbortError and skip state updates) so state setters aren't invoked after
unmount; ensure references to
useEffect/fetchData/fetchDashboardIndex/fetchLatestAggregate/setLoading/setError/setIndex/setAggregate
are updated accordingly.

In `@apps/dashboard/src/components/run-detail/run-detail-page.tsx`:
- Around line 182-186: The current JSX renders runtimeEnvironment.platform and
runtimeEnvironment.bunVersion directly, producing "undefined" if one is missing;
update the render logic in run-detail-page.tsx (the runtimeEnvironment usage in
the paragraph) to build a display string from available parts: create an array
like [runtimeEnvironment?.platform, runtimeEnvironment?.bunVersion], filter out
falsy values, join with " · ", and render the paragraph only when the resulting
string is non-empty so partial data doesn't show "undefined".

In `@src/lib/scorer-core.ts`:
- Around line 234-243: The exported function scoreGenerationInProcess is missing
a `@throws` tag in its TSDoc; update its comment block to include an explicit
`@throws` describing the error conditions (e.g., throws on timeout when scoring
exceeds timeoutMs, throws on invalid/missing scoring spec, and throws on file IO
or parsing errors such as when reading codeFilePath), and ensure the description
mentions the error type/conditions so callers know when to catch or propagate
exceptions; keep the tag adjacent to `@param` and `@returns` in the existing doc
block for scoreGenerationInProcess.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 41216622-1181-4ab1-9c23-7dcf6c221fa9

📥 Commits

Reviewing files that changed from the base of the PR and between 5ef106e and edb1bed.

📒 Files selected for processing (17)

apps/dashboard/src/components/leaderboard/leaderboard-page.tsx
apps/dashboard/src/components/run-detail/run-detail-page.tsx
apps/dashboard/src/lib/api.ts
apps/dashboard/src/lib/schemas.ts
apps/dashboard/src/pages/leaderboard.tsx
llm/implementation/goose-headless-hardening.md
src/cli/compare-command.ts
src/cli/run-command.ts
src/harnesses/code-output-policy.ts
src/harnesses/goose-adapter.ts
src/lib/scorer-core.ts
src/lib/scorer-worker.ts
src/results/reader.ts
src/results/writer.ts
test/aggregate.test.ts
test/compare-checkpoint-guard.test.ts
test/scorer-isolation.test.ts

🚧 Files skipped from review as they are similar to previous changes (5)

llm/implementation/goose-headless-hardening.md
apps/dashboard/src/pages/leaderboard.tsx
test/scorer-isolation.test.ts
src/cli/run-command.ts
apps/dashboard/src/lib/schemas.ts

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

apps/dashboard/src/lib/api.ts (1)

140-167: ⚠️ Potential issue | 🟠 Major

Sanitize runId before constructing artifact URLs.

runId is interpolated directly into fetch paths. If it includes /, .., or reserved URL characters, requests can resolve to unintended artifacts. Encode once and reuse in both functions.

Suggested patch

+function buildRunArtifactUrl(
+	runId: string,
+	artifact: "run.json" | "plan.json",
+): string {
+	const safeRunId = encodeURIComponent(runId);
+	return `${RESULTS_BASE}/${safeRunId}/${artifact}`;
+}
+
 export async function fetchRun(
 	runId: string,
 	options?: FetchRequestOptions,
 ): Promise<RunResult> {
-	const response = await fetch(`${RESULTS_BASE}/${runId}/run.json`, {
+	const response = await fetch(buildRunArtifactUrl(runId, "run.json"), {
 		signal: options?.signal,
 	});
@@
 export async function fetchPlan(
 	runId: string,
 	options?: FetchRequestOptions,
 ): Promise<RunPlan> {
-	const response = await fetch(`${RESULTS_BASE}/${runId}/plan.json`, {
+	const response = await fetch(buildRunArtifactUrl(runId, "plan.json"), {
 		signal: options?.signal,
 	});

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/lib/api.ts` around lines 140 - 167, The runId is
interpolated directly into artifact URLs in fetchRun and fetchPlan; to prevent
path traversal or reserved-character issues, create a single sanitized value
(e.g., const safeRunId = encodeURIComponent(runId)) and use safeRunId in the
fetch URL construction in both fetchRun and fetchPlan (replace
`${RESULTS_BASE}/${runId}/...` with `${RESULTS_BASE}/${safeRunId}/...`),
ensuring encoding happens exactly once and is reused.

♻️ Duplicate comments (1)

src/cli/compare-command.ts (1)
451-455: ⚠️ Potential issue | 🟠 Major

Don’t mark checkpoint guard failures as process errors in MVP flow.

This path is expected CLI validation, not a crash, but it sets a non-zero exit code.
Suggested change
 if (checkpointGuardMessage) {
-	console.error(`Error: ${checkpointGuardMessage}`);
-	process.exitCode = 1;
+	console.error(`✗ FAIL ${checkpointGuardMessage}`);
 	return;
 }
As per coding guidelines "CLI should be non-interactive by default; exit non-zero only on crashes (MVP)".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cli/compare-command.ts` around lines 451 - 455, The code treats
checkpoint guard validation (checkpointGuardMessage) as a crash by setting
process.exitCode = 1; instead it should treat it as a user validation message
for the MVP CLI: remove the assignment to process.exitCode (or explicitly set it
to 0) and simply print the error string and return in the handler in
compare-command.ts (the block referencing checkpointGuardMessage) so the CLI
remains non-interactive and only exits non-zero on real crashes.

🧹 Nitpick comments (3)

src/results/writer.ts (1)

33-38: Add missing @returns tags to exported function docs.

All exported functions document params/throws, but @returns is missing. Please add explicit return docs (Promise<void> / void) for guideline compliance.

As per coding guidelines "All exported functions require TSDoc/JSDoc documentation (purpose, params, returns, throws)".

Also applies to: 50-56, 71-77, 112-118
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/results/writer.ts` around lines 33 - 38, Update the TSDoc blocks for the
exported functions whose docs start with "Writes the run plan to `plan.json`."
and the other exported function blocks at the ranges noted (the blocks whose
first lines are at the other comment ranges) to include explicit `@returns` tags;
for async functions add "@returns {Promise<void>} Resolves when the file has
been written." and for synchronous functions add "@returns {void}." Ensure you
edit the docblock immediately above each exported function (the block describing
the run plan/write behavior and the other two exported write functions
referenced) to add the appropriate `@returns` line so each exported function
documents its return type.

src/cli/compare-command.ts (1)

23-72: Unify checkpoint guard logic to avoid drift.

assertComparableCheckpoints and getCheckpointGuardMessage duplicate the same branching/messages; this can diverge over time.

Refactor direction

 export function assertComparableCheckpoints(
 	checkpointA: string | undefined,
 	checkpointB: string | undefined,
 	allowCrossCheckpoint: boolean,
 ): void {
-	if (allowCrossCheckpoint) return;
-	if (!checkpointA || !checkpointB) {
-		throw new Error(
-			"Checkpoint metadata missing in one or both run artifacts. Re-run with --allow-cross-checkpoint to force compare.",
-		);
-	}
-	if (checkpointA !== checkpointB) {
-		throw new Error(
-			`Checkpoint mismatch: ${checkpointA} vs ${checkpointB}. Re-run with --allow-cross-checkpoint to force compare.`,
-		);
-	}
+	const guardMessage = getCheckpointGuardMessage(
+		checkpointA,
+		checkpointB,
+		allowCrossCheckpoint,
+	);
+	if (guardMessage) throw new Error(guardMessage);
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/cli/compare-command.ts` around lines 23 - 72, The two functions
assertComparableCheckpoints and getCheckpointGuardMessage duplicate the same
branching and error text; consolidate by making one authoritative implementation
(e.g., keep getCheckpointGuardMessage as the single source of truth for the
guard logic and messages) and have assertComparableCheckpoints call it: call
getCheckpointGuardMessage(checkpointA, checkpointB, allowCrossCheckpoint), and
if it returns a string throw new Error(message). Optionally extract the repeated
message strings into constants used by getCheckpointGuardMessage so both throw
and CLI flow use the exact same text.

src/harnesses/goose-adapter.ts (1)

108-118: Consider validating adapter options with Zod at this boundary.

createGooseAdapter(options?) currently validates with normalizeTurnLimit() guards. Per repo guidelines ("Use Zod for schema validation at all boundaries"), create a small Zod schema for GooseAdapterOptions and parse once at entry. This ensures consistency with the rest of the codebase and aligns with the schema validation pattern used throughout harnesses and runtimes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/goose-adapter.ts` around lines 108 - 118, Add Zod-based
validation at the entry of createGooseAdapter: define a small Zod schema for
GooseAdapterOptions (including maxTurns and retryMaxTurns with the same optional
types/defaults), parse the incoming options once (e.g., parsedOptions =
GooseAdapterOptionsSchema.parse(options || {})), and then use parsedOptions when
calling normalizeTurnLimit to compute maxTurns and requestedRetryMaxTurns
instead of using the raw options; update references from options?.maxTurns and
options?.retryMaxTurns to parsedOptions.maxTurns and parsedOptions.retryMaxTurns
and ensure you import Zod and the schema follows existing repo patterns.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/dashboard/src/components/leaderboard/leaderboard-page.tsx`:
- Around line 229-324: The Select components currently used for filtering (the
Select instances bound to filters.machine, filters.runtime, filters.harness,
filters.passType, and filters.test) lack explicit accessible labels; add either
visible <label> elements associated with each Select or add descriptive
aria-label attributes (e.g., aria-label="Machine filter", "Runtime filter",
etc.) on the SelectTrigger or outer Select wrapper so screen readers and
keyboard users retain a clear name after a value is chosen; ensure each label
uses the same identifying symbol (match the Select with value={filters.machine},
value={filters.runtime}, value={filters.harness}, value={filters.passType},
value={filters.test}) and that any visible labels use htmlFor to point to the
corresponding control ID if you add IDs.

In `@src/cli/compare-command.ts`:
- Around line 112-117: The JSDoc above the plan.json reader in
src/cli/compare-command.ts is inaccurate: it claims invalid plans are tolerated
but the implementation throws on invalid JSON/schema; update the JSDoc to
reflect the actual behavior by stating that the function will throw an Error
when plan.json exists but contains invalid JSON or fails schema validation (and
that it returns parsed plan when valid or undefined only for benign
IO/missing-file cases). Ensure the updated doc text references the same function
that reads plan.json so future readers see the correct contract.

In `@src/harnesses/goose-adapter.ts`:
- Around line 119-125: Currently the code silently overrides inconsistent config
by computing retryMaxTurns = Math.max(maxTurns, requestedRetryMaxTurns) and only
logging when requestedRetryMaxTurns < maxTurns; change this to fail fast: if
requestedRetryMaxTurns < maxTurns throw a TypeError (including values of
maxTurns and requestedRetryMaxTurns in the message) instead of computing a
fallback and relying on logger.debug; update any calling code/tests expectations
for retryMaxTurns to use the provided requestedRetryMaxTurns only when it is >=
maxTurns.

In `@src/lib/scorer-core.ts`:
- Around line 390-411: The calls to createInstance wrapped by suppressStdout can
throw and currently escape scoreGenerationInProcess; wrap both calls (the
initial instance creation guarded by if (spec.factoryFn) and the per-test fresh
instance creation inside the loop) in try/catch blocks around suppressStdout(()
=> createInstance(module, spec.factoryFn)), and on catch return a
ScoringResult-like structured failure (same shape currently returned when
!instance) including passed, failed, total, details (exportResults), error
containing the caught error message/stack, extractionMethod: extracted.method,
and failureType: "factory_init_failed"; ensure you reference spec.factoryFn,
createInstance, and suppressStdout when making the change so all exceptions from
user factory/constructor produce the same failure result instead of throwing.

In `@src/results/writer.ts`:
- Around line 93-108: The catch block currently throws cleanupError and drops
the original write/rename failure; instead, ensure the original error (the
variable error in the catch) is always rethrown. Inside the catch, call
fs.unlinkSync(tempPath) in a nested try, and if cleanupError is non-ENOENT do
not throw it standalone—instead attach it to the original error (e.g.,
error.cleanupError or set error.cause = cleanupError or wrap both in an
AggregateError preserving the original as the primary) and then rethrow the
original error variable; update the catch block in writer.ts where tempPath,
cleanupError and error are used so cleanup failures don’t mask the original
failure.

In `@test/compare-checkpoint-guard.test.ts`:
- Around line 1-3: Add a complete file header at the top of
compare-checkpoint-guard.test.ts that includes: the Purpose (already present),
Exports (document that this test file exports nothing or list any exported
helpers), and Invariants (e.g., tests are deterministic, rely only on
mocked/stubbed state, do not modify global state, and must be runnable in CI).
Place the header comment block above the existing JSDoc, clearly naming the
file's exports/invariants so future readers and linters can verify expectations.

---

Outside diff comments:
In `@apps/dashboard/src/lib/api.ts`:
- Around line 140-167: The runId is interpolated directly into artifact URLs in
fetchRun and fetchPlan; to prevent path traversal or reserved-character issues,
create a single sanitized value (e.g., const safeRunId =
encodeURIComponent(runId)) and use safeRunId in the fetch URL construction in
both fetchRun and fetchPlan (replace `${RESULTS_BASE}/${runId}/...` with
`${RESULTS_BASE}/${safeRunId}/...`), ensuring encoding happens exactly once and
is reused.

---

Duplicate comments:
In `@src/cli/compare-command.ts`:
- Around line 451-455: The code treats checkpoint guard validation
(checkpointGuardMessage) as a crash by setting process.exitCode = 1; instead it
should treat it as a user validation message for the MVP CLI: remove the
assignment to process.exitCode (or explicitly set it to 0) and simply print the
error string and return in the handler in compare-command.ts (the block
referencing checkpointGuardMessage) so the CLI remains non-interactive and only
exits non-zero on real crashes.

---

Nitpick comments:
In `@src/cli/compare-command.ts`:
- Around line 23-72: The two functions assertComparableCheckpoints and
getCheckpointGuardMessage duplicate the same branching and error text;
consolidate by making one authoritative implementation (e.g., keep
getCheckpointGuardMessage as the single source of truth for the guard logic and
messages) and have assertComparableCheckpoints call it: call
getCheckpointGuardMessage(checkpointA, checkpointB, allowCrossCheckpoint), and
if it returns a string throw new Error(message). Optionally extract the repeated
message strings into constants used by getCheckpointGuardMessage so both throw
and CLI flow use the exact same text.

In `@src/harnesses/goose-adapter.ts`:
- Around line 108-118: Add Zod-based validation at the entry of
createGooseAdapter: define a small Zod schema for GooseAdapterOptions (including
maxTurns and retryMaxTurns with the same optional types/defaults), parse the
incoming options once (e.g., parsedOptions =
GooseAdapterOptionsSchema.parse(options || {})), and then use parsedOptions when
calling normalizeTurnLimit to compute maxTurns and requestedRetryMaxTurns
instead of using the raw options; update references from options?.maxTurns and
options?.retryMaxTurns to parsedOptions.maxTurns and parsedOptions.retryMaxTurns
and ensure you import Zod and the schema follows existing repo patterns.

In `@src/results/writer.ts`:
- Around line 33-38: Update the TSDoc blocks for the exported functions whose
docs start with "Writes the run plan to `plan.json`." and the other exported
function blocks at the ranges noted (the blocks whose first lines are at the
other comment ranges) to include explicit `@returns` tags; for async functions add
"@returns {Promise<void>} Resolves when the file has been written." and for
synchronous functions add "@returns {void}." Ensure you edit the docblock
immediately above each exported function (the block describing the run
plan/write behavior and the other two exported write functions referenced) to
add the appropriate `@returns` line so each exported function documents its return
type.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7679f67e-2235-4f6b-b36a-f4f1619c18f1

📥 Commits

Reviewing files that changed from the base of the PR and between edb1bed and 34d318b.

📒 Files selected for processing (9)

apps/dashboard/src/components/leaderboard/leaderboard-page.tsx
apps/dashboard/src/components/run-detail/run-detail-page.tsx
apps/dashboard/src/lib/api.ts
src/cli/compare-command.ts
src/harnesses/goose-adapter.ts
src/lib/scorer-core.ts
src/lib/scorer-worker.ts
src/results/writer.ts
test/compare-checkpoint-guard.test.ts

🚧 Files skipped from review as they are similar to previous changes (1)

apps/dashboard/src/components/run-detail/run-detail-page.tsx

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/harnesses/goose-adapter.ts (1)
2-4: ⚠️ Potential issue | 🟡 Minor

Update the file header export list.

Line 3 says Exports: createGooseAdapter, but this file now also exports GooseAdapterOptions (Line 46). Please keep the header accurate.

As per coding guidelines "**/*.{ts,tsx,js,jsx}: Every file begins with a short header documenting purpose, exports, and invariants".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/goose-adapter.ts` around lines 2 - 4, Update the file header
comment to list all exported symbols: modify the Exports line to include both
createGooseAdapter and GooseAdapterOptions so the header accurately reflects the
module API; locate the top-of-file header containing "Purpose: Goose CLI adapter
implementing the Harness interface." and the current "Exports:
createGooseAdapter" text and change it to "Exports: createGooseAdapter,
GooseAdapterOptions".

♻️ Duplicate comments (1)

src/lib/scorer-core.ts (1)

429-436: ⚠️ Potential issue | 🟠 Major

Per-test factory init failure currently drops accumulated test evidence.

At Line 435, returning immediately from inside the loop discards prior testResults and stops collecting evidence for remaining cases. Record this case as failed and continue, or include accumulated test results in the returned failure payload.

Proposed fix (continue matrix with structured per-test failure)

 			if (spec.freshInstancePerTest && spec.factoryFn) {
 				try {
 					instance = suppressStdout(() =>
 						createInstance(module, spec.factoryFn),
 					);
 				} catch (error) {
-					return buildFactoryInitFailure(error);
+					testResults.push({
+						name: testCase.description ?? testCase.fn,
+						passed: false,
+						expected: testCase.expected,
+						error: `Factory init failed: ${
+							error instanceof Error
+								? (error.stack?.trim() ?? error.message)
+								: String(error)
+						}`,
+					});
+					continue;
 				}
 			}

Based on learnings "Never implicitly fix up results after a run; capture enough evidence to explain outcomes (test failures, eval reasoning, durations, best-effort metrics)".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/scorer-core.ts` around lines 429 - 436, The early return inside the
per-case loop when createInstance fails (inside the spec.freshInstancePerTest &&
spec.factoryFn branch using suppressStdout and createInstance) discards
accumulated testResults; change this to record the failure into the existing
testResults collection (e.g., convert buildFactoryInitFailure(error) into a
per-test failure record and push it onto testResults with relevant
metadata/duration/evidence) and then continue the loop so remaining cases run;
ensure the final payload construction uses the full testResults array instead of
an immediate return value so accumulated evidence is preserved.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/dashboard/src/lib/api.ts`:
- Around line 2-4: The module header's export list is incomplete—add the missing
exported symbol fetchRunWithPlan to the header comment so the top-of-file
documentation matches the actual public API; locate the header block at the top
of apps/dashboard/src/lib/api.ts and append "fetchRunWithPlan" to the Exports
line alongside fetchRuns, fetchRun, fetchPlan, fetchDashboardIndex,
fetchLatestAggregate.

In `@src/cli/compare-command.ts`:
- Around line 31-148: The file header comment at the top is stale and must be
updated to list the newly exported helpers; update the header's "exports" line
to include assertComparableCheckpoints, readPlanBestEffort, and
resolveCheckpointId (and any other newly exported symbols), and confirm the
header still documents purpose and invariants for the file; locate the header
block at the top of the file and amend its exports list to exactly match the
current public exports.

In `@src/harnesses/goose-adapter.ts`:
- Around line 54-57: GooseAdapterOptionsSchema currently uses z.object which
strips unknown keys by default; update the schema for GooseAdapterOptionsSchema
to reject unknown/typo'd config keys by making the schema strict (e.g., call
.strict() on the z.object) so invalid keys like retryMaxTurn will cause
validation errors instead of being silently ignored.

In `@src/lib/scorer-core.ts`:
- Around line 180-186: The catch is too broad: narrow the TypeError check in the
factory invocation so you only retry with `new` when the error message matches
the JS class-constructor-without-new pattern (e.g. "Class constructor <Name>
cannot be invoked without 'new'"). Update the catch in the block that calls
`factory` (the code that currently does `(factory as () => unknown)()` and then
`new (factory as new () => unknown)()`) to test both `error instanceof
TypeError` and that `(error as Error).message` matches the
class-constructor-without-new text, otherwise rethrow the original error
unchanged.
- Around line 200-212: The importWithTimeout function currently calls import()
with a raw filesystem path which fails on Windows; convert the filesystem path
to a file URL using pathToFileURL from 'node:url' (e.g., const fileUrl =
pathToFileURL(filePath)), add the cache-busting param via
fileUrl.searchParams.set('t', String(Date.now())), and then call
import(fileUrl.toString()) (or import(fileUrl)) instead of string-concatenating
the path and query; update the import call inside importWithTimeout (and any use
of cacheBuster) accordingly so Windows paths are handled correctly.

---

Outside diff comments:
In `@src/harnesses/goose-adapter.ts`:
- Around line 2-4: Update the file header comment to list all exported symbols:
modify the Exports line to include both createGooseAdapter and
GooseAdapterOptions so the header accurately reflects the module API; locate the
top-of-file header containing "Purpose: Goose CLI adapter implementing the
Harness interface." and the current "Exports: createGooseAdapter" text and
change it to "Exports: createGooseAdapter, GooseAdapterOptions".

---

Duplicate comments:
In `@src/lib/scorer-core.ts`:
- Around line 429-436: The early return inside the per-case loop when
createInstance fails (inside the spec.freshInstancePerTest && spec.factoryFn
branch using suppressStdout and createInstance) discards accumulated
testResults; change this to record the failure into the existing testResults
collection (e.g., convert buildFactoryInitFailure(error) into a per-test failure
record and push it onto testResults with relevant metadata/duration/evidence)
and then continue the loop so remaining cases run; ensure the final payload
construction uses the full testResults array instead of an immediate return
value so accumulated evidence is preserved.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: eef2ce14-5f7b-4597-ba0d-511874fad691

📥 Commits

Reviewing files that changed from the base of the PR and between 34d318b and cc5555b.

📒 Files selected for processing (7)

apps/dashboard/src/components/leaderboard/leaderboard-page.tsx
apps/dashboard/src/lib/api.ts
src/cli/compare-command.ts
src/harnesses/goose-adapter.ts
src/lib/scorer-core.ts
src/results/writer.ts
test/compare-checkpoint-guard.test.ts

🚧 Files skipped from review as they are similar to previous changes (1)

test/compare-checkpoint-guard.test.ts

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

apps/dashboard/src/lib/api.ts (1)
99-109: Validate synthesized fallback payloads with Zod before returning.

Both 404 fallback paths return hand-built objects directly. Parsing these with the corresponding schemas keeps runtime invariants aligned with schema evolution (including refinements/default expectations).
Suggested refactor
 if (response.status === 404) {
   console.warn(
     "No runs index found. Run `bun dashboard:index` to generate apps/dashboard/public/results/index.json.",
   );
-  return {
+  const emptyIndex = {
     schemaVersion: 2,
     generatedAt: new Date(0).toISOString(),
     latestCheckpointId: null,
     runs: [],
     checkpoints: [],
   };
+  return normalizeDashboardIndex(DashboardIndexLegacyOrV2Schema.parse(emptyIndex));
 }
- return createEmptyAggregate(index.latestCheckpointId);
+ return LeaderboardAggregateSchema.parse(
+   createEmptyAggregate(index.latestCheckpointId),
+ );
As per coding guidelines, "Validate at boundaries with Zod and pass typed data inward".

Also applies to: 217-217
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/lib/api.ts` around lines 99 - 109, Replace the raw
hand-built 404 fallback objects with values validated by the corresponding Zod
schema: build the fallback payload and run it through the RunsIndexSchema (or
the actual schema variable used for the endpoint) using parse() or safeParse()
before returning; if safeParse fails, log/throw a clear error so you don't
return invalid data. Apply the same change to the other 404 fallback (the one
referenced around the second occurrence) so both paths validate with Zod before
returning.
src/harnesses/goose-adapter.ts (1)
110-116: Document config-validation throws in createGooseAdapter TSDoc.

createGooseAdapter can now throw during option parsing/normalization (before execution). Please include that throw path in @throws for API accuracy.

As per coding guidelines "All exported functions require TSDoc/JSDoc documentation including purpose, params, returns, and throws".

Also applies to: 117-132
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/goose-adapter.ts` around lines 110 - 116, Update the TSDoc for
createGooseAdapter to document that it can throw during options
parsing/validation/normalization (in addition to runtime Goose execution,
timeouts, and output setup); specifically add a `@throws` entry referencing
validation/config errors thrown while normalizing the provided options (e.g.,
during the internal parse/normalize step used by createGooseAdapter) so callers
know the function may throw before execution begins.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/cli/compare-command.ts`:
- Around line 452-454: The checkpoint guard branch that currently does
console.error(`Error: ${checkpointGuardMessage}`) should be changed to emit an
explicit fail label and symbol (e.g., `✗ FAIL:`) so output follows CLI PASS/FAIL
convention; update the block where checkpointGuardMessage is checked in
compare-command.ts (the code handling checkpointGuardMessage) to print something
like `✗ FAIL: ${checkpointGuardMessage}` instead of the plain `Error:` string,
preserving use of console.error and the original message text.

In `@src/harnesses/goose-adapter.ts`:
- Around line 117-119: The function createGooseAdapter currently uses
GooseAdapterOptionsSchema.parse(options ?? {}) which silently treats null as an
empty/default config; instead, pass the original options through validation so
invalid values (including null) throw: update the call in createGooseAdapter to
validate the raw options (or add an explicit check like if (options === null)
throw) and remove the `?? {}` fallback so GooseAdapterOptionsSchema.parse will
fail on null and force callers to provide a valid config.

---

Nitpick comments:
In `@apps/dashboard/src/lib/api.ts`:
- Around line 99-109: Replace the raw hand-built 404 fallback objects with
values validated by the corresponding Zod schema: build the fallback payload and
run it through the RunsIndexSchema (or the actual schema variable used for the
endpoint) using parse() or safeParse() before returning; if safeParse fails,
log/throw a clear error so you don't return invalid data. Apply the same change
to the other 404 fallback (the one referenced around the second occurrence) so
both paths validate with Zod before returning.

In `@src/harnesses/goose-adapter.ts`:
- Around line 110-116: Update the TSDoc for createGooseAdapter to document that
it can throw during options parsing/validation/normalization (in addition to
runtime Goose execution, timeouts, and output setup); specifically add a `@throws`
entry referencing validation/config errors thrown while normalizing the provided
options (e.g., during the internal parse/normalize step used by
createGooseAdapter) so callers know the function may throw before execution
begins.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5f469759-66f7-4c82-a66c-9f9f80c8e586

📥 Commits

Reviewing files that changed from the base of the PR and between cc5555b and 97de640.

📒 Files selected for processing (4)

apps/dashboard/src/lib/api.ts
src/cli/compare-command.ts
src/harnesses/goose-adapter.ts
src/lib/scorer-core.ts

🚧 Files skipped from review as they are similar to previous changes (1)

src/lib/scorer-core.ts

AustinKelsay added 4 commits March 4, 2026 17:02

Harden long-run stability for long benchmark runs

4c4b111

Add checkpoint write logs and split scoring timing metrics

a21f990

Harden Goose headless retries and turn-limit handling

a761292

Add checkpointed aggregation and leaderboard support

5ef106e

vercel Bot deployed to Preview March 5, 2026 17:54 View deployment