Staging by AustinKelsay · Pull Request #19 · AustinKelsay/plebdev-bench

AustinKelsay · 2026-03-23T02:03:17Z

Summary

This PR lands two related identity layers that now both exist on staging:

Model profiles
Canonical model identities that group runtime-specific variants under one model profile. This lets us treat the same base model as one thing in the product while still preserving runtime, quantization, and format details underneath.

Examples:

qwen3-27b-instruct via Ollama
the same qwen3-27b-instruct profile via vLLM
MLX / GGUF / quantized variants of the same underlying model

Machine profiles
Canonical hardware profiles that are separate from per-machine instance IDs. This lets us aggregate by shared hardware shape while still distinguishing individual hosts.

Examples:

multiple Mac Minis with the same hardware can share one machine profile
each physical host still has its own stable machine instance ID
dashboard aggregates can now show profile-level performance and instance counts separately

What Changed

Added canonical model-profile schema, registry, normalization, and planner/result wiring.
Updated compare and aggregation paths to use canonical model identity where available.
Added standardized machine-profile schema and normalized observed hardware/accelerator metadata.
Split machine identity into:
- machine profile: canonical hardware shape
- machine instance: specific host identity
Added migration support for legacy run/plan artifacts and a migrate-machine-profiles CLI.
Updated dashboard/index generation to expose machine profiles and instance counts cleanly.
Tightened accelerator probing, signal/trusted-metric handling, and compare edge cases.
Updated docs, examples, and tests for both model-profile and machine-profile flows.

Why This Matters

Before this branch, model names were mostly runtime-specific strings and machine identity was too coarse. With these changes:

the same model can be compared across runtimes and packaging formats without fragmenting the UI
hardware can be grouped by profile without losing per-host attribution
legacy artifacts can be migrated forward instead of discarded
trusted metrics and dashboard aggregates are more reliable

Validation

bun run typecheck
bun run test

Summary by CodeRabbit

New Features
- Stable per-machine instance IDs, richer observed hardware (accelerator) reporting, canonical model profiles with runtime variants, per-row signal assessment, and a new CLI command to migrate legacy machine artifacts.
Improvements
- CLI flags renamed to canonical instance/display names; added model-config support to run one canonical model across runtimes and quickstart examples.
- Dashboard/leaderboard relabeled to Profiles & Instances, show instance counts, instance badges, and trust-aware comparison metrics.
Schema Updates
- Artifact/schema versions bumped (0.5.0 / dashboard index v3) with expanded machine, model, and signal fields.

Implement canonical machine profiles

vercel · 2026-03-23T02:03:23Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
plebdev-bench-dashboard	Ready	Preview, Comment	Mar 30, 2026 0:05am

coderabbitai · 2026-03-23T02:03:36Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Standardizes machine identity into per-instance instanceId + canonical profileKey and observed hardware; adds machine-profile probes, normalization, and legacy migration tooling; introduces canonical model profiles and registry; threads per-row signal/taint assessments through harnesses, runner, stats, compare, dashboard, and tests; bumps schema to 0.5.0.

Changes

Cohort / File(s)	Summary
Schemas & Types `src/schemas/...`, `src/schemas/model-profile.schema.ts`, `src/schemas/common.schema.ts`, `src/schemas/index.ts`, `apps/dashboard/src/lib/schemas.ts`, `src/lib/types.ts`	Bump schema → 0.5.0; replace legacy machine shape with instance-centric fields (`instanceId`, `instanceIdSource`, `profileKey`, `profileLabel`, `normalizedProfile`, `observedHardware`); add accelerator and signal-assessment schemas/types; add model-profile schemas; add `sourcePathToken` and instance counters; include legacy-to-current transforms.
Machine-profile pipeline & probes `src/lib/machine-profile/`, `src/lib/machine-profile/probe-.ts`, `src/lib/machine-profile/probe-utils.ts`, `src/lib/machine-profile/normalization.ts`, `src/lib/hardware-profile.ts`, `src/lib/machine-profile/instance-id.ts`, `src/lib/machine-profile/legacy.ts`	New async observed-hardware collectors (macOS/Linux/Windows), normalization to `NormalizedMachineProfile`, deterministic `profileKey`/`profileLabel`, instance-id resolution & persisted ID path logic, legacy migration helpers converting old `profileId`/hardware into current shape.
Model-profile registry & normalization `src/lib/model-profile/*`, `src/lib/model-profiles.ts`, `src/schemas/model-profile.schema.ts`, `models.json`, `models.example.json`	New model-profile file format, loader, inline parser, merging, reverse lookup, fallback heuristics, `buildResolvedModelProfile`, and `getModelIdentityKey`; updates to model config JSON schema and examples.
CLI & migration tooling `src/cli/run-command.ts`, `src/cli/migrate-machine-command.ts`, `src/cli/index.ts`	Add canonical flags `--machine-instance-id`/`--machine-display-label` (deprecated aliases accepted with warnings), canonical model-profile CLI plumbing, and a new `migrate-machine-profiles` command that atomically rewrites legacy run/plan artifacts and can rebuild dashboard outputs with safety checks.
Dashboard indexing, sanitization & publishing `apps/dashboard/scripts/build-index.ts`, `apps/dashboard/public/results/*`, `apps/dashboard/public/results/index.json`	Recursive sanitization of published payloads, broadened public-path patterns, retention of `sourcePathToken`, per-run sanitized bundle writes, safety checks preventing overlapping dirs, and updated published run/plan shapes to new machine fields (bump DashboardIndex schemaVersion).
Aggregation, reader & API `src/results/aggregate.ts`, `src/results/reader.ts`, `apps/dashboard/src/lib/api.ts`, `apps/dashboard/src/lib/aggregations-compare.ts`	Aggregate pivot from `machineProfileId` → `machineProfileKey`; add `instanceCount` / `instances` counters; reader delegates parsing/migration to normalization helpers; compare/aggregation compute metric availability and trusted deltas.
Dashboard UI `apps/dashboard/src/components/...`	Switch filters/keys to `machineProfileKey`, prefer `machineDisplayLabel`/`machineProfileLabel`, render `machineInstanceId` in UI, rename “Machines”→“Profiles” and add “Instances”, update keys and badge logic to new fields.
Signal assessment & harnesses `src/lib/signal-assessment.ts`, `src/harnesses/`, `src/runner/`, `src/lib/stats.ts`, `src/results/compare.ts`, `src/lib/stats-format.ts`	Add SignalAssessment model and helpers; propagate `signalAssessment` and `taintReasons` through harness adapters, generation retry, item execution, scoring, and results; compute trusted metrics excluding tainted rows; compare includes trusted deltas, taint counts, and metricAvailability; formatted stats printing updated.
Plan-builder & runner `src/runner/plan-builder.ts`, `src/runner/*`, `src/runner/generation-retry.ts`, `src/runner/item-executor.ts`, `src/runner/item-retry.ts`	Await async machine collection, resolve instance/profile fields, switch from alias→profile model resolution, attach `modelProfile`/`modelAlias` to items/results, and thread `signalAssessment` across retries and returns.
Indexing safety & publishing internals `apps/dashboard/scripts/build-index.ts`	Add directory overlap validations, `writePublishedRunBundles` to recreate sanitized run bundles under output dir, force-remove published output directory before write, and bump DashboardIndex schema version to 3.
Tests & fixtures `test/`, `src/tests/`	Extensive test updates and new tests: machine-profile migration, probe parsing, instance-id persistence, model-profile registry, signal-assessment helpers and end-to-end flow, adapter regressions, dashboard index/aggregate expectations, and many fixtures updated to new shapes.
Misc / helpers & config `src/schemas/config.schema.ts`, `src/results/reader.ts`, `src/lib/hardware-profile.ts`, `src/lib/stats-format.ts`	Config preprocessor migrates deprecated aliases → canonical fields; remove anonymous hash fallback; normalize hardware/profile flows; extend stats/formatting to include trusted/signal sections and printing logic.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI
    participant Instance as InstanceID Resolver
    participant Probe as Observed Hardware Probe
    participant Normalizer as Profile Normalizer
    participant Writer as Publisher / Indexer

    CLI->>Instance: resolveMachineInstanceId(opts)
    Instance-->>CLI: {instanceId, instanceIdSource}
    CLI->>Probe: collectObservedHardwareProfile()
    Probe-->>Normalizer: observedHardware
    Normalizer-->>CLI: {normalizedProfile, profileKey, profileLabel}
    CLI->>Writer: write sanitized run/plan (include instance/profile)
    Writer-->>CLI: published run.json / plan.json / index.json

sequenceDiagram
    participant Migrate as migrate-machine-profiles
    participant FS as File system reader
    participant Legacy as Legacy normalizer
    participant Schema as Zod validation
    participant Atomic as Atomic swap writer
    participant Indexer as Dashboard indexer

    Migrate->>FS: discover run dirs
    loop per-run
      FS->>FS: read run.json / plan.json
      FS->>Legacy: normalizeKnownRunPayload / Plan
      Legacy->>Schema: validate against RunResult/RunPlan schemas
      Schema-->>Legacy: validated payload
      Legacy-->>Atomic: stage changed files (temp + backup)
      Atomic->>Atomic: rename temp -> orig (with rollback on failure)
    end
    alt --rebuild-dashboard-index
      Migrate->>Indexer: buildDashboardIndexArtifacts(source, outDir)
      Indexer-->>Migrate: index.json + aggregates
    end
    Migrate-->>Migrate: exit(0)

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Implement canonical machine profiles #18: Directly overlaps machine-profile canonicalization, instance ID persistence, probes, schema v0.5.0 bump, and migration helpers in this change.
Add trusted signal assessment to benchmark results #20: Adds the signal-assessment feature set and wiring through harnesses, stats, and compare — strongly connected to the taint/signal changes here.
tighten todo prompt and export exceptions #8: Related to code-output-policy and harness retry/evaluation adjustments that interact with taint detection and adapter behavior.

Poem

🐇 I hopped through probes and patched each legacy thread,
I gave machines instances and named the hardware bed.
I mapped models to profiles, and kept the taint in view,
I stitched the index, wrote the runs, and counted instances too.
A happy rabbit nudged the build — migrations now are through.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch staging

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (8)

apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx (1)
23-39: Extensive fallback chain for machine badge label.

The fallback order ensures backwards compatibility across schema versions. However, including machineInstanceId (line 32) in a display label fallback may show technical identifiers like legacy_profile:anon_6fe55c56c31d to users. Consider whether this is intentional or if machineInstanceId should be excluded from the display fallback chain (keeping it only as internal tracking).
🔧 Optional: Remove machineInstanceId from display fallback
 function getRunMachineBadgeLabel(run: RunListItem): string | undefined {
 	return (
 		run.machineDisplayLabel?.trim() ||
-		run.machineInstanceId?.trim() ||
 		run.machineLabel?.trim() ||
 		run.machineProfileLabel?.trim() ||
 		run.machineProfileKey?.trim() ||
 		run.machineProfileId?.trim() ||
 		undefined
 	);
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx` around
lines 23 - 39, The current getRunMachineBadgeLabel function includes
machineInstanceId in the display fallback chain which can expose technical IDs;
remove machineInstanceId from the returned fallback sequence so the function
only uses human-friendly fields (machineDisplayLabel, machineLabel,
machineProfileLabel, machineProfileKey, machineProfileId) and leave
machineInstanceId for internal tracking elsewhere.
src/results/aggregate.ts (1)
299-301: Resolve run metadata once and reuse it.

resolveRunMetadata() now runs in the checkpoint filter, again inside the main loop, and again for summary.instances. Since it can migrate machine payloads, materializing { input, metadata } once after the filter would avoid repeated normalization and make the new instance-count path easier to follow.

Also applies to: 307-317, 412-416
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/results/aggregate.ts` around lines 299 - 301, The code repeatedly calls
resolveRunMetadata(input) in the checkpoint filter, later in the main loop and
when building summary.instances; instead, map the original runs array to a
materialized array of { input, metadata } by calling resolveRunMetadata once per
run (use this mapped array when computing matchedRuns by checkpointId, iterating
the main aggregation loop, and constructing summary.instances), then update all
references that expect the original input to use the stored metadata (e.g.,
places currently invoking resolveRunMetadata(input) inside the main loop and in
summary.instances) so normalization/migration occurs a single time per run and
the instance-count path is simplified.
src/schemas/config.schema.ts (1)
96-103: Strip deprecated aliases from the parsed config shape.

Because machineProfileId and machineLabel remain in BenchConfigObjectSchema, the parsed BenchConfig type still exposes deprecated fields to internal code. Normalize them here, then drop them so only machineInstanceId and machineDisplayLabel flow inward.

As per coding guidelines: Validate at boundaries with Zod and pass typed data inward.

Also applies to: 153-163
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/schemas/config.schema.ts` around lines 96 - 103, Normalize deprecated
aliases before returning the parsed config: after mapping machineProfileId ->
machineInstanceId and machineLabel -> machineDisplayLabel, delete the deprecated
keys (machineProfileId and machineLabel) from the config object so the returned
BenchConfig only exposes machineInstanceId and machineDisplayLabel; apply the
same removal logic in the other parsing branch that handles the alternate
mapping (the block mirroring the code around
machineInstanceId/machineDisplayLabel) and ensure this aligns with
BenchConfigObjectSchema validation boundaries.
test/aggregate.test.ts (1)
197-198: Add a same-instance regression case for the new counters.

These assertions only prove accumulation across two different IDs. A regression that counts runs instead of distinct instanceIds would still pass. Please add a sibling case where both runs share the same instanceId and assert summary.instances === 1 / machine.instanceCount === 1.

Also applies to: 222-223
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/aggregate.test.ts` around lines 197 - 198, Add a sibling regression test
where two runs share the same instanceId and then assert distinct-instance
counters collapse to one: create an aggregate from two records with identical
instanceId and verify aggregate.summary.instances === 1 and
aggregate.machines[0]?.instanceCount === 1 (and mirror the same change for the
assertions around the other case mentioned at the second location). This ensures
the counters count unique instanceId values rather than run records.
src/cli/migrate-machine-command.ts (2)
114-145: Rollback may leave partial state if backup creation fails mid-loop.

If fs.copyFile(rewrite.artifactPath, rewrite.backupPath) fails for the second file after succeeding for the first, the rollback loop attempts to restore from backups. However, the second backup doesn't exist, so only the first file gets "restored" (from an identical backup). This is safe but the error messaging could be clearer.

Additionally, there's a subtle issue: if the rename loop (lines 119-121) fails on the second file after the first rename succeeds, the rollback attempts copyFile(backupPath, artifactPath) but the first file's original content is in backupPath while the artifact already has nextContent. The rollback correctly restores this.

Consider adding tracking for which operations completed

The current implementation is functionally correct for rollback, but tracking which backup/rename operations completed would allow more precise error messages and cleanup.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cli/migrate-machine-command.ts` around lines 114 - 145, The rollback can
attempt operations for files that never had backups or were never renamed;
update the logic around stagedRewrites to track which steps completed (e.g.,
mark per-rewrite flags like backupCreated and renameCompleted while executing
fs.writeFile, fs.copyFile, and fs.rename) and use those flags in the catch and
finally blocks so you only attempt fs.copyFile(backupPath, artifactPath) for
entries with backupCreated and only attempt fs.rename or fs.unlink for entries
for which the corresponding temp/backup were actually created; also improve
thrown errors and processLogger/error messages to include the rewrite identifier
and which operation failed to make rollback outcomes clearer (referencing
stagedRewrites, fs.copyFile, fs.writeFile, fs.rename, fs.unlink).
222-228: Redundant validation after Zod parse.

The dashboardOutputDir === undefined check at line 224 is redundant since MigrateMachineCommandOptionsSchema.parse() at line 214 already validates this via superRefine. After successful parsing, dashboardOutputDir is guaranteed to be defined when rebuildDashboardIndex is true.
Remove redundant check
 			if (parsedOptions.rebuildDashboardIndex) {
 				const dashboardOutputDirInput = parsedOptions.dashboardOutputDir;
-				if (dashboardOutputDirInput === undefined) {
-					throw new Error(
-						"--rebuild-dashboard-index requires --dashboard-output-dir to avoid mutating the source results directory",
-					);
-				}
-				const dashboardOutputDir = path.resolve(dashboardOutputDirInput);
+				// dashboardOutputDir is guaranteed defined by schema validation
+				const dashboardOutputDir = path.resolve(dashboardOutputDirInput!);
Alternatively, you could use a discriminated union or refine the schema to make TypeScript aware that dashboardOutputDir is defined when rebuildDashboardIndex is true.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cli/migrate-machine-command.ts` around lines 222 - 228, The runtime check
for dashboardOutputDir being undefined is redundant because
MigrateMachineCommandOptionsSchema.parse() already enforces this when
parsedOptions.rebuildDashboardIndex is true; remove the if
(dashboardOutputDirInput === undefined) { throw ... } block and directly use
parsedOptions.dashboardOutputDir (or add a non-null assertion where used) so
TypeScript/Zod guarantees are relied upon instead of duplicating validation in
migrate-machine-command.ts around parsedOptions.rebuildDashboardIndex and
parsedOptions.dashboardOutputDir.
src/cli/run-command.ts (1)
211-226: Redundant reassignment of options properties.

The options.machineInstanceId and options.machineDisplayLabel reassignments appear unnecessary since resolvedMachineId and resolvedMachineLabel are already used directly in configInput at lines 250-251. The options object mutations aren't consumed elsewhere.
Consider removing the redundant reassignments
 			if (legacyMachineId) {
 				logger.warn(
 					"Warning: --machine-id is deprecated; use --machine-instance-id",
 				);
-				options.machineInstanceId = resolvedMachineId;
-			} else if (canonicalMachineId) {
-				options.machineInstanceId = canonicalMachineId;
 			}
 			if (legacyMachineLabel) {
 				logger.warn(
 					"Warning: --machine-label is deprecated; use --machine-display-label",
 				);
-				options.machineDisplayLabel = resolvedMachineLabel;
-			} else if (canonicalMachineLabel) {
-				options.machineDisplayLabel = canonicalMachineLabel;
 			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/cli/run-command.ts` around lines 211 - 226, In run-command.ts the
branches that set options.machineInstanceId and options.machineDisplayLabel
(inside the legacy/canonical checks for legacyMachineId, canonicalMachineId,
legacyMachineLabel, canonicalMachineLabel) are redundant because
resolvedMachineId and resolvedMachineLabel are used directly when building
configInput; remove the options mutations while keeping the deprecation
logger.warn calls, and verify nothing else reads options.machineInstanceId or
options.machineDisplayLabel elsewhere (search for usages of options,
configInput, resolvedMachineId, resolvedMachineLabel) so behavior is unchanged.
src/lib/machine-profile/probe.ts (1)
1-539: Split the platform-specific probing/parsing out of this module.

This file is 539 lines and currently owns command execution, parsers, dedupe, and per-platform collection. Extracting the OS-specific collectors/parsers into sibling modules will keep this src/**/* file under the repo limit and make regressions easier to test in isolation.

As per coding guidelines, "Keep files under 500 lines; split aggressively by responsibility."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/machine-profile/probe.ts` around lines 1 - 539, The file is too large
and mixes OS-specific probes/parsers with shared helpers; split responsibilities
by moving platform-specific collectors and parsers into sibling modules. Extract
collectMacosAccelerators and parseMacosAccelerators to a macos module,
collectLinuxAccelerators plus
parseNvidiaAccelerators/parseRocmAccelerators/parseLspciAccelerators to a linux
module, and collectWindowsAccelerators plus parseWindowsAccelerators to a
windows module; keep shared utilities like runProbe, parsePositiveInt,
parseMemoryBytes, dedupeAccelerators, BYTES_PER_* constants and
collectObservedHardwareProfile (or have it import the collectors) in the current
probe module, update imports/exports to use the new named exports (e.g.,
parseMacosAccelerators, collectLinuxAccelerators), and adjust tests to import
the new modules so the top-level file falls under 500 lines.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/dashboard/scripts/build-index.ts`:
- Around line 431-436: Add a guard to reject overlapping source/output trees:
resolve both outputResultsDir and sourceResultsDir to absolute paths and compute
path.relative(sourceResultsDir, outputResultsDir) and
path.relative(outputResultsDir, sourceResultsDir); if either relative result is
'' (same dir) or does not start with '..' (meaning one is inside the other),
throw an Error and abort before calling rm() in writePublishedRunBundles and
before performing the scan of sourceResultsDir (the code that iterates
subdirectories under sourceResultsDir). This prevents outputResultsDir from
equaling or being a child of sourceResultsDir (and vice versa).

In `@apps/dashboard/src/lib/schemas.ts`:
- Around line 135-147: HardwareProfileSchema is missing the invariant checks
linking acceleratorDetection.status and accelerators; add a superRefine on
HardwareProfileSchema that reads acceleratorDetection.status and enforces: if
status === "detected" then accelerators.length > 0 (add an error on
"accelerators" when empty), and if status === "none_detected" then
accelerators.length === 0 (add an error on "accelerators" when not empty); use
HardwareProfileSchema.superRefine to perform these validations so the dashboard
schema matches the CLI's src/schemas/common.schema.ts invariants.

In `@README.md`:
- Around line 149-150: The example for --machine-instance-id uses a
hardware-like identifier ("mac-mini-m4-pro") which can be reused across multiple
hosts; update the README example command to use a clearly instance-scoped ID
(e.g., "austin-office-mac-mini") and adjust the --machine-display-label if
desired so the sample demonstrates a unique per-machine instance usage (edit the
example line that contains the bun pb invocation with --machine-instance-id and
--machine-display-label).

In `@src/lib/machine-profile/legacy.ts`:
- Around line 142-143: The code currently calls
MachineProfileSchema.safeParse(rawMachine) but ignores the parsed result and
passes rawMachine into normalizeCurrentMachineProfile; change this to capture
the safeParse result (e.g., const parsed =
MachineProfileSchema.safeParse(rawMachine)) and, when parsed.success is true,
pass parsed.data into normalizeCurrentMachineProfile so Zod-applied defaults
(like accelerators defaults) and validated types are used instead of the raw
input.

In `@src/lib/machine-profile/probe.ts`:
- Around line 222-244: The current dedupeAccelerators function collapses
identical vendor|modelRaw keys into a single ObservedAccelerator, which loses
per-device counts and causes acceleratorCount to be wrong downstream; change
dedupeAccelerators to aggregate duplicates instead of dropping them by tracking
and summing an occurrence count (e.g., add or increment a count/acceleratorCount
property on the ObservedAccelerator when merging), while still merging missing
fields (vendor, memoryBytes, backend, kind) from the new entry into the existing
one; ensure the merged object preserves accumulated count so
normalizeMachineProfile() / buildMachineProfileKey() see correct counts.

In `@src/schemas/config.schema.ts`:
- Around line 19-55: The preprocessor (migrateBenchConfigAliases /
readNonEmptyString / validateCanonicalMachineFieldType used in z.preprocess)
must not throw because exceptions escape safeParse; instead remove throws from
readNonEmptyString and validateCanonicalMachineFieldType so they only
normalize/return values, then move all validation into
BenchConfigSchema.superRefine (or, if you must keep checks in the preprocessor,
convert failures into Zod issues by returning an object describing problems
rather than throwing). Implement superRefine to detect blank strings, non-string
canonical fields, and alias conflicts and call ctx.addIssue with appropriate
ZodIssueKinds so BenchConfigSchema.safeParse returns structured ZodError
results.

In `@test/schemas.test.ts`:
- Around line 255-263: The test's expected error regex is incorrect; update the
assertion in the it block that parses HardwareProfileSchema to match the actual
Zod error string by including the "accelerators " prefix and using plain double
quotes (not escaped backslashes). Replace the current toThrow regex with one
that matches: accelerators must be empty when acceleratorDetection\.status is
"none_detected" (e.g. /accelerators must be empty when
acceleratorDetection\.status is "none_detected"/) so the test for
HardwareProfileSchema.parse(...) correctly validates the error message.

---

Nitpick comments:
In `@apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx`:
- Around line 23-39: The current getRunMachineBadgeLabel function includes
machineInstanceId in the display fallback chain which can expose technical IDs;
remove machineInstanceId from the returned fallback sequence so the function
only uses human-friendly fields (machineDisplayLabel, machineLabel,
machineProfileLabel, machineProfileKey, machineProfileId) and leave
machineInstanceId for internal tracking elsewhere.

In `@src/cli/migrate-machine-command.ts`:
- Around line 114-145: The rollback can attempt operations for files that never
had backups or were never renamed; update the logic around stagedRewrites to
track which steps completed (e.g., mark per-rewrite flags like backupCreated and
renameCompleted while executing fs.writeFile, fs.copyFile, and fs.rename) and
use those flags in the catch and finally blocks so you only attempt
fs.copyFile(backupPath, artifactPath) for entries with backupCreated and only
attempt fs.rename or fs.unlink for entries for which the corresponding
temp/backup were actually created; also improve thrown errors and
processLogger/error messages to include the rewrite identifier and which
operation failed to make rollback outcomes clearer (referencing stagedRewrites,
fs.copyFile, fs.writeFile, fs.rename, fs.unlink).
- Around line 222-228: The runtime check for dashboardOutputDir being undefined
is redundant because MigrateMachineCommandOptionsSchema.parse() already enforces
this when parsedOptions.rebuildDashboardIndex is true; remove the if
(dashboardOutputDirInput === undefined) { throw ... } block and directly use
parsedOptions.dashboardOutputDir (or add a non-null assertion where used) so
TypeScript/Zod guarantees are relied upon instead of duplicating validation in
migrate-machine-command.ts around parsedOptions.rebuildDashboardIndex and
parsedOptions.dashboardOutputDir.

In `@src/cli/run-command.ts`:
- Around line 211-226: In run-command.ts the branches that set
options.machineInstanceId and options.machineDisplayLabel (inside the
legacy/canonical checks for legacyMachineId, canonicalMachineId,
legacyMachineLabel, canonicalMachineLabel) are redundant because
resolvedMachineId and resolvedMachineLabel are used directly when building
configInput; remove the options mutations while keeping the deprecation
logger.warn calls, and verify nothing else reads options.machineInstanceId or
options.machineDisplayLabel elsewhere (search for usages of options,
configInput, resolvedMachineId, resolvedMachineLabel) so behavior is unchanged.

In `@src/lib/machine-profile/probe.ts`:
- Around line 1-539: The file is too large and mixes OS-specific probes/parsers
with shared helpers; split responsibilities by moving platform-specific
collectors and parsers into sibling modules. Extract collectMacosAccelerators
and parseMacosAccelerators to a macos module, collectLinuxAccelerators plus
parseNvidiaAccelerators/parseRocmAccelerators/parseLspciAccelerators to a linux
module, and collectWindowsAccelerators plus parseWindowsAccelerators to a
windows module; keep shared utilities like runProbe, parsePositiveInt,
parseMemoryBytes, dedupeAccelerators, BYTES_PER_* constants and
collectObservedHardwareProfile (or have it import the collectors) in the current
probe module, update imports/exports to use the new named exports (e.g.,
parseMacosAccelerators, collectLinuxAccelerators), and adjust tests to import
the new modules so the top-level file falls under 500 lines.

In `@src/results/aggregate.ts`:
- Around line 299-301: The code repeatedly calls resolveRunMetadata(input) in
the checkpoint filter, later in the main loop and when building
summary.instances; instead, map the original runs array to a materialized array
of { input, metadata } by calling resolveRunMetadata once per run (use this
mapped array when computing matchedRuns by checkpointId, iterating the main
aggregation loop, and constructing summary.instances), then update all
references that expect the original input to use the stored metadata (e.g.,
places currently invoking resolveRunMetadata(input) inside the main loop and in
summary.instances) so normalization/migration occurs a single time per run and
the instance-count path is simplified.

In `@src/schemas/config.schema.ts`:
- Around line 96-103: Normalize deprecated aliases before returning the parsed
config: after mapping machineProfileId -> machineInstanceId and machineLabel ->
machineDisplayLabel, delete the deprecated keys (machineProfileId and
machineLabel) from the config object so the returned BenchConfig only exposes
machineInstanceId and machineDisplayLabel; apply the same removal logic in the
other parsing branch that handles the alternate mapping (the block mirroring the
code around machineInstanceId/machineDisplayLabel) and ensure this aligns with
BenchConfigObjectSchema validation boundaries.

In `@test/aggregate.test.ts`:
- Around line 197-198: Add a sibling regression test where two runs share the
same instanceId and then assert distinct-instance counters collapse to one:
create an aggregate from two records with identical instanceId and verify
aggregate.summary.instances === 1 and aggregate.machines[0]?.instanceCount === 1
(and mirror the same change for the assertions around the other case mentioned
at the second location). This ensures the counters count unique instanceId
values rather than run records.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f0c69fb0-cd54-444a-96ec-39416ad8c42c

📥 Commits

Reviewing files that changed from the base of the PR and between 90ecc64 and 741ac1a.

📒 Files selected for processing (58)

README.md
apps/dashboard/public/results/20260305-135118-b11d08/plan.json
apps/dashboard/public/results/20260305-135118-b11d08/run.json
apps/dashboard/public/results/20260305-144041-4d5950/plan.json
apps/dashboard/public/results/20260305-144041-4d5950/run.json
apps/dashboard/public/results/20260306-135715-91b0ee/plan.json
apps/dashboard/public/results/20260306-135715-91b0ee/run.json
apps/dashboard/public/results/20260307-161553-c25c6a/plan.json
apps/dashboard/public/results/20260307-161553-c25c6a/run.json
apps/dashboard/public/results/20260316-141600-6e18e2/plan.json
apps/dashboard/public/results/20260316-141600-6e18e2/run.json
apps/dashboard/public/results/20260317-082000-7b3bf3/plan.json
apps/dashboard/public/results/20260317-082000-7b3bf3/run.json
apps/dashboard/public/results/20260317-170728-697a9a/plan.json
apps/dashboard/public/results/20260317-170728-697a9a/run.json
apps/dashboard/public/results/20260318-091214-a0dbda/plan.json
apps/dashboard/public/results/20260318-091214-a0dbda/run.json
apps/dashboard/public/results/20260319-090246-e5f8be/plan.json
apps/dashboard/public/results/20260319-090246-e5f8be/run.json
apps/dashboard/public/results/aggregates/chk_sha256v1_2c13ad3e58b3.json
apps/dashboard/public/results/aggregates/chk_sha256v1_61a037f87105.json
apps/dashboard/public/results/aggregates/latest.json
apps/dashboard/public/results/index.json
apps/dashboard/scripts/build-index.ts
apps/dashboard/src/components/leaderboard/leaderboard-filters.ts
apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx
apps/dashboard/src/components/leaderboard/leaderboard-results-table.tsx
apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx
apps/dashboard/src/components/run-detail/run-detail-page.tsx
apps/dashboard/src/components/run-list/checkpoint-groups.ts
apps/dashboard/src/components/run-list/run-card.tsx
apps/dashboard/src/components/run-list/run-list-page.tsx
apps/dashboard/src/lib/api.ts
apps/dashboard/src/lib/schemas.ts
apps/dashboard/src/lib/types.ts
src/cli/index.ts
src/cli/migrate-machine-command.ts
src/cli/run-command.ts
src/lib/hardware-profile.ts
src/lib/machine-profile/instance-id.ts
src/lib/machine-profile/legacy.ts
src/lib/machine-profile/normalization.ts
src/lib/machine-profile/probe.ts
src/results/aggregate.ts
src/results/reader.ts
src/runner/plan-builder.ts
src/schemas/common.schema.ts
src/schemas/config.schema.ts
src/schemas/index.ts
src/schemas/result.schema.ts
src/tests/workspace-smoke/test.meta.json
test/aggregate.test.ts
test/build-index.test.ts
test/compare-checkpoint-guard.test.ts
test/hardware-profile.test.ts
test/machine-profile-migration.test.ts
test/plan-builder.test.ts
test/schemas.test.ts

Add trusted signal assessment to benchmark results

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (1)

src/harnesses/code-output-policy.ts (1)
186-193: Consider the "ok" path with raw code signals at lines 223-231.

When hasRawCodeSignals is true, the return at line 229 sets taintReasons: [], but this path uses code: trimmed and method: extracted.method. If extraction was attempted but the code was accepted via raw signal patterns instead, the taintReasons should potentially reflect that the output contained mixed content that happened to pass signal checks.

However, since this is a fallback path where structured extraction failed but code-like patterns exist, returning [] may be intentional (accepting raw output as "clean enough").
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/code-output-policy.ts` around lines 186 - 193, Change the
fallback "raw signal" acceptance branch (the hasRawCodeSignals path) so it
doesn't always return an empty taintReasons array; instead compute and return
appropriate taint reasons similar to the structured-extraction path.
Specifically, in the branch that currently returns taintReasons: [], call
getAcceptedOutputTaintReasons(trimmed, extracted) (or merge its results with any
raw-signal-specific taint markers) and return that value while keeping code:
trimmed and method: extracted.method so the response reflects that extraction
was attempted but raw-code signals were used.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/signal-assessment.ts`:
- Around line 40-50: The createTaintedSignalAssessment function incorrectly
falls back to createTrustworthySignalAssessment when the deduplicated reasons
array is empty; instead, validate and fail fast: after deduping reasons (using
the Set), if uniqueReasons.length === 0 throw a descriptive Error (e.g.
"createTaintedSignalAssessment called with empty reasons") rather than returning
a trustworthy assessment, and otherwise return the tainted assessment with the
uniqueReasons; do not call createTrustworthySignalAssessment in this function.

In `@src/schemas/common.schema.ts`:
- Around line 355-359: The schema currently allows status: "unavailable" with no
explanation; update AcceleratorDetectionSchema so that when status equals the
"unavailable" variant of AcceleratorDetectionStatusSchema the detail string is
required and non-empty (use a conditional validation via z.preprocess/z.refine
or .superRefine on AcceleratorDetectionSchema checking obj.status ===
"unavailable" => Boolean(obj.detail && obj.detail.trim().length)). Reference
AcceleratorDetectionSchema, AcceleratorDetectionStatusSchema, and the
status/detail fields when making the change.

In `@test/goose-adapter.test.ts`:
- Around line 78-118: The test for createGooseAdapter should create and own a
temporary workspace directory and ensure it's removed in a finally block instead
of relying on adapter-managed temp dirs; update the test around
createGooseAdapter() and adapter.generate(...) to (1) create a temp folder
(e.g., with fs.mkdtempSync or a test helper), (2) pass that workspace/path into
the adapter (or set adapter.workspacePath) before calling adapter.generate, and
(3) wrap the test run in try/finally that cleans up the temp workspace
(fs.rmSync or equivalent) to guarantee no artifacts remain after the test
completes.

In `@test/workspace-prompt-parity.test.ts`:
- Around line 59-81: The test "blind coding prompts declare the required public
method contracts" currently asserts only a subset of updated blind-prompt
contracts; add assertions to cover the newly required clauses by updating the
expectations on calculatorPrompt, emitterPrompt, and rateLimiterPrompt: for
calculatorPrompt add expectations for the new calculator method signatures
(e.g., "`reset()`" and any new signatures such as "`subtract(n: number)`" or
"`memoryClear()`" that your prompts require), for emitterPrompt add expectations
for "`once(event, listener)`" and "`off(event, listener)`" (in addition to the
existing `on`, `emit`, `listenerCount`), and for rateLimiterPrompt ensure any
new textual requirements (e.g., the "never go below 0" rule) are asserted;
locate these in the same test block using the existing variables
calculatorPrompt, emitterPrompt, and rateLimiterPrompt and add
expect(...).toContain(...) checks for each missing clause.

---

Nitpick comments:
In `@src/harnesses/code-output-policy.ts`:
- Around line 186-193: Change the fallback "raw signal" acceptance branch (the
hasRawCodeSignals path) so it doesn't always return an empty taintReasons array;
instead compute and return appropriate taint reasons similar to the
structured-extraction path. Specifically, in the branch that currently returns
taintReasons: [], call getAcceptedOutputTaintReasons(trimmed, extracted) (or
merge its results with any raw-signal-specific taint markers) and return that
value while keeping code: trimmed and method: extracted.method so the response
reflects that extraction was attempted but raw-code signals were used.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 40f6afa2-b307-4609-8167-80f5932558ef

📥 Commits

Reviewing files that changed from the base of the PR and between 741ac1a and 0573863.

📒 Files selected for processing (28)

src/cli/compare-command.ts
src/harnesses/code-output-policy.ts
src/harnesses/direct-adapter.ts
src/harnesses/goose-adapter.ts
src/harnesses/harness.ts
src/harnesses/opencode-adapter.ts
src/lib/signal-assessment.ts
src/lib/stats-format.ts
src/lib/stats.ts
src/results/compare.ts
src/runner/generation-retry.ts
src/runner/item-executor.ts
src/runner/item-retry.ts
src/schemas/common.schema.ts
src/schemas/index.ts
src/schemas/result.schema.ts
src/tests/calculator-stateful/prompt.blind.md
src/tests/event-emitter/prompt.blind.md
src/tests/file-delete-smoke/prompt.blind.md
src/tests/rate-limiter/prompt.blind.md
src/tests/safe-cleanup/prompt.blind.md
test/code-output-policy.test.ts
test/compare.test.ts
test/goose-adapter.test.ts
test/opencode-adapter.test.ts
test/signal-assessment.test.ts
test/stats.test.ts
test/workspace-prompt-parity.test.ts

✅ Files skipped from review due to trivial changes (3)

src/tests/file-delete-smoke/prompt.blind.md
src/tests/safe-cleanup/prompt.blind.md
test/signal-assessment.test.ts

🚧 Files skipped from review as they are similar to previous changes (1)

src/schemas/result.schema.ts

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (4)

README.md (1)

154-155: ⚠️ Potential issue | 🟡 Minor

Use an instance-scoped sample ID here.

mac-mini-m4-pro still reads like a hardware/profile identifier, so the example encourages reusing one ID across identical hosts and collapsing the new instance counts.

Doc tweak

- bun pb --machine-instance-id mac-mini-m4-pro --machine-display-label "Austin Mac Mini"
+ bun pb --machine-instance-id austin-office-mac-mini --machine-display-label "Austin Office Mac Mini"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 154 - 155, The README example uses a hardware-like
sample ID ("mac-mini-m4-pro") that encourages reuse across distinct instances;
update the example for the command shown (the bun pb invocation using flags
--machine-instance-id and --machine-display-label) to use an instance-scoped
sample ID (e.g., a UUID or short unique token like "instance-<unique-id>" or
"inst-abc123") and keep the --machine-display-label as a human-friendly name;
replace only the machine-instance-id value in the example so readers are guided
to provide per-instance unique IDs rather than profile/hardware names.

src/schemas/config.schema.ts (1)

20-83: ⚠️ Potential issue | 🔴 Critical

Keep the z.preprocess() path normalization-only.

This migration chain still throws plain Errors for blank strings, type mismatches, alias conflicts, and malformed modelAliases. Under z.preprocess(), those exceptions bypass Zod issue collection, so BenchConfigSchema.safeParse() callers get a thrown exception instead of a structured validation failure. Move the validation into BenchConfigObjectSchema.superRefine() (or other schema issues) and let the preprocessor only rewrite deprecated keys.
Read-only verification
#!/bin/bash
# Inspect whether downstream callers rely on BenchConfigSchema.parse/safeParse
# for user-facing config validation paths.
rg -nP -C3 '\bBenchConfigSchema\.(safeParse|parse)\s*\('
rg -nP -C3 '\bz\.preprocess\(\s*migrateBenchConfigAliases\b' src/schemas/config.schema.ts
As per coding guidelines, "Use Zod for schema validation at all boundaries (CLI args, env, JSON files, HTTP responses)".
Also applies to: 92-147, 235-239
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/schemas/config.schema.ts` around lines 20 - 83, The preprocessor
migrateBenchConfigAliases (and helpers readNonEmptyString,
validateCanonicalMachineFieldType,
migrateLegacyModelAliases/ModelAliasMapSchema) must only normalize keys and
rewrite deprecated shapes without throwing; move all runtime checks into
BenchConfigObjectSchema.superRefine so zod collects issues rather than allowing
exceptions to escape safeParse/parse. Concretely: make the z.preprocess path
return normalized data or undefined and remove throws from
readNonEmptyString/validateCanonicalMachineFieldType/migrateLegacyModelAliases
(have them return safe values or parse results), then implement validations
inside BenchConfigObjectSchema.superRefine to call those helpers and add
ctx.addIssue for blank strings, type mismatches, alias conflicts, and malformed
modelAliases (convert ModelAliasMapSchema.safeParse errors into zod issues
there) so callers receive structured zod validation failures.

test/schemas.test.ts (1)

309-317: ⚠️ Potential issue | 🟡 Minor

Fix the none_detected error regex.

HardwareProfileSchema emits accelerators must be empty when acceleratorDetection.status is "none_detected", so this matcher will not hit the actual Zod message.

🧪 Suggested fix

-		).toThrow(/must be empty when acceleratorDetection\.status is \\\"none_detected\\\"/);
+		).toThrow(
+			/accelerators must be empty when acceleratorDetection\.status is "none_detected"/,
+		);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/schemas.test.ts` around lines 309 - 317, The test's expected error regex
doesn't match the Zod message; update the assertion in the "should reject
none_detected accelerators with a non-empty accelerator list" test so the
toThrow matcher expects the actual message emitted by
HardwareProfileSchema.parse — e.g., change the regex to match 'accelerators must
be empty when acceleratorDetection.status is "none_detected"' (refer to the
failing test block and the HardwareProfileSchema.parse call to locate the
assertion).

apps/dashboard/scripts/build-index.ts (1)

490-507: ⚠️ Potential issue | 🔴 Critical

Reject overlapping source and output trees before reading or deleting.

writePublishedRunBundles() removes outputResultsDir unconditionally. If the output directory equals the source directory, lives under it, or contains it, this can either delete source runs or make the scan treat published artifacts as inputs.

🛡️ Suggested fix

-import { basename, join, resolve } from "node:path";
+import { basename, join, relative, resolve } from "node:path";
...
 export async function buildDashboardIndexArtifacts(
 	options: BuildDashboardIndexArtifactsOptions,
 ): Promise<BuildDashboardIndexArtifactsResult> {
 	const sourceResultsDir = resolve(options.sourceResultsDir);
 	const outputResultsDir = resolve(options.outputResultsDir);
 	const projectRoot = resolve(options.projectRoot ?? DEFAULT_PROJECT_ROOT);
+	const isSameOrNested = (from: string, to: string): boolean => {
+		const rel = relative(from, to);
+		return rel === "" || (rel !== "." && !rel.startsWith(".."));
+	};
+	if (
+		isSameOrNested(sourceResultsDir, outputResultsDir) ||
+		isSameOrNested(outputResultsDir, sourceResultsDir)
+	) {
+		throw new Error(
+			"sourceResultsDir and outputResultsDir must not overlap",
+		);
+	}
 	const indexPath = join(outputResultsDir, "index.json");

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/scripts/build-index.ts` around lines 490 - 507, Detect and
reject overlapping source and output directories before any IO by comparing
resolved sourceResultsDir and outputResultsDir (and aggregatesDir) for equality
or containment; add a pre-check (e.g., isSubpath or canonical resolved path +
separator prefix check) at the top of the routine before the entries loop and
before calling writePublishedRunBundles to throw/exit with a clear error if
outputResultsDir === sourceResultsDir or one path is inside the other, so you
never read source files when output will be deleted or write into the scanned
tree.

🧹 Nitpick comments (3)

test/machine-profile-migration.test.ts (1)

117-118: Cover the happy-path parseKnown*Payload() flow too.

The supported-case assertions stop at the raw schemas, and parseKnownPlanPayload() / parseKnownRunPayload() are only exercised on the unsupported-version branch. Adding one supported plan/run through those helpers would catch parser-wiring regressions that this suite currently misses.

Also applies to: 321-335
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/machine-profile-migration.test.ts` around lines 117 - 118, The test only
asserts raw schema parsing but never exercises the happy-path helpers; add
assertions that call parseKnownPlanPayload(plan) and parseKnownRunPayload(run)
(or their equivalents) with a supported-version plan/run (use the same sample
plan and run used to produce parsedPlan/parsedRun) and assert they return the
expected known payload objects (not the unsupported branch). This will exercise
the parser wiring for the supported-case: locate where RunPlanSchema.parse(plan)
and RunResultSchema.parse(run) are used (parsedPlan/parsedRun) and after those
lines call parseKnownPlanPayload and parseKnownRunPayload and add assertions
that the results are defined and match the expected shape/fields for the
supported version.

src/lib/model-profile/normalization.ts (2)

105-106: Deduplicate tuning token definitions

The same tuning token set is hardcoded in two places. Extract one shared const to avoid future drift between detection and family derivation.

♻️ Proposed refactor

+const TUNING_TOKENS = [
+	"instruct",
+	"chat",
+	"coder",
+	"vision",
+	"reasoning",
+] as const;
+
 /** Detects common tuning suffixes that should stay in the canonical profile. */
 function detectTuning(value: string): string | undefined {
 	const normalized = value.toLowerCase();
-	const tuningPatterns = ["instruct", "chat", "coder", "vision", "reasoning"];
-	return tuningPatterns.find((pattern) => normalized.includes(pattern));
+	return TUNING_TOKENS.find((pattern) => normalized.includes(pattern));
 }
@@
-		if (["instruct", "chat", "coder", "vision", "reasoning"].includes(token)) {
+		if ((TUNING_TOKENS as readonly string[]).includes(token)) {
 			break;
 		}

As per coding guidelines, "Favor iteration and modularization over code duplication".

Also applies to: 151-153

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/model-profile/normalization.ts` around lines 105 - 106, Extract the
hardcoded array into a single shared top-level constant (e.g. TUNING_PATTERNS =
["instruct","chat","coder","vision","reasoning"]) and replace the local
tuningPatterns declarations with references to TUNING_PATTERNS; update both the
detection site that uses return tuningPatterns.find((pattern) =>
normalized.includes(pattern)) and the other occurrence around lines 151-153 (the
family-derivation logic) to use the shared constant so the token set is defined
only once.

94-99: Normalize fallback format casing consistently

detectFormat returns mixed casing (GGUF, MLX, safetensors). Consider one canonical style to avoid downstream grouping mismatches.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/model-profile/normalization.ts` around lines 94 - 99, The
detectFormat function returns inconsistent casing ("GGUF", "MLX", "safetensors")
which can break downstream grouping; update detectFormat to return a single
canonical casing (e.g., all-lowercase like "gguf", "mlx", "safetensors" or
all-uppercase) for every branch, and adjust any consumers if they expect the old
mixed values; specifically change the return values inside detectFormat to the
chosen canonical strings so format comparisons and grouping are consistent
across the codebase.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/model-profile/normalization.ts`:
- Around line 313-323: The current conditional silently falls back to
buildFallbackModelProfile when only one of args.profileKey or args.profile is
provided, hiding configuration/programmer errors; instead validate presence of
both in the normalization logic and throw a clear error when one is missing
(e.g., check args.profileKey && args.profile, else if only one is present throw
a descriptive Error referencing profileKey/profile and the calling context), so
only call buildConfiguredModelProfile when both are present and otherwise fail
fast rather than calling buildFallbackModelProfile.
- Around line 37-47: Add full TSDoc blocks to every exported function in this
module—specifically humanizeSlug, normalizeConfiguredVariant,
buildConfiguredModelProfile, and buildFallbackModelProfile—replacing the short
comments with standard TSDoc that describes the function purpose, documents each
parameter (`@param`), the return value (`@returns`), and any errors that can be
thrown (`@throws`). Ensure the TSDoc sits immediately above each exported function
declaration and uses clear, concise descriptions for input types and expected
outputs so IDEs and generated docs pick up the information.

In `@src/results/compare.ts`:
- Around line 27-40: The compare key in buildCompareKey currently omits the
runtime, causing different runtimes (e.g., ollama vs vllm) for the same model
profile to collapse; update the function signature and key construction in
buildCompareKey to include item.runtime (add runtime to the item shape if
missing) and append it to the returned string (e.g., include |${item.runtime})
so keys are unique per runtime while still using getModelIdentityKey for the
model identity portion.

In `@src/runner/plan-builder.ts`:
- Around line 169-177: If resolveModelSelection(modelSpec, runtimeName,
modelProfiles) returns undefined for a modelSpec that is a configured profile
key, do not silently skip it; instead detect that modelSpec exists in
modelProfiles (or otherwise was explicitly requested) and throw a clear error
indicating the configured profile has no variant for the given runtimeName.
Update the loop around resolveModelSelection / discovered checks to: (1) check
whether modelSpec refers to a configured profile (e.g., lookup in modelProfiles
or the config source), (2) if so and resolvedSelection is undefined throw a
descriptive exception mentioning modelSpec and runtimeName, and (3) otherwise
continue the existing logic that checks
discovered.includes(resolvedSelection.runtimeModelName). This ensures
resolveModelSelection, modelSpec, runtimeName, modelProfiles, discovered, and
runtimeModelName are the focal symbols for the fix.

---

Duplicate comments:
In `@apps/dashboard/scripts/build-index.ts`:
- Around line 490-507: Detect and reject overlapping source and output
directories before any IO by comparing resolved sourceResultsDir and
outputResultsDir (and aggregatesDir) for equality or containment; add a
pre-check (e.g., isSubpath or canonical resolved path + separator prefix check)
at the top of the routine before the entries loop and before calling
writePublishedRunBundles to throw/exit with a clear error if outputResultsDir
=== sourceResultsDir or one path is inside the other, so you never read source
files when output will be deleted or write into the scanned tree.

In `@README.md`:
- Around line 154-155: The README example uses a hardware-like sample ID
("mac-mini-m4-pro") that encourages reuse across distinct instances; update the
example for the command shown (the bun pb invocation using flags
--machine-instance-id and --machine-display-label) to use an instance-scoped
sample ID (e.g., a UUID or short unique token like "instance-<unique-id>" or
"inst-abc123") and keep the --machine-display-label as a human-friendly name;
replace only the machine-instance-id value in the example so readers are guided
to provide per-instance unique IDs rather than profile/hardware names.

In `@src/schemas/config.schema.ts`:
- Around line 20-83: The preprocessor migrateBenchConfigAliases (and helpers
readNonEmptyString, validateCanonicalMachineFieldType,
migrateLegacyModelAliases/ModelAliasMapSchema) must only normalize keys and
rewrite deprecated shapes without throwing; move all runtime checks into
BenchConfigObjectSchema.superRefine so zod collects issues rather than allowing
exceptions to escape safeParse/parse. Concretely: make the z.preprocess path
return normalized data or undefined and remove throws from
readNonEmptyString/validateCanonicalMachineFieldType/migrateLegacyModelAliases
(have them return safe values or parse results), then implement validations
inside BenchConfigObjectSchema.superRefine to call those helpers and add
ctx.addIssue for blank strings, type mismatches, alias conflicts, and malformed
modelAliases (convert ModelAliasMapSchema.safeParse errors into zod issues
there) so callers receive structured zod validation failures.

In `@test/schemas.test.ts`:
- Around line 309-317: The test's expected error regex doesn't match the Zod
message; update the assertion in the "should reject none_detected accelerators
with a non-empty accelerator list" test so the toThrow matcher expects the
actual message emitted by HardwareProfileSchema.parse — e.g., change the regex
to match 'accelerators must be empty when acceleratorDetection.status is
"none_detected"' (refer to the failing test block and the
HardwareProfileSchema.parse call to locate the assertion).

---

Nitpick comments:
In `@src/lib/model-profile/normalization.ts`:
- Around line 105-106: Extract the hardcoded array into a single shared
top-level constant (e.g. TUNING_PATTERNS =
["instruct","chat","coder","vision","reasoning"]) and replace the local
tuningPatterns declarations with references to TUNING_PATTERNS; update both the
detection site that uses return tuningPatterns.find((pattern) =>
normalized.includes(pattern)) and the other occurrence around lines 151-153 (the
family-derivation logic) to use the shared constant so the token set is defined
only once.
- Around line 94-99: The detectFormat function returns inconsistent casing
("GGUF", "MLX", "safetensors") which can break downstream grouping; update
detectFormat to return a single canonical casing (e.g., all-lowercase like
"gguf", "mlx", "safetensors" or all-uppercase) for every branch, and adjust any
consumers if they expect the old mixed values; specifically change the return
values inside detectFormat to the chosen canonical strings so format comparisons
and grouping are consistent across the codebase.

In `@test/machine-profile-migration.test.ts`:
- Around line 117-118: The test only asserts raw schema parsing but never
exercises the happy-path helpers; add assertions that call
parseKnownPlanPayload(plan) and parseKnownRunPayload(run) (or their equivalents)
with a supported-version plan/run (use the same sample plan and run used to
produce parsedPlan/parsedRun) and assert they return the expected known payload
objects (not the unsupported branch). This will exercise the parser wiring for
the supported-case: locate where RunPlanSchema.parse(plan) and
RunResultSchema.parse(run) are used (parsedPlan/parsedRun) and after those lines
call parseKnownPlanPayload and parseKnownRunPayload and add assertions that the
results are defined and match the expected shape/fields for the supported
version.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 117bc571-762b-4ff5-9000-aa5dac10fa91

📥 Commits

Reviewing files that changed from the base of the PR and between 0573863 and e121d55.

📒 Files selected for processing (23)

README.md
apps/dashboard/scripts/build-index.ts
llm/implementation/multi-runtime-mvp-implementation.md
models.example.json
models.json
src/cli/run-command.ts
src/lib/model-profile/normalization.ts
src/lib/model-profile/registry.ts
src/lib/model-profiles.ts
src/results/compare.ts
src/runner/index.ts
src/runner/item-executor.ts
src/runner/plan-builder.ts
src/schemas/config.schema.ts
src/schemas/index.ts
src/schemas/model-profile.schema.ts
src/schemas/plan.schema.ts
src/schemas/result.schema.ts
test/build-index.test.ts
test/compare.test.ts
test/machine-profile-migration.test.ts
test/plan-builder.test.ts
test/schemas.test.ts

✅ Files skipped from review due to trivial changes (5)

llm/implementation/multi-runtime-mvp-implementation.md
src/lib/model-profiles.ts
test/compare.test.ts
src/schemas/model-profile.schema.ts
src/cli/run-command.ts

🚧 Files skipped from review as they are similar to previous changes (1)

src/runner/item-executor.ts

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (3)

src/lib/machine-profile/probe-utils.ts (1)

103-108: Redundant conditional spread on initial entry.

The ...accelerator spread already copies count when present. The conditional spread on line 106 has no effect.

♻️ Simplified initial entry

 		if (!current) {
-			deduped.set(key, {
-				...accelerator,
-				...(accelerator.count === undefined ? {} : { count: accelerator.count }),
-			});
+			deduped.set(key, { ...accelerator });
 			continue;
 		}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/machine-profile/probe-utils.ts` around lines 103 - 108, The
initial-entry branch inside the dedup loop redundantly re-spreads
accelerator.count; in the block that checks "if (!current)" simplify the object
passed to deduped.set by removing the conditional spread and just use the
accelerator (or a shallow copy like {...accelerator}) since the leading
"...accelerator" already includes count; update the deduped.set(key, ...) call
accordingly (refer to deduped.set, accelerator, current, key in probe-utils.ts).

src/schemas/config.schema.ts (1)

243-257: Unreachable type validation in superRefine.

The validateCanonicalMachineFieldType check runs inside superRefine, but by that point Zod's field-level validators (z.string().trim().min(1).optional()) have already rejected non-string values. This code block will never execute because non-string inputs fail schema validation before superRefine runs.

Either remove this dead code or move the type validation into the preprocessor where it can intercept invalid types before the schema validates them.

♻️ Option 1: Remove unreachable code

 		if (config.gooseWorkspaceRetryMaxTurns < config.gooseWorkspaceMaxTurns) {
 			context.addIssue({
 				code: z.ZodIssueCode.custom,
 				path: ["gooseWorkspaceRetryMaxTurns"],
 				message:
 					"gooseWorkspaceRetryMaxTurns must be greater than or equal to gooseWorkspaceMaxTurns",
 			});
 		}
-
-		const rawConfig = config as unknown as Record<string, unknown>;
-		for (const fieldName of [
-			"machineInstanceId",
-			"machineDisplayLabel",
-			"machineProfileId",
-			"machineLabel",
-		] as const) {
-			if (validateCanonicalMachineFieldType(rawConfig, fieldName)) {
-				context.addIssue({
-					code: z.ZodIssueCode.custom,
-					path: [fieldName],
-					message: `Bench config ${fieldName} must be a string when provided`,
-				});
-			}
-		}
 	});

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/schemas/config.schema.ts` around lines 243 - 257, The type-check inside
superRefine using validateCanonicalMachineFieldType is unreachable because Zod's
field-level validators (e.g., z.string().trim().min(1).optional()) already
reject non-string values before superRefine runs; remove the entire loop that
iterates over
["machineInstanceId","machineDisplayLabel","machineProfileId","machineLabel"]
(the rawConfig/validateCanonicalMachineFieldType/context.addIssue block) from
superRefine, or alternatively move that type-check into the preprocessor phase
where rawConfig is available so it can catch non-string types before schema
validation runs.

src/lib/machine-profile/probe-linux.ts (1)

30-33: Remove redundant string parsing.

The parsePositiveInt(memoryMiB) call already parses and validates the string, but its result is discarded. The string is then parsed again with Number.parseInt(memoryMiB, 10). Store and reuse the validated result instead:

♻️ Proposed fix

-			const memoryBytes = parsePositiveInt(memoryMiB)
-				? Number.parseInt(memoryMiB, 10) * BYTES_PER_MIB
-				: undefined;
+			const parsedMiB = parsePositiveInt(memoryMiB);
+			const memoryBytes = parsedMiB !== undefined
+				? parsedMiB * BYTES_PER_MIB
+				: undefined;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/machine-profile/probe-linux.ts` around lines 30 - 33, The code
currently calls parsePositiveInt(memoryMiB) to validate/parse the memory string
but discards its return value and re-parses using Number.parseInt; instead call
parsePositiveInt(memoryMiB) once, store its numeric result (e.g., const
memoryValue = parsePositiveInt(memoryMiB)), then compute memoryBytes =
memoryValue ? memoryValue * BYTES_PER_MIB : undefined; update the block around
modelRaw/memoryMiB parsing in probe-linux.ts to reuse this parsed result and
remove the redundant Number.parseInt call.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/machine-profile/probe-macos.ts`:
- Around line 20-33: The function parseMacosAccelerators assumes parsed is a
non-null object before checking SPDisplaysDataType, which allows
JSON.parse("null") or scalars to cause a TypeError; update the validation to
first assert that parsed is an object and not null (e.g. typeof parsed ===
"object" && parsed !== null) and only then check that (parsed as {
SPDisplaysDataType?: unknown }).SPDisplaysDataType is an Array, so that reading
SPDisplaysDataType and assigning to displays is safe and yields the intended
deterministic probe error when the shape is wrong; reference
parseMacosAccelerators, the parsed variable, SPDisplaysDataType and displays
when making the change.

In `@src/results/compare.ts`:
- Around line 317-323: trustedMetricsAvailable currently uses
hasCompleteSignalAssessments(runAItems) &&
hasCompleteSignalAssessments(runBItems) which lets any unmatched row without
signalAssessment disable trusted deltas for matched rows; instead compute two
booleans: one (e.g., matchedMetricsComplete) that runs
hasCompleteSignalAssessments over only the items present in matched (extract
itemA/itemB arrays from matched) and use that to gate trustedScoringDelta and
trustedFrontierEvalDelta, and keep a separate wholeRun completeness/taint count
(using runAItems/runBItems) only where needed for whole-run taint reporting;
update the code paths that set trustedMatched and the trusted delta calculations
to rely on matchedMetricsComplete rather than the current
trustedMetricsAvailable.

In `@src/schemas/common.schema.ts`:
- Around line 393-417: The .superRefine on the hardware schema currently
enforces rules for acceleratorDetection.status "detected" and "none_detected"
but misses the "unavailable" case; update the same .superRefine callback (the
hardware parameter with hardware.acceleratorDetection and hardware.accelerators)
to add a branch that if hardware.acceleratorDetection.status === "unavailable"
and hardware.accelerators.length > 0 then call context.addIssue (use
z.ZodIssueCode.custom, path ["accelerators"]) with a clear message like
"accelerators must be empty when acceleratorDetection.status is \"unavailable\""
so the schema rejects any non-empty accelerators array when detection is
unavailable.

---

Nitpick comments:
In `@src/lib/machine-profile/probe-linux.ts`:
- Around line 30-33: The code currently calls parsePositiveInt(memoryMiB) to
validate/parse the memory string but discards its return value and re-parses
using Number.parseInt; instead call parsePositiveInt(memoryMiB) once, store its
numeric result (e.g., const memoryValue = parsePositiveInt(memoryMiB)), then
compute memoryBytes = memoryValue ? memoryValue * BYTES_PER_MIB : undefined;
update the block around modelRaw/memoryMiB parsing in probe-linux.ts to reuse
this parsed result and remove the redundant Number.parseInt call.

In `@src/lib/machine-profile/probe-utils.ts`:
- Around line 103-108: The initial-entry branch inside the dedup loop
redundantly re-spreads accelerator.count; in the block that checks "if
(!current)" simplify the object passed to deduped.set by removing the
conditional spread and just use the accelerator (or a shallow copy like
{...accelerator}) since the leading "...accelerator" already includes count;
update the deduped.set(key, ...) call accordingly (refer to deduped.set,
accelerator, current, key in probe-utils.ts).

In `@src/schemas/config.schema.ts`:
- Around line 243-257: The type-check inside superRefine using
validateCanonicalMachineFieldType is unreachable because Zod's field-level
validators (e.g., z.string().trim().min(1).optional()) already reject non-string
values before superRefine runs; remove the entire loop that iterates over
["machineInstanceId","machineDisplayLabel","machineProfileId","machineLabel"]
(the rawConfig/validateCanonicalMachineFieldType/context.addIssue block) from
superRefine, or alternatively move that type-check into the preprocessor phase
where rawConfig is available so it can catch non-string types before schema
validation runs.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f43d3a0f-40d4-4924-863a-09a86fdf5b28

📥 Commits

Reviewing files that changed from the base of the PR and between e121d55 and 2c6997c.

📒 Files selected for processing (33)

README.md
apps/dashboard/scripts/build-index.ts
apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx
apps/dashboard/src/lib/schemas.ts
apps/dashboard/src/lib/types.ts
src/cli/migrate-machine-command.ts
src/cli/run-command.ts
src/harnesses/code-output-policy.ts
src/lib/machine-profile/legacy.ts
src/lib/machine-profile/normalization.ts
src/lib/machine-profile/probe-linux.ts
src/lib/machine-profile/probe-macos.ts
src/lib/machine-profile/probe-utils.ts
src/lib/machine-profile/probe-windows.ts
src/lib/machine-profile/probe.ts
src/lib/model-profile/normalization.ts
src/lib/signal-assessment.ts
src/results/aggregate.ts
src/results/compare.ts
src/runner/plan-builder.ts
src/schemas/common.schema.ts
src/schemas/config.schema.ts
test/aggregate.test.ts
test/build-index.test.ts
test/code-output-policy.test.ts
test/compare.test.ts
test/goose-adapter.test.ts
test/machine-profile-migration.test.ts
test/machine-profile-probe.test.ts
test/plan-builder.test.ts
test/schemas.test.ts
test/signal-assessment.test.ts
test/workspace-prompt-parity.test.ts

✅ Files skipped from review due to trivial changes (4)

test/goose-adapter.test.ts
test/signal-assessment.test.ts
README.md
test/machine-profile-migration.test.ts

🚧 Files skipped from review as they are similar to previous changes (8)

apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx
test/compare.test.ts
test/aggregate.test.ts
test/code-output-policy.test.ts
src/cli/migrate-machine-command.ts
src/runner/plan-builder.ts
test/plan-builder.test.ts
apps/dashboard/src/lib/schemas.ts

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

apps/dashboard/src/lib/schemas.ts (1)
436-443: ⚠️ Potential issue | 🟠 Major

Don't hard-reject persisted v1 aggregates at parse time.

apps/dashboard/src/lib/api.ts:241 still parses aggregates/latest.json with LeaderboardAggregateSchema.parse(data). Tightening this schema to z.literal(2) means any existing schemaVersion: 1 aggregate will now fail the leaderboard until artifacts are rebuilt or migrated. Please add an explicit v1→v2 parse path here, or handle the parse failure with the same rebuild guidance you already give for a missing aggregate.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/lib/schemas.ts` around lines 436 - 443, The current
LeaderboardAggregateSchema hard-fails on schemaVersion: 1; update the parsing so
aggregates with schemaVersion: 1 are accepted and transformed to the v2 shape
instead of rejecting at parse time: modify LeaderboardAggregateSchema (or create
a wrapper parse function used by parse(data) in api code) to accept both v1 and
v2 (e.g., a union or a preprocess step) and implement a conversion path that
maps v1 fields into the v2 structure (keep symbol names:
LeaderboardAggregateSchema, LeaderboardAggregateSummarySchema,
LeaderboardMachineSummarySchema, LeaderboardAggregatedItemSchema) so existing
persisted aggregates parse successfully, and if parsing still fails fall back to
the same "missing aggregate / rebuild" guidance currently used by the API.

🧹 Nitpick comments (1)

src/lib/machine-profile/probe-utils.ts (1)

107-115: Consider simplifying the field backfill logic for readability.

The nested spread-ternary pattern is correct but dense. A helper or explicit assignments could improve clarity.

♻️ Optional refactor for readability

-		deduped.set(key, {
-			...current,
-			...(current.vendor ? {} : accelerator.vendor ? { vendor: accelerator.vendor } : {}),
-			...(current.memoryBytes ? {} : accelerator.memoryBytes ? { memoryBytes: accelerator.memoryBytes } : {}),
-			...(current.backend ? {} : accelerator.backend ? { backend: accelerator.backend } : {}),
-			...(current.kind !== "unknown" ? {} : accelerator.kind !== "unknown" ? { kind: accelerator.kind } : {}),
-			count: (current.count ?? 1) + (accelerator.count ?? 1),
-		});
+		const merged: ObservedAccelerator = {
+			...current,
+			vendor: current.vendor ?? accelerator.vendor,
+			memoryBytes: current.memoryBytes ?? accelerator.memoryBytes,
+			backend: current.backend ?? accelerator.backend,
+			kind: current.kind !== "unknown" ? current.kind : accelerator.kind,
+			count: (current.count ?? 1) + (accelerator.count ?? 1),
+		};
+		deduped.set(key, merged);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/machine-profile/probe-utils.ts` around lines 107 - 115, The nested
spread-ternary in the deduped.set call (using variables key, current,
accelerator) is correct but hard to read; replace it with clearer backfill logic
by building an explicit object before calling deduped.set — e.g., create a
newRecord from current, then conditionally assign vendor, memoryBytes, backend,
and kind only when missing on current and present on accelerator, and finally
set count = (current.count ?? 1) + (accelerator.count ?? 1), then call
deduped.set(key, newRecord); this keeps the same behavior while making the
intent in the deduped.set block much easier to read and maintain.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/results/compare.ts`:
- Around line 281-358: The deltas (passRateDelta / avgScoreDelta) are computed
from aggregates that drop missing automatedScore/frontierEval independently,
causing comparisons across disjoint row sets; fix by filtering matched (and
trustedMatched) to only include rows where both itemA and itemB have the metric
present before calling computeAggregateScoring or computeAggregateFrontier (use
predicates on item.itemA.automatedScore and item.itemB.automatedScore for
scoring, and item.itemA.frontierEval and item.itemB.frontierEval for frontier)
so passRateDelta and avgScoreDelta compare the same rows, and separately compute
and expose metric-availability loss (counts or coverage delta) rather than
folding it into the score deltas (update code paths that set scoringDelta,
trustedScoringDelta, frontierEvalDelta, and trustedFrontierEvalDelta
accordingly).

In `@src/schemas/config.schema.ts`:
- Around line 68-97: The backfill checks use hasOwnProperty, so explicit
properties set to undefined (e.g. {machineInstanceId: undefined}) are treated as
present and skip migration; change those conditions to treat undefined as
"absent" by relying on the parsed/normalized values and direct undefined checks:
use machineInstanceId === undefined (instead of hasMachineInstanceId) when
deciding to copy machineProfileId into config.machineInstanceId, use
machineDisplayLabel === undefined (instead of hasMachineDisplayLabel) when
copying machineLabel into config.machineDisplayLabel, and replace the
hasOwnProperty check for config.modelProfiles with a direct check for
config.modelProfiles === undefined before migrating modelAliases via
migrateLegacyModelAliases; reference symbols: machineInstanceId,
machineProfileId, machineDisplayLabel, machineLabel, modelProfiles,
modelAliases, migrateLegacyModelAliases.

---

Outside diff comments:
In `@apps/dashboard/src/lib/schemas.ts`:
- Around line 436-443: The current LeaderboardAggregateSchema hard-fails on
schemaVersion: 1; update the parsing so aggregates with schemaVersion: 1 are
accepted and transformed to the v2 shape instead of rejecting at parse time:
modify LeaderboardAggregateSchema (or create a wrapper parse function used by
parse(data) in api code) to accept both v1 and v2 (e.g., a union or a preprocess
step) and implement a conversion path that maps v1 fields into the v2 structure
(keep symbol names: LeaderboardAggregateSchema,
LeaderboardAggregateSummarySchema, LeaderboardMachineSummarySchema,
LeaderboardAggregatedItemSchema) so existing persisted aggregates parse
successfully, and if parsing still fails fall back to the same "missing
aggregate / rebuild" guidance currently used by the API.

---

Nitpick comments:
In `@src/lib/machine-profile/probe-utils.ts`:
- Around line 107-115: The nested spread-ternary in the deduped.set call (using
variables key, current, accelerator) is correct but hard to read; replace it
with clearer backfill logic by building an explicit object before calling
deduped.set — e.g., create a newRecord from current, then conditionally assign
vendor, memoryBytes, backend, and kind only when missing on current and present
on accelerator, and finally set count = (current.count ?? 1) +
(accelerator.count ?? 1), then call deduped.set(key, newRecord); this keeps the
same behavior while making the intent in the deduped.set block much easier to
read and maintain.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a036b871-56d8-4b93-a257-5190c8851628

📥 Commits

Reviewing files that changed from the base of the PR and between 2c6997c and 54bd6cb.

📒 Files selected for processing (10)

apps/dashboard/src/lib/schemas.ts
src/lib/machine-profile/probe-linux.ts
src/lib/machine-profile/probe-macos.ts
src/lib/machine-profile/probe-utils.ts
src/results/compare.ts
src/schemas/common.schema.ts
src/schemas/config.schema.ts
test/compare.test.ts
test/machine-profile-probe.test.ts
test/schemas.test.ts

🚧 Files skipped from review as they are similar to previous changes (3)

test/machine-profile-probe.test.ts
src/lib/machine-profile/probe-macos.ts
src/lib/machine-profile/probe-linux.ts

AustinKelsay · 2026-03-25T17:11:07Z

@CodeRabbit full review

coderabbitai · 2026-03-25T17:12:10Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

apps/dashboard/src/lib/schemas.ts (1)
264-282: ⚠️ Potential issue | 🟠 Major

Mirror the new item-level metadata in the dashboard boundary schemas.

These local item schemas still omit modelProfile, and result items also omit signalAssessment. That means the dashboard boundary cannot carry forward the canonical model identity and trusted/tainted metadata now present in run artifacts. Add local ModelProfileSchema and SignalAssessmentSchema mirrors and attach them here.
🔧 Proposed fix
 const MatrixItemResultSchema = z.object({
 	id: z.string(),
 	runtime: z.string(),
 	model: z.string(),
 	modelAlias: z.string().optional(),
+	modelProfile: ModelProfileSchema.optional(),
 	harness: z.string(),
 	test: z.string(),
 	passType: PassTypeSchema,
 	status: ItemStatusSchema,
 	startedAt: z.string().optional(),
 	completedAt: z.string().optional(),
+	signalAssessment: SignalAssessmentSchema.optional(),
 	generation: GenerationResultSchema.optional(),
 	generationFailure: GenerationFailureSchema.optional(),
 	automatedScore: AutomatedScoreSchema.optional(),
 	// ...
 });
 
 const MatrixItemSchema = z.object({
 	id: z.string(),
 	runtime: z.string(),
 	model: z.string(),
 	modelAlias: z.string().optional(),
+	modelProfile: ModelProfileSchema.optional(),
 	harness: z.string(),
 	test: z.string(),
 	passType: PassTypeSchema,
 });
As per coding guidelines, "Validate at boundaries with Zod and pass typed data inward".
Also applies to: 307-315
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/lib/schemas.ts` around lines 264 - 282, The
MatrixItemResultSchema (and the similar item-level schema around lines 307-315)
must include local mirrors of the new item-level metadata: add a
ModelProfileSchema and a SignalAssessmentSchema definitions in this file and
attach them to the appropriate item schemas as optional fields (e.g., add
modelProfile: ModelProfileSchema.optional() and signalAssessment:
SignalAssessmentSchema.optional() to MatrixItemResultSchema and the parallel
item schema referenced in the comment). Ensure the local ModelProfileSchema and
SignalAssessmentSchema mirror the canonical fields used elsewhere so the
dashboard boundary can validate and pass typed model identity and
trusted/tainted metadata inward.
apps/dashboard/src/lib/types.ts (1)
306-318: ⚠️ Potential issue | 🟠 Major

Bump the index schema version before requiring instanceCount.

DashboardCheckpointSummary.instanceCount is now mandatory, but DashboardIndex.schemaVersion stays at 2. Older v2 index payloads are now indistinguishable from the new shape, so readers have no way to detect or migrate stale artifacts. You already version-bumped LeaderboardAggregate on Line 365 for the same kind of contract change, so the index likely needs the same treatment.

Also applies to: 364-366
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/lib/types.ts` around lines 306 - 318,
DashboardIndex.schemaVersion was left at 2 while DashboardCheckpointSummary now
requires instanceCount; bump DashboardIndex.schemaVersion (the schemaVersion
field on the DashboardIndex interface) to a new integer (e.g. 3) so readers can
detect the changed shape, and update any related constants/consumers that rely
on schemaVersion; mirror the same version-bump pattern used for
LeaderboardAggregate to ensure older v2 payloads are distinguishable from the
new format that includes DashboardCheckpointSummary.instanceCount.

♻️ Duplicate comments (3)

src/lib/machine-profile/probe-macos.ts (1)
36-46: ⚠️ Potential issue | 🟡 Minor

Validate each SPDisplaysDataType entry before reading fields.

The container array is checked, but its members are still blindly cast. A payload like {"SPDisplaysDataType":[null]} reaches Line 40 and throws a raw TypeError, so callers lose the deterministic malformed-payload path this parser is trying to provide.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/machine-profile/probe-macos.ts` around lines 36 - 46, The flatMap
callback over displays assumes each SPDisplaysDataType entry is a non-null
object and will throw on null; update the iteration to first validate each entry
(e.g., ensure typeof entry === "object" && entry !== null) before reading
sppci_model, _name, spdisplays_vendor or spdisplays_ndrvs and return [] for
invalid entries so malformed payloads (like {"SPDisplaysDataType":[null]})
follow the parser's deterministic error path; adjust the code around
displays.flatMap(...), the entry type guards, and the modelRaw construction to
rely only on checked properties.
src/lib/machine-profile/probe-utils.ts (1)
98-123: ⚠️ Potential issue | 🟠 Major

Don't turn cross-source sightings into extra accelerator count.

This helper is explicitly merging accelerators from multiple probe sources, but Line 109 increments count on every key collision. A single device reported by two sources will be normalized as two accelerators, which can change acceleratorCount and the derived machine profile key downstream.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/machine-profile/probe-utils.ts` around lines 98 - 123, The current
merge in probe-utils.ts incorrectly adds counts across probe sources (in the
block building merged: ObservedAccelerator), inflating accelerator totals when
the same device is seen by multiple sources; change the count calculation from
addition: (current.count ?? 1) + (accelerator.count ?? 1) to use a
non-duplicating strategy such as taking the maximum: merged.count =
Math.max(current.count ?? 1, accelerator.count ?? 1). Update the merged creation
so it sets count using Math.max and leave other merge rules (vendor,
memoryBytes, backend, kind) unchanged.
src/lib/model-profile/normalization.ts (1)
44-50: ⚠️ Potential issue | 🟡 Minor

Add @throws to the remaining exported TSDoc blocks.

humanizeSlug, normalizeConfiguredVariant, buildFallbackModelProfile, and getModelIdentityKey are exported APIs, but their doc blocks still omit the required @throws entry. Use @throws none on the non-throwing helpers so the module matches the repo's public-doc contract.

As per coding guidelines, "All exported functions require TSDoc/JSDoc documentation including purpose, params, returns, and throws".

Also applies to: 212-218, 289-296, 368-376
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/model-profile/normalization.ts` around lines 44 - 50, The exported
functions humanizeSlug, normalizeConfiguredVariant, buildFallbackModelProfile,
and getModelIdentityKey are missing the required `@throws` TSDoc entry; update
each function's top doc block to include "@throws none" (since they are
non-throwing helpers) so the module complies with the repo's public-doc contract
and the guideline that all exported functions document purpose, params, returns,
and throws.

🧹 Nitpick comments (3)

src/lib/machine-profile/instance-id.ts (1)

88-94: Consider adding a small delay in the retry loop.

The busy-wait loop polling for 50ms can consume unnecessary CPU cycles. A brief sleep between iterations would be more efficient.

♻️ Suggested improvement

 		if ((error as NodeJS.ErrnoException).code === "EEXIST") {
 			const deadline = Date.now() + 50;
 			while (Date.now() < deadline) {
 				const persisted = readPersistedInstanceId(targetPath);
 				if (persisted) {
 					return persisted;
 				}
+				// Brief yield to avoid tight spin
+				Atomics.wait(new Int32Array(new SharedArrayBuffer(4)), 0, 0, 5);
 			}
 			return repairPersistedInstanceId(targetPath, value);
 		}

Alternatively, since this is a rare race condition scenario and the 50ms bound is short, the current implementation is acceptable for pragmatic reasons.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/machine-profile/instance-id.ts` around lines 88 - 94, The tight
busy-wait in the retry loop should sleep briefly to avoid CPU spinning: in the
block around the loop that calls readPersistedInstanceId(targetPath) (the code
that sets deadline = Date.now() + 50 and while (Date.now() < deadline) { ... }),
replace the busy loop with an async loop that awaits a short delay (e.g. await
new Promise(r => setTimeout(r, 5))) between iterations; update the enclosing
function to be async (or use a small synchronous sleep helper if the surrounding
code cannot be async) and adjust any callers accordingly so the retry still
returns the persisted id if found within ~50ms but yields the CPU between polls.

apps/dashboard/src/lib/api.ts (1)

12-15: Consider using a shared package export or alias to reduce relative import complexity.

The import from "../../../../src/lib/machine-profile/legacy.js" works correctly and is part of an established pattern in this monorepo (7 similar cross-package imports exist). The Vite configuration explicitly permits this via fs.allow: ["../.."] and module resolution is properly configured. However, consolidating these helpers through a shared package entry point or a workspace-level path alias would improve readability and maintainability.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/lib/api.ts` around lines 12 - 15, Replace the deep
relative import of parseKnownPlanPayload and parseKnownRunPayload with a shared
workspace export or path alias: update the import that currently points to
"../../../../src/lib/machine-profile/legacy.js" to import these symbols from the
central package entry or configured alias (e.g., a top-level "machine-profile"
export), then update any package exports/tsconfig/vite path mappings so
parseKnownPlanPayload and parseKnownRunPayload are resolved via the new shared
module rather than a long relative path.

src/lib/machine-profile/probe-linux.ts (1)

69-78: Consider enriching lspci parser with vendor extraction.

The parseLspciAccelerators function sets kind: "unknown" but doesn't attempt to extract vendor information from the lspci output, which typically includes vendor names (e.g., "NVIDIA Corporation", "Advanced Micro Devices").

This could help dedupeAccelerators better identify when lspci returns the same device already found by vendor-specific tools.

🔧 Optional enhancement

 export function parseLspciAccelerators(stdout: string): ObservedAccelerator[] {
 	return stdout
 		.split("\n")
 		.map((line) => line.trim())
 		.filter((line) => /(vga|3d|display)/i.test(line))
-		.map((line) => ({
-			modelRaw: line.replace(/^[0-9a-fA-F:.]+\s+[^:]+:\s*/, ""),
-			kind: "unknown" as const,
-		}));
+		.map((line) => {
+			const modelRaw = line.replace(/^[0-9a-fA-F:.]+\s+[^:]+:\s*/, "");
+			const vendor = /nvidia/i.test(modelRaw) ? "NVIDIA"
+				: /amd|radeon/i.test(modelRaw) ? "AMD"
+				: /intel/i.test(modelRaw) ? "Intel"
+				: undefined;
+			return {
+				...(vendor ? { vendor } : {}),
+				modelRaw,
+				kind: "unknown" as const,
+			};
+		});
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/machine-profile/probe-linux.ts` around lines 69 - 78,
parseLspciAccelerators currently returns accelerators with kind set to "unknown"
and doesn't extract vendor info from lspci output; update parseLspciAccelerators
to parse the vendor/manufacturer from each line (e.g., capture strings like
"NVIDIA Corporation" or "Advanced Micro Devices" that appear after the device
id) and include it on the returned ObservedAccelerator (add a vendor or
vendorRaw field and set kind based on vendor when possible). Locate
parseLspciAccelerators and modify the mapping of each line to: extract modelRaw
as now, extract a vendorRaw by regex or tokenization of the remaining
description, set kind to a more specific value when vendorRaw matches known
vendors (e.g., "nvidia" => "nvidia"), and ensure the shape matches what
dedupeAccelerators expects so duplicates between lspci and vendor-specific tools
can be resolved.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/dashboard/src/lib/types.ts`:
- Around line 120-189: The MatrixItemResult interface is missing the new
optional signalAssessment field; update the MatrixItemResult type to include
signalAssessment?: SignalAssessment so it mirrors the CLI schema (see
SignalAssessment/SignalAssessmentSchema) and maintain compatibility between
src/schemas/result.schema.ts and apps/dashboard/src/lib/types.ts; locate
MatrixItemResult in this file and add the optional signalAssessment property
with the SignalAssessment type.

In `@README.md`:
- Around line 238-239: The sentence is ambiguous about which config key is
preferred; update the README text to state that while legacy alias-only files
and the `--model-alias` form are still supported and will be normalized, the
canonical and preferred key for new configs is `models` (mention `modelProfiles`
only as an accepted legacy/compat shape), e.g., change "prefer `modelProfiles` /
`models`" to "prefer `models` (legacy `modelProfiles` are accepted and will be
normalized)"; keep the rest of the normalization note intact.

In `@src/cli/migrate-machine-command.ts`:
- Around line 49-61: The validation in .superRefine currently only ensures
options.dashboardOutputDir is provided when options.rebuildDashboardIndex is
true but does not ensure it doesn't point back into the source results directory
(options.dir); update .superRefine to resolve and normalize both
options.dashboardOutputDir and options.dir (use path.resolve/path.normalize) and
then reject when the dashboardOutputDir equals or is nested inside the source
dir (e.g., path.relative(dir, dashboardOutputDir) === "" or does not start with
".."); update the context.addIssue to keep the same path ["dashboardOutputDir"]
and error message but trigger it when the resolved paths indicate output would
mutate the source results tree (apply the same fix for the other similar
validation block referenced).
- Around line 109-119: The staged rewrite objects (created from changedRewrites
into StagedArtifactRewrite) mark backupCreated and renameCompleted on success
but leave rollbackSucceeded false, which causes shouldDeleteBackup to keep
backups; update the post-rewrite logic that sets renameCompleted to also mark
rollbackSucceeded true (or alter the shouldDeleteBackup check to allow deletion
when renameCompleted is true and rollback did not occur). Locate where
StagedArtifactRewrite instances are mutated (fields tempWritten, backupCreated,
renameCompleted) and either set rollbackSucceeded = true after a successful
rename/commit or change the backup-deletion predicate to treat a successful
rename as safe to delete backups.

In `@src/lib/model-profile/normalization.ts`:
- Around line 37-42: formatCompactNumber currently rounds all non-integer values
to one decimal (value.toFixed(1)), which loses precision and changes canonical
identifiers like profileKey, profileLabel, and parameterScaleLabel; update
formatCompactNumber to preserve exact fractional digits instead of forcing a
single decimal place (e.g., convert the number to a string without unnecessary
trailing zeros but without rounding to one decimal), so that values like 750M
and 1.25B remain distinct in profileKey/profileLabel/parameterScaleLabel while
still trimming insignificant trailing zeros.

In `@src/lib/model-profile/registry.ts`:
- Around line 291-296: The code currently sets resolutionSource based solely on
whether configuredProfile.variants contains string values, which incorrectly
treats modern parsed inline profiles and versioned registries as "legacy_alias";
change this by removing the heuristic and instead use an explicit provenance
flag carried when loading alias files (do not infer from
configuredProfile.variants). Update the logic around resolutionSource assignment
in the registry handling so it reads provenance from the normalized registry
entry (added when alias files are loaded) rather than deriving it from variants;
adjust callers such as parseInlineModelProfile and buildResolvedModelProfile to
propagate or respect the explicit provenance field on the registry entry.
- Around line 150-166: parseInlineModelProfile currently parses runtime:model
mappings (the loop over mappings.split(...) that populates variants) but accepts
unknown runtime tokens and silently allows duplicates; update
parseInlineModelProfile to validate each parsed runtime against the canonical
runtime list/enum used by the codebase (e.g., the Runtime enum or known runtime
names) and throw a clear Error via Zod validation when an unknown runtime is
encountered or when a runtime is duplicated, rather than overwriting in
variants; ensure the validation happens inside the loop that processes each
entry (the code block using colonIndex, runtime, modelName, and variants) so
malformed inline runtime mappings are rejected at parse time.

In `@src/results/aggregate.ts`:
- Around line 100-105: buildAggregateKey currently uses raw item.model which
allows the same model persisted under different aliases to create separate
aggregates; replace item.model with a canonical model identity (e.g., call the
existing model-normalization helper or add one) so the key is stable across
aliases. In buildAggregateKey, compute const canonicalModel =
getCanonicalModelId(item) or item.canonicalModelId (or call a new
canonicalizeModel(item.model)) and use `${canonicalModel}` in the template
instead of `${item.model}`; ensure you import or implement the canonicalization
helper used and run/update any tests that rely on aggregate key behavior.

In `@src/runner/item-executor.ts`:
- Around line 186-188: When runScoringWithCompileRetry() returns a
scoringOutcome whose generation replaces the current generation, ensure you
clear stale signal state by resetting signalAssessment to undefined (or null)
instead of using "scoringOutcome.signalAssessment ?? signalAssessment";
specifically update the assignment around where generation =
scoringOutcome.generation and signalAssessment is set so that if
scoringOutcome.signalAssessment is absent you do not retain the previous
signalAssessment. Apply the same fix to the analogous block around the second
occurrence mentioned (the code handling lines ~303-308) so any promoted retry
that lacks signalAssessment does not inherit the prior assessment.

In `@src/runner/plan-builder.ts`:
- Around line 164-203: In the loop that processes config.models (the block using
resolveModelSelection and pushing into filtered), avoid adding the same
resolvedSelection.runtimeModelName more than once: check whether filtered
already contains resolvedSelection.runtimeModelName (or use a Set) before
pushing and before calling resolvedModelProfiles.set and log.debug, so
overlapping selectors that resolve to the same runtimeModelName are deduplicated
prior to expanding the matrix; keep the existing checks against discovered and
the error throw for unresolved configuredProfile.

---

Outside diff comments:
In `@apps/dashboard/src/lib/schemas.ts`:
- Around line 264-282: The MatrixItemResultSchema (and the similar item-level
schema around lines 307-315) must include local mirrors of the new item-level
metadata: add a ModelProfileSchema and a SignalAssessmentSchema definitions in
this file and attach them to the appropriate item schemas as optional fields
(e.g., add modelProfile: ModelProfileSchema.optional() and signalAssessment:
SignalAssessmentSchema.optional() to MatrixItemResultSchema and the parallel
item schema referenced in the comment). Ensure the local ModelProfileSchema and
SignalAssessmentSchema mirror the canonical fields used elsewhere so the
dashboard boundary can validate and pass typed model identity and
trusted/tainted metadata inward.

In `@apps/dashboard/src/lib/types.ts`:
- Around line 306-318: DashboardIndex.schemaVersion was left at 2 while
DashboardCheckpointSummary now requires instanceCount; bump
DashboardIndex.schemaVersion (the schemaVersion field on the DashboardIndex
interface) to a new integer (e.g. 3) so readers can detect the changed shape,
and update any related constants/consumers that rely on schemaVersion; mirror
the same version-bump pattern used for LeaderboardAggregate to ensure older v2
payloads are distinguishable from the new format that includes
DashboardCheckpointSummary.instanceCount.

---

Duplicate comments:
In `@src/lib/machine-profile/probe-macos.ts`:
- Around line 36-46: The flatMap callback over displays assumes each
SPDisplaysDataType entry is a non-null object and will throw on null; update the
iteration to first validate each entry (e.g., ensure typeof entry === "object"
&& entry !== null) before reading sppci_model, _name, spdisplays_vendor or
spdisplays_ndrvs and return [] for invalid entries so malformed payloads (like
{"SPDisplaysDataType":[null]}) follow the parser's deterministic error path;
adjust the code around displays.flatMap(...), the entry type guards, and the
modelRaw construction to rely only on checked properties.

In `@src/lib/machine-profile/probe-utils.ts`:
- Around line 98-123: The current merge in probe-utils.ts incorrectly adds
counts across probe sources (in the block building merged: ObservedAccelerator),
inflating accelerator totals when the same device is seen by multiple sources;
change the count calculation from addition: (current.count ?? 1) +
(accelerator.count ?? 1) to use a non-duplicating strategy such as taking the
maximum: merged.count = Math.max(current.count ?? 1, accelerator.count ?? 1).
Update the merged creation so it sets count using Math.max and leave other merge
rules (vendor, memoryBytes, backend, kind) unchanged.

In `@src/lib/model-profile/normalization.ts`:
- Around line 44-50: The exported functions humanizeSlug,
normalizeConfiguredVariant, buildFallbackModelProfile, and getModelIdentityKey
are missing the required `@throws` TSDoc entry; update each function's top doc
block to include "@throws none" (since they are non-throwing helpers) so the
module complies with the repo's public-doc contract and the guideline that all
exported functions document purpose, params, returns, and throws.

---

Nitpick comments:
In `@apps/dashboard/src/lib/api.ts`:
- Around line 12-15: Replace the deep relative import of parseKnownPlanPayload
and parseKnownRunPayload with a shared workspace export or path alias: update
the import that currently points to
"../../../../src/lib/machine-profile/legacy.js" to import these symbols from the
central package entry or configured alias (e.g., a top-level "machine-profile"
export), then update any package exports/tsconfig/vite path mappings so
parseKnownPlanPayload and parseKnownRunPayload are resolved via the new shared
module rather than a long relative path.

In `@src/lib/machine-profile/instance-id.ts`:
- Around line 88-94: The tight busy-wait in the retry loop should sleep briefly
to avoid CPU spinning: in the block around the loop that calls
readPersistedInstanceId(targetPath) (the code that sets deadline = Date.now() +
50 and while (Date.now() < deadline) { ... }), replace the busy loop with an
async loop that awaits a short delay (e.g. await new Promise(r => setTimeout(r,
5))) between iterations; update the enclosing function to be async (or use a
small synchronous sleep helper if the surrounding code cannot be async) and
adjust any callers accordingly so the retry still returns the persisted id if
found within ~50ms but yields the CPU between polls.

In `@src/lib/machine-profile/probe-linux.ts`:
- Around line 69-78: parseLspciAccelerators currently returns accelerators with
kind set to "unknown" and doesn't extract vendor info from lspci output; update
parseLspciAccelerators to parse the vendor/manufacturer from each line (e.g.,
capture strings like "NVIDIA Corporation" or "Advanced Micro Devices" that
appear after the device id) and include it on the returned ObservedAccelerator
(add a vendor or vendorRaw field and set kind based on vendor when possible).
Locate parseLspciAccelerators and modify the mapping of each line to: extract
modelRaw as now, extract a vendorRaw by regex or tokenization of the remaining
description, set kind to a more specific value when vendorRaw matches known
vendors (e.g., "nvidia" => "nvidia"), and ensure the shape matches what
dedupeAccelerators expects so duplicates between lspci and vendor-specific tools
can be resolved.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f26bd3df-ffa2-461c-b826-48e24a6f648d

📥 Commits

Reviewing files that changed from the base of the PR and between 90ecc64 and c7750a9.

📒 Files selected for processing (99)

README.md
apps/dashboard/public/results/20260305-135118-b11d08/plan.json
apps/dashboard/public/results/20260305-135118-b11d08/run.json
apps/dashboard/public/results/20260305-144041-4d5950/plan.json
apps/dashboard/public/results/20260305-144041-4d5950/run.json
apps/dashboard/public/results/20260306-135715-91b0ee/plan.json
apps/dashboard/public/results/20260306-135715-91b0ee/run.json
apps/dashboard/public/results/20260307-161553-c25c6a/plan.json
apps/dashboard/public/results/20260307-161553-c25c6a/run.json
apps/dashboard/public/results/20260316-141600-6e18e2/plan.json
apps/dashboard/public/results/20260316-141600-6e18e2/run.json
apps/dashboard/public/results/20260317-082000-7b3bf3/plan.json
apps/dashboard/public/results/20260317-082000-7b3bf3/run.json
apps/dashboard/public/results/20260317-170728-697a9a/plan.json
apps/dashboard/public/results/20260317-170728-697a9a/run.json
apps/dashboard/public/results/20260318-091214-a0dbda/plan.json
apps/dashboard/public/results/20260318-091214-a0dbda/run.json
apps/dashboard/public/results/20260319-090246-e5f8be/plan.json
apps/dashboard/public/results/20260319-090246-e5f8be/run.json
apps/dashboard/public/results/aggregates/chk_sha256v1_2c13ad3e58b3.json
apps/dashboard/public/results/aggregates/chk_sha256v1_61a037f87105.json
apps/dashboard/public/results/aggregates/latest.json
apps/dashboard/public/results/index.json
apps/dashboard/scripts/build-index.ts
apps/dashboard/src/components/leaderboard/leaderboard-filters.ts
apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx
apps/dashboard/src/components/leaderboard/leaderboard-results-table.tsx
apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx
apps/dashboard/src/components/run-detail/run-detail-page.tsx
apps/dashboard/src/components/run-list/checkpoint-groups.ts
apps/dashboard/src/components/run-list/run-card.tsx
apps/dashboard/src/components/run-list/run-list-page.tsx
apps/dashboard/src/lib/aggregations-compare.ts
apps/dashboard/src/lib/api.ts
apps/dashboard/src/lib/schemas.ts
apps/dashboard/src/lib/types.ts
llm/implementation/multi-runtime-mvp-implementation.md
models.example.json
models.json
src/cli/compare-command.ts
src/cli/index.ts
src/cli/migrate-machine-command.ts
src/cli/run-command.ts
src/harnesses/code-output-policy.ts
src/harnesses/direct-adapter.ts
src/harnesses/goose-adapter.ts
src/harnesses/harness.ts
src/harnesses/opencode-adapter.ts
src/lib/hardware-profile.ts
src/lib/machine-profile/instance-id.ts
src/lib/machine-profile/legacy.ts
src/lib/machine-profile/normalization.ts
src/lib/machine-profile/probe-linux.ts
src/lib/machine-profile/probe-macos.ts
src/lib/machine-profile/probe-utils.ts
src/lib/machine-profile/probe-windows.ts
src/lib/machine-profile/probe.ts
src/lib/model-profile/normalization.ts
src/lib/model-profile/registry.ts
src/lib/model-profiles.ts
src/lib/signal-assessment.ts
src/lib/stats-format.ts
src/lib/stats.ts
src/results/aggregate.ts
src/results/compare.ts
src/results/reader.ts
src/runner/generation-retry.ts
src/runner/index.ts
src/runner/item-executor.ts
src/runner/item-retry.ts
src/runner/plan-builder.ts
src/schemas/common.schema.ts
src/schemas/config.schema.ts
src/schemas/index.ts
src/schemas/model-profile.schema.ts
src/schemas/plan.schema.ts
src/schemas/result.schema.ts
src/tests/calculator-stateful/prompt.blind.md
src/tests/event-emitter/prompt.blind.md
src/tests/file-delete-smoke/prompt.blind.md
src/tests/rate-limiter/prompt.blind.md
src/tests/safe-cleanup/prompt.blind.md
src/tests/workspace-smoke/test.meta.json
test/aggregate.test.ts
test/build-index.test.ts
test/code-output-policy.test.ts
test/compare-checkpoint-guard.test.ts
test/compare.test.ts
test/dashboard-schema-compat.test.ts
test/goose-adapter.test.ts
test/hardware-profile.test.ts
test/machine-profile-migration.test.ts
test/machine-profile-probe.test.ts
test/opencode-adapter.test.ts
test/plan-builder.test.ts
test/schemas.test.ts
test/signal-assessment.test.ts
test/stats.test.ts
test/workspace-prompt-parity.test.ts

AustinKelsay · 2026-03-25T19:57:32Z

@CodeRabbit full review

coderabbitai · 2026-03-25T19:57:39Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 17

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

apps/dashboard/src/lib/aggregations-compare.ts (1)

117-159: ⚠️ Potential issue | 🟠 Major

Recompute dashboard deltas from the matched metric subset.

matchedWithScoring and matchedWithFrontierEval only drive the availability counters here; scoringDelta and frontierEvalDelta still compare runA.items vs runB.items. That means the dashboard can report comparedRows: 1 while the visible delta still includes unmatched rows or rows where the metric exists on only one side.

💡 Suggested fix

-	const passRateA = computePassRate(runA.items);
-	const passRateB = computePassRate(runB.items);
 	const matchedWithScoring = matched.filter(
 		(item) => item.itemA.automatedScore && item.itemB.automatedScore,
 	);
+	const passRateA = computePassRate(
+		matchedWithScoring.map(({ itemA }) => itemA),
+	);
+	const passRateB = computePassRate(
+		matchedWithScoring.map(({ itemB }) => itemB),
+	);
 	const scoringDelta =
 		passRateA.total > 0 || passRateB.total > 0
 			? {
 					passRateDelta: passRateB.passRate - passRateA.passRate,
 					totalTestsDelta: passRateB.total - passRateA.total,
 				}
 			: null;
 
-	const frontierA = computeFrontierStats(runA.items);
-	const frontierB = computeFrontierStats(runB.items);
 	const matchedWithFrontierEval = matched.filter(
 		(item) => item.itemA.frontierEval && item.itemB.frontierEval,
 	);
+	const frontierA = computeFrontierStats(
+		matchedWithFrontierEval.map(({ itemA }) => itemA),
+	);
+	const frontierB = computeFrontierStats(
+		matchedWithFrontierEval.map(({ itemB }) => itemB),
+	);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/lib/aggregations-compare.ts` around lines 117 - 159, The
deltas are computed from full runA.runB items but should be computed from the
matched subsets; update scoringDelta and frontierEvalDelta to use only matched
rows that contain the metric. Specifically, for scoringDelta, derive
passRateA/passRateB and total counts from matchedWithScoring by mapping to itemA
and itemB (e.g., build arrays of the matched itemA items and matched itemB
items) and compute pass rates/totals from those instead of passRateA/passRateB;
for frontierEvalDelta call computeFrontierStats on the matchedWithFrontierEval
itemA and itemB subsets (map to their itemA/itemB arrays) and compute
avgScoreDelta from those results. Keep metricAvailability.comparedRows as
matchedWithScoring.length / matchedWithFrontierEval.length.

src/results/aggregate.ts (1)

374-405: ⚠️ Potential issue | 🟡 Minor

Keep zero-item matched runs in machines[] and summary.machines.

profileRunSet and profileInstanceSet already record participating machines before item aggregation, but machineSummary is only seeded from deduped items. If a checkpoint matches a run with zero items, the aggregate reports no machines even though the run contributed a machine/instance, which makes this output diverge from summarizeCheckpoints().

Seed the per-profile summary map from matchedRuns metadata first, then increment itemCount while walking items.

Also applies to: 420-425
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/results/aggregate.ts` around lines 374 - 405, machineSummary is only
seeded from deduped items so machines with zero matched items are omitted; seed
machineSummary from the matchedRuns metadata first (create an entry per matched
run's machineProfileKey with itemCount: 0, machineProfileLabel from the matched
run and verificationStatus from its metadata), then iterate items and increment
itemCount as currently done; this ensures the existing loop that updates
current.itemCount and the later mapping to MachineAggregateSummary (the machines
array and summary.machines) will include profiles with zero items while keeping
the rest of the aggregation logic (profileRunSet, profileInstanceSet,
verificationStatus, runCount, instanceCount) unchanged.

♻️ Duplicate comments (3)

test/workspace-prompt-parity.test.ts (1)

59-87: ⚠️ Potential issue | 🟡 Minor

Prompt parity assertions still miss some required contract clauses.

The test now checks method names, but it still doesn’t assert newly required semantics (notably emitter return behavior and some calculator signature/return clauses), so regressions can still pass unnoticed.

✅ Suggested parity-assertion additions

 		expect(calculatorPrompt).toContain("`add(n: number)`");
 		expect(calculatorPrompt).toContain("`subtract(n: number)`");
+		expect(calculatorPrompt).toContain("`multiply(n: number)`");
+		expect(calculatorPrompt).toContain("`divide(n: number)`");
 		expect(calculatorPrompt).toContain("`clear()`");
+		expect(calculatorPrompt).toContain("`result()` returns the current value");
 		expect(calculatorPrompt).toContain("`memoryClear`");
 		expect(calculatorPrompt).toContain("`memoryRecall()`");
 		expect(emitterPrompt).toContain("`listenerCount(event)`");
 		expect(emitterPrompt).toContain("`on(event, listener)`");
 		expect(emitterPrompt).toContain("`once(event, listener)`");
 		expect(emitterPrompt).toContain("`off(event, listener)`");
 		expect(emitterPrompt).toContain("`emit(event, payload)`");
+		expect(emitterPrompt).toContain(
+			"`on` and `once` return the current listener count for that event",
+		);
+		expect(emitterPrompt).toContain(
+			"`off` returns `true` when one matching listener is removed",
+		);
+		expect(emitterPrompt).toContain(
+			"`listenerCount` returns the current listener count for that event",
+		);
 		expect(rateLimiterPrompt).toContain("`allow(key: string, nowMs: number)`");
 		expect(rateLimiterPrompt).toContain("`remaining(key: string, nowMs: number)`");
 		expect(rateLimiterPrompt).toContain("`reset(key: string)`");
 		expect(rateLimiterPrompt).toContain("never go below 0");

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/workspace-prompt-parity.test.ts` around lines 59 - 87, Update the test
to also assert the required return-type semantics: check that calculator blind
prompt declares add(n: number) and subtract(n: number) with a numeric return
(e.g., "add(n: number): number" and "subtract(n: number): number"), that
memoryRecall and memoryClear include their return signatures (e.g.,
"memoryRecall(): number" or "memoryClear(): void/number" as appropriate), and
that clear includes its declared return; and for the emitter prompt assert
return behaviors such as "listenerCount(event): number", "on(event, listener):
this", "once(event, listener): this", "off(event, listener): this", and
"emit(event, payload): boolean" so the test enforces both method names and their
required return/contract clauses.

src/lib/machine-profile/probe-utils.ts (1)

98-123: ⚠️ Potential issue | 🟠 Major

Don't include optional vendor in the dedupe key.

If one probe source reports a device with vendor and another reports the same modelRaw without it, they end up in different buckets here. That overcounts accelerators and can change the downstream machine profile key for identical hardware.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/machine-profile/probe-utils.ts` around lines 98 - 123, The dedupe key
currently includes accelerator.vendor which causes identical devices to be split
when vendor is missing; change the key generation in probe-utils.ts to omit
vendor and rely on accelerator.modelRaw (and any other stable fields) instead —
update the key construction where const key = [...] is created (referencing
variables key, deduped, accelerator, modelRaw) so it uses
modelRaw.trim().toLowerCase() (and not vendor) when building the dedupe key,
then keep the subsequent merging logic (merged vendor/memoryBytes/backend/kind
and count) as-is so missing vendor values are filled from other entries.

apps/dashboard/scripts/build-index.ts (1)

62-67: ⚠️ Potential issue | 🔴 Critical

Make the overlap guard separator-safe before rm() runs.

isSameOrNestedPath() only works for /-separated strings. On Windows, paths like C:\repo\results\published are treated as unrelated to C:\repo\results, so the overlap check can miss exactly the case that writePublishedRunBundles() later deletes.

🛠️ Suggested fix

-import { basename, join, resolve } from "node:path";
+import { basename, join, relative, resolve } from "node:path";
@@
 function isSameOrNestedPath(basePath: string, candidate: string): boolean {
-	if (basePath === candidate) {
-		return true;
-	}
-	const normalizedBase = basePath.endsWith("/") ? basePath : `${basePath}/`;
-	return candidate.startsWith(normalizedBase);
+	const relativePath = relative(basePath, candidate);
+	return (
+		relativePath === "" ||
+		(!relativePath.startsWith("..") && !relativePath.includes(":"))
+	);
 }

Run this read-only check; the nested case should come back false with the current implementation:

#!/bin/bash
node <<'NODE'
const path = require('node:path');

function isSameOrNestedPath(basePath, candidate) {
  if (basePath === candidate) {
    return true;
  }
  const normalizedBase = basePath.endsWith("/") ? basePath : `${basePath}/`;
  return candidate.startsWith(normalizedBase);
}

const base = path.win32.resolve('C:\\repo\\results');
const nested = path.win32.resolve('C:\\repo\\results\\published');
const parent = path.win32.resolve('C:\\repo');

console.log('nested should be true:', isSameOrNestedPath(base, nested));
console.log('reverse check should be true:', isSameOrNestedPath(parent, base));
NODE

Also applies to: 511-523

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/scripts/build-index.ts` around lines 62 - 67,
isSameOrNestedPath currently compares raw strings using "/" which breaks on
Windows; fix by normalizing both paths with Node's path utilities (e.g., import
path from 'node:path'), resolve/normalize the inputs (path.resolve or
path.normalize), then perform the overlap check using path.relative: compute
const rel = path.relative(resolvedBase, resolvedCandidate) and return rel === ''
|| (!rel.startsWith('..') && !path.isAbsolute(rel)); update the implementation
in isSameOrNestedPath (and any duplicate instances around lines 511-523) and
ensure callers such as writePublishedRunBundles use the updated function so
rm()'s guard is separator-safe across platforms.

🧹 Nitpick comments (8)

test/workspace-prompt-parity.test.ts (1)
1-3: Expand file header to include exports and invariants.

Current header documents purpose only; please include explicit exports/invariants to match repository standard.

As per coding guidelines, "Every file begins with a short header documenting purpose, exports, and invariants".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/workspace-prompt-parity.test.ts` around lines 1 - 3, Update the file
header at the top of test/workspace-prompt-parity.test.ts to follow the repo
standard: keep the Purpose line and add explicit Exports (list any exported
symbols from this test file, or state "none" if it exports nothing) and an
Invariants section describing guarantees this test enforces (e.g., prompt/score
parity must remain stable, deterministic seeding, no external network calls).
Ensure the header is a short 3-part block: Purpose, Exports, Invariants.
apps/dashboard/src/lib/api.ts (1)
12-15: Consider exposing these helpers via a shared package or shorter alias.

The deep relative import path ../../../../src/lib/machine-profile/legacy.js traversing from apps/dashboard/src/lib/ to src/lib/ is fragile and may break if directory structure changes. Consider either:

Exposing these helpers from a shared package entry point, or

Adding a path alias in the dashboard's build config.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/lib/api.ts` around lines 12 - 15, The import of
parseKnownPlanPayload and parseKnownRunPayload uses a brittle deep relative
path; to fix, re-export these helpers from a stable shared entry (e.g., add
exports for parseKnownPlanPayload and parseKnownRunPayload in the library's
public index/packaged entry) or add a path alias in the dashboard build config
and update the import to the alias; make sure to update the module that
currently defines them (the legacy machine-profile module) to export them from
the shared entry so imports of parseKnownPlanPayload and parseKnownRunPayload
become short and robust.
test/dashboard-schema-compat.test.ts (1)
1-3: Complete the new file header metadata.

This header documents purpose only. The TS file convention in this repo also asks for exports and invariants, even when the export list is none.

As per coding guidelines, "Every file begins with a short header documenting purpose, exports, and invariants".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/dashboard-schema-compat.test.ts` around lines 1 - 3, The file header in
test/dashboard-schema-compat.test.ts only documents purpose; add the required TS
file header metadata by expanding the top comment to include an "Exports"
section (even if "none") and an "Invariants" section describing any guarantees
for the test file, matching the repo's header convention used elsewhere; ensure
the header remains a block comment and follows the same phrasing/format as other
test files so tooling and readers can rely on the documented purpose, exports
(none), and invariants.
src/lib/signal-assessment.ts (1)
34-46: Document the new thrown error on createTaintedSignalAssessment.

This exported helper now throws on an empty reason list, but its JSDoc still omits @throws.
Suggested doc update
 /**
  * Builds a tainted signal assessment from stable reasons.
  *
  * `@param` reasons - Taint reasons to record
  * `@returns` Tainted assessment
+ * `@throws` {Error} When `reasons` is empty after de-duplication
  */
As per coding guidelines, "All exported functions must have TSDoc/JSDoc documentation including purpose, params, returns, and throws".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/signal-assessment.ts` around lines 34 - 46, The exported function
createTaintedSignalAssessment now throws when called with an empty reasons
array; update its JSDoc/TSDoc to include an `@throws` tag that documents this
behavior (e.g., state that it throws Error when reasons is empty or contains no
unique reasons) and mention the error message ("createTaintedSignalAssessment
called with empty reasons") so callers and types/docs generators know the
failure mode.
src/results/compare.ts (1)
142-220: Split the aggregate/trusted-metric helpers into a sibling module.

compare.ts is now 539 lines, and most of the new summary helpers are independent of the outer-join/formatting path. Pulling this block out would get the file back under the size cap and make the summary math easier to test in isolation.

As per coding guidelines, "Keep files under 500 lines; split aggressively by responsibility".

Also applies to: 267-429
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/results/compare.ts` around lines 142 - 220, Move the
aggregate/trusted-metric helpers into a new sibling module and import them from
compare.ts: extract the interfaces AggregateScoring and AggregateFrontier and
the functions calculatePassRate, computeAggregateScoring, and
computeAggregateFrontier into a new file (e.g., results/summaryMetrics.ts),
export them, and replace the local definitions in compare.ts with imports of
those exported symbols; ensure all references in compare.ts (and any other
files) use the imported names and update any test/module imports accordingly so
tests compile against the new module.
src/schemas/common.schema.ts (1)
157-219: Split the new signal-assessment and machine-profile blocks out of common.schema.ts.

This file is now one non-trivial addition away from the 500-line cap and is carrying several separate domains. Pulling the signal-assessment and machine-profile schemas into their own modules now will make future schema bumps and migrations much easier to reason about.

As per coding guidelines, "Keep files under 500 lines; split aggressively by responsibility".

Also applies to: 266-475
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/schemas/common.schema.ts` around lines 157 - 219, This file mixes
unrelated schema domains; extract the signal-assessment block
(SignalAssessmentSchema, SignalAssessmentClassificationSchema,
SignalAssessmentReasonSchema, SignalAssessment type and const arrays
signalAssessmentClassifications and signalAssessmentReasonTypes) into its own
module (e.g., signal-assessment.schema.ts) and move the machine-profile related
block (lines ~266-475) into a separate module (e.g., machine-profile.schema.ts);
update exports/imports so any places importing these symbols now import from the
new modules (or re-export from common.schema.ts if you need backward
compatibility), and run type checks to ensure no references were broken.
apps/dashboard/src/lib/types.ts (1)
84-93: Keep ModelVariant.runtime aligned with the shared runtime type.

The CLI schemas only allow canonical runtime names here, but this mirror widens the field to string. That makes it easy for dashboard code to construct impossible ModelProfile objects without a type error.
♻️ Suggested fix
-import type { TestCategory } from "../../../../src/schemas/common.schema.js";
+import type {
+	RuntimeName,
+	TestCategory,
+} from "../../../../src/schemas/common.schema.js";
@@
 export interface ModelVariant {
 	variantKey: string;
 	variantLabel: string;
-	runtime: string;
+	runtime: RuntimeName;
 	runtimeModelName: string;
 	format?: string;
 	quantization?: string;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/lib/types.ts` around lines 84 - 93, The ModelVariant
interface widens runtime to string which diverges from the canonical runtime
type; change ModelVariant.runtime to use the shared runtime type (e.g., import
and use the Runtime/CanonicalRuntime type used by the CLI/shared schemas)
instead of string so TypeScript enforces only allowed runtimes; update the
import and any affected usages (e.g., places constructing ModelVariant or
ModelProfile) to satisfy the new type.
src/lib/hardware-profile.ts (1)
99-103: Validate injected hardware before normalizing it.

The observedHardware / hardwareProfile override is a public input path, but it bypasses the schema entirely. Parsing it here keeps bad fixtures or config from failing later inside normalizeMachineProfile() with a much less actionable error.
🔧 Suggested fix
-import type { HardwareProfile, MachineProfile } from "../schemas/index.js";
+import {
+	HardwareProfileSchema,
+	type HardwareProfile,
+	type MachineProfile,
+} from "../schemas/index.js";
@@
-	const observedHardware =
-		options.observedHardware ??
-		options.hardwareProfile ??
-		(await collectObservedHardwareProfile());
+	const observedHardware = HardwareProfileSchema.parse(
+		options.observedHardware ??
+			options.hardwareProfile ??
+			(await collectObservedHardwareProfile()),
+	);
As per coding guidelines, "Validate at boundaries with Zod and pass typed data inward" and "Treat all boundary inputs (CLI args, env, JSON files, HTTP responses) as unknown and parse with Zod immediately".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/hardware-profile.ts` around lines 99 - 103, The injected
observedHardware/hardwareProfile value must be validated with Zod before calling
normalizeMachineProfile; update the logic around observedHardware (and where
options.observedHardware / options.hardwareProfile are read) to treat the
incoming value as unknown and run it through the appropriate Zod schema (e.g.,
MachineProfileSchema or ObservedHardwareSchema) using parse or safeParse, throw
or log a clear validation error on failure, and only pass the successfully
parsed/typed object into normalizeMachineProfile (keeping
collectObservedHardwareProfile() results already typed if it returns a validated
shape). Ensure you reference and use the existing schema symbol (or create one)
and apply it at the boundary where observedHardware is assigned so
normalizeMachineProfile always receives a validated type.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/dashboard/src/lib/schemas.ts`:
- Around line 73-77: SignalAssessmentSchema currently allows invalid combos
(e.g. classification "trustworthy" with reasons or "tainted" with empty
reasons); update SignalAssessmentSchema (which composes
SignalAssessmentClassificationSchema and SignalAssessmentReasonSchema) to add a
refinement (use .refine or .superRefine) that enforces the invariants: if
classification === "tainted" then reasons.length > 0, if classification ===
"trustworthy" then reasons must be empty (or must not contain specific reasons
like "tool_call_not_executed"), and otherwise validate any other allowed
classification/reason relationships; on violation call ctx.addIssue (for
superRefine) with a clear message so malformed JSON fails at the dashboard
boundary.

In `@llm/implementation/multi-runtime-mvp-implementation.md`:
- Around line 118-120: The example command references a model key
(`qwen2.5-14b-instruct`) that no longer exists in the checked-in
models.example.json; update the example in multi-runtime-mvp-implementation.md
so the command uses one of the current profile keys from models.example.json (or
explicitly mark the command as a historical MVP example). Locate the command
line with `bun pb --runtimes ollama vllm --harnesses direct goose opencode
--models qwen2.5-14b-instruct --model-config models.example.json` and replace
`qwen2.5-14b-instruct` with a valid profile key from models.example.json (or add
a brief note that it’s a historical example).

In `@models.example.json`:
- Around line 14-18: The vllm example variant uses an MLX-formatted model
("Qwen/Qwen3-27B-Instruct-MLX-4bit") which vLLM doesn't support; update the
"vllm" object so modelName is a vLLM-compatible model (e.g.,
"Qwen/Qwen3-27B-Instruct") or point to a vLLM-supported quantized artifact
(GPTQ/AWQ/GGUF) and adjust the "format"/"quantization" fields accordingly to
match that supported format; locate the "vllm" variant in the JSON and replace
the modelName and format/quantization values to a vLLM-supported option.

In `@src/cli/migrate-machine-command.ts`:
- Around line 102-124: prepareArtifactRewrite currently lets JSON.parse and
validate errors surface without context; catch JSON parse errors around
JSON.parse(original) and rethrow a new error that includes artifactPath and the
original error message (preferably using the Error constructor's cause field if
available), and similarly wrap the validate(normalized) call in a try/catch to
rethrow a new error that includes artifactPath and the validation error details;
keep the original error as the cause and preserve message formatting so callers
can identify which run/plan artifact failed.

In `@src/harnesses/goose-adapter.ts`:
- Around line 482-485: The signalAssessment currently adds
"tool_call_not_executed" whenever toolCallDetected is true even if execution
evidence exists; change the appendSignalAssessmentReasons call to only include
"tool_call_not_executed" when toolCallDetected is true AND there is no execution
evidence (e.g., codeFilePath is falsy). Update the argument expression so the
array is passed conditionally (use toolCallDetected && !codeFilePath) to avoid
misclassifying runs where a tool-produced file exists.

In `@src/harnesses/opencode-adapter.ts`:
- Around line 127-140: The current code runs git commands directly against the
caller-supplied workDir (see the execa("git", ...) calls and workDir variable),
which mutates user repos; change this so we never init or reconfigure Git inside
caller workspaces: detect when workDir is user-provided and instead create a
bench-owned temporary directory (use fs.mkdtemp / fs.cp or similar to copy the
workspace or use a temp worktree), run the execa("git", ["init"...]) / git
config calls against that temp dir, and use log.warn({ error: gitErr }, ...)
unchanged for errors; remove any git init/config calls directly targeting the
original workDir.

In `@src/lib/machine-profile/instance-id.ts`:
- Around line 87-95: The EEXIST branch and the similar block at 228-243 must not
call repairPersistedInstanceId on a file another process may be creating; change
the logic in the EEXIST handling (around
readPersistedInstanceId/repairPersistedInstanceId with targetPath/value) to only
retry/re-read until a short deadline and then give up (return undefined or
throw) instead of calling repairPersistedInstanceId; ensure the same change is
applied to the other occurrence so only the process that actually created the
file performs repairPersistedInstanceId, while contenders only wait and re-read
via readPersistedInstanceId.

In `@src/lib/machine-profile/legacy.ts`:
- Around line 147-166: The code casts rawMachine to LegacyMachineProfile
(legacyMachine) after only validating hardware fields, then unconditionally
copies legacyMachine.label into displayLabel when truthy, which allows
non-string labels through; update the legacy branch to validate that
legacyMachine.label is a string before spreading it into displayLabel (e.g.,
check typeof legacyMachine.label === "string"), or parse the entire legacy
object with a strict validator before using
migrateLegacyObservedHardware/normalizeMachineProfile and buildLegacyInstanceId
so only a string label is accepted and propagated.

In `@src/lib/machine-profile/normalization.ts`:
- Around line 244-245: The cpuCoreCount should be derived in a stable way to
avoid re-keying when physical-core probing is absent: change the assignment in
normalization.ts so cpuCoreCount uses normalizedProfile.logicalCores as the
primary source and falls back to normalizedProfile.physicalCores (i.e.,
cpuCoreCount = normalizedProfile.logicalCores ??
normalizedProfile.physicalCores) so machineProfileKey remains stable across runs
even when physicalCores is missing.
- Around line 197-230: The current normalization collapses mixed-accelerator
hosts by emitting a single acceleratorKey plus a total acceleratorCount (using
selectPrimaryAccelerator, normalizeAcceleratorKey and
getObservedAcceleratorCount), which makes different mixes (e.g. 4090+4060 vs
2x4090) collide; fix by replacing the single acceleratorKey/acceleratorCount
pair with a deterministic per-accelerator summary derived from
observedHardware.accelerators: build a map of normalizeAcceleratorKey(acc) ->
count, sort keys deterministically and expose either an acceleratorCounts object
or a compact sorted summary string/array (e.g. ["4090:2","4060:1"]) instead of
acceleratorKey, and keep primary-based acceleratorMemoryGiB as-is; update any
code that reads acceleratorKey/acceleratorCount to use the new
acceleratorCounts/summary and ensure selection logic still uses
selectPrimaryAccelerator for memoryBytes.

In `@src/lib/machine-profile/probe-linux.ts`:
- Around line 69-77: parseLspciAccelerators() (and the consumer
collectLinuxAccelerators()) currently treats any VGA/3D/display lspci line as a
detected accelerator, which creates false positives on BMC/virtual-display
hosts; change the behavior so lspci output is treated only as enrichment: modify
parseLspciAccelerators to include a marker (e.g., kind: "enrichment" or a
boolean like inferredFromLspci) instead of producing a definitive accelerators
status, and update collectLinuxAccelerators to ignore lspci-only entries when
deciding status:"detected" (require corroborating evidence such as /dev/dri,
vendor-specific drivers, or other probes) so lspci lines only augment
modelRaw/name but do not flip a host from none_detected to detected.

In `@src/lib/machine-profile/probe-utils.ts`:
- Around line 68-84: The regex in parseMemoryBytes (the normalized.match(...)
line) must be widened to accept binary units (tib|gib|mib|kib) so inputs like "2
GiB" don't fall through as bytes; update the pattern to include those variants
and then normalize the captured unit (from match[2]) to the existing switch
cases (e.g., strip the optional "i" so "gib" becomes "gb") before the switch so
the current case labels ("tb","gb","mb","kb","b") continue to work unchanged;
ensure the default/fallback behavior still returns Math.round(amount) for
byte-only values.

In `@src/lib/model-profile/normalization.ts`:
- Around line 115-117: The detectTuning function is matching tuning names by
substring which produces false positives (e.g., "starcoder2" -> "coder"); update
detectTuning to match whole tokens instead of substrings by splitting the input
string into tokens (e.g., on non-alphanumeric characters) or using word-boundary
regex checks, then test each token against TUNING_PATTERNS (or test each pattern
against token equality) so only complete token matches return a tuning; keep the
function name detectTuning and the TUNING_PATTERNS reference so the change is
localized.

In `@src/lib/model-profile/registry.ts`:
- Around line 127-132: The current combined try/catch around
JSON.parse(fs.readFileSync(...)) conflates filesystem read errors with JSON
parse errors; split the operation into two steps: first call
fs.readFileSync(filePath, "utf-8") inside its own try/catch and rethrow a
descriptive error that preserves the original error (include the caught
error/message and filePath), then JSON.parse the returned string in a separate
try/catch that throws the existing "Invalid JSON in model profile file" error;
update references around the parsed variable and keep filePath in the error
messages to aid debugging.
- Around line 287-301: The function findConfiguredVariantByRuntimeModel
currently returns the first matching profile and silently prefers registry
order; change it to detect multiple matches and throw an error instead of
returning the first hit. Iterate the registry using the existing symbols
(profileKey, profile), call normalizeConfiguredVariant(configuredVariant) as
before and collect all matches where normalized.modelName === runtimeModelName;
if zero matches return undefined, if exactly one return that match, and if more
than one throw a clear Error indicating an ambiguous reverse match listing the
conflicting profile keys and the { runtime, runtimeModelName } pair to aid
debugging.

In `@src/schemas/model-profile.schema.ts`:
- Around line 123-125: The ModelProfileFileSchema currently uses schemaVersion:
z.string().default(SCHEMA_VERSION) which accepts any string and silently allows
incorrect versions; change it to require the exact current version (e.g.,
replace the defaulting string with a literal/strict validator referencing
SCHEMA_VERSION) so parsing fails for mismatched versions and forces explicit
migrations; update ModelProfileFileSchema to validate schemaVersion equals
SCHEMA_VERSION (using z.literal(SCHEMA_VERSION) or equivalent) to reject unknown
file versions.

In `@test/schemas.test.ts`:
- Around line 272-303: The tests currently assert validation failures by
checking issue.message substrings; update the assertions to check issue.code ===
"invalid_type" combined with the field path instead (for the BenchConfigSchema
safeParse results for machineInstanceId and machineDisplayLabel) so the checks
use the semantic error code instead of brittle message text—locate the two
places referencing idResult.error.issues and labelResult.error.issues and
replace the .message.includes(...) checks with checks that issue.code ===
"invalid_type" and issue.path[0] === "machineInstanceId" (or
"machineDisplayLabel") respectively.

---

Outside diff comments:
In `@apps/dashboard/src/lib/aggregations-compare.ts`:
- Around line 117-159: The deltas are computed from full runA.runB items but
should be computed from the matched subsets; update scoringDelta and
frontierEvalDelta to use only matched rows that contain the metric.
Specifically, for scoringDelta, derive passRateA/passRateB and total counts from
matchedWithScoring by mapping to itemA and itemB (e.g., build arrays of the
matched itemA items and matched itemB items) and compute pass rates/totals from
those instead of passRateA/passRateB; for frontierEvalDelta call
computeFrontierStats on the matchedWithFrontierEval itemA and itemB subsets (map
to their itemA/itemB arrays) and compute avgScoreDelta from those results. Keep
metricAvailability.comparedRows as matchedWithScoring.length /
matchedWithFrontierEval.length.

In `@src/results/aggregate.ts`:
- Around line 374-405: machineSummary is only seeded from deduped items so
machines with zero matched items are omitted; seed machineSummary from the
matchedRuns metadata first (create an entry per matched run's machineProfileKey
with itemCount: 0, machineProfileLabel from the matched run and
verificationStatus from its metadata), then iterate items and increment
itemCount as currently done; this ensures the existing loop that updates
current.itemCount and the later mapping to MachineAggregateSummary (the machines
array and summary.machines) will include profiles with zero items while keeping
the rest of the aggregation logic (profileRunSet, profileInstanceSet,
verificationStatus, runCount, instanceCount) unchanged.

---

Duplicate comments:
In `@apps/dashboard/scripts/build-index.ts`:
- Around line 62-67: isSameOrNestedPath currently compares raw strings using "/"
which breaks on Windows; fix by normalizing both paths with Node's path
utilities (e.g., import path from 'node:path'), resolve/normalize the inputs
(path.resolve or path.normalize), then perform the overlap check using
path.relative: compute const rel = path.relative(resolvedBase,
resolvedCandidate) and return rel === '' || (!rel.startsWith('..') &&
!path.isAbsolute(rel)); update the implementation in isSameOrNestedPath (and any
duplicate instances around lines 511-523) and ensure callers such as
writePublishedRunBundles use the updated function so rm()'s guard is
separator-safe across platforms.

In `@src/lib/machine-profile/probe-utils.ts`:
- Around line 98-123: The dedupe key currently includes accelerator.vendor which
causes identical devices to be split when vendor is missing; change the key
generation in probe-utils.ts to omit vendor and rely on accelerator.modelRaw
(and any other stable fields) instead — update the key construction where const
key = [...] is created (referencing variables key, deduped, accelerator,
modelRaw) so it uses modelRaw.trim().toLowerCase() (and not vendor) when
building the dedupe key, then keep the subsequent merging logic (merged
vendor/memoryBytes/backend/kind and count) as-is so missing vendor values are
filled from other entries.

In `@test/workspace-prompt-parity.test.ts`:
- Around line 59-87: Update the test to also assert the required return-type
semantics: check that calculator blind prompt declares add(n: number) and
subtract(n: number) with a numeric return (e.g., "add(n: number): number" and
"subtract(n: number): number"), that memoryRecall and memoryClear include their
return signatures (e.g., "memoryRecall(): number" or "memoryClear():
void/number" as appropriate), and that clear includes its declared return; and
for the emitter prompt assert return behaviors such as "listenerCount(event):
number", "on(event, listener): this", "once(event, listener): this", "off(event,
listener): this", and "emit(event, payload): boolean" so the test enforces both
method names and their required return/contract clauses.

---

Nitpick comments:
In `@apps/dashboard/src/lib/api.ts`:
- Around line 12-15: The import of parseKnownPlanPayload and
parseKnownRunPayload uses a brittle deep relative path; to fix, re-export these
helpers from a stable shared entry (e.g., add exports for parseKnownPlanPayload
and parseKnownRunPayload in the library's public index/packaged entry) or add a
path alias in the dashboard build config and update the import to the alias;
make sure to update the module that currently defines them (the legacy
machine-profile module) to export them from the shared entry so imports of
parseKnownPlanPayload and parseKnownRunPayload become short and robust.

In `@apps/dashboard/src/lib/types.ts`:
- Around line 84-93: The ModelVariant interface widens runtime to string which
diverges from the canonical runtime type; change ModelVariant.runtime to use the
shared runtime type (e.g., import and use the Runtime/CanonicalRuntime type used
by the CLI/shared schemas) instead of string so TypeScript enforces only allowed
runtimes; update the import and any affected usages (e.g., places constructing
ModelVariant or ModelProfile) to satisfy the new type.

In `@src/lib/hardware-profile.ts`:
- Around line 99-103: The injected observedHardware/hardwareProfile value must
be validated with Zod before calling normalizeMachineProfile; update the logic
around observedHardware (and where options.observedHardware /
options.hardwareProfile are read) to treat the incoming value as unknown and run
it through the appropriate Zod schema (e.g., MachineProfileSchema or
ObservedHardwareSchema) using parse or safeParse, throw or log a clear
validation error on failure, and only pass the successfully parsed/typed object
into normalizeMachineProfile (keeping collectObservedHardwareProfile() results
already typed if it returns a validated shape). Ensure you reference and use the
existing schema symbol (or create one) and apply it at the boundary where
observedHardware is assigned so normalizeMachineProfile always receives a
validated type.

In `@src/lib/signal-assessment.ts`:
- Around line 34-46: The exported function createTaintedSignalAssessment now
throws when called with an empty reasons array; update its JSDoc/TSDoc to
include an `@throws` tag that documents this behavior (e.g., state that it throws
Error when reasons is empty or contains no unique reasons) and mention the error
message ("createTaintedSignalAssessment called with empty reasons") so callers
and types/docs generators know the failure mode.

In `@src/results/compare.ts`:
- Around line 142-220: Move the aggregate/trusted-metric helpers into a new
sibling module and import them from compare.ts: extract the interfaces
AggregateScoring and AggregateFrontier and the functions calculatePassRate,
computeAggregateScoring, and computeAggregateFrontier into a new file (e.g.,
results/summaryMetrics.ts), export them, and replace the local definitions in
compare.ts with imports of those exported symbols; ensure all references in
compare.ts (and any other files) use the imported names and update any
test/module imports accordingly so tests compile against the new module.

In `@src/schemas/common.schema.ts`:
- Around line 157-219: This file mixes unrelated schema domains; extract the
signal-assessment block (SignalAssessmentSchema,
SignalAssessmentClassificationSchema, SignalAssessmentReasonSchema,
SignalAssessment type and const arrays signalAssessmentClassifications and
signalAssessmentReasonTypes) into its own module (e.g.,
signal-assessment.schema.ts) and move the machine-profile related block (lines
~266-475) into a separate module (e.g., machine-profile.schema.ts); update
exports/imports so any places importing these symbols now import from the new
modules (or re-export from common.schema.ts if you need backward compatibility),
and run type checks to ensure no references were broken.

In `@test/dashboard-schema-compat.test.ts`:
- Around line 1-3: The file header in test/dashboard-schema-compat.test.ts only
documents purpose; add the required TS file header metadata by expanding the top
comment to include an "Exports" section (even if "none") and an "Invariants"
section describing any guarantees for the test file, matching the repo's header
convention used elsewhere; ensure the header remains a block comment and follows
the same phrasing/format as other test files so tooling and readers can rely on
the documented purpose, exports (none), and invariants.

In `@test/workspace-prompt-parity.test.ts`:
- Around line 1-3: Update the file header at the top of
test/workspace-prompt-parity.test.ts to follow the repo standard: keep the
Purpose line and add explicit Exports (list any exported symbols from this test
file, or state "none" if it exports nothing) and an Invariants section
describing guarantees this test enforces (e.g., prompt/score parity must remain
stable, deterministic seeding, no external network calls). Ensure the header is
a short 3-part block: Purpose, Exports, Invariants.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 35865241-278e-41d0-ab2d-f2eb15a14635

📥 Commits

Reviewing files that changed from the base of the PR and between 90ecc64 and c0cceaa.

📒 Files selected for processing (100)

README.md
apps/dashboard/public/results/20260305-135118-b11d08/plan.json
apps/dashboard/public/results/20260305-135118-b11d08/run.json
apps/dashboard/public/results/20260305-144041-4d5950/plan.json
apps/dashboard/public/results/20260305-144041-4d5950/run.json
apps/dashboard/public/results/20260306-135715-91b0ee/plan.json
apps/dashboard/public/results/20260306-135715-91b0ee/run.json
apps/dashboard/public/results/20260307-161553-c25c6a/plan.json
apps/dashboard/public/results/20260307-161553-c25c6a/run.json
apps/dashboard/public/results/20260316-141600-6e18e2/plan.json
apps/dashboard/public/results/20260316-141600-6e18e2/run.json
apps/dashboard/public/results/20260317-082000-7b3bf3/plan.json
apps/dashboard/public/results/20260317-082000-7b3bf3/run.json
apps/dashboard/public/results/20260317-170728-697a9a/plan.json
apps/dashboard/public/results/20260317-170728-697a9a/run.json
apps/dashboard/public/results/20260318-091214-a0dbda/plan.json
apps/dashboard/public/results/20260318-091214-a0dbda/run.json
apps/dashboard/public/results/20260319-090246-e5f8be/plan.json
apps/dashboard/public/results/20260319-090246-e5f8be/run.json
apps/dashboard/public/results/aggregates/chk_sha256v1_2c13ad3e58b3.json
apps/dashboard/public/results/aggregates/chk_sha256v1_61a037f87105.json
apps/dashboard/public/results/aggregates/latest.json
apps/dashboard/public/results/index.json
apps/dashboard/scripts/build-index.ts
apps/dashboard/src/components/leaderboard/leaderboard-filters.ts
apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx
apps/dashboard/src/components/leaderboard/leaderboard-results-table.tsx
apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx
apps/dashboard/src/components/run-detail/run-detail-page.tsx
apps/dashboard/src/components/run-list/checkpoint-groups.ts
apps/dashboard/src/components/run-list/run-card.tsx
apps/dashboard/src/components/run-list/run-list-page.tsx
apps/dashboard/src/lib/aggregations-compare.ts
apps/dashboard/src/lib/api.ts
apps/dashboard/src/lib/schemas.ts
apps/dashboard/src/lib/types.ts
llm/implementation/multi-runtime-mvp-implementation.md
models.example.json
models.json
src/cli/compare-command.ts
src/cli/index.ts
src/cli/migrate-machine-command.ts
src/cli/run-command.ts
src/harnesses/code-output-policy.ts
src/harnesses/direct-adapter.ts
src/harnesses/goose-adapter.ts
src/harnesses/harness.ts
src/harnesses/opencode-adapter.ts
src/lib/hardware-profile.ts
src/lib/machine-profile/instance-id.ts
src/lib/machine-profile/legacy.ts
src/lib/machine-profile/normalization.ts
src/lib/machine-profile/probe-linux.ts
src/lib/machine-profile/probe-macos.ts
src/lib/machine-profile/probe-utils.ts
src/lib/machine-profile/probe-windows.ts
src/lib/machine-profile/probe.ts
src/lib/model-profile/normalization.ts
src/lib/model-profile/registry.ts
src/lib/model-profiles.ts
src/lib/signal-assessment.ts
src/lib/stats-format.ts
src/lib/stats.ts
src/results/aggregate.ts
src/results/compare.ts
src/results/reader.ts
src/runner/generation-retry.ts
src/runner/index.ts
src/runner/item-executor.ts
src/runner/item-retry.ts
src/runner/plan-builder.ts
src/schemas/common.schema.ts
src/schemas/config.schema.ts
src/schemas/index.ts
src/schemas/model-profile.schema.ts
src/schemas/plan.schema.ts
src/schemas/result.schema.ts
src/tests/calculator-stateful/prompt.blind.md
src/tests/event-emitter/prompt.blind.md
src/tests/file-delete-smoke/prompt.blind.md
src/tests/rate-limiter/prompt.blind.md
src/tests/safe-cleanup/prompt.blind.md
src/tests/workspace-smoke/test.meta.json
test/aggregate.test.ts
test/build-index.test.ts
test/code-output-policy.test.ts
test/compare-checkpoint-guard.test.ts
test/compare.test.ts
test/dashboard-schema-compat.test.ts
test/goose-adapter.test.ts
test/hardware-profile.test.ts
test/machine-profile-migration.test.ts
test/machine-profile-probe.test.ts
test/model-profile-registry.test.ts
test/opencode-adapter.test.ts
test/plan-builder.test.ts
test/schemas.test.ts
test/signal-assessment.test.ts
test/stats.test.ts
test/workspace-prompt-parity.test.ts

AustinKelsay · 2026-03-25T20:25:57Z

@CodeRabbit full review

coderabbitai · 2026-03-25T20:26:49Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 13

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

llm/implementation/multi-runtime-mvp-implementation.md (1)

189-200: ⚠️ Potential issue | 🟡 Minor

The validation-evidence block now contradicts itself.

Line 189 says this run used canonical profile qwen3-27b-instruct, but Lines 199-200 still list qwen2.5 runtime model names. Either the canonical profile name or the evidence block is stale, so readers cannot tell which model family this section is actually documenting.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@llm/implementation/multi-runtime-mvp-implementation.md` around lines 189 -
200, The validation-evidence block is inconsistent: the header lists canonical
model profile `qwen3-27b-instruct` but the evidence lines reference `qwen2.5`
models (e.g., ollama `qwen2.5:14b`, vLLM `Qwen/Qwen2.5-14B-Instruct`); pick the
correct model family and make the section consistent by either changing the
canonical profile to `qwen2.5` or updating the evidence entries to the
corresponding `qwen3-27b-instruct` runtime names (update all occurrences of
`qwen2.5` in the evidence block and any by-model entries to match
`qwen3-27b-instruct` naming).

apps/dashboard/scripts/build-index.ts (1)

174-231: ⚠️ Potential issue | 🟠 Major

Don't publish raw machine instance IDs.

sanitizePublishedRun() keeps run.machine.instanceId, and this builder now also copies machineInstanceId into index.json. If those IDs come from hostnames or other per-host labels, the public dashboard leaks stable machine identifiers even though the published data only needs profile grouping and counts. Hash or omit the raw ID before publication.

Also applies to: 422-423
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/scripts/build-index.ts` around lines 174 - 231,
sanitizePublishedRun currently preserves run.machine.instanceId (and the build
also copies machineInstanceId into index.json), leaking stable per-host
identifiers; update sanitizePublishedRun to either remove/undefine
run.machine.instanceId or replace it with a non-reversible scrubbed value (e.g.
a salted hash or normalized token) before returning the RunResult, and ensure
the same transformation is applied wherever machineInstanceId is copied into
index.json (the code that writes machineInstanceId should use the same
sanitizer/tokenizer); reference the sanitizePublishedRun function and the
run.machine.instanceId / machineInstanceId fields when making the change.

src/results/compare.ts (1)

231-254: ⚠️ Potential issue | 🟠 Major

Keep one-sided metric loss out of matched row deltas.

These new summary filters correctly exclude rows where only one side has automatedScore or frontierEval, but computeDeltas() still substitutes missing metrics with zeroes. That makes matched-row output show fake regressions/improvements for coverage drift. Return null unless both sides carry the metric, and expose availability changes separately if you still want that signal.

💡 Suggested alignment with the new summary semantics

-	if (a.automatedScore || b.automatedScore) {
-		const scoreA = a.automatedScore || { passed: 0, failed: 0, total: 0 };
-		const scoreB = b.automatedScore || { passed: 0, failed: 0, total: 0 };
-		const passRateA = calculatePassRate(a.automatedScore) ?? 0;
-		const passRateB = calculatePassRate(b.automatedScore) ?? 0;
+	if (a.automatedScore && b.automatedScore) {
+		const scoreA = a.automatedScore;
+		const scoreB = b.automatedScore;
+		const passRateA = calculatePassRate(scoreA) ?? 0;
+		const passRateB = calculatePassRate(scoreB) ?? 0;
@@
-	if (a.frontierEval || b.frontierEval) {
-		const scoreA = a.frontierEval?.score ?? 0;
-		const scoreB = b.frontierEval?.score ?? 0;
+	if (a.frontierEval && b.frontierEval) {
+		const scoreA = a.frontierEval.score;
+		const scoreB = b.frontierEval.score;

Also applies to: 293-330

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/results/compare.ts` around lines 231 - 254, computeDeltas currently
substitutes missing automatedScore/frontierEval with zeroes causing spurious
deltas; change the logic in computeDeltas (the automatedScore and frontierEval
blocks) to only compute and return a ScoreDelta or EvalDelta when both a and b
actually have the metric (i.e., a.automatedScore && b.automatedScore for
automatedScore, and a.frontierEval && b.frontierEval for frontierEval);
otherwise set automatedScore/frontierEval to null and handle availability
changes separately if you need that signal; apply the same fix to the other
similar block later in computeDeltas (around the second
automatedScore/frontierEval handling).

src/results/aggregate.ts (1)

374-421: ⚠️ Potential issue | 🟠 Major

Don't keep only the first verification status for a merged machine profile.

Lines 384-388 seed each machineProfileKey with the first status seen, and Lines 404-417 emit that value unchanged. Now that multiple runs and instances collapse under one profile, the summary can report a profile as verified even when some contributing runs are only self_reported or rejected.

🛠️ Possible fix

 const machineSummary = new Map<
 	string,
-	{ machineProfileLabel?: string; verificationStatus: VerificationStatus; itemCount: number }
+	{
+		machineProfileLabel?: string;
+		verificationStatuses: Set<VerificationStatus>;
+		itemCount: number;
+	}
 >();

 	machineSummary.set(machineProfileKey, {
 		machineProfileLabel: metadata.machineProfileLabel,
-		verificationStatus: metadata.verificationStatus,
+		verificationStatuses: new Set([metadata.verificationStatus]),
 		itemCount: 0,
 	});
@@
 		if (!current) {
 			machineSummary.set(item.machineProfileKey, {
 				machineProfileLabel: item.machineProfileLabel,
-				verificationStatus: item.verificationStatus,
+				verificationStatuses: new Set([item.verificationStatus]),
 				itemCount: 1,
 			});
 			continue;
 		}
 		current.itemCount += 1;
+		current.verificationStatuses.add(item.verificationStatus);
 	}
@@
-			verificationStatus: value.verificationStatus,
+			verificationStatus:
+				value.verificationStatuses.has("rejected")
+					? "rejected"
+					: value.verificationStatuses.size === 1
+						? [...value.verificationStatuses][0]!
+						: "self_reported",

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/results/aggregate.ts` around lines 374 - 421, The code seeds
machineSummary with the first VerificationStatus seen (in the matchedRuns loop)
and never updates it, so merged profiles can show an incorrect status; change
this by computing an aggregated status per profile instead of taking the first
one — add a small helper (e.g., computeAggregateVerificationStatus or a
verification precedence map where rejected > self_reported > verified) and use
it whenever you set or update machineSummary entries (referencing
machineSummary, matchedRuns, items, and the VerificationStatus type); update the
logic in both the matchedRuns seeding block and the items-processing block to
combine the existing status with the new status using that helper so the final
MachineAggregateSummary reflects the worst/most conservative status across all
runs and instances (profileRunSet/profileInstanceSet remain unchanged).

♻️ Duplicate comments (2)

src/cli/migrate-machine-command.ts (1)

127-135: ⚠️ Potential issue | 🟡 Minor

Wrap normalization failures with the artifact path too.

normalizeKnownRunPayload() / normalizeKnownPlanPayload() can still throw before validate() runs. Those exceptions bypass the artifact-scoped wrapper, so bulk migration failures still won't tell you which run.json / plan.json caused them.

🛠️ Minimal patch

-	const normalized = normalize(parsed);
-	try {
-		validate(normalized);
+	let normalized: unknown;
+	try {
+		normalized = normalize(parsed);
+		validate(normalized);
 	} catch (error) {
 		throw new Error(
 			`Invalid migrated artifact ${artifactPath}: ${error instanceof Error ? error.message : String(error)}`,
 			{ cause: error },
 		);
 	}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/cli/migrate-machine-command.ts` around lines 127 - 135, Normalization
errors from normalizeKnownRunPayload()/normalizeKnownPlanPayload() (called via
normalize(parsed)) aren't wrapped with the artifact path, so exceptions escape
the artifact-scoped error message; wrap the normalize(parsed) call in a
try/catch and on error throw a new Error that includes `artifactPath` (same
pattern used for validate) and pass the caught error as the cause, updating the
block around normalize(parsed) / validate(normalized) so both normalization and
validation failures are reported with the artifact path.

src/lib/machine-profile/probe-linux.ts (1)

69-77: ⚠️ Potential issue | 🟠 Major

Keep lspci entries out of the returned accelerator list unless they're corroborated.

This still appends every lspci VGA/3D/display line into accelerators, and once one vendor probe succeeds hasCorroboratedAccelerators makes the whole merged list publishable. A machine with one real CUDA device plus a BMC/virtual display controller will therefore report extra fake accelerators and the wrong acceleratorCount/profileKey.

🛠️ One safe direction

-	const accelerators: ObservedAccelerator[] = [];
+	const corroboratedAccelerators: ObservedAccelerator[] = [];
+	let lspciAccelerators: ObservedAccelerator[] = [];
@@
 	if (nvidia) {
 		probeCount++;
-		accelerators.push(...parseNvidiaAccelerators(nvidia));
+		corroboratedAccelerators.push(...parseNvidiaAccelerators(nvidia));
 	}
@@
 	if (rocm) {
 		probeCount++;
-		accelerators.push(...parseRocmAccelerators(rocm));
+		corroboratedAccelerators.push(...parseRocmAccelerators(rocm));
 	}
@@
 	if (lspci) {
 		probeCount++;
-		accelerators.push(...parseLspciAccelerators(lspci));
+		lspciAccelerators = parseLspciAccelerators(lspci);
 	}
 
-	const dedupedAccelerators = dedupeAccelerators(accelerators);
-	const hasCorroboratedAccelerators = accelerators.some(
-		(accelerator) =>
-			accelerator.backend !== undefined || accelerator.vendor !== undefined,
-	);
+	const hasCorroboratedAccelerators = corroboratedAccelerators.length > 0;
+	const dedupedAccelerators = dedupeAccelerators(corroboratedAccelerators);

Also applies to: 107-123

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/lib/machine-profile/probe-linux.ts` around lines 69 - 77,
parseLspciAccelerators is currently returning every VGA/3D/display lspci line as
a confirmed accelerator; change it to mark those entries as uncorroborated
candidates (e.g., add a property like source: "lspci" and corroborated: false or
kind: "candidate") instead of treating them as real devices, and update the
merging logic that uses hasCorroboratedAccelerators to only promote entries with
corroborated: true (or vendor-specific kind) when a vendor probe (e.g.,
NVIDIA/AMD probe functions) confirms a match; apply the same change to the other
lspci parsing block referenced around lines 107-123 so lspci-only entries never
inflate acceleratorCount or profileKey unless explicitly corroborated.

🧹 Nitpick comments (4)

test/workspace-prompt-parity.test.ts (1)
64-98: Consider adding assertions for multiply and divide method signatures.

The test now covers most of the updated prompt contracts, addressing the prior review feedback. However, the calculator prompt at line 13 includes `multiply(n: number)` and `divide(n: number)` which aren't asserted here. Adding these would complete coverage of all explicit method signatures.
♻️ Suggested additions
 		expect(calculatorPrompt).toContain("`add(n: number)`");
 		expect(calculatorPrompt).toContain("`subtract(n: number)`");
+		expect(calculatorPrompt).toContain("`multiply(n: number)`");
+		expect(calculatorPrompt).toContain("`divide(n: number)`");
 		expect(calculatorPrompt).toContain("`clear()`");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/workspace-prompt-parity.test.ts` around lines 64 - 98, The test "blind
coding prompts declare the required public method contracts" is missing
assertions for the calculator's multiply and divide method signatures; update
the test to assert that calculatorPrompt contains "`multiply(n: number)`" and
"`divide(n: number)`" (add these alongside the existing expects for
add/subtract/clear/memoryRecall/etc.), referencing the calculatorPrompt variable
so the new assertions follow the same pattern as the other expect(...) calls.
src/runner/item-executor.ts (1)
186-188: Generation-level signal assessment may be lost when scoring succeeds without retry.

When scoringOutcome.signalAssessment is undefined (scoring succeeded without using retry), line 187 unconditionally overwrites signalAssessment, discarding any generation-level taints (e.g., tool_permission_denied from the harness). The subsequent finalization at lines 302-307 builds from existing: undefined, losing this context.

Consider preserving generation-level assessment when scoring doesn't provide one:
💡 Suggested fix
 				generation = scoringOutcome.generation;
-				signalAssessment = scoringOutcome.signalAssessment;
+				// Preserve generation-level assessment if scoring didn't produce one
+				// (only overwrite if retry was promoted and returned an assessment)
+				if (scoringOutcome.signalAssessment !== undefined) {
+					signalAssessment = scoringOutcome.signalAssessment;
+				}
This differs from the past review comment's concern (which focused on replaced generations), but addresses a related issue where generation-level taints are unconditionally discarded.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/runner/item-executor.ts` around lines 186 - 188, The code unconditionally
assigns signalAssessment = scoringOutcome.signalAssessment which discards any
pre-existing generation-level assessment when scoringOutcome.signalAssessment is
undefined; update the assignment in the block that handles scoringOutcome
(around variables generation, scoringOutcome, scoringResult) to preserve
existing generation-level signalAssessment by only replacing it when
scoringOutcome.signalAssessment is defined (e.g., signalAssessment =
scoringOutcome.signalAssessment ?? signalAssessment or an explicit undefined
check), or merge the two assessments if needed before the finalization logic
that builds the final result from existing signalAssessment so generation-level
taints (like tool_permission_denied) are not lost.
test/plan-builder.test.ts (1)
325-340: Use the current schema version in these new model-profile regressions.

These cases are validating canonical model-profile behavior, but they all still build plans with schemaVersion: "0.4.0". Elsewhere in this PR the persisted plan/run schema is already on 0.5.0, so these assertions are only exercising the legacy compatibility path. Prefer the current schema constant/version here and keep explicit legacy-input coverage in a dedicated migration test.

Also applies to: 421-436, 490-505, 543-558
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/plan-builder.test.ts` around lines 325 - 340, Replace the hard-coded
legacy schemaVersion "0.4.0" in these test fixtures with the project's current
schema version constant (i.e., use the exported schema/current version constant
instead of the string literal) so the model-profile regression cases exercise
the current plan/run schema path; update the schemaVersion property in the
object shown (and the other occurrences you noted at 421-436, 490-505, 543-558)
to reference the canonical schema constant (e.g., SCHEMA_VERSION or
CURRENT_SCHEMA_VERSION) rather than the old "0.4.0".
src/results/compare.ts (1)
142-220: Split the summary/availability logic into a helper module.

This file is now over the repo's 500-line limit, and the new scoring/frontier/signal code makes compare.ts handle matching, per-row deltas, aggregate math, trusted filtering, and formatting in one place. Extracting the aggregate-summary pieces would keep the compare entrypoint readable and bring the file back under the project limit.

As per coding guidelines, "Keep files under 500 lines; split aggressively by responsibility".

Also applies to: 270-429
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/results/compare.ts` around lines 142 - 220, The compare.ts file is too
large and should have the aggregate-summary logic split into a helper module:
extract the AggregateScoring and AggregateFrontier interfaces and the functions
calculatePassRate, computeAggregateScoring, and computeAggregateFrontier into a
new module (e.g., results/aggregate.ts), export them, and replace their original
definitions in compare.ts with imports; update any references to these symbols
(calculatePassRate, computeAggregateScoring, computeAggregateFrontier,
AggregateScoring, AggregateFrontier) to use the new module, ensure types/exports
align with existing signatures (same parameters and return types), and run type
checks to fix any missing imports or exported names.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/dashboard/src/lib/schemas.ts`:
- Around line 494-511: The transform in DashboardIndexLegacyOrCurrentSchema
returns raw RunListItem[] when Array.isArray(index) which leaves callers with an
array instead of the normalized object shape; change the transform so the Array
branch normalizes the legacy array into the current object shape (same shape
produced for legacy object inputs): return an object with schemaVersion: 3,
generatedAt (derive from the array items or set a sensible default),
latestCheckpointId (derive or null), runs mapped from the array, and checkpoints
mapped converting machineCount -> instanceCount (same mapping used in the
non-array legacy branch), ensuring both array and legacy-object inputs always
yield the current Dashboard index shape.

In `@src/harnesses/code-output-policy.ts`:
- Around line 127-145: getAcceptedOutputTaintReasons currently treats every
extracted.method === "raw" as untainted; change it to run the same purity checks
as non-raw outputs: for raw extractions, compare trimmed rawOutput to
extracted.code.trim() and test SINGLE_CODE_BLOCK_REGEX, returning [] only when
the raw text is pure code (exact match or single code block); if mixed prose is
detected return the appropriate taint reason (e.g., "mixed_prose_salvaged" or
"output_contract_violation") instead of []. Apply the same purity-check logic to
the other spots noted (the other uses of raw acceptance logic referenced in
evaluateCodeOnlyOutput()/hasRawCodeSignals) so accepted raw outputs are
validated before being marked untainted.

In `@src/harnesses/direct-adapter.ts`:
- Around line 99-112: The current retry branch uses the nested retryResult's
promptTokens/completionTokens but drops the initial call's token usage,
undercounting tokens; change the logic in the evaluateCodeOnlyOutput/retry flow
so you store the initial generation result (the first call to
createDirectAdapter().generate) and, if a retry is performed
(decision.shouldRetry && !isRetryAttempt), sum promptTokens and completionTokens
(and totalTokens if present) from the initial result and the retryResult, then
return an object that merges the retryResult data but with aggregated
promptTokens/completionTokens/totalTokens and the durationMs computed from
Math.round(performance.now() - startTime); update any references to
promptTokens/completionTokens in the returned object (and tests if present) to
use these aggregated values.

In `@src/lib/machine-profile/instance-id.ts`:
- Around line 45-68: resolveDefaultInstanceIdPath currently uses os.homedir()
directly so injected env is ignored; change it to prefer the injected env home
values (e.g. readNonEmpty(env.HOME) or platform-appropriate HOME-like vars) and
only fall back to os.homedir() if those are unset; update the const homeDir
assignment in resolveDefaultInstanceIdPath (and any callers such as
resolveMachineInstanceId) to use the injected env first so XDG_STATE_HOME /
LOCALAPPDATA fallbacks behave correctly under test injection.

In `@src/lib/machine-profile/normalization.ts`:
- Around line 310-320: buildMachineProfileLabel currently only shows the primary
accelerator (via selectPrimaryAccelerator and accelerator?.modelRaw) which makes
mixed or multi-device summaries ambiguous; change the acceleratorLabel logic in
buildMachineProfileLabel to use normalizedProfile.acceleratorSummary when the
summary indicates a mixed set or when any device in
observedHardware.accelerators has count > 1, otherwise fall back to the primary
accelerator modelRaw; inspect observedHardware.accelerators for counts and
normalizedProfile.acceleratorSummary for a precomputed human string and include
counts (e.g., "2x RTX 4090" or "RTX 4090 + RTX 4060") in the returned label.

In `@src/lib/machine-profile/probe-macos.ts`:
- Around line 51-61: The code currently allows parsed VRAM value 0 to be treated
as “missing” because a falsy-coalescing operator was used; ensure you use
nullish coalescing so 0 is preserved: change any `||` (or other falsy checks) to
`??` when assigning `memoryBytes` from `parseMemoryBytes(...)` for
`display.spdisplays_vram` and `display.spdisplays_vram_shared` (and the same
pattern at the other occurrence around the second GPU check), so a returned 0 is
kept instead of being overridden.

In `@src/lib/machine-profile/probe-utils.ts`:
- Around line 100-109: When deduping by accelerator.modelRaw (key) in
probe-utils.ts you must not sum counts on collision; locate the block where key,
current, and merged (type ObservedAccelerator) are created and remove the
addition logic that sets count to (current.count ?? 1) + (accelerator.count ??
1). Instead preserve the already-recorded count from current (e.g., set
merged.count = current.count ?? 1 or omit count so current’s value remains), so
duplicate probe reports don’t inflate device counts.

In `@src/lib/machine-profile/probe-windows.ts`:
- Around line 37-55: The current entries.flatMap callback drops entries that
lack a usable Name, causing collectWindowsAccelerators() to report
"none_detected"; instead, convert malformed entries into an ObservedAccelerator
with kind: "unavailable" so they surface as unavailable rather than omitted.
Modify the flatMap callback in parseWindowsAccelerators()/probe-windows.ts so
that when (entry as { Name?: unknown }).Name is missing or empty you return an
ObservedAccelerator object (include memoryBytes if AdapterRAM is present, and
set modelRaw to undefined or an empty string per the ObservedAccelerator shape)
with kind: "unavailable" rather than returning []; make the same change in the
duplicate block referenced (lines ~81-87) to ensure consistent behavior.

In `@src/lib/model-profile/normalization.ts`:
- Around line 251-270: The code currently sets family = args.profile.family ??
args.profileKey which incorrectly uses the whole profileKey as the family for
minimal backfilled profiles; change this to derive family (and related
parametersBillions/tuning fallbacks) from the profileKey instead of using
args.profileKey verbatim — e.g., parse args.profileKey (or reuse
configuredVariant.modelName if that encodes the model) to extract the base
family string, then feed that extracted family into buildCanonicalProfileLabel
and into the detectParametersBillions/detectTuning fallbacks so
canonical.family, parametersBillions, and tuning are computed from the parsed
profileKey rather than the full key.

In `@src/lib/model-profiles.ts`:
- Around line 1-6: Add the missing "Invariant:" header block at the top of
src/lib/model-profiles.ts describing the file-level guarantees required by
src/**/*.{ts,tsx} files; include brief, declarative invariants such as stable
exported API names (loadModelProfiles, parseInlineModelProfile,
parseInlineModelProfiles, mergeModelProfiles, resolveModelSelection,
buildResolvedModelProfile, getModelIdentityKey), expected input/output contracts
(e.g., deterministic results, no global side effects), and any
preconditions/postconditions or error behavior so the header contains Purpose,
Exports, and Invariant sections.

In `@src/schemas/model-profile.schema.ts`:
- Around line 97-106: The ConfiguredModelProfileSchema currently allows any
non-empty string as keys for the variants record, so invalid runtime names slip
through; change the variants definition to validate keys using the existing
RuntimeNameSchema instead of z.string(), i.e. make variants a z.record keyed by
RuntimeNameSchema with values of ConfiguredModelVariantValueSchema so invalid
runtime keys are rejected at parse time (this touches
ConfiguredModelProfileSchema, the variants property, and uses RuntimeNameSchema
and ConfiguredModelVariantValueSchema that normalization.ts and registry.ts
expect).

In `@src/tests/workspace-smoke/test.meta.json`:
- Around line 7-10: The test's requiredHarnessCapabilities array
(requiredHarnessCapabilities) no longer includes "workspace-mkdir", but the test
writes logs/session.log and needs the harness to create the logs/ directory; add
"workspace-mkdir" back into the requiredHarnessCapabilities array so the harness
knows to allow directory creation for this test.

In `@test/stats.test.ts`:
- Around line 35-52: The createRunStats fixture replaces the entire timing
object when overrides include timing, losing defaults; modify createRunStats to
deep-merge the provided overrides.timing into the default timing object (rather
than spreading overrides last), so default timing fields (avgGenerationMs,
avgScoringMs, avgFrontierEvalMs, minGenerationMs, maxGenerationMs) are preserved
while allowing selective overrides; ensure the function still returns a RunStats
shape consistent with calculateRunStats by merging timing = { ...defaultTiming,
...(overrides.timing || {}) } and then spreading other overrides.

---

Outside diff comments:
In `@apps/dashboard/scripts/build-index.ts`:
- Around line 174-231: sanitizePublishedRun currently preserves
run.machine.instanceId (and the build also copies machineInstanceId into
index.json), leaking stable per-host identifiers; update sanitizePublishedRun to
either remove/undefine run.machine.instanceId or replace it with a
non-reversible scrubbed value (e.g. a salted hash or normalized token) before
returning the RunResult, and ensure the same transformation is applied wherever
machineInstanceId is copied into index.json (the code that writes
machineInstanceId should use the same sanitizer/tokenizer); reference the
sanitizePublishedRun function and the run.machine.instanceId / machineInstanceId
fields when making the change.

In `@llm/implementation/multi-runtime-mvp-implementation.md`:
- Around line 189-200: The validation-evidence block is inconsistent: the header
lists canonical model profile `qwen3-27b-instruct` but the evidence lines
reference `qwen2.5` models (e.g., ollama `qwen2.5:14b`, vLLM
`Qwen/Qwen2.5-14B-Instruct`); pick the correct model family and make the section
consistent by either changing the canonical profile to `qwen2.5` or updating the
evidence entries to the corresponding `qwen3-27b-instruct` runtime names (update
all occurrences of `qwen2.5` in the evidence block and any by-model entries to
match `qwen3-27b-instruct` naming).

In `@src/results/aggregate.ts`:
- Around line 374-421: The code seeds machineSummary with the first
VerificationStatus seen (in the matchedRuns loop) and never updates it, so
merged profiles can show an incorrect status; change this by computing an
aggregated status per profile instead of taking the first one — add a small
helper (e.g., computeAggregateVerificationStatus or a verification precedence
map where rejected > self_reported > verified) and use it whenever you set or
update machineSummary entries (referencing machineSummary, matchedRuns, items,
and the VerificationStatus type); update the logic in both the matchedRuns
seeding block and the items-processing block to combine the existing status with
the new status using that helper so the final MachineAggregateSummary reflects
the worst/most conservative status across all runs and instances
(profileRunSet/profileInstanceSet remain unchanged).

In `@src/results/compare.ts`:
- Around line 231-254: computeDeltas currently substitutes missing
automatedScore/frontierEval with zeroes causing spurious deltas; change the
logic in computeDeltas (the automatedScore and frontierEval blocks) to only
compute and return a ScoreDelta or EvalDelta when both a and b actually have the
metric (i.e., a.automatedScore && b.automatedScore for automatedScore, and
a.frontierEval && b.frontierEval for frontierEval); otherwise set
automatedScore/frontierEval to null and handle availability changes separately
if you need that signal; apply the same fix to the other similar block later in
computeDeltas (around the second automatedScore/frontierEval handling).

---

Duplicate comments:
In `@src/cli/migrate-machine-command.ts`:
- Around line 127-135: Normalization errors from
normalizeKnownRunPayload()/normalizeKnownPlanPayload() (called via
normalize(parsed)) aren't wrapped with the artifact path, so exceptions escape
the artifact-scoped error message; wrap the normalize(parsed) call in a
try/catch and on error throw a new Error that includes `artifactPath` (same
pattern used for validate) and pass the caught error as the cause, updating the
block around normalize(parsed) / validate(normalized) so both normalization and
validation failures are reported with the artifact path.

In `@src/lib/machine-profile/probe-linux.ts`:
- Around line 69-77: parseLspciAccelerators is currently returning every
VGA/3D/display lspci line as a confirmed accelerator; change it to mark those
entries as uncorroborated candidates (e.g., add a property like source: "lspci"
and corroborated: false or kind: "candidate") instead of treating them as real
devices, and update the merging logic that uses hasCorroboratedAccelerators to
only promote entries with corroborated: true (or vendor-specific kind) when a
vendor probe (e.g., NVIDIA/AMD probe functions) confirms a match; apply the same
change to the other lspci parsing block referenced around lines 107-123 so
lspci-only entries never inflate acceleratorCount or profileKey unless
explicitly corroborated.

---

Nitpick comments:
In `@src/results/compare.ts`:
- Around line 142-220: The compare.ts file is too large and should have the
aggregate-summary logic split into a helper module: extract the AggregateScoring
and AggregateFrontier interfaces and the functions calculatePassRate,
computeAggregateScoring, and computeAggregateFrontier into a new module (e.g.,
results/aggregate.ts), export them, and replace their original definitions in
compare.ts with imports; update any references to these symbols
(calculatePassRate, computeAggregateScoring, computeAggregateFrontier,
AggregateScoring, AggregateFrontier) to use the new module, ensure types/exports
align with existing signatures (same parameters and return types), and run type
checks to fix any missing imports or exported names.

In `@src/runner/item-executor.ts`:
- Around line 186-188: The code unconditionally assigns signalAssessment =
scoringOutcome.signalAssessment which discards any pre-existing generation-level
assessment when scoringOutcome.signalAssessment is undefined; update the
assignment in the block that handles scoringOutcome (around variables
generation, scoringOutcome, scoringResult) to preserve existing generation-level
signalAssessment by only replacing it when scoringOutcome.signalAssessment is
defined (e.g., signalAssessment = scoringOutcome.signalAssessment ??
signalAssessment or an explicit undefined check), or merge the two assessments
if needed before the finalization logic that builds the final result from
existing signalAssessment so generation-level taints (like
tool_permission_denied) are not lost.

In `@test/plan-builder.test.ts`:
- Around line 325-340: Replace the hard-coded legacy schemaVersion "0.4.0" in
these test fixtures with the project's current schema version constant (i.e.,
use the exported schema/current version constant instead of the string literal)
so the model-profile regression cases exercise the current plan/run schema path;
update the schemaVersion property in the object shown (and the other occurrences
you noted at 421-436, 490-505, 543-558) to reference the canonical schema
constant (e.g., SCHEMA_VERSION or CURRENT_SCHEMA_VERSION) rather than the old
"0.4.0".

In `@test/workspace-prompt-parity.test.ts`:
- Around line 64-98: The test "blind coding prompts declare the required public
method contracts" is missing assertions for the calculator's multiply and divide
method signatures; update the test to assert that calculatorPrompt contains
"`multiply(n: number)`" and "`divide(n: number)`" (add these alongside the
existing expects for add/subtract/clear/memoryRecall/etc.), referencing the
calculatorPrompt variable so the new assertions follow the same pattern as the
other expect(...) calls.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 30f49794-99d4-47e3-960e-66ed3b1d9617

📥 Commits

Reviewing files that changed from the base of the PR and between 90ecc64 and f7146b7.

📒 Files selected for processing (102)

README.md
apps/dashboard/public/results/20260305-135118-b11d08/plan.json
apps/dashboard/public/results/20260305-135118-b11d08/run.json
apps/dashboard/public/results/20260305-144041-4d5950/plan.json
apps/dashboard/public/results/20260305-144041-4d5950/run.json
apps/dashboard/public/results/20260306-135715-91b0ee/plan.json
apps/dashboard/public/results/20260306-135715-91b0ee/run.json
apps/dashboard/public/results/20260307-161553-c25c6a/plan.json
apps/dashboard/public/results/20260307-161553-c25c6a/run.json
apps/dashboard/public/results/20260316-141600-6e18e2/plan.json
apps/dashboard/public/results/20260316-141600-6e18e2/run.json
apps/dashboard/public/results/20260317-082000-7b3bf3/plan.json
apps/dashboard/public/results/20260317-082000-7b3bf3/run.json
apps/dashboard/public/results/20260317-170728-697a9a/plan.json
apps/dashboard/public/results/20260317-170728-697a9a/run.json
apps/dashboard/public/results/20260318-091214-a0dbda/plan.json
apps/dashboard/public/results/20260318-091214-a0dbda/run.json
apps/dashboard/public/results/20260319-090246-e5f8be/plan.json
apps/dashboard/public/results/20260319-090246-e5f8be/run.json
apps/dashboard/public/results/aggregates/chk_sha256v1_2c13ad3e58b3.json
apps/dashboard/public/results/aggregates/chk_sha256v1_61a037f87105.json
apps/dashboard/public/results/aggregates/latest.json
apps/dashboard/public/results/index.json
apps/dashboard/scripts/build-index.ts
apps/dashboard/src/components/leaderboard/leaderboard-filters.ts
apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx
apps/dashboard/src/components/leaderboard/leaderboard-results-table.tsx
apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx
apps/dashboard/src/components/run-detail/run-detail-page.tsx
apps/dashboard/src/components/run-list/checkpoint-groups.ts
apps/dashboard/src/components/run-list/run-card.tsx
apps/dashboard/src/components/run-list/run-list-page.tsx
apps/dashboard/src/lib/aggregations-compare.ts
apps/dashboard/src/lib/aggregations-core.ts
apps/dashboard/src/lib/api.ts
apps/dashboard/src/lib/schemas.ts
apps/dashboard/src/lib/types.ts
llm/implementation/multi-runtime-mvp-implementation.md
models.example.json
models.json
src/cli/compare-command.ts
src/cli/index.ts
src/cli/migrate-machine-command.ts
src/cli/run-command.ts
src/harnesses/code-output-policy.ts
src/harnesses/direct-adapter.ts
src/harnesses/goose-adapter.ts
src/harnesses/harness.ts
src/harnesses/opencode-adapter.ts
src/lib/hardware-profile.ts
src/lib/machine-profile/instance-id.ts
src/lib/machine-profile/legacy.ts
src/lib/machine-profile/normalization.ts
src/lib/machine-profile/probe-linux.ts
src/lib/machine-profile/probe-macos.ts
src/lib/machine-profile/probe-utils.ts
src/lib/machine-profile/probe-windows.ts
src/lib/machine-profile/probe.ts
src/lib/model-profile/normalization.ts
src/lib/model-profile/registry.ts
src/lib/model-profiles.ts
src/lib/signal-assessment.ts
src/lib/stats-format.ts
src/lib/stats.ts
src/results/aggregate.ts
src/results/compare.ts
src/results/reader.ts
src/runner/generation-retry.ts
src/runner/index.ts
src/runner/item-executor.ts
src/runner/item-retry.ts
src/runner/plan-builder.ts
src/schemas/common.schema.ts
src/schemas/config.schema.ts
src/schemas/index.ts
src/schemas/model-profile.schema.ts
src/schemas/plan.schema.ts
src/schemas/result.schema.ts
src/tests/calculator-stateful/prompt.blind.md
src/tests/event-emitter/prompt.blind.md
src/tests/file-delete-smoke/prompt.blind.md
src/tests/rate-limiter/prompt.blind.md
src/tests/safe-cleanup/prompt.blind.md
src/tests/workspace-smoke/test.meta.json
test/aggregate.test.ts
test/build-index.test.ts
test/code-output-policy.test.ts
test/compare-checkpoint-guard.test.ts
test/compare.test.ts
test/dashboard-aggregations-compare.test.ts
test/dashboard-schema-compat.test.ts
test/goose-adapter.test.ts
test/hardware-profile.test.ts
test/machine-profile-migration.test.ts
test/machine-profile-probe.test.ts
test/model-profile-registry.test.ts
test/opencode-adapter.test.ts
test/plan-builder.test.ts
test/schemas.test.ts
test/signal-assessment.test.ts
test/stats.test.ts
test/workspace-prompt-parity.test.ts

AustinKelsay · 2026-03-25T21:34:26Z

@CodeRabbit full review

coderabbitai · 2026-03-25T21:34:31Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/harnesses/code-output-policy.ts (1)

224-233: ⚠️ Potential issue | 🟠 Major

Use the extracted candidate in the raw-signal accept path.

When extractCode() finds fenced/extracted code but code.length < minOutputLength, this branch still accepts the output via hasRawCodeSignals() and persists trimmed. That leaves fences/prose in code while method still reports the extracted method, so short valid snippets can be accepted and then fail import/compile downstream.
🐛 Proposed fix
 	if (hasRawCodeSignals) {
 		return {
 			shouldRetry: false,
 			reason: "ok",
-			code: trimmed,
+			code: candidateCode,
 			method: extracted.method,
 			taintReasons: getAcceptedOutputTaintReasons(trimmed, extracted),
 		};
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/code-output-policy.ts` around lines 224 - 233, The raw-signal
accept path currently returns trimmed but uses extracted.method, causing
fenced/prose to slip through; change the returned code and taint reasoning to
use the extracted candidate (e.g., extracted.code or extracted.candidate as
provided by extractCode()) when hasRawCodeSignals is true so the accepted
payload matches the extracted method: replace usages of trimmed with the
extracted code value and pass that extracted value into
getAcceptedOutputTaintReasons, keeping method: extracted.method unchanged.

apps/dashboard/scripts/build-index.ts (1)

192-239: ⚠️ Potential issue | 🟠 Major

Redact internal transcripts in error/failure strings too.

INTERNAL_TRACE_PATTERN only gates generation.output. The same transcript in generation.error or a structured generationFailure / scoringFailure / frontierEvalFailure message still gets published after path scrubbing, which defeats the new trace redaction.

🔒 Minimal fix

 							...(item.generation.error
 								? {
-										error: sanitizePublishedText(item.generation.error),
+										error: INTERNAL_TRACE_PATTERN.test(item.generation.error)
+											? "[redacted internal tool transcript]"
+											: sanitizePublishedText(item.generation.error),
 									}
 								: {}),
@@
 			...(item.generationFailure
 				? {
 						generationFailure: {
 							...item.generationFailure,
-							message: sanitizePublishedText(item.generationFailure.message),
+							message: INTERNAL_TRACE_PATTERN.test(
+								item.generationFailure.message,
+							)
+								? "[redacted internal tool transcript]"
+								: sanitizePublishedText(item.generationFailure.message),
 						},
 					}
 				: {}),
 			...(item.scoringFailure
 				? {
 						scoringFailure: {
 							...item.scoringFailure,
-							message: sanitizePublishedText(item.scoringFailure.message),
+							message: INTERNAL_TRACE_PATTERN.test(item.scoringFailure.message)
+								? "[redacted internal tool transcript]"
+								: sanitizePublishedText(item.scoringFailure.message),
 						},
 					}
 				: {}),
 			...(item.frontierEvalFailure
 				? {
 						frontierEvalFailure: {
 							...item.frontierEvalFailure,
-							message: sanitizePublishedText(item.frontierEvalFailure.message),
+							message: INTERNAL_TRACE_PATTERN.test(
+								item.frontierEvalFailure.message,
+							)
+								? "[redacted internal tool transcript]"
+								: sanitizePublishedText(item.frontierEvalFailure.message),
 						},
 					}
 				: {}),

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/scripts/build-index.ts` around lines 192 - 239, The code
currently only redacts INTERNAL_TRACE_PATTERN for generation.output but still
publishes internal transcripts present in generation.error and the various
failure.message fields; update the sanitization logic in the mapping for
generation.error, generationFailure.message, scoringFailure.message, and
frontierEvalFailure.message to first test INTERNAL_TRACE_PATTERN and if it
matches replace with "[redacted internal tool transcript]" (same token used for
output) otherwise call sanitizePublishedText; ensure you apply the same pattern
check for any other places where a raw message/error string is being set (e.g.,
the generation.error branch and the message fields inside generationFailure,
scoringFailure, frontierEvalFailure) so all internal traces are consistently
redacted before publishing.

♻️ Duplicate comments (5)

src/tests/workspace-smoke/test.meta.json (1)
7-10: ⚠️ Potential issue | 🟠 Major

Restore workspace-mkdir for workspace-smoke.

This task still writes into a nested logs/ path, so dropping workspace-mkdir turns a benchmark task into a harness-capability failure. The same bad capability set is then propagated into the checked-in workspace-smoke plan artifacts.
🛠️ Minimal fix
  "requiredHarnessCapabilities": [
    "workspace-read",
-   "workspace-write"
+   "workspace-write",
+   "workspace-mkdir"
  ],
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/tests/workspace-smoke/test.meta.json` around lines 7 - 10, The test meta
for workspace-smoke removed the "workspace-mkdir" capability causing tasks that
write nested logs/ to fail; restore "workspace-mkdir" in the
requiredHarnessCapabilities array in the test.meta.json for the workspace-smoke
test (update the requiredHarnessCapabilities JSON array to include
"workspace-mkdir" alongside "workspace-read" and "workspace-write") so the
harness can create nested directories for logs.
test/stats.test.ts (1)
35-54: ⚠️ Potential issue | 🟡 Minor

timing still gets overwritten by the final spread.

Line 53 reapplies overrides.timing, so partial timing overrides drop the merged defaults again. That means this fixture still doesn't guarantee a stable RunStats shape across tests.
💡 Suggested fix
 function createRunStats(overrides: Partial<RunStats> = {}): RunStats {
+	const { timing: timingOverrides, ...rest } = overrides;
 	const timing = {
 		avgGenerationMs: 1000,
 		avgScoringMs: null,
 		avgFrontierEvalMs: null,
 		minGenerationMs: 1000,
 		maxGenerationMs: 1000,
-		...(overrides.timing ?? {}),
+		...(timingOverrides ?? {}),
 	};
 	return {
 		timing,
 		tokens: null,
 		scoring: null,
 		trustedScoring: null,
 		frontier: null,
 		trustedFrontier: null,
 		signal: null,
 		generationFailures: null,
-		...overrides,
+		...rest,
 	};
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/stats.test.ts` around lines 35 - 54, The helper createRunStats merges
default timing but then re-applies overrides via the final spread, which lets
overrides.timing clobber the merged defaults; fix by excluding timing from the
final spread so the merged timing is preserved — e.g., build the merged timing
as you already do, then when returning the RunStats object spread the rest of
overrides without timing (use a destructuring like const { timing: _, ...rest }
= overrides) and return { timing, ...rest } to keep the merged defaults intact;
reference createRunStats and the RunStats fixture to locate the change.
src/lib/machine-profile/probe-macos.ts (1)
76-84: ⚠️ Potential issue | 🟡 Minor

Truthy check drops valid memoryBytes: 0.

Line 80 uses ...(memoryBytes ? { memoryBytes } : {}) which treats 0 as falsy and excludes it. If parseMemoryBytes() ever returns 0, that GPU becomes indistinguishable from "unknown memory".
Proposed fix
 		return [
 			{
 				modelRaw,
 				...(vendor ? { vendor } : {}),
-				...(memoryBytes ? { memoryBytes } : {}),
+				...(memoryBytes !== undefined ? { memoryBytes } : {}),
 				kind,
 				backend: "metal",
 			},
 		];
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/machine-profile/probe-macos.ts` around lines 76 - 84, The object
spread currently uses a truthy check that drops valid zero values: replace the
conditional spread ...(memoryBytes ? { memoryBytes } : {}) with a check that
only omits memoryBytes when it is null/undefined (e.g. memoryBytes != null or
typeof memoryBytes === "number") so that memoryBytes === 0 is preserved; update
the returned object in the function that builds the probe result (the place
referencing memoryBytes and modelRaw/kind/backend) to use that null/undefined
check instead of a truthy check.
src/lib/machine-profile/legacy.ts (1)
147-176: ⚠️ Potential issue | 🟠 Major

Parse the legacy branch with Zod instead of shape checks + cast.

This branch still validates only a few fields and then turns the whole payload into LegacyMachineProfile. Invalid-but-JSON-valid legacy data can escape migrateLegacyMachineProfile() as a typed MachineProfile, and direct callers never hit the final RunPlanSchema / RunResultSchema parse. Please add a dedicated legacy schema and normalize result.data.

As per coding guidelines, "Treat all boundary inputs (CLI args, env, JSON files, HTTP responses) as unknown and parse with Zod immediately."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/machine-profile/legacy.ts` around lines 147 - 176, The current branch
performs ad-hoc shape checks then casts rawMachine to LegacyMachineProfile,
which lets malformed JSON slip through; replace those checks with a Zod schema
(e.g., LegacyMachineProfileSchema) that fully describes the expected legacy
shape, use Schema.parse(rawMachine) instead of casting, and validate/normalize
result.data with that schema before calling migrateLegacyObservedHardware,
normalizeMachineProfile, buildLegacyInstanceId, buildMachineProfileKey, and
buildMachineProfileLabel so only parsed/typed legacy objects reach the rest of
the logic.
src/lib/machine-profile/probe-linux.ts (1)
107-123: ⚠️ Potential issue | 🟠 Major

Don’t return raw lspci controllers once a backend probe succeeds.

hasCorroboratedAccelerators only controls status; it does not filter the returned set. On a host with nvidia-smi plus an ASPEED/BMC or virtual display controller, the lspci fallback still lands in accelerators, so normalization produces a mixed-accelerator profile and the wrong profileKey/label.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/machine-profile/probe-linux.ts` around lines 107 - 123, The current
code returns dedupedAccelerators derived from accelerators (which includes raw
parseLspciAccelerators results) even when a stronger probe (e.g., nvidia-smi)
corroborates real devices; change the return path so that when
hasCorroboratedAccelerators is true you filter dedupedAccelerators to only
include entries that are corroborated (i.e., accelerator.backend !== undefined
|| accelerator.vendor !== undefined) before returning so raw lspci-only
controllers (from runProbe("lspci") / parseLspciAccelerators) are excluded;
update the block that builds the returned object (using dedupeAccelerators,
hasCorroboratedAccelerators, and the accelerators array) to apply this filter
and return the filtered list instead of the unfiltered dedupedAccelerators.

🧹 Nitpick comments (11)

src/tests/calculator-stateful/prompt.blind.md (1)
24-24: Make all signatures fully explicit in TypeScript notation.

Line 24 is clearer, but the contract still mixes prose with signatures. Consider expressing every method as explicit signatures (e.g., result(): number, memoryStore(): Calculator, clear(): Calculator) to reduce model-output ambiguity.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/tests/calculator-stateful/prompt.blind.md` at line 24, Update the
documentation to use explicit TypeScript-style method signatures for the
Calculator API instead of prose; e.g., replace descriptions like "result() and
memoryRecall() return numbers; the other methods remain chainable" with explicit
signatures such as result(): number; memoryRecall(): number; memoryStore(value:
number): Calculator; clear(): Calculator; add(value: number): Calculator;
subtract(value: number): Calculator (include every method from the prompt
example with exact names used in the diff so the contract is unambiguous).
src/tests/event-emitter/prompt.blind.md (1)
18-18: Clarify the behavior for events with no listeners.

The specification states that listenerCount returns "the current listener count for that event," but it doesn't explicitly cover the edge case where an event has never been registered or has zero listeners.

Consider adding explicit wording such as: "listenerCount returns the current listener count for that event (0 if no listeners are registered)."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/tests/event-emitter/prompt.blind.md` at line 18, Update the specification
sentence for listenerCount to explicitly state its behavior when no listeners
exist: change the line describing listenerCount to read that it "returns the
current listener count for that event (0 if no listeners are registered)" so
callers and tests know it returns 0 for events that have never been registered
or that currently have no listeners; reference the listenerCount API in the
EventEmitter spec or test description to ensure clarity.
test/workspace-prompt-parity.test.ts (1)
83-83: Use exact method-signature text for memoryClear() assertion.

At Line 83, toContain("memoryClear") is looser than the other signature checks and can pass non-contract text.
Suggested test assertion adjustment
-		expect(calculatorPrompt).toContain("`memoryClear`");
+		expect(calculatorPrompt).toContain("`memoryClear()`");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/workspace-prompt-parity.test.ts` at line 83, The assertion for
memoryClear is too loose—update the test that checks calculatorPrompt to assert
the exact method signature by replacing the current toContain("`memoryClear`")
with a check that includes the parentheses (e.g., toContain("`memoryClear()`")),
so the test for memoryClear() matches the exact method-signature text used by
the other signature checks (target the calculatorPrompt variable and adjust the
assertion accordingly).
src/tests/rate-limiter/prompt.blind.md (1)
16-16: Tighten wording to require exactly the three methods.

At Line 16, “with … methods” can still allow extra methods. If the contract is meant to be strict, say “returns an object with exactly these methods”.
Suggested wording tweak
-- Return an object with `allow(key: string, nowMs: number)`, `remaining(key: string, nowMs: number)`, and `reset(key: string)`.
+- Return an object with exactly these methods: `allow(key: string, nowMs: number)`, `remaining(key: string, nowMs: number)`, and `reset(key: string)`.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/tests/rate-limiter/prompt.blind.md` at line 16, Update the prompt text to
require the exact API surface: replace "Return an object with `allow(key:
string, nowMs: number)`, `remaining(key: string, nowMs: number)`, and
`reset(key: string)`." with a stricter phrasing that specifies "returns an
object with exactly these methods: `allow(key: string, nowMs: number)`,
`remaining(key: string, nowMs: number)`, and `reset(key: string)`" so it cannot
be interpreted as allowing additional methods; reference the same method names
(`allow`, `remaining`, `reset`) to ensure clarity.
apps/dashboard/src/components/leaderboard/leaderboard-filters.ts (1)
64-65: Update stale doc wording (“ID” → “key”).

The implementation now keys by machineProfileKey, but the comment still says “ID”.
Suggested doc-only patch
- * Builds machine filter options keyed by stable machine profile ID.
+ * Builds machine filter options keyed by stable machine profile key.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/leaderboard/leaderboard-filters.ts` around
lines 64 - 65, Update the stale doc comment that reads "Builds machine filter
options keyed by stable machine profile ID." to use "key" instead of "ID" (so it
reads "Builds machine filter options keyed by stable machine profile key" or
similar), i.e., edit the comment in leaderboard-filters.ts where that sentence
appears to reflect that the implementation keys by machineProfileKey.
test/machine-profile-instance-id.test.ts (1)
24-31: Consider a stricter assertion for the Windows path.

The darwin and linux tests use .toBe() with exact expected paths, but the win32 test uses .toContain("AppData"), which is a weaker assertion. This could mask regressions in the Windows path construction logic (e.g., wrong subdirectory or missing plebdev-bench/machine-instance-id suffix).
💡 Suggested stricter assertion
 	it("prefers injected LOCALAPPDATA on win32", () => {
 		expect(
 			resolveDefaultInstanceIdPath("win32", {
 				LOCALAPPDATA: "C:\\Users\\tester\\AppData\\Local",
 				USERPROFILE: "C:\\Users\\tester",
 			}),
-		).toContain("AppData");
+		).toBe(
+			"C:\\Users\\tester\\AppData\\Local\\plebdev-bench\\machine-instance-id",
+		);
 	});
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/machine-profile-instance-id.test.ts` around lines 24 - 31, The win32
test uses a weak .toContain("AppData") assertion; change it to assert the full
exact path returned by resolveDefaultInstanceIdPath("win32", {...}) — e.g.,
expect(...).toBe("C:\\Users\\tester\\AppData\\Local\\plebdev-bench\\machine-instance-id")
— ensuring you escape backslashes in the test string; update the assertion for
resolveDefaultInstanceIdPath to match the exact Windows path and include the
full plebdev-bench/machine-instance-id suffix.
apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx (1)
29-37: Inconsistent fallback chain with run-card.tsx.

This helper uses the chain: machineDisplayLabel → machineLabel → machineProfileLabel → machineProfileKey → machineProfileId.

However, run-card.tsx (lines 36-37) uses: machineDisplayLabel → machineProfileLabel.

The discrepancy may be intentional (this component handles more legacy run shapes), but consider aligning the fallback logic or documenting why they differ to avoid future maintenance confusion.
💡 Consider extracting to a shared utility
// In a shared module, e.g., apps/dashboard/src/lib/machine-utils.ts
export function getRunMachineBadgeLabel(run: RunListItem): string | undefined {
  return (
    run.machineDisplayLabel?.trim() ||
    run.machineLabel?.trim() ||
    run.machineProfileLabel?.trim() ||
    run.machineProfileKey?.trim() ||
    run.machineProfileId?.trim() ||
    undefined
  );
}
Then import and use in both components for consistency.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx` around
lines 29 - 37, The fallback chain for deriving the machine badge differs between
getRunMachineBadgeLabel (uses machineDisplayLabel → machineLabel →
machineProfileLabel → machineProfileKey → machineProfileId) and the logic in
run-card.tsx (machineDisplayLabel → machineProfileLabel); extract the canonical
logic into a shared utility function (e.g., export function
getRunMachineBadgeLabel(run: RunListItem)) that implements the full trimmed
fallback chain, replace the local implementations in leaderboard-latest-runs.tsx
and run-card.tsx to import and call this shared function, and if the divergence
was intentional instead document the reason near the function and align both
files to reference that single documented helper.
src/cli/compare-command.ts (1)

238-323: Extract the summary rendering out of this CLI module.

This block pushes src/cli/compare-command.ts further past the repo's 500-line limit and mixes CLI flow with a lot of presentation branching. Pulling the summary/section printers into a dedicated formatter module would keep compareCommand easier to scan and test.

As per coding guidelines, "Keep files under 500 lines; split aggressively by responsibility."
src/harnesses/opencode-adapter.ts (1)
48-59: Extract the signal-assessment/error-decoration path into a helper.

This adapter is already past the repo's 500-line limit, and the permission-denied parsing plus result/error decoration is a separate concern from process orchestration. Pulling this branch into a small helper/module would make the timeout/retry flow easier to follow and bring the file back toward the size cap.

As per coding guidelines, "src/**/*.{ts,tsx}: Keep files under 500 lines; split aggressively by responsibility".

Also applies to: 336-340, 449-462, 528-541
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/harnesses/opencode-adapter.ts` around lines 48 - 59, The
permission-denied detection and the subsequent
signal-assessment/error-decoration logic in this file should be extracted into a
small helper module to reduce file size and separate concerns: move the
hasPermissionDeniedStderr function and the code that inspects process signals
and decorates results/errors (the "signal-assessment/error-decoration" path
referenced around lines 336-340, 449-462, 528-541) into a new helper (e.g.,
createOpenCodeErrorHelper) that exports a clear API like
isPermissionDenied(stderr: string) and decorateProcessResult(result, signal,
stderr) and update the adapter to call those helpers (replace direct regex
checks and inline decoration with isPermissionDenied and decorateProcessResult)
so process orchestration remains in the adapter while parsing/decoration lives
in the helper.
src/results/compare.ts (1)
270-429: Split the summary helpers out of this module.

The new matching + trusted-metric aggregation logic pushes src/results/compare.ts past the repo's 500-line cap. Extracting the aggregate/trusted-summary helpers into a sibling module would keep matching, summary math, and formatting separate.

As per coding guidelines, "Keep files under 500 lines; split aggressively by responsibility."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/results/compare.ts` around lines 270 - 429, computeSummary has grown past
the file-size limit; extract the aggregate and trusted-summary helper logic (the
scoring/frontier aggregation and trusted-metrics checks) into a new sibling
module and import them into src/results/compare.ts. Specifically, move
implementations and any small helpers used by computeSummary such as
computeAggregateScoring, computeAggregateFrontier, hasCompleteSignalAssessments
and the trusted-matching/trusted-aggregation logic that computes
trustedScoringDelta/trustedFrontierEvalDelta and
metricAvailability/trustedComparedRows into a new module (e.g.,
results/summaryHelpers) and update computeSummary to call the exported functions
(keeping the signature/behavior identical), then update imports and run tests to
ensure no behavioral changes.
src/lib/model-profile/registry.ts (1)
270-292: Consider: Provenance may be lost during merge.

When merging profiles, if the same profileKey exists in multiple registries, the later profile's provenance overrides the earlier one. If a configured_profile registry is merged over a legacy_alias entry, the provenance changes to configured_profile even though some fields might have originated from the legacy alias.

This appears intentional given "later definitions taking precedence," but worth confirming this is the desired behavior for provenance tracking.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/model-profile/registry.ts` around lines 270 - 292, mergeModelProfiles
currently allows later registries to overwrite the profile.provenance, losing
earlier provenance (e.g., configured_profile overwriting legacy_alias); update
mergeModelProfiles to preserve provenance when combining profiles by detecting
existing.provenance and profile.provenance and merging them (e.g., concatenate
into an array or keep a deduplicated list) instead of simply taking the later
value—apply this change in the merge branch where existing and profile are
combined (the block that sets merged[profileKey] = existing ? { ...existing,
...profile, variants: {...} } : {...}), ensuring provenance is computed and
included in the returned merged object.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/harnesses/goose-adapter.ts`:
- Around line 416-429: The ternary that appends "tool_call_not_executed" is dead
because codeFilePath is unconditionally assigned to solutionPath earlier, so the
condition toolCallDetected && !codeFilePath will always be false; fix by either
removing the dead ternary entirely from the appendSignalAssessmentReasons call
in this return, or change the condition to check the actual provenance flag
(e.g., replace !codeFilePath with a real indicator such as a new or existing
variable that denotes whether the code was extracted vs executed—use
toolCallDetected && <provenanceFlag>), keeping references to toolCallDetected,
codeFilePath/solutionPath, appendSignalAssessmentReasons and the
"tool_call_not_executed" tag so the intended taint is applied correctly.

In `@src/lib/machine-profile/instance-id.ts`:
- Around line 81-127: The writeGeneratedInstanceId function can leave a partial
file if openSync("wx") succeeded but writeFileSync or fchmodSync throws; update
the non-EEXIST error handling to remove the file this process created when
persistence fails mid-write: inside the catch block for non-EEXIST errors, if
fileDescriptor !== undefined (meaning openSync succeeded) close the descriptor
(fs.closeSync(fileDescriptor)) and then attempt to unlink targetPath
(fs.unlinkSync(targetPath)) inside a try/catch (best-effort cleanup), before
rethrowing the original error; keep existing finally logic for safe double-close
but ensure the explicit unlink runs when we own the file so concurrent callers
won’t see a malformed file.

In `@src/results/compare.ts`:
- Around line 27-41: buildCompareKey currently collapses different runtime
variants for the same canonical model profile; update buildCompareKey (and the
payload shape it consumes) to include a stable runtime variant discriminator
(e.g., item.runtimeVariant or item.runtimeTag) in the returned key (append after
item.runtime) so variants like different ollama builds produce distinct keys,
and/or add a duplicate-detection guard in the code that inserts into the
comparison Map (the caller that uses Map.set with keys from buildCompareKey) to
throw an error if a key already exists to fail fast instead of silently
overwriting; reference buildCompareKey and getModelIdentityKey when making this
change and also add the Map.set existence check at the Map insertion site.

In `@src/runner/item-retry.ts`:
- Around line 209-216: The scoring helper runScoringWithCompileRetry can return
signalAssessment as undefined, which currently overwrites an earlier
generation-set signalAssessment (variable signalAssessment) in the caller (e.g.,
where scoringOutcome is assigned), losing taint/trust metadata; update callers
to merge instead of overwrite by assigning signalAssessment =
scoringOutcome.signalAssessment ?? signalAssessment (and apply the same
nullish-coalescing pattern wherever runScoringWithCompileRetry results are
consumed) so the existing signalAssessment is preserved when the helper returns
undefined.

In `@src/runner/plan-builder.ts`:
- Around line 167-205: The loop in build plan silently drops model selectors
from config.models that never resolve on any reachable runtime; update the logic
in plan-builder (around resolveModelSelection, discovered, filteredSet,
resolvedModelProfiles) to detect selectors that remain unmatched after checking
all runtimes and throw a clear Error listing those selectors instead of silently
ignoring them; implement this by tracking unresolved modelSpecs (e.g., a Set
initialized from config.models, removing each modelSpec when you push into
filtered/filteredSet or set resolvedModelProfiles) and after the runtime loop,
if the Set is non-empty throw an Error("Requested model selectors not found:
...") including the missing keys.

In `@test/signal-assessment.test.ts`:
- Around line 1-3: Add the repo-standard header to the top of
test/signal-assessment.test.ts by expanding the existing Purpose block to
include an Exports section (list exported test helpers or `none` if the file
exports nothing) and an Invariants section (any preconditions or assumptions the
tests rely on), matching the shape used in test/machine-profile-probe.test.ts;
ensure the header sits before any imports or code and clearly names the file
purpose, exported symbols (or `none`), and invariants.

---

Outside diff comments:
In `@apps/dashboard/scripts/build-index.ts`:
- Around line 192-239: The code currently only redacts INTERNAL_TRACE_PATTERN
for generation.output but still publishes internal transcripts present in
generation.error and the various failure.message fields; update the sanitization
logic in the mapping for generation.error, generationFailure.message,
scoringFailure.message, and frontierEvalFailure.message to first test
INTERNAL_TRACE_PATTERN and if it matches replace with "[redacted internal tool
transcript]" (same token used for output) otherwise call sanitizePublishedText;
ensure you apply the same pattern check for any other places where a raw
message/error string is being set (e.g., the generation.error branch and the
message fields inside generationFailure, scoringFailure, frontierEvalFailure) so
all internal traces are consistently redacted before publishing.

In `@src/harnesses/code-output-policy.ts`:
- Around line 224-233: The raw-signal accept path currently returns trimmed but
uses extracted.method, causing fenced/prose to slip through; change the returned
code and taint reasoning to use the extracted candidate (e.g., extracted.code or
extracted.candidate as provided by extractCode()) when hasRawCodeSignals is true
so the accepted payload matches the extracted method: replace usages of trimmed
with the extracted code value and pass that extracted value into
getAcceptedOutputTaintReasons, keeping method: extracted.method unchanged.

---

Duplicate comments:
In `@src/lib/machine-profile/legacy.ts`:
- Around line 147-176: The current branch performs ad-hoc shape checks then
casts rawMachine to LegacyMachineProfile, which lets malformed JSON slip
through; replace those checks with a Zod schema (e.g.,
LegacyMachineProfileSchema) that fully describes the expected legacy shape, use
Schema.parse(rawMachine) instead of casting, and validate/normalize result.data
with that schema before calling migrateLegacyObservedHardware,
normalizeMachineProfile, buildLegacyInstanceId, buildMachineProfileKey, and
buildMachineProfileLabel so only parsed/typed legacy objects reach the rest of
the logic.

In `@src/lib/machine-profile/probe-linux.ts`:
- Around line 107-123: The current code returns dedupedAccelerators derived from
accelerators (which includes raw parseLspciAccelerators results) even when a
stronger probe (e.g., nvidia-smi) corroborates real devices; change the return
path so that when hasCorroboratedAccelerators is true you filter
dedupedAccelerators to only include entries that are corroborated (i.e.,
accelerator.backend !== undefined || accelerator.vendor !== undefined) before
returning so raw lspci-only controllers (from runProbe("lspci") /
parseLspciAccelerators) are excluded; update the block that builds the returned
object (using dedupeAccelerators, hasCorroboratedAccelerators, and the
accelerators array) to apply this filter and return the filtered list instead of
the unfiltered dedupedAccelerators.

In `@src/lib/machine-profile/probe-macos.ts`:
- Around line 76-84: The object spread currently uses a truthy check that drops
valid zero values: replace the conditional spread ...(memoryBytes ? {
memoryBytes } : {}) with a check that only omits memoryBytes when it is
null/undefined (e.g. memoryBytes != null or typeof memoryBytes === "number") so
that memoryBytes === 0 is preserved; update the returned object in the function
that builds the probe result (the place referencing memoryBytes and
modelRaw/kind/backend) to use that null/undefined check instead of a truthy
check.

In `@src/tests/workspace-smoke/test.meta.json`:
- Around line 7-10: The test meta for workspace-smoke removed the
"workspace-mkdir" capability causing tasks that write nested logs/ to fail;
restore "workspace-mkdir" in the requiredHarnessCapabilities array in the
test.meta.json for the workspace-smoke test (update the
requiredHarnessCapabilities JSON array to include "workspace-mkdir" alongside
"workspace-read" and "workspace-write") so the harness can create nested
directories for logs.

In `@test/stats.test.ts`:
- Around line 35-54: The helper createRunStats merges default timing but then
re-applies overrides via the final spread, which lets overrides.timing clobber
the merged defaults; fix by excluding timing from the final spread so the merged
timing is preserved — e.g., build the merged timing as you already do, then when
returning the RunStats object spread the rest of overrides without timing (use a
destructuring like const { timing: _, ...rest } = overrides) and return {
timing, ...rest } to keep the merged defaults intact; reference createRunStats
and the RunStats fixture to locate the change.

---

Nitpick comments:
In `@apps/dashboard/src/components/leaderboard/leaderboard-filters.ts`:
- Around line 64-65: Update the stale doc comment that reads "Builds machine
filter options keyed by stable machine profile ID." to use "key" instead of "ID"
(so it reads "Builds machine filter options keyed by stable machine profile key"
or similar), i.e., edit the comment in leaderboard-filters.ts where that
sentence appears to reflect that the implementation keys by machineProfileKey.

In `@apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx`:
- Around line 29-37: The fallback chain for deriving the machine badge differs
between getRunMachineBadgeLabel (uses machineDisplayLabel → machineLabel →
machineProfileLabel → machineProfileKey → machineProfileId) and the logic in
run-card.tsx (machineDisplayLabel → machineProfileLabel); extract the canonical
logic into a shared utility function (e.g., export function
getRunMachineBadgeLabel(run: RunListItem)) that implements the full trimmed
fallback chain, replace the local implementations in leaderboard-latest-runs.tsx
and run-card.tsx to import and call this shared function, and if the divergence
was intentional instead document the reason near the function and align both
files to reference that single documented helper.

In `@src/harnesses/opencode-adapter.ts`:
- Around line 48-59: The permission-denied detection and the subsequent
signal-assessment/error-decoration logic in this file should be extracted into a
small helper module to reduce file size and separate concerns: move the
hasPermissionDeniedStderr function and the code that inspects process signals
and decorates results/errors (the "signal-assessment/error-decoration" path
referenced around lines 336-340, 449-462, 528-541) into a new helper (e.g.,
createOpenCodeErrorHelper) that exports a clear API like
isPermissionDenied(stderr: string) and decorateProcessResult(result, signal,
stderr) and update the adapter to call those helpers (replace direct regex
checks and inline decoration with isPermissionDenied and decorateProcessResult)
so process orchestration remains in the adapter while parsing/decoration lives
in the helper.

In `@src/lib/model-profile/registry.ts`:
- Around line 270-292: mergeModelProfiles currently allows later registries to
overwrite the profile.provenance, losing earlier provenance (e.g.,
configured_profile overwriting legacy_alias); update mergeModelProfiles to
preserve provenance when combining profiles by detecting existing.provenance and
profile.provenance and merging them (e.g., concatenate into an array or keep a
deduplicated list) instead of simply taking the later value—apply this change in
the merge branch where existing and profile are combined (the block that sets
merged[profileKey] = existing ? { ...existing, ...profile, variants: {...} } :
{...}), ensuring provenance is computed and included in the returned merged
object.

In `@src/results/compare.ts`:
- Around line 270-429: computeSummary has grown past the file-size limit;
extract the aggregate and trusted-summary helper logic (the scoring/frontier
aggregation and trusted-metrics checks) into a new sibling module and import
them into src/results/compare.ts. Specifically, move implementations and any
small helpers used by computeSummary such as computeAggregateScoring,
computeAggregateFrontier, hasCompleteSignalAssessments and the
trusted-matching/trusted-aggregation logic that computes
trustedScoringDelta/trustedFrontierEvalDelta and
metricAvailability/trustedComparedRows into a new module (e.g.,
results/summaryHelpers) and update computeSummary to call the exported functions
(keeping the signature/behavior identical), then update imports and run tests to
ensure no behavioral changes.

In `@src/tests/calculator-stateful/prompt.blind.md`:
- Line 24: Update the documentation to use explicit TypeScript-style method
signatures for the Calculator API instead of prose; e.g., replace descriptions
like "result() and memoryRecall() return numbers; the other methods remain
chainable" with explicit signatures such as result(): number; memoryRecall():
number; memoryStore(value: number): Calculator; clear(): Calculator; add(value:
number): Calculator; subtract(value: number): Calculator (include every method
from the prompt example with exact names used in the diff so the contract is
unambiguous).

In `@src/tests/event-emitter/prompt.blind.md`:
- Line 18: Update the specification sentence for listenerCount to explicitly
state its behavior when no listeners exist: change the line describing
listenerCount to read that it "returns the current listener count for that event
(0 if no listeners are registered)" so callers and tests know it returns 0 for
events that have never been registered or that currently have no listeners;
reference the listenerCount API in the EventEmitter spec or test description to
ensure clarity.

In `@src/tests/rate-limiter/prompt.blind.md`:
- Line 16: Update the prompt text to require the exact API surface: replace
"Return an object with `allow(key: string, nowMs: number)`, `remaining(key:
string, nowMs: number)`, and `reset(key: string)`." with a stricter phrasing
that specifies "returns an object with exactly these methods: `allow(key:
string, nowMs: number)`, `remaining(key: string, nowMs: number)`, and
`reset(key: string)`" so it cannot be interpreted as allowing additional
methods; reference the same method names (`allow`, `remaining`, `reset`) to
ensure clarity.

In `@test/machine-profile-instance-id.test.ts`:
- Around line 24-31: The win32 test uses a weak .toContain("AppData") assertion;
change it to assert the full exact path returned by
resolveDefaultInstanceIdPath("win32", {...}) — e.g.,
expect(...).toBe("C:\\Users\\tester\\AppData\\Local\\plebdev-bench\\machine-instance-id")
— ensuring you escape backslashes in the test string; update the assertion for
resolveDefaultInstanceIdPath to match the exact Windows path and include the
full plebdev-bench/machine-instance-id suffix.

In `@test/workspace-prompt-parity.test.ts`:
- Line 83: The assertion for memoryClear is too loose—update the test that
checks calculatorPrompt to assert the exact method signature by replacing the
current toContain("`memoryClear`") with a check that includes the parentheses
(e.g., toContain("`memoryClear()`")), so the test for memoryClear() matches the
exact method-signature text used by the other signature checks (target the
calculatorPrompt variable and adjust the assertion accordingly).

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 555b9328-c221-4afb-b1e0-ece45de4ad5b

📥 Commits

Reviewing files that changed from the base of the PR and between 90ecc64 and b6ef5b1.

📒 Files selected for processing (104)

README.md
apps/dashboard/public/results/20260305-135118-b11d08/plan.json
apps/dashboard/public/results/20260305-135118-b11d08/run.json
apps/dashboard/public/results/20260305-144041-4d5950/plan.json
apps/dashboard/public/results/20260305-144041-4d5950/run.json
apps/dashboard/public/results/20260306-135715-91b0ee/plan.json
apps/dashboard/public/results/20260306-135715-91b0ee/run.json
apps/dashboard/public/results/20260307-161553-c25c6a/plan.json
apps/dashboard/public/results/20260307-161553-c25c6a/run.json
apps/dashboard/public/results/20260316-141600-6e18e2/plan.json
apps/dashboard/public/results/20260316-141600-6e18e2/run.json
apps/dashboard/public/results/20260317-082000-7b3bf3/plan.json
apps/dashboard/public/results/20260317-082000-7b3bf3/run.json
apps/dashboard/public/results/20260317-170728-697a9a/plan.json
apps/dashboard/public/results/20260317-170728-697a9a/run.json
apps/dashboard/public/results/20260318-091214-a0dbda/plan.json
apps/dashboard/public/results/20260318-091214-a0dbda/run.json
apps/dashboard/public/results/20260319-090246-e5f8be/plan.json
apps/dashboard/public/results/20260319-090246-e5f8be/run.json
apps/dashboard/public/results/aggregates/chk_sha256v1_2c13ad3e58b3.json
apps/dashboard/public/results/aggregates/chk_sha256v1_61a037f87105.json
apps/dashboard/public/results/aggregates/latest.json
apps/dashboard/public/results/index.json
apps/dashboard/scripts/build-index.ts
apps/dashboard/src/components/leaderboard/leaderboard-filters.ts
apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx
apps/dashboard/src/components/leaderboard/leaderboard-results-table.tsx
apps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsx
apps/dashboard/src/components/run-detail/run-detail-page.tsx
apps/dashboard/src/components/run-list/checkpoint-groups.ts
apps/dashboard/src/components/run-list/run-card.tsx
apps/dashboard/src/components/run-list/run-list-page.tsx
apps/dashboard/src/lib/aggregations-compare.ts
apps/dashboard/src/lib/aggregations-core.ts
apps/dashboard/src/lib/api.ts
apps/dashboard/src/lib/schemas.ts
apps/dashboard/src/lib/types.ts
llm/implementation/multi-runtime-mvp-implementation.md
models.example.json
models.json
src/cli/compare-command.ts
src/cli/index.ts
src/cli/migrate-machine-command.ts
src/cli/run-command.ts
src/harnesses/code-output-policy.ts
src/harnesses/direct-adapter.ts
src/harnesses/goose-adapter.ts
src/harnesses/harness.ts
src/harnesses/opencode-adapter.ts
src/lib/hardware-profile.ts
src/lib/machine-profile/instance-id.ts
src/lib/machine-profile/legacy.ts
src/lib/machine-profile/normalization.ts
src/lib/machine-profile/probe-linux.ts
src/lib/machine-profile/probe-macos.ts
src/lib/machine-profile/probe-utils.ts
src/lib/machine-profile/probe-windows.ts
src/lib/machine-profile/probe.ts
src/lib/model-profile/normalization.ts
src/lib/model-profile/registry.ts
src/lib/model-profiles.ts
src/lib/signal-assessment.ts
src/lib/stats-format.ts
src/lib/stats.ts
src/results/aggregate.ts
src/results/compare.ts
src/results/reader.ts
src/runner/generation-retry.ts
src/runner/index.ts
src/runner/item-executor.ts
src/runner/item-retry.ts
src/runner/plan-builder.ts
src/schemas/common.schema.ts
src/schemas/config.schema.ts
src/schemas/index.ts
src/schemas/model-profile.schema.ts
src/schemas/plan.schema.ts
src/schemas/result.schema.ts
src/tests/calculator-stateful/prompt.blind.md
src/tests/event-emitter/prompt.blind.md
src/tests/file-delete-smoke/prompt.blind.md
src/tests/rate-limiter/prompt.blind.md
src/tests/safe-cleanup/prompt.blind.md
src/tests/workspace-smoke/test.meta.json
test/aggregate.test.ts
test/build-index.test.ts
test/code-output-policy.test.ts
test/compare-checkpoint-guard.test.ts
test/compare.test.ts
test/dashboard-aggregations-compare.test.ts
test/dashboard-schema-compat.test.ts
test/goose-adapter.test.ts
test/hardware-profile.test.ts
test/machine-profile-instance-id.test.ts
test/machine-profile-migration.test.ts
test/machine-profile-probe.test.ts
test/model-profile-registry.test.ts
test/ollama-adapter.test.ts
test/opencode-adapter.test.ts
test/plan-builder.test.ts
test/schemas.test.ts
test/signal-assessment.test.ts
test/stats.test.ts
test/workspace-prompt-parity.test.ts

AustinKelsay and others added 8 commits March 21, 2026 13:57

Implement canonical machine profiles

2500baf

Address machine profile review findings

03c6213

Tighten machine profile compatibility checks

ce6077f

Address follow-up machine profile review findings

ae53c35

Address follow-up dashboard and machine profile review comments

ce0259d

Address legacy machine metadata review follow-ups

84a29e2

Tighten published machine metadata validation

77172b2

Merge pull request #18 from AustinKelsay/feature/machine-profiles

741ac1a

Implement canonical machine profiles

vercel Bot deployed to Preview March 23, 2026 02:03 View deployment

coderabbitai Bot reviewed Mar 23, 2026

View reviewed changes

AustinKelsay and others added 3 commits March 24, 2026 11:48

Add trusted signal assessment to benchmark results

a0aa67a

Refactor compare aggregates and prompt path tests

4f0e104

Merge pull request #20 from AustinKelsay/codex/benchmark-signal-cleanup

0573863

Add trusted signal assessment to benchmark results

vercel Bot deployed to Preview March 24, 2026 22:14 View deployment

coderabbitai Bot reviewed Mar 24, 2026

View reviewed changes

Comment thread src/lib/signal-assessment.ts

Comment thread src/schemas/common.schema.ts Outdated

Comment thread test/goose-adapter.test.ts Outdated

Comment thread test/workspace-prompt-parity.test.ts

Add canonical model profiles

e121d55

vercel Bot deployed to Preview March 24, 2026 22:38 View deployment

coderabbitai Bot reviewed Mar 24, 2026

View reviewed changes

Comment thread src/lib/model-profile/normalization.ts

Comment thread src/lib/model-profile/normalization.ts

Comment thread src/results/compare.ts Outdated

Comment thread src/runner/plan-builder.ts

Harden profile and machine metadata handling

2c6997c

coderabbitai Bot reviewed Mar 25, 2026

View reviewed changes

Comment thread src/lib/machine-profile/probe-macos.ts

Comment thread src/results/compare.ts

Comment thread src/schemas/common.schema.ts

Fix probe and trusted compare edge cases

54bd6cb

coderabbitai Bot reviewed Mar 25, 2026

View reviewed changes

Comment thread src/results/compare.ts

Comment thread src/schemas/config.schema.ts Outdated

vercel Bot had a problem deploying to Preview March 25, 2026 16:52 Failure

AustinKelsay added 2 commits March 25, 2026 09:54

Tighten compare coverage and dashboard compat

6f909c2

Fix dashboard build regressions

c7750a9

vercel Bot deployed to Preview March 25, 2026 17:04 View deployment

coderabbitai Bot reviewed Mar 25, 2026

View reviewed changes

Align dashboard and model profile schema mirrors

c0cceaa

vercel Bot deployed to Preview March 25, 2026 19:01 View deployment

coderabbitai Bot reviewed Mar 25, 2026

View reviewed changes

Harden profile validation and machine identity handling

f7146b7

vercel Bot deployed to Preview March 25, 2026 20:26 View deployment

coderabbitai Bot reviewed Mar 25, 2026

View reviewed changes

Harden dashboard and profile compatibility

b6ef5b1

vercel Bot deployed to Preview March 25, 2026 21:34 View deployment

coderabbitai Bot reviewed Mar 25, 2026

View reviewed changes

Comment thread src/harnesses/goose-adapter.ts

Comment thread src/lib/machine-profile/instance-id.ts

Comment thread src/results/compare.ts Outdated

Comment thread src/runner/item-retry.ts

Comment thread src/runner/plan-builder.ts

Comment thread test/signal-assessment.test.ts

Harden compare and publishing edge cases

6576008

vercel Bot deployed to Preview March 25, 2026 22:02 View deployment

dashboard: publish latest benchmark checkpoints

5ba7975

vercel Bot deployed to Preview March 30, 2026 10:44 View deployment

AustinKelsay merged commit cbd4142 into main Mar 30, 2026
3 checks passed

This was referenced Apr 5, 2026

[codex] bench: harden signal assessment and retry fairness #23

Merged

Staging #24

Open

Conversation

AustinKelsay commented Mar 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What Changed

Why This Matters

Validation

Summary by CodeRabbit

Uh oh!

vercel Bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

AustinKelsay commented Mar 25, 2026

Uh oh!

coderabbitai Bot commented Mar 25, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AustinKelsay commented Mar 25, 2026

Uh oh!

coderabbitai Bot commented Mar 25, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

AustinKelsay commented Mar 23, 2026 •

edited by coderabbitai Bot

Loading

vercel Bot commented Mar 23, 2026 •

edited

Loading

coderabbitai Bot commented Mar 23, 2026 •

edited

Loading