Staging#19
Conversation
Implement canonical machine profiles
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughStandardizes machine identity into per-instance Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI
participant Instance as InstanceID Resolver
participant Probe as Observed Hardware Probe
participant Normalizer as Profile Normalizer
participant Writer as Publisher / Indexer
CLI->>Instance: resolveMachineInstanceId(opts)
Instance-->>CLI: {instanceId, instanceIdSource}
CLI->>Probe: collectObservedHardwareProfile()
Probe-->>Normalizer: observedHardware
Normalizer-->>CLI: {normalizedProfile, profileKey, profileLabel}
CLI->>Writer: write sanitized run/plan (include instance/profile)
Writer-->>CLI: published run.json / plan.json / index.json
sequenceDiagram
participant Migrate as migrate-machine-profiles
participant FS as File system reader
participant Legacy as Legacy normalizer
participant Schema as Zod validation
participant Atomic as Atomic swap writer
participant Indexer as Dashboard indexer
Migrate->>FS: discover run dirs
loop per-run
FS->>FS: read run.json / plan.json
FS->>Legacy: normalizeKnownRunPayload / Plan
Legacy->>Schema: validate against RunResult/RunPlan schemas
Schema-->>Legacy: validated payload
Legacy-->>Atomic: stage changed files (temp + backup)
Atomic->>Atomic: rename temp -> orig (with rollback on failure)
end
alt --rebuild-dashboard-index
Migrate->>Indexer: buildDashboardIndexArtifacts(source, outDir)
Indexer-->>Migrate: index.json + aggregates
end
Migrate-->>Migrate: exit(0)
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Poem
✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (8)
apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx (1)
23-39: Extensive fallback chain for machine badge label.The fallback order ensures backwards compatibility across schema versions. However, including
machineInstanceId(line 32) in a display label fallback may show technical identifiers likelegacy_profile:anon_6fe55c56c31dto users. Consider whether this is intentional or ifmachineInstanceIdshould be excluded from the display fallback chain (keeping it only as internal tracking).🔧 Optional: Remove machineInstanceId from display fallback
function getRunMachineBadgeLabel(run: RunListItem): string | undefined { return ( run.machineDisplayLabel?.trim() || - run.machineInstanceId?.trim() || run.machineLabel?.trim() || run.machineProfileLabel?.trim() || run.machineProfileKey?.trim() || run.machineProfileId?.trim() || undefined ); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx` around lines 23 - 39, The current getRunMachineBadgeLabel function includes machineInstanceId in the display fallback chain which can expose technical IDs; remove machineInstanceId from the returned fallback sequence so the function only uses human-friendly fields (machineDisplayLabel, machineLabel, machineProfileLabel, machineProfileKey, machineProfileId) and leave machineInstanceId for internal tracking elsewhere.src/results/aggregate.ts (1)
299-301: Resolve run metadata once and reuse it.
resolveRunMetadata()now runs in the checkpoint filter, again inside the main loop, and again forsummary.instances. Since it can migrate machine payloads, materializing{ input, metadata }once after the filter would avoid repeated normalization and make the new instance-count path easier to follow.Also applies to: 307-317, 412-416
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/results/aggregate.ts` around lines 299 - 301, The code repeatedly calls resolveRunMetadata(input) in the checkpoint filter, later in the main loop and when building summary.instances; instead, map the original runs array to a materialized array of { input, metadata } by calling resolveRunMetadata once per run (use this mapped array when computing matchedRuns by checkpointId, iterating the main aggregation loop, and constructing summary.instances), then update all references that expect the original input to use the stored metadata (e.g., places currently invoking resolveRunMetadata(input) inside the main loop and in summary.instances) so normalization/migration occurs a single time per run and the instance-count path is simplified.src/schemas/config.schema.ts (1)
96-103: Strip deprecated aliases from the parsed config shape.Because
machineProfileIdandmachineLabelremain inBenchConfigObjectSchema, the parsedBenchConfigtype still exposes deprecated fields to internal code. Normalize them here, then drop them so onlymachineInstanceIdandmachineDisplayLabelflow inward.As per coding guidelines:
Validate at boundaries with Zod and pass typed data inward.Also applies to: 153-163
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/schemas/config.schema.ts` around lines 96 - 103, Normalize deprecated aliases before returning the parsed config: after mapping machineProfileId -> machineInstanceId and machineLabel -> machineDisplayLabel, delete the deprecated keys (machineProfileId and machineLabel) from the config object so the returned BenchConfig only exposes machineInstanceId and machineDisplayLabel; apply the same removal logic in the other parsing branch that handles the alternate mapping (the block mirroring the code around machineInstanceId/machineDisplayLabel) and ensure this aligns with BenchConfigObjectSchema validation boundaries.test/aggregate.test.ts (1)
197-198: Add a same-instance regression case for the new counters.These assertions only prove accumulation across two different IDs. A regression that counts runs instead of distinct
instanceIds would still pass. Please add a sibling case where both runs share the sameinstanceIdand assertsummary.instances === 1/machine.instanceCount === 1.Also applies to: 222-223
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/aggregate.test.ts` around lines 197 - 198, Add a sibling regression test where two runs share the same instanceId and then assert distinct-instance counters collapse to one: create an aggregate from two records with identical instanceId and verify aggregate.summary.instances === 1 and aggregate.machines[0]?.instanceCount === 1 (and mirror the same change for the assertions around the other case mentioned at the second location). This ensures the counters count unique instanceId values rather than run records.src/cli/migrate-machine-command.ts (2)
114-145: Rollback may leave partial state if backup creation fails mid-loop.If
fs.copyFile(rewrite.artifactPath, rewrite.backupPath)fails for the second file after succeeding for the first, the rollback loop attempts to restore from backups. However, the second backup doesn't exist, so only the first file gets "restored" (from an identical backup). This is safe but the error messaging could be clearer.Additionally, there's a subtle issue: if the rename loop (lines 119-121) fails on the second file after the first rename succeeds, the rollback attempts
copyFile(backupPath, artifactPath)but the first file's original content is inbackupPathwhile the artifact already hasnextContent. The rollback correctly restores this.Consider adding tracking for which operations completed
The current implementation is functionally correct for rollback, but tracking which backup/rename operations completed would allow more precise error messages and cleanup.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cli/migrate-machine-command.ts` around lines 114 - 145, The rollback can attempt operations for files that never had backups or were never renamed; update the logic around stagedRewrites to track which steps completed (e.g., mark per-rewrite flags like backupCreated and renameCompleted while executing fs.writeFile, fs.copyFile, and fs.rename) and use those flags in the catch and finally blocks so you only attempt fs.copyFile(backupPath, artifactPath) for entries with backupCreated and only attempt fs.rename or fs.unlink for entries for which the corresponding temp/backup were actually created; also improve thrown errors and processLogger/error messages to include the rewrite identifier and which operation failed to make rollback outcomes clearer (referencing stagedRewrites, fs.copyFile, fs.writeFile, fs.rename, fs.unlink).
222-228: Redundant validation after Zod parse.The
dashboardOutputDir === undefinedcheck at line 224 is redundant sinceMigrateMachineCommandOptionsSchema.parse()at line 214 already validates this viasuperRefine. After successful parsing,dashboardOutputDiris guaranteed to be defined whenrebuildDashboardIndexis true.Remove redundant check
if (parsedOptions.rebuildDashboardIndex) { const dashboardOutputDirInput = parsedOptions.dashboardOutputDir; - if (dashboardOutputDirInput === undefined) { - throw new Error( - "--rebuild-dashboard-index requires --dashboard-output-dir to avoid mutating the source results directory", - ); - } - const dashboardOutputDir = path.resolve(dashboardOutputDirInput); + // dashboardOutputDir is guaranteed defined by schema validation + const dashboardOutputDir = path.resolve(dashboardOutputDirInput!);Alternatively, you could use a discriminated union or refine the schema to make TypeScript aware that
dashboardOutputDiris defined whenrebuildDashboardIndexis true.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cli/migrate-machine-command.ts` around lines 222 - 228, The runtime check for dashboardOutputDir being undefined is redundant because MigrateMachineCommandOptionsSchema.parse() already enforces this when parsedOptions.rebuildDashboardIndex is true; remove the if (dashboardOutputDirInput === undefined) { throw ... } block and directly use parsedOptions.dashboardOutputDir (or add a non-null assertion where used) so TypeScript/Zod guarantees are relied upon instead of duplicating validation in migrate-machine-command.ts around parsedOptions.rebuildDashboardIndex and parsedOptions.dashboardOutputDir.src/cli/run-command.ts (1)
211-226: Redundant reassignment of options properties.The
options.machineInstanceIdandoptions.machineDisplayLabelreassignments appear unnecessary sinceresolvedMachineIdandresolvedMachineLabelare already used directly inconfigInputat lines 250-251. Theoptionsobject mutations aren't consumed elsewhere.Consider removing the redundant reassignments
if (legacyMachineId) { logger.warn( "Warning: --machine-id is deprecated; use --machine-instance-id", ); - options.machineInstanceId = resolvedMachineId; - } else if (canonicalMachineId) { - options.machineInstanceId = canonicalMachineId; } if (legacyMachineLabel) { logger.warn( "Warning: --machine-label is deprecated; use --machine-display-label", ); - options.machineDisplayLabel = resolvedMachineLabel; - } else if (canonicalMachineLabel) { - options.machineDisplayLabel = canonicalMachineLabel; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cli/run-command.ts` around lines 211 - 226, In run-command.ts the branches that set options.machineInstanceId and options.machineDisplayLabel (inside the legacy/canonical checks for legacyMachineId, canonicalMachineId, legacyMachineLabel, canonicalMachineLabel) are redundant because resolvedMachineId and resolvedMachineLabel are used directly when building configInput; remove the options mutations while keeping the deprecation logger.warn calls, and verify nothing else reads options.machineInstanceId or options.machineDisplayLabel elsewhere (search for usages of options, configInput, resolvedMachineId, resolvedMachineLabel) so behavior is unchanged.src/lib/machine-profile/probe.ts (1)
1-539: Split the platform-specific probing/parsing out of this module.This file is 539 lines and currently owns command execution, parsers, dedupe, and per-platform collection. Extracting the OS-specific collectors/parsers into sibling modules will keep this
src/**/*file under the repo limit and make regressions easier to test in isolation.As per coding guidelines, "Keep files under 500 lines; split aggressively by responsibility."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/machine-profile/probe.ts` around lines 1 - 539, The file is too large and mixes OS-specific probes/parsers with shared helpers; split responsibilities by moving platform-specific collectors and parsers into sibling modules. Extract collectMacosAccelerators and parseMacosAccelerators to a macos module, collectLinuxAccelerators plus parseNvidiaAccelerators/parseRocmAccelerators/parseLspciAccelerators to a linux module, and collectWindowsAccelerators plus parseWindowsAccelerators to a windows module; keep shared utilities like runProbe, parsePositiveInt, parseMemoryBytes, dedupeAccelerators, BYTES_PER_* constants and collectObservedHardwareProfile (or have it import the collectors) in the current probe module, update imports/exports to use the new named exports (e.g., parseMacosAccelerators, collectLinuxAccelerators), and adjust tests to import the new modules so the top-level file falls under 500 lines.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/dashboard/scripts/build-index.ts`:
- Around line 431-436: Add a guard to reject overlapping source/output trees:
resolve both outputResultsDir and sourceResultsDir to absolute paths and compute
path.relative(sourceResultsDir, outputResultsDir) and
path.relative(outputResultsDir, sourceResultsDir); if either relative result is
'' (same dir) or does not start with '..' (meaning one is inside the other),
throw an Error and abort before calling rm() in writePublishedRunBundles and
before performing the scan of sourceResultsDir (the code that iterates
subdirectories under sourceResultsDir). This prevents outputResultsDir from
equaling or being a child of sourceResultsDir (and vice versa).
In `@apps/dashboard/src/lib/schemas.ts`:
- Around line 135-147: HardwareProfileSchema is missing the invariant checks
linking acceleratorDetection.status and accelerators; add a superRefine on
HardwareProfileSchema that reads acceleratorDetection.status and enforces: if
status === "detected" then accelerators.length > 0 (add an error on
"accelerators" when empty), and if status === "none_detected" then
accelerators.length === 0 (add an error on "accelerators" when not empty); use
HardwareProfileSchema.superRefine to perform these validations so the dashboard
schema matches the CLI's src/schemas/common.schema.ts invariants.
In `@README.md`:
- Around line 149-150: The example for --machine-instance-id uses a
hardware-like identifier ("mac-mini-m4-pro") which can be reused across multiple
hosts; update the README example command to use a clearly instance-scoped ID
(e.g., "austin-office-mac-mini") and adjust the --machine-display-label if
desired so the sample demonstrates a unique per-machine instance usage (edit the
example line that contains the bun pb invocation with --machine-instance-id and
--machine-display-label).
In `@src/lib/machine-profile/legacy.ts`:
- Around line 142-143: The code currently calls
MachineProfileSchema.safeParse(rawMachine) but ignores the parsed result and
passes rawMachine into normalizeCurrentMachineProfile; change this to capture
the safeParse result (e.g., const parsed =
MachineProfileSchema.safeParse(rawMachine)) and, when parsed.success is true,
pass parsed.data into normalizeCurrentMachineProfile so Zod-applied defaults
(like accelerators defaults) and validated types are used instead of the raw
input.
In `@src/lib/machine-profile/probe.ts`:
- Around line 222-244: The current dedupeAccelerators function collapses
identical vendor|modelRaw keys into a single ObservedAccelerator, which loses
per-device counts and causes acceleratorCount to be wrong downstream; change
dedupeAccelerators to aggregate duplicates instead of dropping them by tracking
and summing an occurrence count (e.g., add or increment a count/acceleratorCount
property on the ObservedAccelerator when merging), while still merging missing
fields (vendor, memoryBytes, backend, kind) from the new entry into the existing
one; ensure the merged object preserves accumulated count so
normalizeMachineProfile() / buildMachineProfileKey() see correct counts.
In `@src/schemas/config.schema.ts`:
- Around line 19-55: The preprocessor (migrateBenchConfigAliases /
readNonEmptyString / validateCanonicalMachineFieldType used in z.preprocess)
must not throw because exceptions escape safeParse; instead remove throws from
readNonEmptyString and validateCanonicalMachineFieldType so they only
normalize/return values, then move all validation into
BenchConfigSchema.superRefine (or, if you must keep checks in the preprocessor,
convert failures into Zod issues by returning an object describing problems
rather than throwing). Implement superRefine to detect blank strings, non-string
canonical fields, and alias conflicts and call ctx.addIssue with appropriate
ZodIssueKinds so BenchConfigSchema.safeParse returns structured ZodError
results.
In `@test/schemas.test.ts`:
- Around line 255-263: The test's expected error regex is incorrect; update the
assertion in the it block that parses HardwareProfileSchema to match the actual
Zod error string by including the "accelerators " prefix and using plain double
quotes (not escaped backslashes). Replace the current toThrow regex with one
that matches: accelerators must be empty when acceleratorDetection\.status is
"none_detected" (e.g. /accelerators must be empty when
acceleratorDetection\.status is "none_detected"/) so the test for
HardwareProfileSchema.parse(...) correctly validates the error message.
---
Nitpick comments:
In `@apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx`:
- Around line 23-39: The current getRunMachineBadgeLabel function includes
machineInstanceId in the display fallback chain which can expose technical IDs;
remove machineInstanceId from the returned fallback sequence so the function
only uses human-friendly fields (machineDisplayLabel, machineLabel,
machineProfileLabel, machineProfileKey, machineProfileId) and leave
machineInstanceId for internal tracking elsewhere.
In `@src/cli/migrate-machine-command.ts`:
- Around line 114-145: The rollback can attempt operations for files that never
had backups or were never renamed; update the logic around stagedRewrites to
track which steps completed (e.g., mark per-rewrite flags like backupCreated and
renameCompleted while executing fs.writeFile, fs.copyFile, and fs.rename) and
use those flags in the catch and finally blocks so you only attempt
fs.copyFile(backupPath, artifactPath) for entries with backupCreated and only
attempt fs.rename or fs.unlink for entries for which the corresponding
temp/backup were actually created; also improve thrown errors and
processLogger/error messages to include the rewrite identifier and which
operation failed to make rollback outcomes clearer (referencing stagedRewrites,
fs.copyFile, fs.writeFile, fs.rename, fs.unlink).
- Around line 222-228: The runtime check for dashboardOutputDir being undefined
is redundant because MigrateMachineCommandOptionsSchema.parse() already enforces
this when parsedOptions.rebuildDashboardIndex is true; remove the if
(dashboardOutputDirInput === undefined) { throw ... } block and directly use
parsedOptions.dashboardOutputDir (or add a non-null assertion where used) so
TypeScript/Zod guarantees are relied upon instead of duplicating validation in
migrate-machine-command.ts around parsedOptions.rebuildDashboardIndex and
parsedOptions.dashboardOutputDir.
In `@src/cli/run-command.ts`:
- Around line 211-226: In run-command.ts the branches that set
options.machineInstanceId and options.machineDisplayLabel (inside the
legacy/canonical checks for legacyMachineId, canonicalMachineId,
legacyMachineLabel, canonicalMachineLabel) are redundant because
resolvedMachineId and resolvedMachineLabel are used directly when building
configInput; remove the options mutations while keeping the deprecation
logger.warn calls, and verify nothing else reads options.machineInstanceId or
options.machineDisplayLabel elsewhere (search for usages of options,
configInput, resolvedMachineId, resolvedMachineLabel) so behavior is unchanged.
In `@src/lib/machine-profile/probe.ts`:
- Around line 1-539: The file is too large and mixes OS-specific probes/parsers
with shared helpers; split responsibilities by moving platform-specific
collectors and parsers into sibling modules. Extract collectMacosAccelerators
and parseMacosAccelerators to a macos module, collectLinuxAccelerators plus
parseNvidiaAccelerators/parseRocmAccelerators/parseLspciAccelerators to a linux
module, and collectWindowsAccelerators plus parseWindowsAccelerators to a
windows module; keep shared utilities like runProbe, parsePositiveInt,
parseMemoryBytes, dedupeAccelerators, BYTES_PER_* constants and
collectObservedHardwareProfile (or have it import the collectors) in the current
probe module, update imports/exports to use the new named exports (e.g.,
parseMacosAccelerators, collectLinuxAccelerators), and adjust tests to import
the new modules so the top-level file falls under 500 lines.
In `@src/results/aggregate.ts`:
- Around line 299-301: The code repeatedly calls resolveRunMetadata(input) in
the checkpoint filter, later in the main loop and when building
summary.instances; instead, map the original runs array to a materialized array
of { input, metadata } by calling resolveRunMetadata once per run (use this
mapped array when computing matchedRuns by checkpointId, iterating the main
aggregation loop, and constructing summary.instances), then update all
references that expect the original input to use the stored metadata (e.g.,
places currently invoking resolveRunMetadata(input) inside the main loop and in
summary.instances) so normalization/migration occurs a single time per run and
the instance-count path is simplified.
In `@src/schemas/config.schema.ts`:
- Around line 96-103: Normalize deprecated aliases before returning the parsed
config: after mapping machineProfileId -> machineInstanceId and machineLabel ->
machineDisplayLabel, delete the deprecated keys (machineProfileId and
machineLabel) from the config object so the returned BenchConfig only exposes
machineInstanceId and machineDisplayLabel; apply the same removal logic in the
other parsing branch that handles the alternate mapping (the block mirroring the
code around machineInstanceId/machineDisplayLabel) and ensure this aligns with
BenchConfigObjectSchema validation boundaries.
In `@test/aggregate.test.ts`:
- Around line 197-198: Add a sibling regression test where two runs share the
same instanceId and then assert distinct-instance counters collapse to one:
create an aggregate from two records with identical instanceId and verify
aggregate.summary.instances === 1 and aggregate.machines[0]?.instanceCount === 1
(and mirror the same change for the assertions around the other case mentioned
at the second location). This ensures the counters count unique instanceId
values rather than run records.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f0c69fb0-cd54-444a-96ec-39416ad8c42c
📒 Files selected for processing (58)
README.mdapps/dashboard/public/results/20260305-135118-b11d08/plan.jsonapps/dashboard/public/results/20260305-135118-b11d08/run.jsonapps/dashboard/public/results/20260305-144041-4d5950/plan.jsonapps/dashboard/public/results/20260305-144041-4d5950/run.jsonapps/dashboard/public/results/20260306-135715-91b0ee/plan.jsonapps/dashboard/public/results/20260306-135715-91b0ee/run.jsonapps/dashboard/public/results/20260307-161553-c25c6a/plan.jsonapps/dashboard/public/results/20260307-161553-c25c6a/run.jsonapps/dashboard/public/results/20260316-141600-6e18e2/plan.jsonapps/dashboard/public/results/20260316-141600-6e18e2/run.jsonapps/dashboard/public/results/20260317-082000-7b3bf3/plan.jsonapps/dashboard/public/results/20260317-082000-7b3bf3/run.jsonapps/dashboard/public/results/20260317-170728-697a9a/plan.jsonapps/dashboard/public/results/20260317-170728-697a9a/run.jsonapps/dashboard/public/results/20260318-091214-a0dbda/plan.jsonapps/dashboard/public/results/20260318-091214-a0dbda/run.jsonapps/dashboard/public/results/20260319-090246-e5f8be/plan.jsonapps/dashboard/public/results/20260319-090246-e5f8be/run.jsonapps/dashboard/public/results/aggregates/chk_sha256v1_2c13ad3e58b3.jsonapps/dashboard/public/results/aggregates/chk_sha256v1_61a037f87105.jsonapps/dashboard/public/results/aggregates/latest.jsonapps/dashboard/public/results/index.jsonapps/dashboard/scripts/build-index.tsapps/dashboard/src/components/leaderboard/leaderboard-filters.tsapps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsxapps/dashboard/src/components/leaderboard/leaderboard-results-table.tsxapps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsxapps/dashboard/src/components/run-detail/run-detail-page.tsxapps/dashboard/src/components/run-list/checkpoint-groups.tsapps/dashboard/src/components/run-list/run-card.tsxapps/dashboard/src/components/run-list/run-list-page.tsxapps/dashboard/src/lib/api.tsapps/dashboard/src/lib/schemas.tsapps/dashboard/src/lib/types.tssrc/cli/index.tssrc/cli/migrate-machine-command.tssrc/cli/run-command.tssrc/lib/hardware-profile.tssrc/lib/machine-profile/instance-id.tssrc/lib/machine-profile/legacy.tssrc/lib/machine-profile/normalization.tssrc/lib/machine-profile/probe.tssrc/results/aggregate.tssrc/results/reader.tssrc/runner/plan-builder.tssrc/schemas/common.schema.tssrc/schemas/config.schema.tssrc/schemas/index.tssrc/schemas/result.schema.tssrc/tests/workspace-smoke/test.meta.jsontest/aggregate.test.tstest/build-index.test.tstest/compare-checkpoint-guard.test.tstest/hardware-profile.test.tstest/machine-profile-migration.test.tstest/plan-builder.test.tstest/schemas.test.ts
Add trusted signal assessment to benchmark results
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
src/harnesses/code-output-policy.ts (1)
186-193: Consider the "ok" path with raw code signals at lines 223-231.When
hasRawCodeSignalsis true, the return at line 229 setstaintReasons: [], but this path usescode: trimmedandmethod: extracted.method. If extraction was attempted but the code was accepted via raw signal patterns instead, thetaintReasonsshould potentially reflect that the output contained mixed content that happened to pass signal checks.However, since this is a fallback path where structured extraction failed but code-like patterns exist, returning
[]may be intentional (accepting raw output as "clean enough").🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/harnesses/code-output-policy.ts` around lines 186 - 193, Change the fallback "raw signal" acceptance branch (the hasRawCodeSignals path) so it doesn't always return an empty taintReasons array; instead compute and return appropriate taint reasons similar to the structured-extraction path. Specifically, in the branch that currently returns taintReasons: [], call getAcceptedOutputTaintReasons(trimmed, extracted) (or merge its results with any raw-signal-specific taint markers) and return that value while keeping code: trimmed and method: extracted.method so the response reflects that extraction was attempted but raw-code signals were used.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/signal-assessment.ts`:
- Around line 40-50: The createTaintedSignalAssessment function incorrectly
falls back to createTrustworthySignalAssessment when the deduplicated reasons
array is empty; instead, validate and fail fast: after deduping reasons (using
the Set), if uniqueReasons.length === 0 throw a descriptive Error (e.g.
"createTaintedSignalAssessment called with empty reasons") rather than returning
a trustworthy assessment, and otherwise return the tainted assessment with the
uniqueReasons; do not call createTrustworthySignalAssessment in this function.
In `@src/schemas/common.schema.ts`:
- Around line 355-359: The schema currently allows status: "unavailable" with no
explanation; update AcceleratorDetectionSchema so that when status equals the
"unavailable" variant of AcceleratorDetectionStatusSchema the detail string is
required and non-empty (use a conditional validation via z.preprocess/z.refine
or .superRefine on AcceleratorDetectionSchema checking obj.status ===
"unavailable" => Boolean(obj.detail && obj.detail.trim().length)). Reference
AcceleratorDetectionSchema, AcceleratorDetectionStatusSchema, and the
status/detail fields when making the change.
In `@test/goose-adapter.test.ts`:
- Around line 78-118: The test for createGooseAdapter should create and own a
temporary workspace directory and ensure it's removed in a finally block instead
of relying on adapter-managed temp dirs; update the test around
createGooseAdapter() and adapter.generate(...) to (1) create a temp folder
(e.g., with fs.mkdtempSync or a test helper), (2) pass that workspace/path into
the adapter (or set adapter.workspacePath) before calling adapter.generate, and
(3) wrap the test run in try/finally that cleans up the temp workspace
(fs.rmSync or equivalent) to guarantee no artifacts remain after the test
completes.
In `@test/workspace-prompt-parity.test.ts`:
- Around line 59-81: The test "blind coding prompts declare the required public
method contracts" currently asserts only a subset of updated blind-prompt
contracts; add assertions to cover the newly required clauses by updating the
expectations on calculatorPrompt, emitterPrompt, and rateLimiterPrompt: for
calculatorPrompt add expectations for the new calculator method signatures
(e.g., "`reset()`" and any new signatures such as "`subtract(n: number)`" or
"`memoryClear()`" that your prompts require), for emitterPrompt add expectations
for "`once(event, listener)`" and "`off(event, listener)`" (in addition to the
existing `on`, `emit`, `listenerCount`), and for rateLimiterPrompt ensure any
new textual requirements (e.g., the "never go below 0" rule) are asserted;
locate these in the same test block using the existing variables
calculatorPrompt, emitterPrompt, and rateLimiterPrompt and add
expect(...).toContain(...) checks for each missing clause.
---
Nitpick comments:
In `@src/harnesses/code-output-policy.ts`:
- Around line 186-193: Change the fallback "raw signal" acceptance branch (the
hasRawCodeSignals path) so it doesn't always return an empty taintReasons array;
instead compute and return appropriate taint reasons similar to the
structured-extraction path. Specifically, in the branch that currently returns
taintReasons: [], call getAcceptedOutputTaintReasons(trimmed, extracted) (or
merge its results with any raw-signal-specific taint markers) and return that
value while keeping code: trimmed and method: extracted.method so the response
reflects that extraction was attempted but raw-code signals were used.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 40f6afa2-b307-4609-8167-80f5932558ef
📒 Files selected for processing (28)
src/cli/compare-command.tssrc/harnesses/code-output-policy.tssrc/harnesses/direct-adapter.tssrc/harnesses/goose-adapter.tssrc/harnesses/harness.tssrc/harnesses/opencode-adapter.tssrc/lib/signal-assessment.tssrc/lib/stats-format.tssrc/lib/stats.tssrc/results/compare.tssrc/runner/generation-retry.tssrc/runner/item-executor.tssrc/runner/item-retry.tssrc/schemas/common.schema.tssrc/schemas/index.tssrc/schemas/result.schema.tssrc/tests/calculator-stateful/prompt.blind.mdsrc/tests/event-emitter/prompt.blind.mdsrc/tests/file-delete-smoke/prompt.blind.mdsrc/tests/rate-limiter/prompt.blind.mdsrc/tests/safe-cleanup/prompt.blind.mdtest/code-output-policy.test.tstest/compare.test.tstest/goose-adapter.test.tstest/opencode-adapter.test.tstest/signal-assessment.test.tstest/stats.test.tstest/workspace-prompt-parity.test.ts
✅ Files skipped from review due to trivial changes (3)
- src/tests/file-delete-smoke/prompt.blind.md
- src/tests/safe-cleanup/prompt.blind.md
- test/signal-assessment.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- src/schemas/result.schema.ts
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (4)
README.md (1)
154-155:⚠️ Potential issue | 🟡 MinorUse an instance-scoped sample ID here.
mac-mini-m4-prostill reads like a hardware/profile identifier, so the example encourages reusing one ID across identical hosts and collapsing the new instance counts.Doc tweak
- bun pb --machine-instance-id mac-mini-m4-pro --machine-display-label "Austin Mac Mini" + bun pb --machine-instance-id austin-office-mac-mini --machine-display-label "Austin Office Mac Mini"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@README.md` around lines 154 - 155, The README example uses a hardware-like sample ID ("mac-mini-m4-pro") that encourages reuse across distinct instances; update the example for the command shown (the bun pb invocation using flags --machine-instance-id and --machine-display-label) to use an instance-scoped sample ID (e.g., a UUID or short unique token like "instance-<unique-id>" or "inst-abc123") and keep the --machine-display-label as a human-friendly name; replace only the machine-instance-id value in the example so readers are guided to provide per-instance unique IDs rather than profile/hardware names.src/schemas/config.schema.ts (1)
20-83:⚠️ Potential issue | 🔴 CriticalKeep the
z.preprocess()path normalization-only.This migration chain still throws plain
Errors for blank strings, type mismatches, alias conflicts, and malformedmodelAliases. Underz.preprocess(), those exceptions bypass Zod issue collection, soBenchConfigSchema.safeParse()callers get a thrown exception instead of a structured validation failure. Move the validation intoBenchConfigObjectSchema.superRefine()(or other schema issues) and let the preprocessor only rewrite deprecated keys.As per coding guidelines, "Use Zod for schema validation at all boundaries (CLI args, env, JSON files, HTTP responses)".Read-only verification
#!/bin/bash # Inspect whether downstream callers rely on BenchConfigSchema.parse/safeParse # for user-facing config validation paths. rg -nP -C3 '\bBenchConfigSchema\.(safeParse|parse)\s*\(' rg -nP -C3 '\bz\.preprocess\(\s*migrateBenchConfigAliases\b' src/schemas/config.schema.tsAlso applies to: 92-147, 235-239
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/schemas/config.schema.ts` around lines 20 - 83, The preprocessor migrateBenchConfigAliases (and helpers readNonEmptyString, validateCanonicalMachineFieldType, migrateLegacyModelAliases/ModelAliasMapSchema) must only normalize keys and rewrite deprecated shapes without throwing; move all runtime checks into BenchConfigObjectSchema.superRefine so zod collects issues rather than allowing exceptions to escape safeParse/parse. Concretely: make the z.preprocess path return normalized data or undefined and remove throws from readNonEmptyString/validateCanonicalMachineFieldType/migrateLegacyModelAliases (have them return safe values or parse results), then implement validations inside BenchConfigObjectSchema.superRefine to call those helpers and add ctx.addIssue for blank strings, type mismatches, alias conflicts, and malformed modelAliases (convert ModelAliasMapSchema.safeParse errors into zod issues there) so callers receive structured zod validation failures.test/schemas.test.ts (1)
309-317:⚠️ Potential issue | 🟡 MinorFix the
none_detectederror regex.
HardwareProfileSchemaemitsaccelerators must be empty when acceleratorDetection.status is "none_detected", so this matcher will not hit the actual Zod message.🧪 Suggested fix
- ).toThrow(/must be empty when acceleratorDetection\.status is \\\"none_detected\\\"/); + ).toThrow( + /accelerators must be empty when acceleratorDetection\.status is "none_detected"/, + );🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/schemas.test.ts` around lines 309 - 317, The test's expected error regex doesn't match the Zod message; update the assertion in the "should reject none_detected accelerators with a non-empty accelerator list" test so the toThrow matcher expects the actual message emitted by HardwareProfileSchema.parse — e.g., change the regex to match 'accelerators must be empty when acceleratorDetection.status is "none_detected"' (refer to the failing test block and the HardwareProfileSchema.parse call to locate the assertion).apps/dashboard/scripts/build-index.ts (1)
490-507:⚠️ Potential issue | 🔴 CriticalReject overlapping source and output trees before reading or deleting.
writePublishedRunBundles()removesoutputResultsDirunconditionally. If the output directory equals the source directory, lives under it, or contains it, this can either delete source runs or make the scan treat published artifacts as inputs.🛡️ Suggested fix
-import { basename, join, resolve } from "node:path"; +import { basename, join, relative, resolve } from "node:path"; ... export async function buildDashboardIndexArtifacts( options: BuildDashboardIndexArtifactsOptions, ): Promise<BuildDashboardIndexArtifactsResult> { const sourceResultsDir = resolve(options.sourceResultsDir); const outputResultsDir = resolve(options.outputResultsDir); const projectRoot = resolve(options.projectRoot ?? DEFAULT_PROJECT_ROOT); + const isSameOrNested = (from: string, to: string): boolean => { + const rel = relative(from, to); + return rel === "" || (rel !== "." && !rel.startsWith("..")); + }; + if ( + isSameOrNested(sourceResultsDir, outputResultsDir) || + isSameOrNested(outputResultsDir, sourceResultsDir) + ) { + throw new Error( + "sourceResultsDir and outputResultsDir must not overlap", + ); + } const indexPath = join(outputResultsDir, "index.json");🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/scripts/build-index.ts` around lines 490 - 507, Detect and reject overlapping source and output directories before any IO by comparing resolved sourceResultsDir and outputResultsDir (and aggregatesDir) for equality or containment; add a pre-check (e.g., isSubpath or canonical resolved path + separator prefix check) at the top of the routine before the entries loop and before calling writePublishedRunBundles to throw/exit with a clear error if outputResultsDir === sourceResultsDir or one path is inside the other, so you never read source files when output will be deleted or write into the scanned tree.
🧹 Nitpick comments (3)
test/machine-profile-migration.test.ts (1)
117-118: Cover the happy-pathparseKnown*Payload()flow too.The supported-case assertions stop at the raw schemas, and
parseKnownPlanPayload()/parseKnownRunPayload()are only exercised on the unsupported-version branch. Adding one supported plan/run through those helpers would catch parser-wiring regressions that this suite currently misses.Also applies to: 321-335
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/machine-profile-migration.test.ts` around lines 117 - 118, The test only asserts raw schema parsing but never exercises the happy-path helpers; add assertions that call parseKnownPlanPayload(plan) and parseKnownRunPayload(run) (or their equivalents) with a supported-version plan/run (use the same sample plan and run used to produce parsedPlan/parsedRun) and assert they return the expected known payload objects (not the unsupported branch). This will exercise the parser wiring for the supported-case: locate where RunPlanSchema.parse(plan) and RunResultSchema.parse(run) are used (parsedPlan/parsedRun) and after those lines call parseKnownPlanPayload and parseKnownRunPayload and add assertions that the results are defined and match the expected shape/fields for the supported version.src/lib/model-profile/normalization.ts (2)
105-106: Deduplicate tuning token definitionsThe same tuning token set is hardcoded in two places. Extract one shared
constto avoid future drift between detection and family derivation.♻️ Proposed refactor
+const TUNING_TOKENS = [ + "instruct", + "chat", + "coder", + "vision", + "reasoning", +] as const; + /** Detects common tuning suffixes that should stay in the canonical profile. */ function detectTuning(value: string): string | undefined { const normalized = value.toLowerCase(); - const tuningPatterns = ["instruct", "chat", "coder", "vision", "reasoning"]; - return tuningPatterns.find((pattern) => normalized.includes(pattern)); + return TUNING_TOKENS.find((pattern) => normalized.includes(pattern)); } @@ - if (["instruct", "chat", "coder", "vision", "reasoning"].includes(token)) { + if ((TUNING_TOKENS as readonly string[]).includes(token)) { break; }As per coding guidelines, "Favor iteration and modularization over code duplication".
Also applies to: 151-153
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/model-profile/normalization.ts` around lines 105 - 106, Extract the hardcoded array into a single shared top-level constant (e.g. TUNING_PATTERNS = ["instruct","chat","coder","vision","reasoning"]) and replace the local tuningPatterns declarations with references to TUNING_PATTERNS; update both the detection site that uses return tuningPatterns.find((pattern) => normalized.includes(pattern)) and the other occurrence around lines 151-153 (the family-derivation logic) to use the shared constant so the token set is defined only once.
94-99: Normalize fallbackformatcasing consistently
detectFormatreturns mixed casing (GGUF,MLX,safetensors). Consider one canonical style to avoid downstream grouping mismatches.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/model-profile/normalization.ts` around lines 94 - 99, The detectFormat function returns inconsistent casing ("GGUF", "MLX", "safetensors") which can break downstream grouping; update detectFormat to return a single canonical casing (e.g., all-lowercase like "gguf", "mlx", "safetensors" or all-uppercase) for every branch, and adjust any consumers if they expect the old mixed values; specifically change the return values inside detectFormat to the chosen canonical strings so format comparisons and grouping are consistent across the codebase.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/model-profile/normalization.ts`:
- Around line 313-323: The current conditional silently falls back to
buildFallbackModelProfile when only one of args.profileKey or args.profile is
provided, hiding configuration/programmer errors; instead validate presence of
both in the normalization logic and throw a clear error when one is missing
(e.g., check args.profileKey && args.profile, else if only one is present throw
a descriptive Error referencing profileKey/profile and the calling context), so
only call buildConfiguredModelProfile when both are present and otherwise fail
fast rather than calling buildFallbackModelProfile.
- Around line 37-47: Add full TSDoc blocks to every exported function in this
module—specifically humanizeSlug, normalizeConfiguredVariant,
buildConfiguredModelProfile, and buildFallbackModelProfile—replacing the short
comments with standard TSDoc that describes the function purpose, documents each
parameter (`@param`), the return value (`@returns`), and any errors that can be
thrown (`@throws`). Ensure the TSDoc sits immediately above each exported function
declaration and uses clear, concise descriptions for input types and expected
outputs so IDEs and generated docs pick up the information.
In `@src/results/compare.ts`:
- Around line 27-40: The compare key in buildCompareKey currently omits the
runtime, causing different runtimes (e.g., ollama vs vllm) for the same model
profile to collapse; update the function signature and key construction in
buildCompareKey to include item.runtime (add runtime to the item shape if
missing) and append it to the returned string (e.g., include |${item.runtime})
so keys are unique per runtime while still using getModelIdentityKey for the
model identity portion.
In `@src/runner/plan-builder.ts`:
- Around line 169-177: If resolveModelSelection(modelSpec, runtimeName,
modelProfiles) returns undefined for a modelSpec that is a configured profile
key, do not silently skip it; instead detect that modelSpec exists in
modelProfiles (or otherwise was explicitly requested) and throw a clear error
indicating the configured profile has no variant for the given runtimeName.
Update the loop around resolveModelSelection / discovered checks to: (1) check
whether modelSpec refers to a configured profile (e.g., lookup in modelProfiles
or the config source), (2) if so and resolvedSelection is undefined throw a
descriptive exception mentioning modelSpec and runtimeName, and (3) otherwise
continue the existing logic that checks
discovered.includes(resolvedSelection.runtimeModelName). This ensures
resolveModelSelection, modelSpec, runtimeName, modelProfiles, discovered, and
runtimeModelName are the focal symbols for the fix.
---
Duplicate comments:
In `@apps/dashboard/scripts/build-index.ts`:
- Around line 490-507: Detect and reject overlapping source and output
directories before any IO by comparing resolved sourceResultsDir and
outputResultsDir (and aggregatesDir) for equality or containment; add a
pre-check (e.g., isSubpath or canonical resolved path + separator prefix check)
at the top of the routine before the entries loop and before calling
writePublishedRunBundles to throw/exit with a clear error if outputResultsDir
=== sourceResultsDir or one path is inside the other, so you never read source
files when output will be deleted or write into the scanned tree.
In `@README.md`:
- Around line 154-155: The README example uses a hardware-like sample ID
("mac-mini-m4-pro") that encourages reuse across distinct instances; update the
example for the command shown (the bun pb invocation using flags
--machine-instance-id and --machine-display-label) to use an instance-scoped
sample ID (e.g., a UUID or short unique token like "instance-<unique-id>" or
"inst-abc123") and keep the --machine-display-label as a human-friendly name;
replace only the machine-instance-id value in the example so readers are guided
to provide per-instance unique IDs rather than profile/hardware names.
In `@src/schemas/config.schema.ts`:
- Around line 20-83: The preprocessor migrateBenchConfigAliases (and helpers
readNonEmptyString, validateCanonicalMachineFieldType,
migrateLegacyModelAliases/ModelAliasMapSchema) must only normalize keys and
rewrite deprecated shapes without throwing; move all runtime checks into
BenchConfigObjectSchema.superRefine so zod collects issues rather than allowing
exceptions to escape safeParse/parse. Concretely: make the z.preprocess path
return normalized data or undefined and remove throws from
readNonEmptyString/validateCanonicalMachineFieldType/migrateLegacyModelAliases
(have them return safe values or parse results), then implement validations
inside BenchConfigObjectSchema.superRefine to call those helpers and add
ctx.addIssue for blank strings, type mismatches, alias conflicts, and malformed
modelAliases (convert ModelAliasMapSchema.safeParse errors into zod issues
there) so callers receive structured zod validation failures.
In `@test/schemas.test.ts`:
- Around line 309-317: The test's expected error regex doesn't match the Zod
message; update the assertion in the "should reject none_detected accelerators
with a non-empty accelerator list" test so the toThrow matcher expects the
actual message emitted by HardwareProfileSchema.parse — e.g., change the regex
to match 'accelerators must be empty when acceleratorDetection.status is
"none_detected"' (refer to the failing test block and the
HardwareProfileSchema.parse call to locate the assertion).
---
Nitpick comments:
In `@src/lib/model-profile/normalization.ts`:
- Around line 105-106: Extract the hardcoded array into a single shared
top-level constant (e.g. TUNING_PATTERNS =
["instruct","chat","coder","vision","reasoning"]) and replace the local
tuningPatterns declarations with references to TUNING_PATTERNS; update both the
detection site that uses return tuningPatterns.find((pattern) =>
normalized.includes(pattern)) and the other occurrence around lines 151-153 (the
family-derivation logic) to use the shared constant so the token set is defined
only once.
- Around line 94-99: The detectFormat function returns inconsistent casing
("GGUF", "MLX", "safetensors") which can break downstream grouping; update
detectFormat to return a single canonical casing (e.g., all-lowercase like
"gguf", "mlx", "safetensors" or all-uppercase) for every branch, and adjust any
consumers if they expect the old mixed values; specifically change the return
values inside detectFormat to the chosen canonical strings so format comparisons
and grouping are consistent across the codebase.
In `@test/machine-profile-migration.test.ts`:
- Around line 117-118: The test only asserts raw schema parsing but never
exercises the happy-path helpers; add assertions that call
parseKnownPlanPayload(plan) and parseKnownRunPayload(run) (or their equivalents)
with a supported-version plan/run (use the same sample plan and run used to
produce parsedPlan/parsedRun) and assert they return the expected known payload
objects (not the unsupported branch). This will exercise the parser wiring for
the supported-case: locate where RunPlanSchema.parse(plan) and
RunResultSchema.parse(run) are used (parsedPlan/parsedRun) and after those lines
call parseKnownPlanPayload and parseKnownRunPayload and add assertions that the
results are defined and match the expected shape/fields for the supported
version.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 117bc571-762b-4ff5-9000-aa5dac10fa91
📒 Files selected for processing (23)
README.mdapps/dashboard/scripts/build-index.tsllm/implementation/multi-runtime-mvp-implementation.mdmodels.example.jsonmodels.jsonsrc/cli/run-command.tssrc/lib/model-profile/normalization.tssrc/lib/model-profile/registry.tssrc/lib/model-profiles.tssrc/results/compare.tssrc/runner/index.tssrc/runner/item-executor.tssrc/runner/plan-builder.tssrc/schemas/config.schema.tssrc/schemas/index.tssrc/schemas/model-profile.schema.tssrc/schemas/plan.schema.tssrc/schemas/result.schema.tstest/build-index.test.tstest/compare.test.tstest/machine-profile-migration.test.tstest/plan-builder.test.tstest/schemas.test.ts
✅ Files skipped from review due to trivial changes (5)
- llm/implementation/multi-runtime-mvp-implementation.md
- src/lib/model-profiles.ts
- test/compare.test.ts
- src/schemas/model-profile.schema.ts
- src/cli/run-command.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- src/runner/item-executor.ts
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (3)
src/lib/machine-profile/probe-utils.ts (1)
103-108: Redundant conditional spread on initial entry.The
...acceleratorspread already copiescountwhen present. The conditional spread on line 106 has no effect.♻️ Simplified initial entry
if (!current) { - deduped.set(key, { - ...accelerator, - ...(accelerator.count === undefined ? {} : { count: accelerator.count }), - }); + deduped.set(key, { ...accelerator }); continue; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/machine-profile/probe-utils.ts` around lines 103 - 108, The initial-entry branch inside the dedup loop redundantly re-spreads accelerator.count; in the block that checks "if (!current)" simplify the object passed to deduped.set by removing the conditional spread and just use the accelerator (or a shallow copy like {...accelerator}) since the leading "...accelerator" already includes count; update the deduped.set(key, ...) call accordingly (refer to deduped.set, accelerator, current, key in probe-utils.ts).src/schemas/config.schema.ts (1)
243-257: Unreachable type validation in superRefine.The
validateCanonicalMachineFieldTypecheck runs insidesuperRefine, but by that point Zod's field-level validators (z.string().trim().min(1).optional()) have already rejected non-string values. This code block will never execute because non-string inputs fail schema validation beforesuperRefineruns.Either remove this dead code or move the type validation into the preprocessor where it can intercept invalid types before the schema validates them.
♻️ Option 1: Remove unreachable code
if (config.gooseWorkspaceRetryMaxTurns < config.gooseWorkspaceMaxTurns) { context.addIssue({ code: z.ZodIssueCode.custom, path: ["gooseWorkspaceRetryMaxTurns"], message: "gooseWorkspaceRetryMaxTurns must be greater than or equal to gooseWorkspaceMaxTurns", }); } - - const rawConfig = config as unknown as Record<string, unknown>; - for (const fieldName of [ - "machineInstanceId", - "machineDisplayLabel", - "machineProfileId", - "machineLabel", - ] as const) { - if (validateCanonicalMachineFieldType(rawConfig, fieldName)) { - context.addIssue({ - code: z.ZodIssueCode.custom, - path: [fieldName], - message: `Bench config ${fieldName} must be a string when provided`, - }); - } - } });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/schemas/config.schema.ts` around lines 243 - 257, The type-check inside superRefine using validateCanonicalMachineFieldType is unreachable because Zod's field-level validators (e.g., z.string().trim().min(1).optional()) already reject non-string values before superRefine runs; remove the entire loop that iterates over ["machineInstanceId","machineDisplayLabel","machineProfileId","machineLabel"] (the rawConfig/validateCanonicalMachineFieldType/context.addIssue block) from superRefine, or alternatively move that type-check into the preprocessor phase where rawConfig is available so it can catch non-string types before schema validation runs.src/lib/machine-profile/probe-linux.ts (1)
30-33: Remove redundant string parsing.The
parsePositiveInt(memoryMiB)call already parses and validates the string, but its result is discarded. The string is then parsed again withNumber.parseInt(memoryMiB, 10). Store and reuse the validated result instead:♻️ Proposed fix
- const memoryBytes = parsePositiveInt(memoryMiB) - ? Number.parseInt(memoryMiB, 10) * BYTES_PER_MIB - : undefined; + const parsedMiB = parsePositiveInt(memoryMiB); + const memoryBytes = parsedMiB !== undefined + ? parsedMiB * BYTES_PER_MIB + : undefined;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/machine-profile/probe-linux.ts` around lines 30 - 33, The code currently calls parsePositiveInt(memoryMiB) to validate/parse the memory string but discards its return value and re-parses using Number.parseInt; instead call parsePositiveInt(memoryMiB) once, store its numeric result (e.g., const memoryValue = parsePositiveInt(memoryMiB)), then compute memoryBytes = memoryValue ? memoryValue * BYTES_PER_MIB : undefined; update the block around modelRaw/memoryMiB parsing in probe-linux.ts to reuse this parsed result and remove the redundant Number.parseInt call.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/lib/machine-profile/probe-macos.ts`:
- Around line 20-33: The function parseMacosAccelerators assumes parsed is a
non-null object before checking SPDisplaysDataType, which allows
JSON.parse("null") or scalars to cause a TypeError; update the validation to
first assert that parsed is an object and not null (e.g. typeof parsed ===
"object" && parsed !== null) and only then check that (parsed as {
SPDisplaysDataType?: unknown }).SPDisplaysDataType is an Array, so that reading
SPDisplaysDataType and assigning to displays is safe and yields the intended
deterministic probe error when the shape is wrong; reference
parseMacosAccelerators, the parsed variable, SPDisplaysDataType and displays
when making the change.
In `@src/results/compare.ts`:
- Around line 317-323: trustedMetricsAvailable currently uses
hasCompleteSignalAssessments(runAItems) &&
hasCompleteSignalAssessments(runBItems) which lets any unmatched row without
signalAssessment disable trusted deltas for matched rows; instead compute two
booleans: one (e.g., matchedMetricsComplete) that runs
hasCompleteSignalAssessments over only the items present in matched (extract
itemA/itemB arrays from matched) and use that to gate trustedScoringDelta and
trustedFrontierEvalDelta, and keep a separate wholeRun completeness/taint count
(using runAItems/runBItems) only where needed for whole-run taint reporting;
update the code paths that set trustedMatched and the trusted delta calculations
to rely on matchedMetricsComplete rather than the current
trustedMetricsAvailable.
In `@src/schemas/common.schema.ts`:
- Around line 393-417: The .superRefine on the hardware schema currently
enforces rules for acceleratorDetection.status "detected" and "none_detected"
but misses the "unavailable" case; update the same .superRefine callback (the
hardware parameter with hardware.acceleratorDetection and hardware.accelerators)
to add a branch that if hardware.acceleratorDetection.status === "unavailable"
and hardware.accelerators.length > 0 then call context.addIssue (use
z.ZodIssueCode.custom, path ["accelerators"]) with a clear message like
"accelerators must be empty when acceleratorDetection.status is \"unavailable\""
so the schema rejects any non-empty accelerators array when detection is
unavailable.
---
Nitpick comments:
In `@src/lib/machine-profile/probe-linux.ts`:
- Around line 30-33: The code currently calls parsePositiveInt(memoryMiB) to
validate/parse the memory string but discards its return value and re-parses
using Number.parseInt; instead call parsePositiveInt(memoryMiB) once, store its
numeric result (e.g., const memoryValue = parsePositiveInt(memoryMiB)), then
compute memoryBytes = memoryValue ? memoryValue * BYTES_PER_MIB : undefined;
update the block around modelRaw/memoryMiB parsing in probe-linux.ts to reuse
this parsed result and remove the redundant Number.parseInt call.
In `@src/lib/machine-profile/probe-utils.ts`:
- Around line 103-108: The initial-entry branch inside the dedup loop
redundantly re-spreads accelerator.count; in the block that checks "if
(!current)" simplify the object passed to deduped.set by removing the
conditional spread and just use the accelerator (or a shallow copy like
{...accelerator}) since the leading "...accelerator" already includes count;
update the deduped.set(key, ...) call accordingly (refer to deduped.set,
accelerator, current, key in probe-utils.ts).
In `@src/schemas/config.schema.ts`:
- Around line 243-257: The type-check inside superRefine using
validateCanonicalMachineFieldType is unreachable because Zod's field-level
validators (e.g., z.string().trim().min(1).optional()) already reject non-string
values before superRefine runs; remove the entire loop that iterates over
["machineInstanceId","machineDisplayLabel","machineProfileId","machineLabel"]
(the rawConfig/validateCanonicalMachineFieldType/context.addIssue block) from
superRefine, or alternatively move that type-check into the preprocessor phase
where rawConfig is available so it can catch non-string types before schema
validation runs.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f43d3a0f-40d4-4924-863a-09a86fdf5b28
📒 Files selected for processing (33)
README.mdapps/dashboard/scripts/build-index.tsapps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsxapps/dashboard/src/lib/schemas.tsapps/dashboard/src/lib/types.tssrc/cli/migrate-machine-command.tssrc/cli/run-command.tssrc/harnesses/code-output-policy.tssrc/lib/machine-profile/legacy.tssrc/lib/machine-profile/normalization.tssrc/lib/machine-profile/probe-linux.tssrc/lib/machine-profile/probe-macos.tssrc/lib/machine-profile/probe-utils.tssrc/lib/machine-profile/probe-windows.tssrc/lib/machine-profile/probe.tssrc/lib/model-profile/normalization.tssrc/lib/signal-assessment.tssrc/results/aggregate.tssrc/results/compare.tssrc/runner/plan-builder.tssrc/schemas/common.schema.tssrc/schemas/config.schema.tstest/aggregate.test.tstest/build-index.test.tstest/code-output-policy.test.tstest/compare.test.tstest/goose-adapter.test.tstest/machine-profile-migration.test.tstest/machine-profile-probe.test.tstest/plan-builder.test.tstest/schemas.test.tstest/signal-assessment.test.tstest/workspace-prompt-parity.test.ts
✅ Files skipped from review due to trivial changes (4)
- test/goose-adapter.test.ts
- test/signal-assessment.test.ts
- README.md
- test/machine-profile-migration.test.ts
🚧 Files skipped from review as they are similar to previous changes (8)
- apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx
- test/compare.test.ts
- test/aggregate.test.ts
- test/code-output-policy.test.ts
- src/cli/migrate-machine-command.ts
- src/runner/plan-builder.ts
- test/plan-builder.test.ts
- apps/dashboard/src/lib/schemas.ts
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
apps/dashboard/src/lib/schemas.ts (1)
436-443:⚠️ Potential issue | 🟠 MajorDon't hard-reject persisted v1 aggregates at parse time.
apps/dashboard/src/lib/api.ts:241still parsesaggregates/latest.jsonwithLeaderboardAggregateSchema.parse(data). Tightening this schema toz.literal(2)means any existingschemaVersion: 1aggregate will now fail the leaderboard until artifacts are rebuilt or migrated. Please add an explicit v1→v2 parse path here, or handle the parse failure with the same rebuild guidance you already give for a missing aggregate.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/lib/schemas.ts` around lines 436 - 443, The current LeaderboardAggregateSchema hard-fails on schemaVersion: 1; update the parsing so aggregates with schemaVersion: 1 are accepted and transformed to the v2 shape instead of rejecting at parse time: modify LeaderboardAggregateSchema (or create a wrapper parse function used by parse(data) in api code) to accept both v1 and v2 (e.g., a union or a preprocess step) and implement a conversion path that maps v1 fields into the v2 structure (keep symbol names: LeaderboardAggregateSchema, LeaderboardAggregateSummarySchema, LeaderboardMachineSummarySchema, LeaderboardAggregatedItemSchema) so existing persisted aggregates parse successfully, and if parsing still fails fall back to the same "missing aggregate / rebuild" guidance currently used by the API.
🧹 Nitpick comments (1)
src/lib/machine-profile/probe-utils.ts (1)
107-115: Consider simplifying the field backfill logic for readability.The nested spread-ternary pattern is correct but dense. A helper or explicit assignments could improve clarity.
♻️ Optional refactor for readability
- deduped.set(key, { - ...current, - ...(current.vendor ? {} : accelerator.vendor ? { vendor: accelerator.vendor } : {}), - ...(current.memoryBytes ? {} : accelerator.memoryBytes ? { memoryBytes: accelerator.memoryBytes } : {}), - ...(current.backend ? {} : accelerator.backend ? { backend: accelerator.backend } : {}), - ...(current.kind !== "unknown" ? {} : accelerator.kind !== "unknown" ? { kind: accelerator.kind } : {}), - count: (current.count ?? 1) + (accelerator.count ?? 1), - }); + const merged: ObservedAccelerator = { + ...current, + vendor: current.vendor ?? accelerator.vendor, + memoryBytes: current.memoryBytes ?? accelerator.memoryBytes, + backend: current.backend ?? accelerator.backend, + kind: current.kind !== "unknown" ? current.kind : accelerator.kind, + count: (current.count ?? 1) + (accelerator.count ?? 1), + }; + deduped.set(key, merged);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/machine-profile/probe-utils.ts` around lines 107 - 115, The nested spread-ternary in the deduped.set call (using variables key, current, accelerator) is correct but hard to read; replace it with clearer backfill logic by building an explicit object before calling deduped.set — e.g., create a newRecord from current, then conditionally assign vendor, memoryBytes, backend, and kind only when missing on current and present on accelerator, and finally set count = (current.count ?? 1) + (accelerator.count ?? 1), then call deduped.set(key, newRecord); this keeps the same behavior while making the intent in the deduped.set block much easier to read and maintain.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/results/compare.ts`:
- Around line 281-358: The deltas (passRateDelta / avgScoreDelta) are computed
from aggregates that drop missing automatedScore/frontierEval independently,
causing comparisons across disjoint row sets; fix by filtering matched (and
trustedMatched) to only include rows where both itemA and itemB have the metric
present before calling computeAggregateScoring or computeAggregateFrontier (use
predicates on item.itemA.automatedScore and item.itemB.automatedScore for
scoring, and item.itemA.frontierEval and item.itemB.frontierEval for frontier)
so passRateDelta and avgScoreDelta compare the same rows, and separately compute
and expose metric-availability loss (counts or coverage delta) rather than
folding it into the score deltas (update code paths that set scoringDelta,
trustedScoringDelta, frontierEvalDelta, and trustedFrontierEvalDelta
accordingly).
In `@src/schemas/config.schema.ts`:
- Around line 68-97: The backfill checks use hasOwnProperty, so explicit
properties set to undefined (e.g. {machineInstanceId: undefined}) are treated as
present and skip migration; change those conditions to treat undefined as
"absent" by relying on the parsed/normalized values and direct undefined checks:
use machineInstanceId === undefined (instead of hasMachineInstanceId) when
deciding to copy machineProfileId into config.machineInstanceId, use
machineDisplayLabel === undefined (instead of hasMachineDisplayLabel) when
copying machineLabel into config.machineDisplayLabel, and replace the
hasOwnProperty check for config.modelProfiles with a direct check for
config.modelProfiles === undefined before migrating modelAliases via
migrateLegacyModelAliases; reference symbols: machineInstanceId,
machineProfileId, machineDisplayLabel, machineLabel, modelProfiles,
modelAliases, migrateLegacyModelAliases.
---
Outside diff comments:
In `@apps/dashboard/src/lib/schemas.ts`:
- Around line 436-443: The current LeaderboardAggregateSchema hard-fails on
schemaVersion: 1; update the parsing so aggregates with schemaVersion: 1 are
accepted and transformed to the v2 shape instead of rejecting at parse time:
modify LeaderboardAggregateSchema (or create a wrapper parse function used by
parse(data) in api code) to accept both v1 and v2 (e.g., a union or a preprocess
step) and implement a conversion path that maps v1 fields into the v2 structure
(keep symbol names: LeaderboardAggregateSchema,
LeaderboardAggregateSummarySchema, LeaderboardMachineSummarySchema,
LeaderboardAggregatedItemSchema) so existing persisted aggregates parse
successfully, and if parsing still fails fall back to the same "missing
aggregate / rebuild" guidance currently used by the API.
---
Nitpick comments:
In `@src/lib/machine-profile/probe-utils.ts`:
- Around line 107-115: The nested spread-ternary in the deduped.set call (using
variables key, current, accelerator) is correct but hard to read; replace it
with clearer backfill logic by building an explicit object before calling
deduped.set — e.g., create a newRecord from current, then conditionally assign
vendor, memoryBytes, backend, and kind only when missing on current and present
on accelerator, and finally set count = (current.count ?? 1) +
(accelerator.count ?? 1), then call deduped.set(key, newRecord); this keeps the
same behavior while making the intent in the deduped.set block much easier to
read and maintain.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a036b871-56d8-4b93-a257-5190c8851628
📒 Files selected for processing (10)
apps/dashboard/src/lib/schemas.tssrc/lib/machine-profile/probe-linux.tssrc/lib/machine-profile/probe-macos.tssrc/lib/machine-profile/probe-utils.tssrc/results/compare.tssrc/schemas/common.schema.tssrc/schemas/config.schema.tstest/compare.test.tstest/machine-profile-probe.test.tstest/schemas.test.ts
🚧 Files skipped from review as they are similar to previous changes (3)
- test/machine-profile-probe.test.ts
- src/lib/machine-profile/probe-macos.ts
- src/lib/machine-profile/probe-linux.ts
|
@CodeRabbit full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
apps/dashboard/src/lib/schemas.ts (1)
264-282:⚠️ Potential issue | 🟠 MajorMirror the new item-level metadata in the dashboard boundary schemas.
These local item schemas still omit
modelProfile, and result items also omitsignalAssessment. That means the dashboard boundary cannot carry forward the canonical model identity and trusted/tainted metadata now present in run artifacts. Add localModelProfileSchemaandSignalAssessmentSchemamirrors and attach them here.As per coding guidelines, "Validate at boundaries with Zod and pass typed data inward".🔧 Proposed fix
const MatrixItemResultSchema = z.object({ id: z.string(), runtime: z.string(), model: z.string(), modelAlias: z.string().optional(), + modelProfile: ModelProfileSchema.optional(), harness: z.string(), test: z.string(), passType: PassTypeSchema, status: ItemStatusSchema, startedAt: z.string().optional(), completedAt: z.string().optional(), + signalAssessment: SignalAssessmentSchema.optional(), generation: GenerationResultSchema.optional(), generationFailure: GenerationFailureSchema.optional(), automatedScore: AutomatedScoreSchema.optional(), // ... }); const MatrixItemSchema = z.object({ id: z.string(), runtime: z.string(), model: z.string(), modelAlias: z.string().optional(), + modelProfile: ModelProfileSchema.optional(), harness: z.string(), test: z.string(), passType: PassTypeSchema, });Also applies to: 307-315
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/lib/schemas.ts` around lines 264 - 282, The MatrixItemResultSchema (and the similar item-level schema around lines 307-315) must include local mirrors of the new item-level metadata: add a ModelProfileSchema and a SignalAssessmentSchema definitions in this file and attach them to the appropriate item schemas as optional fields (e.g., add modelProfile: ModelProfileSchema.optional() and signalAssessment: SignalAssessmentSchema.optional() to MatrixItemResultSchema and the parallel item schema referenced in the comment). Ensure the local ModelProfileSchema and SignalAssessmentSchema mirror the canonical fields used elsewhere so the dashboard boundary can validate and pass typed model identity and trusted/tainted metadata inward.apps/dashboard/src/lib/types.ts (1)
306-318:⚠️ Potential issue | 🟠 MajorBump the index schema version before requiring
instanceCount.
DashboardCheckpointSummary.instanceCountis now mandatory, butDashboardIndex.schemaVersionstays at2. Older v2 index payloads are now indistinguishable from the new shape, so readers have no way to detect or migrate stale artifacts. You already version-bumpedLeaderboardAggregateon Line 365 for the same kind of contract change, so the index likely needs the same treatment.Also applies to: 364-366
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/lib/types.ts` around lines 306 - 318, DashboardIndex.schemaVersion was left at 2 while DashboardCheckpointSummary now requires instanceCount; bump DashboardIndex.schemaVersion (the schemaVersion field on the DashboardIndex interface) to a new integer (e.g. 3) so readers can detect the changed shape, and update any related constants/consumers that rely on schemaVersion; mirror the same version-bump pattern used for LeaderboardAggregate to ensure older v2 payloads are distinguishable from the new format that includes DashboardCheckpointSummary.instanceCount.
♻️ Duplicate comments (3)
src/lib/machine-profile/probe-macos.ts (1)
36-46:⚠️ Potential issue | 🟡 MinorValidate each
SPDisplaysDataTypeentry before reading fields.The container array is checked, but its members are still blindly cast. A payload like
{"SPDisplaysDataType":[null]}reaches Line 40 and throws a rawTypeError, so callers lose the deterministic malformed-payload path this parser is trying to provide.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/machine-profile/probe-macos.ts` around lines 36 - 46, The flatMap callback over displays assumes each SPDisplaysDataType entry is a non-null object and will throw on null; update the iteration to first validate each entry (e.g., ensure typeof entry === "object" && entry !== null) before reading sppci_model, _name, spdisplays_vendor or spdisplays_ndrvs and return [] for invalid entries so malformed payloads (like {"SPDisplaysDataType":[null]}) follow the parser's deterministic error path; adjust the code around displays.flatMap(...), the entry type guards, and the modelRaw construction to rely only on checked properties.src/lib/machine-profile/probe-utils.ts (1)
98-123:⚠️ Potential issue | 🟠 MajorDon't turn cross-source sightings into extra accelerator count.
This helper is explicitly merging accelerators from multiple probe sources, but Line 109 increments
counton every key collision. A single device reported by two sources will be normalized as two accelerators, which can changeacceleratorCountand the derived machine profile key downstream.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/machine-profile/probe-utils.ts` around lines 98 - 123, The current merge in probe-utils.ts incorrectly adds counts across probe sources (in the block building merged: ObservedAccelerator), inflating accelerator totals when the same device is seen by multiple sources; change the count calculation from addition: (current.count ?? 1) + (accelerator.count ?? 1) to use a non-duplicating strategy such as taking the maximum: merged.count = Math.max(current.count ?? 1, accelerator.count ?? 1). Update the merged creation so it sets count using Math.max and leave other merge rules (vendor, memoryBytes, backend, kind) unchanged.src/lib/model-profile/normalization.ts (1)
44-50:⚠️ Potential issue | 🟡 MinorAdd
@throwsto the remaining exported TSDoc blocks.
humanizeSlug,normalizeConfiguredVariant,buildFallbackModelProfile, andgetModelIdentityKeyare exported APIs, but their doc blocks still omit the required@throwsentry. Use@throws noneon the non-throwing helpers so the module matches the repo's public-doc contract.As per coding guidelines, "All exported functions require TSDoc/JSDoc documentation including purpose, params, returns, and throws".
Also applies to: 212-218, 289-296, 368-376
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/model-profile/normalization.ts` around lines 44 - 50, The exported functions humanizeSlug, normalizeConfiguredVariant, buildFallbackModelProfile, and getModelIdentityKey are missing the required `@throws` TSDoc entry; update each function's top doc block to include "@throws none" (since they are non-throwing helpers) so the module complies with the repo's public-doc contract and the guideline that all exported functions document purpose, params, returns, and throws.
🧹 Nitpick comments (3)
src/lib/machine-profile/instance-id.ts (1)
88-94: Consider adding a small delay in the retry loop.The busy-wait loop polling for 50ms can consume unnecessary CPU cycles. A brief sleep between iterations would be more efficient.
♻️ Suggested improvement
if ((error as NodeJS.ErrnoException).code === "EEXIST") { const deadline = Date.now() + 50; while (Date.now() < deadline) { const persisted = readPersistedInstanceId(targetPath); if (persisted) { return persisted; } + // Brief yield to avoid tight spin + Atomics.wait(new Int32Array(new SharedArrayBuffer(4)), 0, 0, 5); } return repairPersistedInstanceId(targetPath, value); }Alternatively, since this is a rare race condition scenario and the 50ms bound is short, the current implementation is acceptable for pragmatic reasons.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/machine-profile/instance-id.ts` around lines 88 - 94, The tight busy-wait in the retry loop should sleep briefly to avoid CPU spinning: in the block around the loop that calls readPersistedInstanceId(targetPath) (the code that sets deadline = Date.now() + 50 and while (Date.now() < deadline) { ... }), replace the busy loop with an async loop that awaits a short delay (e.g. await new Promise(r => setTimeout(r, 5))) between iterations; update the enclosing function to be async (or use a small synchronous sleep helper if the surrounding code cannot be async) and adjust any callers accordingly so the retry still returns the persisted id if found within ~50ms but yields the CPU between polls.apps/dashboard/src/lib/api.ts (1)
12-15: Consider using a shared package export or alias to reduce relative import complexity.The import
from "../../../../src/lib/machine-profile/legacy.js"works correctly and is part of an established pattern in this monorepo (7 similar cross-package imports exist). The Vite configuration explicitly permits this viafs.allow: ["../.."]and module resolution is properly configured. However, consolidating these helpers through a shared package entry point or a workspace-level path alias would improve readability and maintainability.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/lib/api.ts` around lines 12 - 15, Replace the deep relative import of parseKnownPlanPayload and parseKnownRunPayload with a shared workspace export or path alias: update the import that currently points to "../../../../src/lib/machine-profile/legacy.js" to import these symbols from the central package entry or configured alias (e.g., a top-level "machine-profile" export), then update any package exports/tsconfig/vite path mappings so parseKnownPlanPayload and parseKnownRunPayload are resolved via the new shared module rather than a long relative path.src/lib/machine-profile/probe-linux.ts (1)
69-78: Consider enrichinglspciparser with vendor extraction.The
parseLspciAcceleratorsfunction setskind: "unknown"but doesn't attempt to extract vendor information from the lspci output, which typically includes vendor names (e.g., "NVIDIA Corporation", "Advanced Micro Devices").This could help
dedupeAcceleratorsbetter identify when lspci returns the same device already found by vendor-specific tools.🔧 Optional enhancement
export function parseLspciAccelerators(stdout: string): ObservedAccelerator[] { return stdout .split("\n") .map((line) => line.trim()) .filter((line) => /(vga|3d|display)/i.test(line)) - .map((line) => ({ - modelRaw: line.replace(/^[0-9a-fA-F:.]+\s+[^:]+:\s*/, ""), - kind: "unknown" as const, - })); + .map((line) => { + const modelRaw = line.replace(/^[0-9a-fA-F:.]+\s+[^:]+:\s*/, ""); + const vendor = /nvidia/i.test(modelRaw) ? "NVIDIA" + : /amd|radeon/i.test(modelRaw) ? "AMD" + : /intel/i.test(modelRaw) ? "Intel" + : undefined; + return { + ...(vendor ? { vendor } : {}), + modelRaw, + kind: "unknown" as const, + }; + }); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/machine-profile/probe-linux.ts` around lines 69 - 78, parseLspciAccelerators currently returns accelerators with kind set to "unknown" and doesn't extract vendor info from lspci output; update parseLspciAccelerators to parse the vendor/manufacturer from each line (e.g., capture strings like "NVIDIA Corporation" or "Advanced Micro Devices" that appear after the device id) and include it on the returned ObservedAccelerator (add a vendor or vendorRaw field and set kind based on vendor when possible). Locate parseLspciAccelerators and modify the mapping of each line to: extract modelRaw as now, extract a vendorRaw by regex or tokenization of the remaining description, set kind to a more specific value when vendorRaw matches known vendors (e.g., "nvidia" => "nvidia"), and ensure the shape matches what dedupeAccelerators expects so duplicates between lspci and vendor-specific tools can be resolved.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/dashboard/src/lib/types.ts`:
- Around line 120-189: The MatrixItemResult interface is missing the new
optional signalAssessment field; update the MatrixItemResult type to include
signalAssessment?: SignalAssessment so it mirrors the CLI schema (see
SignalAssessment/SignalAssessmentSchema) and maintain compatibility between
src/schemas/result.schema.ts and apps/dashboard/src/lib/types.ts; locate
MatrixItemResult in this file and add the optional signalAssessment property
with the SignalAssessment type.
In `@README.md`:
- Around line 238-239: The sentence is ambiguous about which config key is
preferred; update the README text to state that while legacy alias-only files
and the `--model-alias` form are still supported and will be normalized, the
canonical and preferred key for new configs is `models` (mention `modelProfiles`
only as an accepted legacy/compat shape), e.g., change "prefer `modelProfiles` /
`models`" to "prefer `models` (legacy `modelProfiles` are accepted and will be
normalized)"; keep the rest of the normalization note intact.
In `@src/cli/migrate-machine-command.ts`:
- Around line 49-61: The validation in .superRefine currently only ensures
options.dashboardOutputDir is provided when options.rebuildDashboardIndex is
true but does not ensure it doesn't point back into the source results directory
(options.dir); update .superRefine to resolve and normalize both
options.dashboardOutputDir and options.dir (use path.resolve/path.normalize) and
then reject when the dashboardOutputDir equals or is nested inside the source
dir (e.g., path.relative(dir, dashboardOutputDir) === "" or does not start with
".."); update the context.addIssue to keep the same path ["dashboardOutputDir"]
and error message but trigger it when the resolved paths indicate output would
mutate the source results tree (apply the same fix for the other similar
validation block referenced).
- Around line 109-119: The staged rewrite objects (created from changedRewrites
into StagedArtifactRewrite) mark backupCreated and renameCompleted on success
but leave rollbackSucceeded false, which causes shouldDeleteBackup to keep
backups; update the post-rewrite logic that sets renameCompleted to also mark
rollbackSucceeded true (or alter the shouldDeleteBackup check to allow deletion
when renameCompleted is true and rollback did not occur). Locate where
StagedArtifactRewrite instances are mutated (fields tempWritten, backupCreated,
renameCompleted) and either set rollbackSucceeded = true after a successful
rename/commit or change the backup-deletion predicate to treat a successful
rename as safe to delete backups.
In `@src/lib/model-profile/normalization.ts`:
- Around line 37-42: formatCompactNumber currently rounds all non-integer values
to one decimal (value.toFixed(1)), which loses precision and changes canonical
identifiers like profileKey, profileLabel, and parameterScaleLabel; update
formatCompactNumber to preserve exact fractional digits instead of forcing a
single decimal place (e.g., convert the number to a string without unnecessary
trailing zeros but without rounding to one decimal), so that values like 750M
and 1.25B remain distinct in profileKey/profileLabel/parameterScaleLabel while
still trimming insignificant trailing zeros.
In `@src/lib/model-profile/registry.ts`:
- Around line 291-296: The code currently sets resolutionSource based solely on
whether configuredProfile.variants contains string values, which incorrectly
treats modern parsed inline profiles and versioned registries as "legacy_alias";
change this by removing the heuristic and instead use an explicit provenance
flag carried when loading alias files (do not infer from
configuredProfile.variants). Update the logic around resolutionSource assignment
in the registry handling so it reads provenance from the normalized registry
entry (added when alias files are loaded) rather than deriving it from variants;
adjust callers such as parseInlineModelProfile and buildResolvedModelProfile to
propagate or respect the explicit provenance field on the registry entry.
- Around line 150-166: parseInlineModelProfile currently parses runtime:model
mappings (the loop over mappings.split(...) that populates variants) but accepts
unknown runtime tokens and silently allows duplicates; update
parseInlineModelProfile to validate each parsed runtime against the canonical
runtime list/enum used by the codebase (e.g., the Runtime enum or known runtime
names) and throw a clear Error via Zod validation when an unknown runtime is
encountered or when a runtime is duplicated, rather than overwriting in
variants; ensure the validation happens inside the loop that processes each
entry (the code block using colonIndex, runtime, modelName, and variants) so
malformed inline runtime mappings are rejected at parse time.
In `@src/results/aggregate.ts`:
- Around line 100-105: buildAggregateKey currently uses raw item.model which
allows the same model persisted under different aliases to create separate
aggregates; replace item.model with a canonical model identity (e.g., call the
existing model-normalization helper or add one) so the key is stable across
aliases. In buildAggregateKey, compute const canonicalModel =
getCanonicalModelId(item) or item.canonicalModelId (or call a new
canonicalizeModel(item.model)) and use `${canonicalModel}` in the template
instead of `${item.model}`; ensure you import or implement the canonicalization
helper used and run/update any tests that rely on aggregate key behavior.
In `@src/runner/item-executor.ts`:
- Around line 186-188: When runScoringWithCompileRetry() returns a
scoringOutcome whose generation replaces the current generation, ensure you
clear stale signal state by resetting signalAssessment to undefined (or null)
instead of using "scoringOutcome.signalAssessment ?? signalAssessment";
specifically update the assignment around where generation =
scoringOutcome.generation and signalAssessment is set so that if
scoringOutcome.signalAssessment is absent you do not retain the previous
signalAssessment. Apply the same fix to the analogous block around the second
occurrence mentioned (the code handling lines ~303-308) so any promoted retry
that lacks signalAssessment does not inherit the prior assessment.
In `@src/runner/plan-builder.ts`:
- Around line 164-203: In the loop that processes config.models (the block using
resolveModelSelection and pushing into filtered), avoid adding the same
resolvedSelection.runtimeModelName more than once: check whether filtered
already contains resolvedSelection.runtimeModelName (or use a Set) before
pushing and before calling resolvedModelProfiles.set and log.debug, so
overlapping selectors that resolve to the same runtimeModelName are deduplicated
prior to expanding the matrix; keep the existing checks against discovered and
the error throw for unresolved configuredProfile.
---
Outside diff comments:
In `@apps/dashboard/src/lib/schemas.ts`:
- Around line 264-282: The MatrixItemResultSchema (and the similar item-level
schema around lines 307-315) must include local mirrors of the new item-level
metadata: add a ModelProfileSchema and a SignalAssessmentSchema definitions in
this file and attach them to the appropriate item schemas as optional fields
(e.g., add modelProfile: ModelProfileSchema.optional() and signalAssessment:
SignalAssessmentSchema.optional() to MatrixItemResultSchema and the parallel
item schema referenced in the comment). Ensure the local ModelProfileSchema and
SignalAssessmentSchema mirror the canonical fields used elsewhere so the
dashboard boundary can validate and pass typed model identity and
trusted/tainted metadata inward.
In `@apps/dashboard/src/lib/types.ts`:
- Around line 306-318: DashboardIndex.schemaVersion was left at 2 while
DashboardCheckpointSummary now requires instanceCount; bump
DashboardIndex.schemaVersion (the schemaVersion field on the DashboardIndex
interface) to a new integer (e.g. 3) so readers can detect the changed shape,
and update any related constants/consumers that rely on schemaVersion; mirror
the same version-bump pattern used for LeaderboardAggregate to ensure older v2
payloads are distinguishable from the new format that includes
DashboardCheckpointSummary.instanceCount.
---
Duplicate comments:
In `@src/lib/machine-profile/probe-macos.ts`:
- Around line 36-46: The flatMap callback over displays assumes each
SPDisplaysDataType entry is a non-null object and will throw on null; update the
iteration to first validate each entry (e.g., ensure typeof entry === "object"
&& entry !== null) before reading sppci_model, _name, spdisplays_vendor or
spdisplays_ndrvs and return [] for invalid entries so malformed payloads (like
{"SPDisplaysDataType":[null]}) follow the parser's deterministic error path;
adjust the code around displays.flatMap(...), the entry type guards, and the
modelRaw construction to rely only on checked properties.
In `@src/lib/machine-profile/probe-utils.ts`:
- Around line 98-123: The current merge in probe-utils.ts incorrectly adds
counts across probe sources (in the block building merged: ObservedAccelerator),
inflating accelerator totals when the same device is seen by multiple sources;
change the count calculation from addition: (current.count ?? 1) +
(accelerator.count ?? 1) to use a non-duplicating strategy such as taking the
maximum: merged.count = Math.max(current.count ?? 1, accelerator.count ?? 1).
Update the merged creation so it sets count using Math.max and leave other merge
rules (vendor, memoryBytes, backend, kind) unchanged.
In `@src/lib/model-profile/normalization.ts`:
- Around line 44-50: The exported functions humanizeSlug,
normalizeConfiguredVariant, buildFallbackModelProfile, and getModelIdentityKey
are missing the required `@throws` TSDoc entry; update each function's top doc
block to include "@throws none" (since they are non-throwing helpers) so the
module complies with the repo's public-doc contract and the guideline that all
exported functions document purpose, params, returns, and throws.
---
Nitpick comments:
In `@apps/dashboard/src/lib/api.ts`:
- Around line 12-15: Replace the deep relative import of parseKnownPlanPayload
and parseKnownRunPayload with a shared workspace export or path alias: update
the import that currently points to
"../../../../src/lib/machine-profile/legacy.js" to import these symbols from the
central package entry or configured alias (e.g., a top-level "machine-profile"
export), then update any package exports/tsconfig/vite path mappings so
parseKnownPlanPayload and parseKnownRunPayload are resolved via the new shared
module rather than a long relative path.
In `@src/lib/machine-profile/instance-id.ts`:
- Around line 88-94: The tight busy-wait in the retry loop should sleep briefly
to avoid CPU spinning: in the block around the loop that calls
readPersistedInstanceId(targetPath) (the code that sets deadline = Date.now() +
50 and while (Date.now() < deadline) { ... }), replace the busy loop with an
async loop that awaits a short delay (e.g. await new Promise(r => setTimeout(r,
5))) between iterations; update the enclosing function to be async (or use a
small synchronous sleep helper if the surrounding code cannot be async) and
adjust any callers accordingly so the retry still returns the persisted id if
found within ~50ms but yields the CPU between polls.
In `@src/lib/machine-profile/probe-linux.ts`:
- Around line 69-78: parseLspciAccelerators currently returns accelerators with
kind set to "unknown" and doesn't extract vendor info from lspci output; update
parseLspciAccelerators to parse the vendor/manufacturer from each line (e.g.,
capture strings like "NVIDIA Corporation" or "Advanced Micro Devices" that
appear after the device id) and include it on the returned ObservedAccelerator
(add a vendor or vendorRaw field and set kind based on vendor when possible).
Locate parseLspciAccelerators and modify the mapping of each line to: extract
modelRaw as now, extract a vendorRaw by regex or tokenization of the remaining
description, set kind to a more specific value when vendorRaw matches known
vendors (e.g., "nvidia" => "nvidia"), and ensure the shape matches what
dedupeAccelerators expects so duplicates between lspci and vendor-specific tools
can be resolved.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f26bd3df-ffa2-461c-b826-48e24a6f648d
📒 Files selected for processing (99)
README.mdapps/dashboard/public/results/20260305-135118-b11d08/plan.jsonapps/dashboard/public/results/20260305-135118-b11d08/run.jsonapps/dashboard/public/results/20260305-144041-4d5950/plan.jsonapps/dashboard/public/results/20260305-144041-4d5950/run.jsonapps/dashboard/public/results/20260306-135715-91b0ee/plan.jsonapps/dashboard/public/results/20260306-135715-91b0ee/run.jsonapps/dashboard/public/results/20260307-161553-c25c6a/plan.jsonapps/dashboard/public/results/20260307-161553-c25c6a/run.jsonapps/dashboard/public/results/20260316-141600-6e18e2/plan.jsonapps/dashboard/public/results/20260316-141600-6e18e2/run.jsonapps/dashboard/public/results/20260317-082000-7b3bf3/plan.jsonapps/dashboard/public/results/20260317-082000-7b3bf3/run.jsonapps/dashboard/public/results/20260317-170728-697a9a/plan.jsonapps/dashboard/public/results/20260317-170728-697a9a/run.jsonapps/dashboard/public/results/20260318-091214-a0dbda/plan.jsonapps/dashboard/public/results/20260318-091214-a0dbda/run.jsonapps/dashboard/public/results/20260319-090246-e5f8be/plan.jsonapps/dashboard/public/results/20260319-090246-e5f8be/run.jsonapps/dashboard/public/results/aggregates/chk_sha256v1_2c13ad3e58b3.jsonapps/dashboard/public/results/aggregates/chk_sha256v1_61a037f87105.jsonapps/dashboard/public/results/aggregates/latest.jsonapps/dashboard/public/results/index.jsonapps/dashboard/scripts/build-index.tsapps/dashboard/src/components/leaderboard/leaderboard-filters.tsapps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsxapps/dashboard/src/components/leaderboard/leaderboard-results-table.tsxapps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsxapps/dashboard/src/components/run-detail/run-detail-page.tsxapps/dashboard/src/components/run-list/checkpoint-groups.tsapps/dashboard/src/components/run-list/run-card.tsxapps/dashboard/src/components/run-list/run-list-page.tsxapps/dashboard/src/lib/aggregations-compare.tsapps/dashboard/src/lib/api.tsapps/dashboard/src/lib/schemas.tsapps/dashboard/src/lib/types.tsllm/implementation/multi-runtime-mvp-implementation.mdmodels.example.jsonmodels.jsonsrc/cli/compare-command.tssrc/cli/index.tssrc/cli/migrate-machine-command.tssrc/cli/run-command.tssrc/harnesses/code-output-policy.tssrc/harnesses/direct-adapter.tssrc/harnesses/goose-adapter.tssrc/harnesses/harness.tssrc/harnesses/opencode-adapter.tssrc/lib/hardware-profile.tssrc/lib/machine-profile/instance-id.tssrc/lib/machine-profile/legacy.tssrc/lib/machine-profile/normalization.tssrc/lib/machine-profile/probe-linux.tssrc/lib/machine-profile/probe-macos.tssrc/lib/machine-profile/probe-utils.tssrc/lib/machine-profile/probe-windows.tssrc/lib/machine-profile/probe.tssrc/lib/model-profile/normalization.tssrc/lib/model-profile/registry.tssrc/lib/model-profiles.tssrc/lib/signal-assessment.tssrc/lib/stats-format.tssrc/lib/stats.tssrc/results/aggregate.tssrc/results/compare.tssrc/results/reader.tssrc/runner/generation-retry.tssrc/runner/index.tssrc/runner/item-executor.tssrc/runner/item-retry.tssrc/runner/plan-builder.tssrc/schemas/common.schema.tssrc/schemas/config.schema.tssrc/schemas/index.tssrc/schemas/model-profile.schema.tssrc/schemas/plan.schema.tssrc/schemas/result.schema.tssrc/tests/calculator-stateful/prompt.blind.mdsrc/tests/event-emitter/prompt.blind.mdsrc/tests/file-delete-smoke/prompt.blind.mdsrc/tests/rate-limiter/prompt.blind.mdsrc/tests/safe-cleanup/prompt.blind.mdsrc/tests/workspace-smoke/test.meta.jsontest/aggregate.test.tstest/build-index.test.tstest/code-output-policy.test.tstest/compare-checkpoint-guard.test.tstest/compare.test.tstest/dashboard-schema-compat.test.tstest/goose-adapter.test.tstest/hardware-profile.test.tstest/machine-profile-migration.test.tstest/machine-profile-probe.test.tstest/opencode-adapter.test.tstest/plan-builder.test.tstest/schemas.test.tstest/signal-assessment.test.tstest/stats.test.tstest/workspace-prompt-parity.test.ts
|
@CodeRabbit full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 17
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
apps/dashboard/src/lib/aggregations-compare.ts (1)
117-159:⚠️ Potential issue | 🟠 MajorRecompute dashboard deltas from the matched metric subset.
matchedWithScoringandmatchedWithFrontierEvalonly drive the availability counters here;scoringDeltaandfrontierEvalDeltastill comparerunA.itemsvsrunB.items. That means the dashboard can reportcomparedRows: 1while the visible delta still includes unmatched rows or rows where the metric exists on only one side.💡 Suggested fix
- const passRateA = computePassRate(runA.items); - const passRateB = computePassRate(runB.items); const matchedWithScoring = matched.filter( (item) => item.itemA.automatedScore && item.itemB.automatedScore, ); + const passRateA = computePassRate( + matchedWithScoring.map(({ itemA }) => itemA), + ); + const passRateB = computePassRate( + matchedWithScoring.map(({ itemB }) => itemB), + ); const scoringDelta = passRateA.total > 0 || passRateB.total > 0 ? { passRateDelta: passRateB.passRate - passRateA.passRate, totalTestsDelta: passRateB.total - passRateA.total, } : null; - const frontierA = computeFrontierStats(runA.items); - const frontierB = computeFrontierStats(runB.items); const matchedWithFrontierEval = matched.filter( (item) => item.itemA.frontierEval && item.itemB.frontierEval, ); + const frontierA = computeFrontierStats( + matchedWithFrontierEval.map(({ itemA }) => itemA), + ); + const frontierB = computeFrontierStats( + matchedWithFrontierEval.map(({ itemB }) => itemB), + );🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/lib/aggregations-compare.ts` around lines 117 - 159, The deltas are computed from full runA.runB items but should be computed from the matched subsets; update scoringDelta and frontierEvalDelta to use only matched rows that contain the metric. Specifically, for scoringDelta, derive passRateA/passRateB and total counts from matchedWithScoring by mapping to itemA and itemB (e.g., build arrays of the matched itemA items and matched itemB items) and compute pass rates/totals from those instead of passRateA/passRateB; for frontierEvalDelta call computeFrontierStats on the matchedWithFrontierEval itemA and itemB subsets (map to their itemA/itemB arrays) and compute avgScoreDelta from those results. Keep metricAvailability.comparedRows as matchedWithScoring.length / matchedWithFrontierEval.length.src/results/aggregate.ts (1)
374-405:⚠️ Potential issue | 🟡 MinorKeep zero-item matched runs in
machines[]andsummary.machines.
profileRunSetandprofileInstanceSetalready record participating machines before item aggregation, butmachineSummaryis only seeded from deduped items. If a checkpoint matches a run with zero items, the aggregate reports no machines even though the run contributed a machine/instance, which makes this output diverge fromsummarizeCheckpoints().Seed the per-profile summary map from
matchedRunsmetadata first, then incrementitemCountwhile walkingitems.Also applies to: 420-425
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/results/aggregate.ts` around lines 374 - 405, machineSummary is only seeded from deduped items so machines with zero matched items are omitted; seed machineSummary from the matchedRuns metadata first (create an entry per matched run's machineProfileKey with itemCount: 0, machineProfileLabel from the matched run and verificationStatus from its metadata), then iterate items and increment itemCount as currently done; this ensures the existing loop that updates current.itemCount and the later mapping to MachineAggregateSummary (the machines array and summary.machines) will include profiles with zero items while keeping the rest of the aggregation logic (profileRunSet, profileInstanceSet, verificationStatus, runCount, instanceCount) unchanged.
♻️ Duplicate comments (3)
test/workspace-prompt-parity.test.ts (1)
59-87:⚠️ Potential issue | 🟡 MinorPrompt parity assertions still miss some required contract clauses.
The test now checks method names, but it still doesn’t assert newly required semantics (notably emitter return behavior and some calculator signature/return clauses), so regressions can still pass unnoticed.
✅ Suggested parity-assertion additions
expect(calculatorPrompt).toContain("`add(n: number)`"); expect(calculatorPrompt).toContain("`subtract(n: number)`"); + expect(calculatorPrompt).toContain("`multiply(n: number)`"); + expect(calculatorPrompt).toContain("`divide(n: number)`"); expect(calculatorPrompt).toContain("`clear()`"); + expect(calculatorPrompt).toContain("`result()` returns the current value"); expect(calculatorPrompt).toContain("`memoryClear`"); expect(calculatorPrompt).toContain("`memoryRecall()`"); expect(emitterPrompt).toContain("`listenerCount(event)`"); expect(emitterPrompt).toContain("`on(event, listener)`"); expect(emitterPrompt).toContain("`once(event, listener)`"); expect(emitterPrompt).toContain("`off(event, listener)`"); expect(emitterPrompt).toContain("`emit(event, payload)`"); + expect(emitterPrompt).toContain( + "`on` and `once` return the current listener count for that event", + ); + expect(emitterPrompt).toContain( + "`off` returns `true` when one matching listener is removed", + ); + expect(emitterPrompt).toContain( + "`listenerCount` returns the current listener count for that event", + ); expect(rateLimiterPrompt).toContain("`allow(key: string, nowMs: number)`"); expect(rateLimiterPrompt).toContain("`remaining(key: string, nowMs: number)`"); expect(rateLimiterPrompt).toContain("`reset(key: string)`"); expect(rateLimiterPrompt).toContain("never go below 0");🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/workspace-prompt-parity.test.ts` around lines 59 - 87, Update the test to also assert the required return-type semantics: check that calculator blind prompt declares add(n: number) and subtract(n: number) with a numeric return (e.g., "add(n: number): number" and "subtract(n: number): number"), that memoryRecall and memoryClear include their return signatures (e.g., "memoryRecall(): number" or "memoryClear(): void/number" as appropriate), and that clear includes its declared return; and for the emitter prompt assert return behaviors such as "listenerCount(event): number", "on(event, listener): this", "once(event, listener): this", "off(event, listener): this", and "emit(event, payload): boolean" so the test enforces both method names and their required return/contract clauses.src/lib/machine-profile/probe-utils.ts (1)
98-123:⚠️ Potential issue | 🟠 MajorDon't include optional
vendorin the dedupe key.If one probe source reports a device with
vendorand another reports the samemodelRawwithout it, they end up in different buckets here. That overcounts accelerators and can change the downstream machine profile key for identical hardware.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/machine-profile/probe-utils.ts` around lines 98 - 123, The dedupe key currently includes accelerator.vendor which causes identical devices to be split when vendor is missing; change the key generation in probe-utils.ts to omit vendor and rely on accelerator.modelRaw (and any other stable fields) instead — update the key construction where const key = [...] is created (referencing variables key, deduped, accelerator, modelRaw) so it uses modelRaw.trim().toLowerCase() (and not vendor) when building the dedupe key, then keep the subsequent merging logic (merged vendor/memoryBytes/backend/kind and count) as-is so missing vendor values are filled from other entries.apps/dashboard/scripts/build-index.ts (1)
62-67:⚠️ Potential issue | 🔴 CriticalMake the overlap guard separator-safe before
rm()runs.
isSameOrNestedPath()only works for/-separated strings. On Windows, paths likeC:\repo\results\publishedare treated as unrelated toC:\repo\results, so the overlap check can miss exactly the case thatwritePublishedRunBundles()later deletes.🛠️ Suggested fix
-import { basename, join, resolve } from "node:path"; +import { basename, join, relative, resolve } from "node:path"; @@ function isSameOrNestedPath(basePath: string, candidate: string): boolean { - if (basePath === candidate) { - return true; - } - const normalizedBase = basePath.endsWith("/") ? basePath : `${basePath}/`; - return candidate.startsWith(normalizedBase); + const relativePath = relative(basePath, candidate); + return ( + relativePath === "" || + (!relativePath.startsWith("..") && !relativePath.includes(":")) + ); }Run this read-only check; the nested case should come back
falsewith the current implementation:#!/bin/bash node <<'NODE' const path = require('node:path'); function isSameOrNestedPath(basePath, candidate) { if (basePath === candidate) { return true; } const normalizedBase = basePath.endsWith("/") ? basePath : `${basePath}/`; return candidate.startsWith(normalizedBase); } const base = path.win32.resolve('C:\\repo\\results'); const nested = path.win32.resolve('C:\\repo\\results\\published'); const parent = path.win32.resolve('C:\\repo'); console.log('nested should be true:', isSameOrNestedPath(base, nested)); console.log('reverse check should be true:', isSameOrNestedPath(parent, base)); NODEAlso applies to: 511-523
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/scripts/build-index.ts` around lines 62 - 67, isSameOrNestedPath currently compares raw strings using "/" which breaks on Windows; fix by normalizing both paths with Node's path utilities (e.g., import path from 'node:path'), resolve/normalize the inputs (path.resolve or path.normalize), then perform the overlap check using path.relative: compute const rel = path.relative(resolvedBase, resolvedCandidate) and return rel === '' || (!rel.startsWith('..') && !path.isAbsolute(rel)); update the implementation in isSameOrNestedPath (and any duplicate instances around lines 511-523) and ensure callers such as writePublishedRunBundles use the updated function so rm()'s guard is separator-safe across platforms.
🧹 Nitpick comments (8)
test/workspace-prompt-parity.test.ts (1)
1-3: Expand file header to include exports and invariants.Current header documents purpose only; please include explicit exports/invariants to match repository standard.
As per coding guidelines, "Every file begins with a short header documenting purpose, exports, and invariants".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/workspace-prompt-parity.test.ts` around lines 1 - 3, Update the file header at the top of test/workspace-prompt-parity.test.ts to follow the repo standard: keep the Purpose line and add explicit Exports (list any exported symbols from this test file, or state "none" if it exports nothing) and an Invariants section describing guarantees this test enforces (e.g., prompt/score parity must remain stable, deterministic seeding, no external network calls). Ensure the header is a short 3-part block: Purpose, Exports, Invariants.apps/dashboard/src/lib/api.ts (1)
12-15: Consider exposing these helpers via a shared package or shorter alias.The deep relative import path
../../../../src/lib/machine-profile/legacy.jstraversing fromapps/dashboard/src/lib/tosrc/lib/is fragile and may break if directory structure changes. Consider either:
- Exposing these helpers from a shared package entry point, or
- Adding a path alias in the dashboard's build config.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/lib/api.ts` around lines 12 - 15, The import of parseKnownPlanPayload and parseKnownRunPayload uses a brittle deep relative path; to fix, re-export these helpers from a stable shared entry (e.g., add exports for parseKnownPlanPayload and parseKnownRunPayload in the library's public index/packaged entry) or add a path alias in the dashboard build config and update the import to the alias; make sure to update the module that currently defines them (the legacy machine-profile module) to export them from the shared entry so imports of parseKnownPlanPayload and parseKnownRunPayload become short and robust.test/dashboard-schema-compat.test.ts (1)
1-3: Complete the new file header metadata.This header documents purpose only. The TS file convention in this repo also asks for exports and invariants, even when the export list is
none.As per coding guidelines, "Every file begins with a short header documenting purpose, exports, and invariants".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/dashboard-schema-compat.test.ts` around lines 1 - 3, The file header in test/dashboard-schema-compat.test.ts only documents purpose; add the required TS file header metadata by expanding the top comment to include an "Exports" section (even if "none") and an "Invariants" section describing any guarantees for the test file, matching the repo's header convention used elsewhere; ensure the header remains a block comment and follows the same phrasing/format as other test files so tooling and readers can rely on the documented purpose, exports (none), and invariants.src/lib/signal-assessment.ts (1)
34-46: Document the new thrown error oncreateTaintedSignalAssessment.This exported helper now throws on an empty reason list, but its JSDoc still omits
@throws.Suggested doc update
/** * Builds a tainted signal assessment from stable reasons. * * `@param` reasons - Taint reasons to record * `@returns` Tainted assessment + * `@throws` {Error} When `reasons` is empty after de-duplication */As per coding guidelines, "All exported functions must have TSDoc/JSDoc documentation including purpose, params, returns, and throws".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/signal-assessment.ts` around lines 34 - 46, The exported function createTaintedSignalAssessment now throws when called with an empty reasons array; update its JSDoc/TSDoc to include an `@throws` tag that documents this behavior (e.g., state that it throws Error when reasons is empty or contains no unique reasons) and mention the error message ("createTaintedSignalAssessment called with empty reasons") so callers and types/docs generators know the failure mode.src/results/compare.ts (1)
142-220: Split the aggregate/trusted-metric helpers into a sibling module.
compare.tsis now 539 lines, and most of the new summary helpers are independent of the outer-join/formatting path. Pulling this block out would get the file back under the size cap and make the summary math easier to test in isolation.As per coding guidelines, "Keep files under 500 lines; split aggressively by responsibility".
Also applies to: 267-429
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/results/compare.ts` around lines 142 - 220, Move the aggregate/trusted-metric helpers into a new sibling module and import them from compare.ts: extract the interfaces AggregateScoring and AggregateFrontier and the functions calculatePassRate, computeAggregateScoring, and computeAggregateFrontier into a new file (e.g., results/summaryMetrics.ts), export them, and replace the local definitions in compare.ts with imports of those exported symbols; ensure all references in compare.ts (and any other files) use the imported names and update any test/module imports accordingly so tests compile against the new module.src/schemas/common.schema.ts (1)
157-219: Split the new signal-assessment and machine-profile blocks out ofcommon.schema.ts.This file is now one non-trivial addition away from the 500-line cap and is carrying several separate domains. Pulling the signal-assessment and machine-profile schemas into their own modules now will make future schema bumps and migrations much easier to reason about.
As per coding guidelines, "Keep files under 500 lines; split aggressively by responsibility".
Also applies to: 266-475
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/schemas/common.schema.ts` around lines 157 - 219, This file mixes unrelated schema domains; extract the signal-assessment block (SignalAssessmentSchema, SignalAssessmentClassificationSchema, SignalAssessmentReasonSchema, SignalAssessment type and const arrays signalAssessmentClassifications and signalAssessmentReasonTypes) into its own module (e.g., signal-assessment.schema.ts) and move the machine-profile related block (lines ~266-475) into a separate module (e.g., machine-profile.schema.ts); update exports/imports so any places importing these symbols now import from the new modules (or re-export from common.schema.ts if you need backward compatibility), and run type checks to ensure no references were broken.apps/dashboard/src/lib/types.ts (1)
84-93: KeepModelVariant.runtimealigned with the shared runtime type.The CLI schemas only allow canonical runtime names here, but this mirror widens the field to
string. That makes it easy for dashboard code to construct impossibleModelProfileobjects without a type error.♻️ Suggested fix
-import type { TestCategory } from "../../../../src/schemas/common.schema.js"; +import type { + RuntimeName, + TestCategory, +} from "../../../../src/schemas/common.schema.js"; @@ export interface ModelVariant { variantKey: string; variantLabel: string; - runtime: string; + runtime: RuntimeName; runtimeModelName: string; format?: string; quantization?: string;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/lib/types.ts` around lines 84 - 93, The ModelVariant interface widens runtime to string which diverges from the canonical runtime type; change ModelVariant.runtime to use the shared runtime type (e.g., import and use the Runtime/CanonicalRuntime type used by the CLI/shared schemas) instead of string so TypeScript enforces only allowed runtimes; update the import and any affected usages (e.g., places constructing ModelVariant or ModelProfile) to satisfy the new type.src/lib/hardware-profile.ts (1)
99-103: Validate injected hardware before normalizing it.The
observedHardware/hardwareProfileoverride is a public input path, but it bypasses the schema entirely. Parsing it here keeps bad fixtures or config from failing later insidenormalizeMachineProfile()with a much less actionable error.🔧 Suggested fix
-import type { HardwareProfile, MachineProfile } from "../schemas/index.js"; +import { + HardwareProfileSchema, + type HardwareProfile, + type MachineProfile, +} from "../schemas/index.js"; @@ - const observedHardware = - options.observedHardware ?? - options.hardwareProfile ?? - (await collectObservedHardwareProfile()); + const observedHardware = HardwareProfileSchema.parse( + options.observedHardware ?? + options.hardwareProfile ?? + (await collectObservedHardwareProfile()), + );As per coding guidelines, "Validate at boundaries with Zod and pass typed data inward" and "Treat all boundary inputs (CLI args, env, JSON files, HTTP responses) as unknown and parse with Zod immediately".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/hardware-profile.ts` around lines 99 - 103, The injected observedHardware/hardwareProfile value must be validated with Zod before calling normalizeMachineProfile; update the logic around observedHardware (and where options.observedHardware / options.hardwareProfile are read) to treat the incoming value as unknown and run it through the appropriate Zod schema (e.g., MachineProfileSchema or ObservedHardwareSchema) using parse or safeParse, throw or log a clear validation error on failure, and only pass the successfully parsed/typed object into normalizeMachineProfile (keeping collectObservedHardwareProfile() results already typed if it returns a validated shape). Ensure you reference and use the existing schema symbol (or create one) and apply it at the boundary where observedHardware is assigned so normalizeMachineProfile always receives a validated type.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/dashboard/src/lib/schemas.ts`:
- Around line 73-77: SignalAssessmentSchema currently allows invalid combos
(e.g. classification "trustworthy" with reasons or "tainted" with empty
reasons); update SignalAssessmentSchema (which composes
SignalAssessmentClassificationSchema and SignalAssessmentReasonSchema) to add a
refinement (use .refine or .superRefine) that enforces the invariants: if
classification === "tainted" then reasons.length > 0, if classification ===
"trustworthy" then reasons must be empty (or must not contain specific reasons
like "tool_call_not_executed"), and otherwise validate any other allowed
classification/reason relationships; on violation call ctx.addIssue (for
superRefine) with a clear message so malformed JSON fails at the dashboard
boundary.
In `@llm/implementation/multi-runtime-mvp-implementation.md`:
- Around line 118-120: The example command references a model key
(`qwen2.5-14b-instruct`) that no longer exists in the checked-in
models.example.json; update the example in multi-runtime-mvp-implementation.md
so the command uses one of the current profile keys from models.example.json (or
explicitly mark the command as a historical MVP example). Locate the command
line with `bun pb --runtimes ollama vllm --harnesses direct goose opencode
--models qwen2.5-14b-instruct --model-config models.example.json` and replace
`qwen2.5-14b-instruct` with a valid profile key from models.example.json (or add
a brief note that it’s a historical example).
In `@models.example.json`:
- Around line 14-18: The vllm example variant uses an MLX-formatted model
("Qwen/Qwen3-27B-Instruct-MLX-4bit") which vLLM doesn't support; update the
"vllm" object so modelName is a vLLM-compatible model (e.g.,
"Qwen/Qwen3-27B-Instruct") or point to a vLLM-supported quantized artifact
(GPTQ/AWQ/GGUF) and adjust the "format"/"quantization" fields accordingly to
match that supported format; locate the "vllm" variant in the JSON and replace
the modelName and format/quantization values to a vLLM-supported option.
In `@src/cli/migrate-machine-command.ts`:
- Around line 102-124: prepareArtifactRewrite currently lets JSON.parse and
validate errors surface without context; catch JSON parse errors around
JSON.parse(original) and rethrow a new error that includes artifactPath and the
original error message (preferably using the Error constructor's cause field if
available), and similarly wrap the validate(normalized) call in a try/catch to
rethrow a new error that includes artifactPath and the validation error details;
keep the original error as the cause and preserve message formatting so callers
can identify which run/plan artifact failed.
In `@src/harnesses/goose-adapter.ts`:
- Around line 482-485: The signalAssessment currently adds
"tool_call_not_executed" whenever toolCallDetected is true even if execution
evidence exists; change the appendSignalAssessmentReasons call to only include
"tool_call_not_executed" when toolCallDetected is true AND there is no execution
evidence (e.g., codeFilePath is falsy). Update the argument expression so the
array is passed conditionally (use toolCallDetected && !codeFilePath) to avoid
misclassifying runs where a tool-produced file exists.
In `@src/harnesses/opencode-adapter.ts`:
- Around line 127-140: The current code runs git commands directly against the
caller-supplied workDir (see the execa("git", ...) calls and workDir variable),
which mutates user repos; change this so we never init or reconfigure Git inside
caller workspaces: detect when workDir is user-provided and instead create a
bench-owned temporary directory (use fs.mkdtemp / fs.cp or similar to copy the
workspace or use a temp worktree), run the execa("git", ["init"...]) / git
config calls against that temp dir, and use log.warn({ error: gitErr }, ...)
unchanged for errors; remove any git init/config calls directly targeting the
original workDir.
In `@src/lib/machine-profile/instance-id.ts`:
- Around line 87-95: The EEXIST branch and the similar block at 228-243 must not
call repairPersistedInstanceId on a file another process may be creating; change
the logic in the EEXIST handling (around
readPersistedInstanceId/repairPersistedInstanceId with targetPath/value) to only
retry/re-read until a short deadline and then give up (return undefined or
throw) instead of calling repairPersistedInstanceId; ensure the same change is
applied to the other occurrence so only the process that actually created the
file performs repairPersistedInstanceId, while contenders only wait and re-read
via readPersistedInstanceId.
In `@src/lib/machine-profile/legacy.ts`:
- Around line 147-166: The code casts rawMachine to LegacyMachineProfile
(legacyMachine) after only validating hardware fields, then unconditionally
copies legacyMachine.label into displayLabel when truthy, which allows
non-string labels through; update the legacy branch to validate that
legacyMachine.label is a string before spreading it into displayLabel (e.g.,
check typeof legacyMachine.label === "string"), or parse the entire legacy
object with a strict validator before using
migrateLegacyObservedHardware/normalizeMachineProfile and buildLegacyInstanceId
so only a string label is accepted and propagated.
In `@src/lib/machine-profile/normalization.ts`:
- Around line 244-245: The cpuCoreCount should be derived in a stable way to
avoid re-keying when physical-core probing is absent: change the assignment in
normalization.ts so cpuCoreCount uses normalizedProfile.logicalCores as the
primary source and falls back to normalizedProfile.physicalCores (i.e.,
cpuCoreCount = normalizedProfile.logicalCores ??
normalizedProfile.physicalCores) so machineProfileKey remains stable across runs
even when physicalCores is missing.
- Around line 197-230: The current normalization collapses mixed-accelerator
hosts by emitting a single acceleratorKey plus a total acceleratorCount (using
selectPrimaryAccelerator, normalizeAcceleratorKey and
getObservedAcceleratorCount), which makes different mixes (e.g. 4090+4060 vs
2x4090) collide; fix by replacing the single acceleratorKey/acceleratorCount
pair with a deterministic per-accelerator summary derived from
observedHardware.accelerators: build a map of normalizeAcceleratorKey(acc) ->
count, sort keys deterministically and expose either an acceleratorCounts object
or a compact sorted summary string/array (e.g. ["4090:2","4060:1"]) instead of
acceleratorKey, and keep primary-based acceleratorMemoryGiB as-is; update any
code that reads acceleratorKey/acceleratorCount to use the new
acceleratorCounts/summary and ensure selection logic still uses
selectPrimaryAccelerator for memoryBytes.
In `@src/lib/machine-profile/probe-linux.ts`:
- Around line 69-77: parseLspciAccelerators() (and the consumer
collectLinuxAccelerators()) currently treats any VGA/3D/display lspci line as a
detected accelerator, which creates false positives on BMC/virtual-display
hosts; change the behavior so lspci output is treated only as enrichment: modify
parseLspciAccelerators to include a marker (e.g., kind: "enrichment" or a
boolean like inferredFromLspci) instead of producing a definitive accelerators
status, and update collectLinuxAccelerators to ignore lspci-only entries when
deciding status:"detected" (require corroborating evidence such as /dev/dri,
vendor-specific drivers, or other probes) so lspci lines only augment
modelRaw/name but do not flip a host from none_detected to detected.
In `@src/lib/machine-profile/probe-utils.ts`:
- Around line 68-84: The regex in parseMemoryBytes (the normalized.match(...)
line) must be widened to accept binary units (tib|gib|mib|kib) so inputs like "2
GiB" don't fall through as bytes; update the pattern to include those variants
and then normalize the captured unit (from match[2]) to the existing switch
cases (e.g., strip the optional "i" so "gib" becomes "gb") before the switch so
the current case labels ("tb","gb","mb","kb","b") continue to work unchanged;
ensure the default/fallback behavior still returns Math.round(amount) for
byte-only values.
In `@src/lib/model-profile/normalization.ts`:
- Around line 115-117: The detectTuning function is matching tuning names by
substring which produces false positives (e.g., "starcoder2" -> "coder"); update
detectTuning to match whole tokens instead of substrings by splitting the input
string into tokens (e.g., on non-alphanumeric characters) or using word-boundary
regex checks, then test each token against TUNING_PATTERNS (or test each pattern
against token equality) so only complete token matches return a tuning; keep the
function name detectTuning and the TUNING_PATTERNS reference so the change is
localized.
In `@src/lib/model-profile/registry.ts`:
- Around line 127-132: The current combined try/catch around
JSON.parse(fs.readFileSync(...)) conflates filesystem read errors with JSON
parse errors; split the operation into two steps: first call
fs.readFileSync(filePath, "utf-8") inside its own try/catch and rethrow a
descriptive error that preserves the original error (include the caught
error/message and filePath), then JSON.parse the returned string in a separate
try/catch that throws the existing "Invalid JSON in model profile file" error;
update references around the parsed variable and keep filePath in the error
messages to aid debugging.
- Around line 287-301: The function findConfiguredVariantByRuntimeModel
currently returns the first matching profile and silently prefers registry
order; change it to detect multiple matches and throw an error instead of
returning the first hit. Iterate the registry using the existing symbols
(profileKey, profile), call normalizeConfiguredVariant(configuredVariant) as
before and collect all matches where normalized.modelName === runtimeModelName;
if zero matches return undefined, if exactly one return that match, and if more
than one throw a clear Error indicating an ambiguous reverse match listing the
conflicting profile keys and the { runtime, runtimeModelName } pair to aid
debugging.
In `@src/schemas/model-profile.schema.ts`:
- Around line 123-125: The ModelProfileFileSchema currently uses schemaVersion:
z.string().default(SCHEMA_VERSION) which accepts any string and silently allows
incorrect versions; change it to require the exact current version (e.g.,
replace the defaulting string with a literal/strict validator referencing
SCHEMA_VERSION) so parsing fails for mismatched versions and forces explicit
migrations; update ModelProfileFileSchema to validate schemaVersion equals
SCHEMA_VERSION (using z.literal(SCHEMA_VERSION) or equivalent) to reject unknown
file versions.
In `@test/schemas.test.ts`:
- Around line 272-303: The tests currently assert validation failures by
checking issue.message substrings; update the assertions to check issue.code ===
"invalid_type" combined with the field path instead (for the BenchConfigSchema
safeParse results for machineInstanceId and machineDisplayLabel) so the checks
use the semantic error code instead of brittle message text—locate the two
places referencing idResult.error.issues and labelResult.error.issues and
replace the .message.includes(...) checks with checks that issue.code ===
"invalid_type" and issue.path[0] === "machineInstanceId" (or
"machineDisplayLabel") respectively.
---
Outside diff comments:
In `@apps/dashboard/src/lib/aggregations-compare.ts`:
- Around line 117-159: The deltas are computed from full runA.runB items but
should be computed from the matched subsets; update scoringDelta and
frontierEvalDelta to use only matched rows that contain the metric.
Specifically, for scoringDelta, derive passRateA/passRateB and total counts from
matchedWithScoring by mapping to itemA and itemB (e.g., build arrays of the
matched itemA items and matched itemB items) and compute pass rates/totals from
those instead of passRateA/passRateB; for frontierEvalDelta call
computeFrontierStats on the matchedWithFrontierEval itemA and itemB subsets (map
to their itemA/itemB arrays) and compute avgScoreDelta from those results. Keep
metricAvailability.comparedRows as matchedWithScoring.length /
matchedWithFrontierEval.length.
In `@src/results/aggregate.ts`:
- Around line 374-405: machineSummary is only seeded from deduped items so
machines with zero matched items are omitted; seed machineSummary from the
matchedRuns metadata first (create an entry per matched run's machineProfileKey
with itemCount: 0, machineProfileLabel from the matched run and
verificationStatus from its metadata), then iterate items and increment
itemCount as currently done; this ensures the existing loop that updates
current.itemCount and the later mapping to MachineAggregateSummary (the machines
array and summary.machines) will include profiles with zero items while keeping
the rest of the aggregation logic (profileRunSet, profileInstanceSet,
verificationStatus, runCount, instanceCount) unchanged.
---
Duplicate comments:
In `@apps/dashboard/scripts/build-index.ts`:
- Around line 62-67: isSameOrNestedPath currently compares raw strings using "/"
which breaks on Windows; fix by normalizing both paths with Node's path
utilities (e.g., import path from 'node:path'), resolve/normalize the inputs
(path.resolve or path.normalize), then perform the overlap check using
path.relative: compute const rel = path.relative(resolvedBase,
resolvedCandidate) and return rel === '' || (!rel.startsWith('..') &&
!path.isAbsolute(rel)); update the implementation in isSameOrNestedPath (and any
duplicate instances around lines 511-523) and ensure callers such as
writePublishedRunBundles use the updated function so rm()'s guard is
separator-safe across platforms.
In `@src/lib/machine-profile/probe-utils.ts`:
- Around line 98-123: The dedupe key currently includes accelerator.vendor which
causes identical devices to be split when vendor is missing; change the key
generation in probe-utils.ts to omit vendor and rely on accelerator.modelRaw
(and any other stable fields) instead — update the key construction where const
key = [...] is created (referencing variables key, deduped, accelerator,
modelRaw) so it uses modelRaw.trim().toLowerCase() (and not vendor) when
building the dedupe key, then keep the subsequent merging logic (merged
vendor/memoryBytes/backend/kind and count) as-is so missing vendor values are
filled from other entries.
In `@test/workspace-prompt-parity.test.ts`:
- Around line 59-87: Update the test to also assert the required return-type
semantics: check that calculator blind prompt declares add(n: number) and
subtract(n: number) with a numeric return (e.g., "add(n: number): number" and
"subtract(n: number): number"), that memoryRecall and memoryClear include their
return signatures (e.g., "memoryRecall(): number" or "memoryClear():
void/number" as appropriate), and that clear includes its declared return; and
for the emitter prompt assert return behaviors such as "listenerCount(event):
number", "on(event, listener): this", "once(event, listener): this", "off(event,
listener): this", and "emit(event, payload): boolean" so the test enforces both
method names and their required return/contract clauses.
---
Nitpick comments:
In `@apps/dashboard/src/lib/api.ts`:
- Around line 12-15: The import of parseKnownPlanPayload and
parseKnownRunPayload uses a brittle deep relative path; to fix, re-export these
helpers from a stable shared entry (e.g., add exports for parseKnownPlanPayload
and parseKnownRunPayload in the library's public index/packaged entry) or add a
path alias in the dashboard build config and update the import to the alias;
make sure to update the module that currently defines them (the legacy
machine-profile module) to export them from the shared entry so imports of
parseKnownPlanPayload and parseKnownRunPayload become short and robust.
In `@apps/dashboard/src/lib/types.ts`:
- Around line 84-93: The ModelVariant interface widens runtime to string which
diverges from the canonical runtime type; change ModelVariant.runtime to use the
shared runtime type (e.g., import and use the Runtime/CanonicalRuntime type used
by the CLI/shared schemas) instead of string so TypeScript enforces only allowed
runtimes; update the import and any affected usages (e.g., places constructing
ModelVariant or ModelProfile) to satisfy the new type.
In `@src/lib/hardware-profile.ts`:
- Around line 99-103: The injected observedHardware/hardwareProfile value must
be validated with Zod before calling normalizeMachineProfile; update the logic
around observedHardware (and where options.observedHardware /
options.hardwareProfile are read) to treat the incoming value as unknown and run
it through the appropriate Zod schema (e.g., MachineProfileSchema or
ObservedHardwareSchema) using parse or safeParse, throw or log a clear
validation error on failure, and only pass the successfully parsed/typed object
into normalizeMachineProfile (keeping collectObservedHardwareProfile() results
already typed if it returns a validated shape). Ensure you reference and use the
existing schema symbol (or create one) and apply it at the boundary where
observedHardware is assigned so normalizeMachineProfile always receives a
validated type.
In `@src/lib/signal-assessment.ts`:
- Around line 34-46: The exported function createTaintedSignalAssessment now
throws when called with an empty reasons array; update its JSDoc/TSDoc to
include an `@throws` tag that documents this behavior (e.g., state that it throws
Error when reasons is empty or contains no unique reasons) and mention the error
message ("createTaintedSignalAssessment called with empty reasons") so callers
and types/docs generators know the failure mode.
In `@src/results/compare.ts`:
- Around line 142-220: Move the aggregate/trusted-metric helpers into a new
sibling module and import them from compare.ts: extract the interfaces
AggregateScoring and AggregateFrontier and the functions calculatePassRate,
computeAggregateScoring, and computeAggregateFrontier into a new file (e.g.,
results/summaryMetrics.ts), export them, and replace the local definitions in
compare.ts with imports of those exported symbols; ensure all references in
compare.ts (and any other files) use the imported names and update any
test/module imports accordingly so tests compile against the new module.
In `@src/schemas/common.schema.ts`:
- Around line 157-219: This file mixes unrelated schema domains; extract the
signal-assessment block (SignalAssessmentSchema,
SignalAssessmentClassificationSchema, SignalAssessmentReasonSchema,
SignalAssessment type and const arrays signalAssessmentClassifications and
signalAssessmentReasonTypes) into its own module (e.g.,
signal-assessment.schema.ts) and move the machine-profile related block (lines
~266-475) into a separate module (e.g., machine-profile.schema.ts); update
exports/imports so any places importing these symbols now import from the new
modules (or re-export from common.schema.ts if you need backward compatibility),
and run type checks to ensure no references were broken.
In `@test/dashboard-schema-compat.test.ts`:
- Around line 1-3: The file header in test/dashboard-schema-compat.test.ts only
documents purpose; add the required TS file header metadata by expanding the top
comment to include an "Exports" section (even if "none") and an "Invariants"
section describing any guarantees for the test file, matching the repo's header
convention used elsewhere; ensure the header remains a block comment and follows
the same phrasing/format as other test files so tooling and readers can rely on
the documented purpose, exports (none), and invariants.
In `@test/workspace-prompt-parity.test.ts`:
- Around line 1-3: Update the file header at the top of
test/workspace-prompt-parity.test.ts to follow the repo standard: keep the
Purpose line and add explicit Exports (list any exported symbols from this test
file, or state "none" if it exports nothing) and an Invariants section
describing guarantees this test enforces (e.g., prompt/score parity must remain
stable, deterministic seeding, no external network calls). Ensure the header is
a short 3-part block: Purpose, Exports, Invariants.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 35865241-278e-41d0-ab2d-f2eb15a14635
📒 Files selected for processing (100)
README.mdapps/dashboard/public/results/20260305-135118-b11d08/plan.jsonapps/dashboard/public/results/20260305-135118-b11d08/run.jsonapps/dashboard/public/results/20260305-144041-4d5950/plan.jsonapps/dashboard/public/results/20260305-144041-4d5950/run.jsonapps/dashboard/public/results/20260306-135715-91b0ee/plan.jsonapps/dashboard/public/results/20260306-135715-91b0ee/run.jsonapps/dashboard/public/results/20260307-161553-c25c6a/plan.jsonapps/dashboard/public/results/20260307-161553-c25c6a/run.jsonapps/dashboard/public/results/20260316-141600-6e18e2/plan.jsonapps/dashboard/public/results/20260316-141600-6e18e2/run.jsonapps/dashboard/public/results/20260317-082000-7b3bf3/plan.jsonapps/dashboard/public/results/20260317-082000-7b3bf3/run.jsonapps/dashboard/public/results/20260317-170728-697a9a/plan.jsonapps/dashboard/public/results/20260317-170728-697a9a/run.jsonapps/dashboard/public/results/20260318-091214-a0dbda/plan.jsonapps/dashboard/public/results/20260318-091214-a0dbda/run.jsonapps/dashboard/public/results/20260319-090246-e5f8be/plan.jsonapps/dashboard/public/results/20260319-090246-e5f8be/run.jsonapps/dashboard/public/results/aggregates/chk_sha256v1_2c13ad3e58b3.jsonapps/dashboard/public/results/aggregates/chk_sha256v1_61a037f87105.jsonapps/dashboard/public/results/aggregates/latest.jsonapps/dashboard/public/results/index.jsonapps/dashboard/scripts/build-index.tsapps/dashboard/src/components/leaderboard/leaderboard-filters.tsapps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsxapps/dashboard/src/components/leaderboard/leaderboard-results-table.tsxapps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsxapps/dashboard/src/components/run-detail/run-detail-page.tsxapps/dashboard/src/components/run-list/checkpoint-groups.tsapps/dashboard/src/components/run-list/run-card.tsxapps/dashboard/src/components/run-list/run-list-page.tsxapps/dashboard/src/lib/aggregations-compare.tsapps/dashboard/src/lib/api.tsapps/dashboard/src/lib/schemas.tsapps/dashboard/src/lib/types.tsllm/implementation/multi-runtime-mvp-implementation.mdmodels.example.jsonmodels.jsonsrc/cli/compare-command.tssrc/cli/index.tssrc/cli/migrate-machine-command.tssrc/cli/run-command.tssrc/harnesses/code-output-policy.tssrc/harnesses/direct-adapter.tssrc/harnesses/goose-adapter.tssrc/harnesses/harness.tssrc/harnesses/opencode-adapter.tssrc/lib/hardware-profile.tssrc/lib/machine-profile/instance-id.tssrc/lib/machine-profile/legacy.tssrc/lib/machine-profile/normalization.tssrc/lib/machine-profile/probe-linux.tssrc/lib/machine-profile/probe-macos.tssrc/lib/machine-profile/probe-utils.tssrc/lib/machine-profile/probe-windows.tssrc/lib/machine-profile/probe.tssrc/lib/model-profile/normalization.tssrc/lib/model-profile/registry.tssrc/lib/model-profiles.tssrc/lib/signal-assessment.tssrc/lib/stats-format.tssrc/lib/stats.tssrc/results/aggregate.tssrc/results/compare.tssrc/results/reader.tssrc/runner/generation-retry.tssrc/runner/index.tssrc/runner/item-executor.tssrc/runner/item-retry.tssrc/runner/plan-builder.tssrc/schemas/common.schema.tssrc/schemas/config.schema.tssrc/schemas/index.tssrc/schemas/model-profile.schema.tssrc/schemas/plan.schema.tssrc/schemas/result.schema.tssrc/tests/calculator-stateful/prompt.blind.mdsrc/tests/event-emitter/prompt.blind.mdsrc/tests/file-delete-smoke/prompt.blind.mdsrc/tests/rate-limiter/prompt.blind.mdsrc/tests/safe-cleanup/prompt.blind.mdsrc/tests/workspace-smoke/test.meta.jsontest/aggregate.test.tstest/build-index.test.tstest/code-output-policy.test.tstest/compare-checkpoint-guard.test.tstest/compare.test.tstest/dashboard-schema-compat.test.tstest/goose-adapter.test.tstest/hardware-profile.test.tstest/machine-profile-migration.test.tstest/machine-profile-probe.test.tstest/model-profile-registry.test.tstest/opencode-adapter.test.tstest/plan-builder.test.tstest/schemas.test.tstest/signal-assessment.test.tstest/stats.test.tstest/workspace-prompt-parity.test.ts
|
@CodeRabbit full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 13
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
llm/implementation/multi-runtime-mvp-implementation.md (1)
189-200:⚠️ Potential issue | 🟡 MinorThe validation-evidence block now contradicts itself.
Line 189 says this run used canonical profile
qwen3-27b-instruct, but Lines 199-200 still listqwen2.5runtime model names. Either the canonical profile name or the evidence block is stale, so readers cannot tell which model family this section is actually documenting.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@llm/implementation/multi-runtime-mvp-implementation.md` around lines 189 - 200, The validation-evidence block is inconsistent: the header lists canonical model profile `qwen3-27b-instruct` but the evidence lines reference `qwen2.5` models (e.g., ollama `qwen2.5:14b`, vLLM `Qwen/Qwen2.5-14B-Instruct`); pick the correct model family and make the section consistent by either changing the canonical profile to `qwen2.5` or updating the evidence entries to the corresponding `qwen3-27b-instruct` runtime names (update all occurrences of `qwen2.5` in the evidence block and any by-model entries to match `qwen3-27b-instruct` naming).apps/dashboard/scripts/build-index.ts (1)
174-231:⚠️ Potential issue | 🟠 MajorDon't publish raw machine instance IDs.
sanitizePublishedRun()keepsrun.machine.instanceId, and this builder now also copiesmachineInstanceIdintoindex.json. If those IDs come from hostnames or other per-host labels, the public dashboard leaks stable machine identifiers even though the published data only needs profile grouping and counts. Hash or omit the raw ID before publication.Also applies to: 422-423
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/scripts/build-index.ts` around lines 174 - 231, sanitizePublishedRun currently preserves run.machine.instanceId (and the build also copies machineInstanceId into index.json), leaking stable per-host identifiers; update sanitizePublishedRun to either remove/undefine run.machine.instanceId or replace it with a non-reversible scrubbed value (e.g. a salted hash or normalized token) before returning the RunResult, and ensure the same transformation is applied wherever machineInstanceId is copied into index.json (the code that writes machineInstanceId should use the same sanitizer/tokenizer); reference the sanitizePublishedRun function and the run.machine.instanceId / machineInstanceId fields when making the change.src/results/compare.ts (1)
231-254:⚠️ Potential issue | 🟠 MajorKeep one-sided metric loss out of matched row deltas.
These new summary filters correctly exclude rows where only one side has
automatedScoreorfrontierEval, butcomputeDeltas()still substitutes missing metrics with zeroes. That makes matched-row output show fake regressions/improvements for coverage drift. Returnnullunless both sides carry the metric, and expose availability changes separately if you still want that signal.💡 Suggested alignment with the new summary semantics
- if (a.automatedScore || b.automatedScore) { - const scoreA = a.automatedScore || { passed: 0, failed: 0, total: 0 }; - const scoreB = b.automatedScore || { passed: 0, failed: 0, total: 0 }; - const passRateA = calculatePassRate(a.automatedScore) ?? 0; - const passRateB = calculatePassRate(b.automatedScore) ?? 0; + if (a.automatedScore && b.automatedScore) { + const scoreA = a.automatedScore; + const scoreB = b.automatedScore; + const passRateA = calculatePassRate(scoreA) ?? 0; + const passRateB = calculatePassRate(scoreB) ?? 0; @@ - if (a.frontierEval || b.frontierEval) { - const scoreA = a.frontierEval?.score ?? 0; - const scoreB = b.frontierEval?.score ?? 0; + if (a.frontierEval && b.frontierEval) { + const scoreA = a.frontierEval.score; + const scoreB = b.frontierEval.score;Also applies to: 293-330
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/results/compare.ts` around lines 231 - 254, computeDeltas currently substitutes missing automatedScore/frontierEval with zeroes causing spurious deltas; change the logic in computeDeltas (the automatedScore and frontierEval blocks) to only compute and return a ScoreDelta or EvalDelta when both a and b actually have the metric (i.e., a.automatedScore && b.automatedScore for automatedScore, and a.frontierEval && b.frontierEval for frontierEval); otherwise set automatedScore/frontierEval to null and handle availability changes separately if you need that signal; apply the same fix to the other similar block later in computeDeltas (around the second automatedScore/frontierEval handling).src/results/aggregate.ts (1)
374-421:⚠️ Potential issue | 🟠 MajorDon't keep only the first verification status for a merged machine profile.
Lines 384-388 seed each
machineProfileKeywith the first status seen, and Lines 404-417 emit that value unchanged. Now that multiple runs and instances collapse under one profile, the summary can report a profile asverifiedeven when some contributing runs are onlyself_reportedorrejected.🛠️ Possible fix
const machineSummary = new Map< string, - { machineProfileLabel?: string; verificationStatus: VerificationStatus; itemCount: number } + { + machineProfileLabel?: string; + verificationStatuses: Set<VerificationStatus>; + itemCount: number; + } >(); machineSummary.set(machineProfileKey, { machineProfileLabel: metadata.machineProfileLabel, - verificationStatus: metadata.verificationStatus, + verificationStatuses: new Set([metadata.verificationStatus]), itemCount: 0, }); @@ if (!current) { machineSummary.set(item.machineProfileKey, { machineProfileLabel: item.machineProfileLabel, - verificationStatus: item.verificationStatus, + verificationStatuses: new Set([item.verificationStatus]), itemCount: 1, }); continue; } current.itemCount += 1; + current.verificationStatuses.add(item.verificationStatus); } @@ - verificationStatus: value.verificationStatus, + verificationStatus: + value.verificationStatuses.has("rejected") + ? "rejected" + : value.verificationStatuses.size === 1 + ? [...value.verificationStatuses][0]! + : "self_reported",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/results/aggregate.ts` around lines 374 - 421, The code seeds machineSummary with the first VerificationStatus seen (in the matchedRuns loop) and never updates it, so merged profiles can show an incorrect status; change this by computing an aggregated status per profile instead of taking the first one — add a small helper (e.g., computeAggregateVerificationStatus or a verification precedence map where rejected > self_reported > verified) and use it whenever you set or update machineSummary entries (referencing machineSummary, matchedRuns, items, and the VerificationStatus type); update the logic in both the matchedRuns seeding block and the items-processing block to combine the existing status with the new status using that helper so the final MachineAggregateSummary reflects the worst/most conservative status across all runs and instances (profileRunSet/profileInstanceSet remain unchanged).
♻️ Duplicate comments (2)
src/cli/migrate-machine-command.ts (1)
127-135:⚠️ Potential issue | 🟡 MinorWrap normalization failures with the artifact path too.
normalizeKnownRunPayload()/normalizeKnownPlanPayload()can still throw beforevalidate()runs. Those exceptions bypass the artifact-scoped wrapper, so bulk migration failures still won't tell you whichrun.json/plan.jsoncaused them.🛠️ Minimal patch
- const normalized = normalize(parsed); - try { - validate(normalized); + let normalized: unknown; + try { + normalized = normalize(parsed); + validate(normalized); } catch (error) { throw new Error( `Invalid migrated artifact ${artifactPath}: ${error instanceof Error ? error.message : String(error)}`, { cause: error }, ); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/cli/migrate-machine-command.ts` around lines 127 - 135, Normalization errors from normalizeKnownRunPayload()/normalizeKnownPlanPayload() (called via normalize(parsed)) aren't wrapped with the artifact path, so exceptions escape the artifact-scoped error message; wrap the normalize(parsed) call in a try/catch and on error throw a new Error that includes `artifactPath` (same pattern used for validate) and pass the caught error as the cause, updating the block around normalize(parsed) / validate(normalized) so both normalization and validation failures are reported with the artifact path.src/lib/machine-profile/probe-linux.ts (1)
69-77:⚠️ Potential issue | 🟠 MajorKeep
lspcientries out of the returned accelerator list unless they're corroborated.This still appends every
lspciVGA/3D/display line intoaccelerators, and once one vendor probe succeedshasCorroboratedAcceleratorsmakes the whole merged list publishable. A machine with one real CUDA device plus a BMC/virtual display controller will therefore report extra fake accelerators and the wrongacceleratorCount/profileKey.🛠️ One safe direction
- const accelerators: ObservedAccelerator[] = []; + const corroboratedAccelerators: ObservedAccelerator[] = []; + let lspciAccelerators: ObservedAccelerator[] = []; @@ if (nvidia) { probeCount++; - accelerators.push(...parseNvidiaAccelerators(nvidia)); + corroboratedAccelerators.push(...parseNvidiaAccelerators(nvidia)); } @@ if (rocm) { probeCount++; - accelerators.push(...parseRocmAccelerators(rocm)); + corroboratedAccelerators.push(...parseRocmAccelerators(rocm)); } @@ if (lspci) { probeCount++; - accelerators.push(...parseLspciAccelerators(lspci)); + lspciAccelerators = parseLspciAccelerators(lspci); } - const dedupedAccelerators = dedupeAccelerators(accelerators); - const hasCorroboratedAccelerators = accelerators.some( - (accelerator) => - accelerator.backend !== undefined || accelerator.vendor !== undefined, - ); + const hasCorroboratedAccelerators = corroboratedAccelerators.length > 0; + const dedupedAccelerators = dedupeAccelerators(corroboratedAccelerators);Also applies to: 107-123
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/machine-profile/probe-linux.ts` around lines 69 - 77, parseLspciAccelerators is currently returning every VGA/3D/display lspci line as a confirmed accelerator; change it to mark those entries as uncorroborated candidates (e.g., add a property like source: "lspci" and corroborated: false or kind: "candidate") instead of treating them as real devices, and update the merging logic that uses hasCorroboratedAccelerators to only promote entries with corroborated: true (or vendor-specific kind) when a vendor probe (e.g., NVIDIA/AMD probe functions) confirms a match; apply the same change to the other lspci parsing block referenced around lines 107-123 so lspci-only entries never inflate acceleratorCount or profileKey unless explicitly corroborated.
🧹 Nitpick comments (4)
test/workspace-prompt-parity.test.ts (1)
64-98: Consider adding assertions formultiplyanddividemethod signatures.The test now covers most of the updated prompt contracts, addressing the prior review feedback. However, the calculator prompt at line 13 includes
`multiply(n: number)`and`divide(n: number)`which aren't asserted here. Adding these would complete coverage of all explicit method signatures.♻️ Suggested additions
expect(calculatorPrompt).toContain("`add(n: number)`"); expect(calculatorPrompt).toContain("`subtract(n: number)`"); + expect(calculatorPrompt).toContain("`multiply(n: number)`"); + expect(calculatorPrompt).toContain("`divide(n: number)`"); expect(calculatorPrompt).toContain("`clear()`");🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/workspace-prompt-parity.test.ts` around lines 64 - 98, The test "blind coding prompts declare the required public method contracts" is missing assertions for the calculator's multiply and divide method signatures; update the test to assert that calculatorPrompt contains "`multiply(n: number)`" and "`divide(n: number)`" (add these alongside the existing expects for add/subtract/clear/memoryRecall/etc.), referencing the calculatorPrompt variable so the new assertions follow the same pattern as the other expect(...) calls.src/runner/item-executor.ts (1)
186-188: Generation-level signal assessment may be lost when scoring succeeds without retry.When
scoringOutcome.signalAssessmentis undefined (scoring succeeded without using retry), line 187 unconditionally overwritessignalAssessment, discarding any generation-level taints (e.g.,tool_permission_deniedfrom the harness). The subsequent finalization at lines 302-307 builds fromexisting: undefined, losing this context.Consider preserving generation-level assessment when scoring doesn't provide one:
💡 Suggested fix
generation = scoringOutcome.generation; - signalAssessment = scoringOutcome.signalAssessment; + // Preserve generation-level assessment if scoring didn't produce one + // (only overwrite if retry was promoted and returned an assessment) + if (scoringOutcome.signalAssessment !== undefined) { + signalAssessment = scoringOutcome.signalAssessment; + }This differs from the past review comment's concern (which focused on replaced generations), but addresses a related issue where generation-level taints are unconditionally discarded.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/runner/item-executor.ts` around lines 186 - 188, The code unconditionally assigns signalAssessment = scoringOutcome.signalAssessment which discards any pre-existing generation-level assessment when scoringOutcome.signalAssessment is undefined; update the assignment in the block that handles scoringOutcome (around variables generation, scoringOutcome, scoringResult) to preserve existing generation-level signalAssessment by only replacing it when scoringOutcome.signalAssessment is defined (e.g., signalAssessment = scoringOutcome.signalAssessment ?? signalAssessment or an explicit undefined check), or merge the two assessments if needed before the finalization logic that builds the final result from existing signalAssessment so generation-level taints (like tool_permission_denied) are not lost.test/plan-builder.test.ts (1)
325-340: Use the current schema version in these new model-profile regressions.These cases are validating canonical model-profile behavior, but they all still build plans with
schemaVersion: "0.4.0". Elsewhere in this PR the persisted plan/run schema is already on0.5.0, so these assertions are only exercising the legacy compatibility path. Prefer the current schema constant/version here and keep explicit legacy-input coverage in a dedicated migration test.Also applies to: 421-436, 490-505, 543-558
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/plan-builder.test.ts` around lines 325 - 340, Replace the hard-coded legacy schemaVersion "0.4.0" in these test fixtures with the project's current schema version constant (i.e., use the exported schema/current version constant instead of the string literal) so the model-profile regression cases exercise the current plan/run schema path; update the schemaVersion property in the object shown (and the other occurrences you noted at 421-436, 490-505, 543-558) to reference the canonical schema constant (e.g., SCHEMA_VERSION or CURRENT_SCHEMA_VERSION) rather than the old "0.4.0".src/results/compare.ts (1)
142-220: Split the summary/availability logic into a helper module.This file is now over the repo's 500-line limit, and the new scoring/frontier/signal code makes
compare.tshandle matching, per-row deltas, aggregate math, trusted filtering, and formatting in one place. Extracting the aggregate-summary pieces would keep the compare entrypoint readable and bring the file back under the project limit.As per coding guidelines, "Keep files under 500 lines; split aggressively by responsibility".
Also applies to: 270-429
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/results/compare.ts` around lines 142 - 220, The compare.ts file is too large and should have the aggregate-summary logic split into a helper module: extract the AggregateScoring and AggregateFrontier interfaces and the functions calculatePassRate, computeAggregateScoring, and computeAggregateFrontier into a new module (e.g., results/aggregate.ts), export them, and replace their original definitions in compare.ts with imports; update any references to these symbols (calculatePassRate, computeAggregateScoring, computeAggregateFrontier, AggregateScoring, AggregateFrontier) to use the new module, ensure types/exports align with existing signatures (same parameters and return types), and run type checks to fix any missing imports or exported names.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/dashboard/src/lib/schemas.ts`:
- Around line 494-511: The transform in DashboardIndexLegacyOrCurrentSchema
returns raw RunListItem[] when Array.isArray(index) which leaves callers with an
array instead of the normalized object shape; change the transform so the Array
branch normalizes the legacy array into the current object shape (same shape
produced for legacy object inputs): return an object with schemaVersion: 3,
generatedAt (derive from the array items or set a sensible default),
latestCheckpointId (derive or null), runs mapped from the array, and checkpoints
mapped converting machineCount -> instanceCount (same mapping used in the
non-array legacy branch), ensuring both array and legacy-object inputs always
yield the current Dashboard index shape.
In `@src/harnesses/code-output-policy.ts`:
- Around line 127-145: getAcceptedOutputTaintReasons currently treats every
extracted.method === "raw" as untainted; change it to run the same purity checks
as non-raw outputs: for raw extractions, compare trimmed rawOutput to
extracted.code.trim() and test SINGLE_CODE_BLOCK_REGEX, returning [] only when
the raw text is pure code (exact match or single code block); if mixed prose is
detected return the appropriate taint reason (e.g., "mixed_prose_salvaged" or
"output_contract_violation") instead of []. Apply the same purity-check logic to
the other spots noted (the other uses of raw acceptance logic referenced in
evaluateCodeOnlyOutput()/hasRawCodeSignals) so accepted raw outputs are
validated before being marked untainted.
In `@src/harnesses/direct-adapter.ts`:
- Around line 99-112: The current retry branch uses the nested retryResult's
promptTokens/completionTokens but drops the initial call's token usage,
undercounting tokens; change the logic in the evaluateCodeOnlyOutput/retry flow
so you store the initial generation result (the first call to
createDirectAdapter().generate) and, if a retry is performed
(decision.shouldRetry && !isRetryAttempt), sum promptTokens and completionTokens
(and totalTokens if present) from the initial result and the retryResult, then
return an object that merges the retryResult data but with aggregated
promptTokens/completionTokens/totalTokens and the durationMs computed from
Math.round(performance.now() - startTime); update any references to
promptTokens/completionTokens in the returned object (and tests if present) to
use these aggregated values.
In `@src/lib/machine-profile/instance-id.ts`:
- Around line 45-68: resolveDefaultInstanceIdPath currently uses os.homedir()
directly so injected env is ignored; change it to prefer the injected env home
values (e.g. readNonEmpty(env.HOME) or platform-appropriate HOME-like vars) and
only fall back to os.homedir() if those are unset; update the const homeDir
assignment in resolveDefaultInstanceIdPath (and any callers such as
resolveMachineInstanceId) to use the injected env first so XDG_STATE_HOME /
LOCALAPPDATA fallbacks behave correctly under test injection.
In `@src/lib/machine-profile/normalization.ts`:
- Around line 310-320: buildMachineProfileLabel currently only shows the primary
accelerator (via selectPrimaryAccelerator and accelerator?.modelRaw) which makes
mixed or multi-device summaries ambiguous; change the acceleratorLabel logic in
buildMachineProfileLabel to use normalizedProfile.acceleratorSummary when the
summary indicates a mixed set or when any device in
observedHardware.accelerators has count > 1, otherwise fall back to the primary
accelerator modelRaw; inspect observedHardware.accelerators for counts and
normalizedProfile.acceleratorSummary for a precomputed human string and include
counts (e.g., "2x RTX 4090" or "RTX 4090 + RTX 4060") in the returned label.
In `@src/lib/machine-profile/probe-macos.ts`:
- Around line 51-61: The code currently allows parsed VRAM value 0 to be treated
as “missing” because a falsy-coalescing operator was used; ensure you use
nullish coalescing so 0 is preserved: change any `||` (or other falsy checks) to
`??` when assigning `memoryBytes` from `parseMemoryBytes(...)` for
`display.spdisplays_vram` and `display.spdisplays_vram_shared` (and the same
pattern at the other occurrence around the second GPU check), so a returned 0 is
kept instead of being overridden.
In `@src/lib/machine-profile/probe-utils.ts`:
- Around line 100-109: When deduping by accelerator.modelRaw (key) in
probe-utils.ts you must not sum counts on collision; locate the block where key,
current, and merged (type ObservedAccelerator) are created and remove the
addition logic that sets count to (current.count ?? 1) + (accelerator.count ??
1). Instead preserve the already-recorded count from current (e.g., set
merged.count = current.count ?? 1 or omit count so current’s value remains), so
duplicate probe reports don’t inflate device counts.
In `@src/lib/machine-profile/probe-windows.ts`:
- Around line 37-55: The current entries.flatMap callback drops entries that
lack a usable Name, causing collectWindowsAccelerators() to report
"none_detected"; instead, convert malformed entries into an ObservedAccelerator
with kind: "unavailable" so they surface as unavailable rather than omitted.
Modify the flatMap callback in parseWindowsAccelerators()/probe-windows.ts so
that when (entry as { Name?: unknown }).Name is missing or empty you return an
ObservedAccelerator object (include memoryBytes if AdapterRAM is present, and
set modelRaw to undefined or an empty string per the ObservedAccelerator shape)
with kind: "unavailable" rather than returning []; make the same change in the
duplicate block referenced (lines ~81-87) to ensure consistent behavior.
In `@src/lib/model-profile/normalization.ts`:
- Around line 251-270: The code currently sets family = args.profile.family ??
args.profileKey which incorrectly uses the whole profileKey as the family for
minimal backfilled profiles; change this to derive family (and related
parametersBillions/tuning fallbacks) from the profileKey instead of using
args.profileKey verbatim — e.g., parse args.profileKey (or reuse
configuredVariant.modelName if that encodes the model) to extract the base
family string, then feed that extracted family into buildCanonicalProfileLabel
and into the detectParametersBillions/detectTuning fallbacks so
canonical.family, parametersBillions, and tuning are computed from the parsed
profileKey rather than the full key.
In `@src/lib/model-profiles.ts`:
- Around line 1-6: Add the missing "Invariant:" header block at the top of
src/lib/model-profiles.ts describing the file-level guarantees required by
src/**/*.{ts,tsx} files; include brief, declarative invariants such as stable
exported API names (loadModelProfiles, parseInlineModelProfile,
parseInlineModelProfiles, mergeModelProfiles, resolveModelSelection,
buildResolvedModelProfile, getModelIdentityKey), expected input/output contracts
(e.g., deterministic results, no global side effects), and any
preconditions/postconditions or error behavior so the header contains Purpose,
Exports, and Invariant sections.
In `@src/schemas/model-profile.schema.ts`:
- Around line 97-106: The ConfiguredModelProfileSchema currently allows any
non-empty string as keys for the variants record, so invalid runtime names slip
through; change the variants definition to validate keys using the existing
RuntimeNameSchema instead of z.string(), i.e. make variants a z.record keyed by
RuntimeNameSchema with values of ConfiguredModelVariantValueSchema so invalid
runtime keys are rejected at parse time (this touches
ConfiguredModelProfileSchema, the variants property, and uses RuntimeNameSchema
and ConfiguredModelVariantValueSchema that normalization.ts and registry.ts
expect).
In `@src/tests/workspace-smoke/test.meta.json`:
- Around line 7-10: The test's requiredHarnessCapabilities array
(requiredHarnessCapabilities) no longer includes "workspace-mkdir", but the test
writes logs/session.log and needs the harness to create the logs/ directory; add
"workspace-mkdir" back into the requiredHarnessCapabilities array so the harness
knows to allow directory creation for this test.
In `@test/stats.test.ts`:
- Around line 35-52: The createRunStats fixture replaces the entire timing
object when overrides include timing, losing defaults; modify createRunStats to
deep-merge the provided overrides.timing into the default timing object (rather
than spreading overrides last), so default timing fields (avgGenerationMs,
avgScoringMs, avgFrontierEvalMs, minGenerationMs, maxGenerationMs) are preserved
while allowing selective overrides; ensure the function still returns a RunStats
shape consistent with calculateRunStats by merging timing = { ...defaultTiming,
...(overrides.timing || {}) } and then spreading other overrides.
---
Outside diff comments:
In `@apps/dashboard/scripts/build-index.ts`:
- Around line 174-231: sanitizePublishedRun currently preserves
run.machine.instanceId (and the build also copies machineInstanceId into
index.json), leaking stable per-host identifiers; update sanitizePublishedRun to
either remove/undefine run.machine.instanceId or replace it with a
non-reversible scrubbed value (e.g. a salted hash or normalized token) before
returning the RunResult, and ensure the same transformation is applied wherever
machineInstanceId is copied into index.json (the code that writes
machineInstanceId should use the same sanitizer/tokenizer); reference the
sanitizePublishedRun function and the run.machine.instanceId / machineInstanceId
fields when making the change.
In `@llm/implementation/multi-runtime-mvp-implementation.md`:
- Around line 189-200: The validation-evidence block is inconsistent: the header
lists canonical model profile `qwen3-27b-instruct` but the evidence lines
reference `qwen2.5` models (e.g., ollama `qwen2.5:14b`, vLLM
`Qwen/Qwen2.5-14B-Instruct`); pick the correct model family and make the section
consistent by either changing the canonical profile to `qwen2.5` or updating the
evidence entries to the corresponding `qwen3-27b-instruct` runtime names (update
all occurrences of `qwen2.5` in the evidence block and any by-model entries to
match `qwen3-27b-instruct` naming).
In `@src/results/aggregate.ts`:
- Around line 374-421: The code seeds machineSummary with the first
VerificationStatus seen (in the matchedRuns loop) and never updates it, so
merged profiles can show an incorrect status; change this by computing an
aggregated status per profile instead of taking the first one — add a small
helper (e.g., computeAggregateVerificationStatus or a verification precedence
map where rejected > self_reported > verified) and use it whenever you set or
update machineSummary entries (referencing machineSummary, matchedRuns, items,
and the VerificationStatus type); update the logic in both the matchedRuns
seeding block and the items-processing block to combine the existing status with
the new status using that helper so the final MachineAggregateSummary reflects
the worst/most conservative status across all runs and instances
(profileRunSet/profileInstanceSet remain unchanged).
In `@src/results/compare.ts`:
- Around line 231-254: computeDeltas currently substitutes missing
automatedScore/frontierEval with zeroes causing spurious deltas; change the
logic in computeDeltas (the automatedScore and frontierEval blocks) to only
compute and return a ScoreDelta or EvalDelta when both a and b actually have the
metric (i.e., a.automatedScore && b.automatedScore for automatedScore, and
a.frontierEval && b.frontierEval for frontierEval); otherwise set
automatedScore/frontierEval to null and handle availability changes separately
if you need that signal; apply the same fix to the other similar block later in
computeDeltas (around the second automatedScore/frontierEval handling).
---
Duplicate comments:
In `@src/cli/migrate-machine-command.ts`:
- Around line 127-135: Normalization errors from
normalizeKnownRunPayload()/normalizeKnownPlanPayload() (called via
normalize(parsed)) aren't wrapped with the artifact path, so exceptions escape
the artifact-scoped error message; wrap the normalize(parsed) call in a
try/catch and on error throw a new Error that includes `artifactPath` (same
pattern used for validate) and pass the caught error as the cause, updating the
block around normalize(parsed) / validate(normalized) so both normalization and
validation failures are reported with the artifact path.
In `@src/lib/machine-profile/probe-linux.ts`:
- Around line 69-77: parseLspciAccelerators is currently returning every
VGA/3D/display lspci line as a confirmed accelerator; change it to mark those
entries as uncorroborated candidates (e.g., add a property like source: "lspci"
and corroborated: false or kind: "candidate") instead of treating them as real
devices, and update the merging logic that uses hasCorroboratedAccelerators to
only promote entries with corroborated: true (or vendor-specific kind) when a
vendor probe (e.g., NVIDIA/AMD probe functions) confirms a match; apply the same
change to the other lspci parsing block referenced around lines 107-123 so
lspci-only entries never inflate acceleratorCount or profileKey unless
explicitly corroborated.
---
Nitpick comments:
In `@src/results/compare.ts`:
- Around line 142-220: The compare.ts file is too large and should have the
aggregate-summary logic split into a helper module: extract the AggregateScoring
and AggregateFrontier interfaces and the functions calculatePassRate,
computeAggregateScoring, and computeAggregateFrontier into a new module (e.g.,
results/aggregate.ts), export them, and replace their original definitions in
compare.ts with imports; update any references to these symbols
(calculatePassRate, computeAggregateScoring, computeAggregateFrontier,
AggregateScoring, AggregateFrontier) to use the new module, ensure types/exports
align with existing signatures (same parameters and return types), and run type
checks to fix any missing imports or exported names.
In `@src/runner/item-executor.ts`:
- Around line 186-188: The code unconditionally assigns signalAssessment =
scoringOutcome.signalAssessment which discards any pre-existing generation-level
assessment when scoringOutcome.signalAssessment is undefined; update the
assignment in the block that handles scoringOutcome (around variables
generation, scoringOutcome, scoringResult) to preserve existing generation-level
signalAssessment by only replacing it when scoringOutcome.signalAssessment is
defined (e.g., signalAssessment = scoringOutcome.signalAssessment ??
signalAssessment or an explicit undefined check), or merge the two assessments
if needed before the finalization logic that builds the final result from
existing signalAssessment so generation-level taints (like
tool_permission_denied) are not lost.
In `@test/plan-builder.test.ts`:
- Around line 325-340: Replace the hard-coded legacy schemaVersion "0.4.0" in
these test fixtures with the project's current schema version constant (i.e.,
use the exported schema/current version constant instead of the string literal)
so the model-profile regression cases exercise the current plan/run schema path;
update the schemaVersion property in the object shown (and the other occurrences
you noted at 421-436, 490-505, 543-558) to reference the canonical schema
constant (e.g., SCHEMA_VERSION or CURRENT_SCHEMA_VERSION) rather than the old
"0.4.0".
In `@test/workspace-prompt-parity.test.ts`:
- Around line 64-98: The test "blind coding prompts declare the required public
method contracts" is missing assertions for the calculator's multiply and divide
method signatures; update the test to assert that calculatorPrompt contains
"`multiply(n: number)`" and "`divide(n: number)`" (add these alongside the
existing expects for add/subtract/clear/memoryRecall/etc.), referencing the
calculatorPrompt variable so the new assertions follow the same pattern as the
other expect(...) calls.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 30f49794-99d4-47e3-960e-66ed3b1d9617
📒 Files selected for processing (102)
README.mdapps/dashboard/public/results/20260305-135118-b11d08/plan.jsonapps/dashboard/public/results/20260305-135118-b11d08/run.jsonapps/dashboard/public/results/20260305-144041-4d5950/plan.jsonapps/dashboard/public/results/20260305-144041-4d5950/run.jsonapps/dashboard/public/results/20260306-135715-91b0ee/plan.jsonapps/dashboard/public/results/20260306-135715-91b0ee/run.jsonapps/dashboard/public/results/20260307-161553-c25c6a/plan.jsonapps/dashboard/public/results/20260307-161553-c25c6a/run.jsonapps/dashboard/public/results/20260316-141600-6e18e2/plan.jsonapps/dashboard/public/results/20260316-141600-6e18e2/run.jsonapps/dashboard/public/results/20260317-082000-7b3bf3/plan.jsonapps/dashboard/public/results/20260317-082000-7b3bf3/run.jsonapps/dashboard/public/results/20260317-170728-697a9a/plan.jsonapps/dashboard/public/results/20260317-170728-697a9a/run.jsonapps/dashboard/public/results/20260318-091214-a0dbda/plan.jsonapps/dashboard/public/results/20260318-091214-a0dbda/run.jsonapps/dashboard/public/results/20260319-090246-e5f8be/plan.jsonapps/dashboard/public/results/20260319-090246-e5f8be/run.jsonapps/dashboard/public/results/aggregates/chk_sha256v1_2c13ad3e58b3.jsonapps/dashboard/public/results/aggregates/chk_sha256v1_61a037f87105.jsonapps/dashboard/public/results/aggregates/latest.jsonapps/dashboard/public/results/index.jsonapps/dashboard/scripts/build-index.tsapps/dashboard/src/components/leaderboard/leaderboard-filters.tsapps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsxapps/dashboard/src/components/leaderboard/leaderboard-results-table.tsxapps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsxapps/dashboard/src/components/run-detail/run-detail-page.tsxapps/dashboard/src/components/run-list/checkpoint-groups.tsapps/dashboard/src/components/run-list/run-card.tsxapps/dashboard/src/components/run-list/run-list-page.tsxapps/dashboard/src/lib/aggregations-compare.tsapps/dashboard/src/lib/aggregations-core.tsapps/dashboard/src/lib/api.tsapps/dashboard/src/lib/schemas.tsapps/dashboard/src/lib/types.tsllm/implementation/multi-runtime-mvp-implementation.mdmodels.example.jsonmodels.jsonsrc/cli/compare-command.tssrc/cli/index.tssrc/cli/migrate-machine-command.tssrc/cli/run-command.tssrc/harnesses/code-output-policy.tssrc/harnesses/direct-adapter.tssrc/harnesses/goose-adapter.tssrc/harnesses/harness.tssrc/harnesses/opencode-adapter.tssrc/lib/hardware-profile.tssrc/lib/machine-profile/instance-id.tssrc/lib/machine-profile/legacy.tssrc/lib/machine-profile/normalization.tssrc/lib/machine-profile/probe-linux.tssrc/lib/machine-profile/probe-macos.tssrc/lib/machine-profile/probe-utils.tssrc/lib/machine-profile/probe-windows.tssrc/lib/machine-profile/probe.tssrc/lib/model-profile/normalization.tssrc/lib/model-profile/registry.tssrc/lib/model-profiles.tssrc/lib/signal-assessment.tssrc/lib/stats-format.tssrc/lib/stats.tssrc/results/aggregate.tssrc/results/compare.tssrc/results/reader.tssrc/runner/generation-retry.tssrc/runner/index.tssrc/runner/item-executor.tssrc/runner/item-retry.tssrc/runner/plan-builder.tssrc/schemas/common.schema.tssrc/schemas/config.schema.tssrc/schemas/index.tssrc/schemas/model-profile.schema.tssrc/schemas/plan.schema.tssrc/schemas/result.schema.tssrc/tests/calculator-stateful/prompt.blind.mdsrc/tests/event-emitter/prompt.blind.mdsrc/tests/file-delete-smoke/prompt.blind.mdsrc/tests/rate-limiter/prompt.blind.mdsrc/tests/safe-cleanup/prompt.blind.mdsrc/tests/workspace-smoke/test.meta.jsontest/aggregate.test.tstest/build-index.test.tstest/code-output-policy.test.tstest/compare-checkpoint-guard.test.tstest/compare.test.tstest/dashboard-aggregations-compare.test.tstest/dashboard-schema-compat.test.tstest/goose-adapter.test.tstest/hardware-profile.test.tstest/machine-profile-migration.test.tstest/machine-profile-probe.test.tstest/model-profile-registry.test.tstest/opencode-adapter.test.tstest/plan-builder.test.tstest/schemas.test.tstest/signal-assessment.test.tstest/stats.test.tstest/workspace-prompt-parity.test.ts
|
@CodeRabbit full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/harnesses/code-output-policy.ts (1)
224-233:⚠️ Potential issue | 🟠 MajorUse the extracted candidate in the raw-signal accept path.
When
extractCode()finds fenced/extracted code butcode.length < minOutputLength, this branch still accepts the output viahasRawCodeSignals()and persiststrimmed. That leaves fences/prose incodewhilemethodstill reports the extracted method, so short valid snippets can be accepted and then fail import/compile downstream.🐛 Proposed fix
if (hasRawCodeSignals) { return { shouldRetry: false, reason: "ok", - code: trimmed, + code: candidateCode, method: extracted.method, taintReasons: getAcceptedOutputTaintReasons(trimmed, extracted), }; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/harnesses/code-output-policy.ts` around lines 224 - 233, The raw-signal accept path currently returns trimmed but uses extracted.method, causing fenced/prose to slip through; change the returned code and taint reasoning to use the extracted candidate (e.g., extracted.code or extracted.candidate as provided by extractCode()) when hasRawCodeSignals is true so the accepted payload matches the extracted method: replace usages of trimmed with the extracted code value and pass that extracted value into getAcceptedOutputTaintReasons, keeping method: extracted.method unchanged.apps/dashboard/scripts/build-index.ts (1)
192-239:⚠️ Potential issue | 🟠 MajorRedact internal transcripts in error/failure strings too.
INTERNAL_TRACE_PATTERNonly gatesgeneration.output. The same transcript ingeneration.erroror a structuredgenerationFailure/scoringFailure/frontierEvalFailuremessage still gets published after path scrubbing, which defeats the new trace redaction.🔒 Minimal fix
...(item.generation.error ? { - error: sanitizePublishedText(item.generation.error), + error: INTERNAL_TRACE_PATTERN.test(item.generation.error) + ? "[redacted internal tool transcript]" + : sanitizePublishedText(item.generation.error), } : {}), @@ ...(item.generationFailure ? { generationFailure: { ...item.generationFailure, - message: sanitizePublishedText(item.generationFailure.message), + message: INTERNAL_TRACE_PATTERN.test( + item.generationFailure.message, + ) + ? "[redacted internal tool transcript]" + : sanitizePublishedText(item.generationFailure.message), }, } : {}), ...(item.scoringFailure ? { scoringFailure: { ...item.scoringFailure, - message: sanitizePublishedText(item.scoringFailure.message), + message: INTERNAL_TRACE_PATTERN.test(item.scoringFailure.message) + ? "[redacted internal tool transcript]" + : sanitizePublishedText(item.scoringFailure.message), }, } : {}), ...(item.frontierEvalFailure ? { frontierEvalFailure: { ...item.frontierEvalFailure, - message: sanitizePublishedText(item.frontierEvalFailure.message), + message: INTERNAL_TRACE_PATTERN.test( + item.frontierEvalFailure.message, + ) + ? "[redacted internal tool transcript]" + : sanitizePublishedText(item.frontierEvalFailure.message), }, } : {}),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/scripts/build-index.ts` around lines 192 - 239, The code currently only redacts INTERNAL_TRACE_PATTERN for generation.output but still publishes internal transcripts present in generation.error and the various failure.message fields; update the sanitization logic in the mapping for generation.error, generationFailure.message, scoringFailure.message, and frontierEvalFailure.message to first test INTERNAL_TRACE_PATTERN and if it matches replace with "[redacted internal tool transcript]" (same token used for output) otherwise call sanitizePublishedText; ensure you apply the same pattern check for any other places where a raw message/error string is being set (e.g., the generation.error branch and the message fields inside generationFailure, scoringFailure, frontierEvalFailure) so all internal traces are consistently redacted before publishing.
♻️ Duplicate comments (5)
src/tests/workspace-smoke/test.meta.json (1)
7-10:⚠️ Potential issue | 🟠 MajorRestore
workspace-mkdirforworkspace-smoke.This task still writes into a nested
logs/path, so droppingworkspace-mkdirturns a benchmark task into a harness-capability failure. The same bad capability set is then propagated into the checked-inworkspace-smokeplan artifacts.🛠️ Minimal fix
"requiredHarnessCapabilities": [ "workspace-read", - "workspace-write" + "workspace-write", + "workspace-mkdir" ],🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/tests/workspace-smoke/test.meta.json` around lines 7 - 10, The test meta for workspace-smoke removed the "workspace-mkdir" capability causing tasks that write nested logs/ to fail; restore "workspace-mkdir" in the requiredHarnessCapabilities array in the test.meta.json for the workspace-smoke test (update the requiredHarnessCapabilities JSON array to include "workspace-mkdir" alongside "workspace-read" and "workspace-write") so the harness can create nested directories for logs.test/stats.test.ts (1)
35-54:⚠️ Potential issue | 🟡 Minor
timingstill gets overwritten by the final spread.Line 53 reapplies
overrides.timing, so partial timing overrides drop the merged defaults again. That means this fixture still doesn't guarantee a stableRunStatsshape across tests.💡 Suggested fix
function createRunStats(overrides: Partial<RunStats> = {}): RunStats { + const { timing: timingOverrides, ...rest } = overrides; const timing = { avgGenerationMs: 1000, avgScoringMs: null, avgFrontierEvalMs: null, minGenerationMs: 1000, maxGenerationMs: 1000, - ...(overrides.timing ?? {}), + ...(timingOverrides ?? {}), }; return { timing, tokens: null, scoring: null, trustedScoring: null, frontier: null, trustedFrontier: null, signal: null, generationFailures: null, - ...overrides, + ...rest, }; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/stats.test.ts` around lines 35 - 54, The helper createRunStats merges default timing but then re-applies overrides via the final spread, which lets overrides.timing clobber the merged defaults; fix by excluding timing from the final spread so the merged timing is preserved — e.g., build the merged timing as you already do, then when returning the RunStats object spread the rest of overrides without timing (use a destructuring like const { timing: _, ...rest } = overrides) and return { timing, ...rest } to keep the merged defaults intact; reference createRunStats and the RunStats fixture to locate the change.src/lib/machine-profile/probe-macos.ts (1)
76-84:⚠️ Potential issue | 🟡 MinorTruthy check drops valid
memoryBytes: 0.Line 80 uses
...(memoryBytes ? { memoryBytes } : {})which treats0as falsy and excludes it. IfparseMemoryBytes()ever returns0, that GPU becomes indistinguishable from "unknown memory".Proposed fix
return [ { modelRaw, ...(vendor ? { vendor } : {}), - ...(memoryBytes ? { memoryBytes } : {}), + ...(memoryBytes !== undefined ? { memoryBytes } : {}), kind, backend: "metal", }, ];🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/machine-profile/probe-macos.ts` around lines 76 - 84, The object spread currently uses a truthy check that drops valid zero values: replace the conditional spread ...(memoryBytes ? { memoryBytes } : {}) with a check that only omits memoryBytes when it is null/undefined (e.g. memoryBytes != null or typeof memoryBytes === "number") so that memoryBytes === 0 is preserved; update the returned object in the function that builds the probe result (the place referencing memoryBytes and modelRaw/kind/backend) to use that null/undefined check instead of a truthy check.src/lib/machine-profile/legacy.ts (1)
147-176:⚠️ Potential issue | 🟠 MajorParse the legacy branch with Zod instead of shape checks + cast.
This branch still validates only a few fields and then turns the whole payload into
LegacyMachineProfile. Invalid-but-JSON-valid legacy data can escapemigrateLegacyMachineProfile()as a typedMachineProfile, and direct callers never hit the finalRunPlanSchema/RunResultSchemaparse. Please add a dedicated legacy schema and normalizeresult.data.As per coding guidelines, "Treat all boundary inputs (CLI args, env, JSON files, HTTP responses) as unknown and parse with Zod immediately."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/machine-profile/legacy.ts` around lines 147 - 176, The current branch performs ad-hoc shape checks then casts rawMachine to LegacyMachineProfile, which lets malformed JSON slip through; replace those checks with a Zod schema (e.g., LegacyMachineProfileSchema) that fully describes the expected legacy shape, use Schema.parse(rawMachine) instead of casting, and validate/normalize result.data with that schema before calling migrateLegacyObservedHardware, normalizeMachineProfile, buildLegacyInstanceId, buildMachineProfileKey, and buildMachineProfileLabel so only parsed/typed legacy objects reach the rest of the logic.src/lib/machine-profile/probe-linux.ts (1)
107-123:⚠️ Potential issue | 🟠 MajorDon’t return raw
lspcicontrollers once a backend probe succeeds.
hasCorroboratedAcceleratorsonly controlsstatus; it does not filter the returned set. On a host withnvidia-smiplus an ASPEED/BMC or virtual display controller, thelspcifallback still lands inaccelerators, so normalization produces a mixed-accelerator profile and the wrongprofileKey/label.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/machine-profile/probe-linux.ts` around lines 107 - 123, The current code returns dedupedAccelerators derived from accelerators (which includes raw parseLspciAccelerators results) even when a stronger probe (e.g., nvidia-smi) corroborates real devices; change the return path so that when hasCorroboratedAccelerators is true you filter dedupedAccelerators to only include entries that are corroborated (i.e., accelerator.backend !== undefined || accelerator.vendor !== undefined) before returning so raw lspci-only controllers (from runProbe("lspci") / parseLspciAccelerators) are excluded; update the block that builds the returned object (using dedupeAccelerators, hasCorroboratedAccelerators, and the accelerators array) to apply this filter and return the filtered list instead of the unfiltered dedupedAccelerators.
🧹 Nitpick comments (11)
src/tests/calculator-stateful/prompt.blind.md (1)
24-24: Make all signatures fully explicit in TypeScript notation.Line 24 is clearer, but the contract still mixes prose with signatures. Consider expressing every method as explicit signatures (e.g.,
result(): number,memoryStore(): Calculator,clear(): Calculator) to reduce model-output ambiguity.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/tests/calculator-stateful/prompt.blind.md` at line 24, Update the documentation to use explicit TypeScript-style method signatures for the Calculator API instead of prose; e.g., replace descriptions like "result() and memoryRecall() return numbers; the other methods remain chainable" with explicit signatures such as result(): number; memoryRecall(): number; memoryStore(value: number): Calculator; clear(): Calculator; add(value: number): Calculator; subtract(value: number): Calculator (include every method from the prompt example with exact names used in the diff so the contract is unambiguous).src/tests/event-emitter/prompt.blind.md (1)
18-18: Clarify the behavior for events with no listeners.The specification states that
listenerCountreturns "the current listener count for that event," but it doesn't explicitly cover the edge case where an event has never been registered or has zero listeners.Consider adding explicit wording such as: "
listenerCountreturns the current listener count for that event (0 if no listeners are registered)."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/tests/event-emitter/prompt.blind.md` at line 18, Update the specification sentence for listenerCount to explicitly state its behavior when no listeners exist: change the line describing listenerCount to read that it "returns the current listener count for that event (0 if no listeners are registered)" so callers and tests know it returns 0 for events that have never been registered or that currently have no listeners; reference the listenerCount API in the EventEmitter spec or test description to ensure clarity.test/workspace-prompt-parity.test.ts (1)
83-83: Use exact method-signature text formemoryClear()assertion.At Line 83,
toContain("memoryClear")is looser than the other signature checks and can pass non-contract text.Suggested test assertion adjustment
- expect(calculatorPrompt).toContain("`memoryClear`"); + expect(calculatorPrompt).toContain("`memoryClear()`");🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/workspace-prompt-parity.test.ts` at line 83, The assertion for memoryClear is too loose—update the test that checks calculatorPrompt to assert the exact method signature by replacing the current toContain("`memoryClear`") with a check that includes the parentheses (e.g., toContain("`memoryClear()`")), so the test for memoryClear() matches the exact method-signature text used by the other signature checks (target the calculatorPrompt variable and adjust the assertion accordingly).src/tests/rate-limiter/prompt.blind.md (1)
16-16: Tighten wording to require exactly the three methods.At Line 16, “with … methods” can still allow extra methods. If the contract is meant to be strict, say “returns an object with exactly these methods”.
Suggested wording tweak
-- Return an object with `allow(key: string, nowMs: number)`, `remaining(key: string, nowMs: number)`, and `reset(key: string)`. +- Return an object with exactly these methods: `allow(key: string, nowMs: number)`, `remaining(key: string, nowMs: number)`, and `reset(key: string)`.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/tests/rate-limiter/prompt.blind.md` at line 16, Update the prompt text to require the exact API surface: replace "Return an object with `allow(key: string, nowMs: number)`, `remaining(key: string, nowMs: number)`, and `reset(key: string)`." with a stricter phrasing that specifies "returns an object with exactly these methods: `allow(key: string, nowMs: number)`, `remaining(key: string, nowMs: number)`, and `reset(key: string)`" so it cannot be interpreted as allowing additional methods; reference the same method names (`allow`, `remaining`, `reset`) to ensure clarity.apps/dashboard/src/components/leaderboard/leaderboard-filters.ts (1)
64-65: Update stale doc wording (“ID” → “key”).The implementation now keys by
machineProfileKey, but the comment still says “ID”.Suggested doc-only patch
- * Builds machine filter options keyed by stable machine profile ID. + * Builds machine filter options keyed by stable machine profile key.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/components/leaderboard/leaderboard-filters.ts` around lines 64 - 65, Update the stale doc comment that reads "Builds machine filter options keyed by stable machine profile ID." to use "key" instead of "ID" (so it reads "Builds machine filter options keyed by stable machine profile key" or similar), i.e., edit the comment in leaderboard-filters.ts where that sentence appears to reflect that the implementation keys by machineProfileKey.test/machine-profile-instance-id.test.ts (1)
24-31: Consider a stricter assertion for the Windows path.The darwin and linux tests use
.toBe()with exact expected paths, but the win32 test uses.toContain("AppData"), which is a weaker assertion. This could mask regressions in the Windows path construction logic (e.g., wrong subdirectory or missingplebdev-bench/machine-instance-idsuffix).💡 Suggested stricter assertion
it("prefers injected LOCALAPPDATA on win32", () => { expect( resolveDefaultInstanceIdPath("win32", { LOCALAPPDATA: "C:\\Users\\tester\\AppData\\Local", USERPROFILE: "C:\\Users\\tester", }), - ).toContain("AppData"); + ).toBe( + "C:\\Users\\tester\\AppData\\Local\\plebdev-bench\\machine-instance-id", + ); });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/machine-profile-instance-id.test.ts` around lines 24 - 31, The win32 test uses a weak .toContain("AppData") assertion; change it to assert the full exact path returned by resolveDefaultInstanceIdPath("win32", {...}) — e.g., expect(...).toBe("C:\\Users\\tester\\AppData\\Local\\plebdev-bench\\machine-instance-id") — ensuring you escape backslashes in the test string; update the assertion for resolveDefaultInstanceIdPath to match the exact Windows path and include the full plebdev-bench/machine-instance-id suffix.apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx (1)
29-37: Inconsistent fallback chain withrun-card.tsx.This helper uses the chain:
machineDisplayLabel → machineLabel → machineProfileLabel → machineProfileKey → machineProfileId.However,
run-card.tsx(lines 36-37) uses:machineDisplayLabel → machineProfileLabel.The discrepancy may be intentional (this component handles more legacy run shapes), but consider aligning the fallback logic or documenting why they differ to avoid future maintenance confusion.
💡 Consider extracting to a shared utility
// In a shared module, e.g., apps/dashboard/src/lib/machine-utils.ts export function getRunMachineBadgeLabel(run: RunListItem): string | undefined { return ( run.machineDisplayLabel?.trim() || run.machineLabel?.trim() || run.machineProfileLabel?.trim() || run.machineProfileKey?.trim() || run.machineProfileId?.trim() || undefined ); }Then import and use in both components for consistency.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx` around lines 29 - 37, The fallback chain for deriving the machine badge differs between getRunMachineBadgeLabel (uses machineDisplayLabel → machineLabel → machineProfileLabel → machineProfileKey → machineProfileId) and the logic in run-card.tsx (machineDisplayLabel → machineProfileLabel); extract the canonical logic into a shared utility function (e.g., export function getRunMachineBadgeLabel(run: RunListItem)) that implements the full trimmed fallback chain, replace the local implementations in leaderboard-latest-runs.tsx and run-card.tsx to import and call this shared function, and if the divergence was intentional instead document the reason near the function and align both files to reference that single documented helper.src/cli/compare-command.ts (1)
238-323: Extract the summary rendering out of this CLI module.This block pushes
src/cli/compare-command.tsfurther past the repo's 500-line limit and mixes CLI flow with a lot of presentation branching. Pulling the summary/section printers into a dedicated formatter module would keepcompareCommandeasier to scan and test.As per coding guidelines, "Keep files under 500 lines; split aggressively by responsibility."
src/harnesses/opencode-adapter.ts (1)
48-59: Extract the signal-assessment/error-decoration path into a helper.This adapter is already past the repo's 500-line limit, and the permission-denied parsing plus result/error decoration is a separate concern from process orchestration. Pulling this branch into a small helper/module would make the timeout/retry flow easier to follow and bring the file back toward the size cap.
As per coding guidelines, "
src/**/*.{ts,tsx}: Keep files under 500 lines; split aggressively by responsibility".Also applies to: 336-340, 449-462, 528-541
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/harnesses/opencode-adapter.ts` around lines 48 - 59, The permission-denied detection and the subsequent signal-assessment/error-decoration logic in this file should be extracted into a small helper module to reduce file size and separate concerns: move the hasPermissionDeniedStderr function and the code that inspects process signals and decorates results/errors (the "signal-assessment/error-decoration" path referenced around lines 336-340, 449-462, 528-541) into a new helper (e.g., createOpenCodeErrorHelper) that exports a clear API like isPermissionDenied(stderr: string) and decorateProcessResult(result, signal, stderr) and update the adapter to call those helpers (replace direct regex checks and inline decoration with isPermissionDenied and decorateProcessResult) so process orchestration remains in the adapter while parsing/decoration lives in the helper.src/results/compare.ts (1)
270-429: Split the summary helpers out of this module.The new matching + trusted-metric aggregation logic pushes
src/results/compare.tspast the repo's 500-line cap. Extracting the aggregate/trusted-summary helpers into a sibling module would keep matching, summary math, and formatting separate.As per coding guidelines, "Keep files under 500 lines; split aggressively by responsibility."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/results/compare.ts` around lines 270 - 429, computeSummary has grown past the file-size limit; extract the aggregate and trusted-summary helper logic (the scoring/frontier aggregation and trusted-metrics checks) into a new sibling module and import them into src/results/compare.ts. Specifically, move implementations and any small helpers used by computeSummary such as computeAggregateScoring, computeAggregateFrontier, hasCompleteSignalAssessments and the trusted-matching/trusted-aggregation logic that computes trustedScoringDelta/trustedFrontierEvalDelta and metricAvailability/trustedComparedRows into a new module (e.g., results/summaryHelpers) and update computeSummary to call the exported functions (keeping the signature/behavior identical), then update imports and run tests to ensure no behavioral changes.src/lib/model-profile/registry.ts (1)
270-292: Consider: Provenance may be lost during merge.When merging profiles, if the same
profileKeyexists in multiple registries, the later profile's provenance overrides the earlier one. If aconfigured_profileregistry is merged over alegacy_aliasentry, the provenance changes toconfigured_profileeven though some fields might have originated from the legacy alias.This appears intentional given "later definitions taking precedence," but worth confirming this is the desired behavior for provenance tracking.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib/model-profile/registry.ts` around lines 270 - 292, mergeModelProfiles currently allows later registries to overwrite the profile.provenance, losing earlier provenance (e.g., configured_profile overwriting legacy_alias); update mergeModelProfiles to preserve provenance when combining profiles by detecting existing.provenance and profile.provenance and merging them (e.g., concatenate into an array or keep a deduplicated list) instead of simply taking the later value—apply this change in the merge branch where existing and profile are combined (the block that sets merged[profileKey] = existing ? { ...existing, ...profile, variants: {...} } : {...}), ensuring provenance is computed and included in the returned merged object.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/harnesses/goose-adapter.ts`:
- Around line 416-429: The ternary that appends "tool_call_not_executed" is dead
because codeFilePath is unconditionally assigned to solutionPath earlier, so the
condition toolCallDetected && !codeFilePath will always be false; fix by either
removing the dead ternary entirely from the appendSignalAssessmentReasons call
in this return, or change the condition to check the actual provenance flag
(e.g., replace !codeFilePath with a real indicator such as a new or existing
variable that denotes whether the code was extracted vs executed—use
toolCallDetected && <provenanceFlag>), keeping references to toolCallDetected,
codeFilePath/solutionPath, appendSignalAssessmentReasons and the
"tool_call_not_executed" tag so the intended taint is applied correctly.
In `@src/lib/machine-profile/instance-id.ts`:
- Around line 81-127: The writeGeneratedInstanceId function can leave a partial
file if openSync("wx") succeeded but writeFileSync or fchmodSync throws; update
the non-EEXIST error handling to remove the file this process created when
persistence fails mid-write: inside the catch block for non-EEXIST errors, if
fileDescriptor !== undefined (meaning openSync succeeded) close the descriptor
(fs.closeSync(fileDescriptor)) and then attempt to unlink targetPath
(fs.unlinkSync(targetPath)) inside a try/catch (best-effort cleanup), before
rethrowing the original error; keep existing finally logic for safe double-close
but ensure the explicit unlink runs when we own the file so concurrent callers
won’t see a malformed file.
In `@src/results/compare.ts`:
- Around line 27-41: buildCompareKey currently collapses different runtime
variants for the same canonical model profile; update buildCompareKey (and the
payload shape it consumes) to include a stable runtime variant discriminator
(e.g., item.runtimeVariant or item.runtimeTag) in the returned key (append after
item.runtime) so variants like different ollama builds produce distinct keys,
and/or add a duplicate-detection guard in the code that inserts into the
comparison Map (the caller that uses Map.set with keys from buildCompareKey) to
throw an error if a key already exists to fail fast instead of silently
overwriting; reference buildCompareKey and getModelIdentityKey when making this
change and also add the Map.set existence check at the Map insertion site.
In `@src/runner/item-retry.ts`:
- Around line 209-216: The scoring helper runScoringWithCompileRetry can return
signalAssessment as undefined, which currently overwrites an earlier
generation-set signalAssessment (variable signalAssessment) in the caller (e.g.,
where scoringOutcome is assigned), losing taint/trust metadata; update callers
to merge instead of overwrite by assigning signalAssessment =
scoringOutcome.signalAssessment ?? signalAssessment (and apply the same
nullish-coalescing pattern wherever runScoringWithCompileRetry results are
consumed) so the existing signalAssessment is preserved when the helper returns
undefined.
In `@src/runner/plan-builder.ts`:
- Around line 167-205: The loop in build plan silently drops model selectors
from config.models that never resolve on any reachable runtime; update the logic
in plan-builder (around resolveModelSelection, discovered, filteredSet,
resolvedModelProfiles) to detect selectors that remain unmatched after checking
all runtimes and throw a clear Error listing those selectors instead of silently
ignoring them; implement this by tracking unresolved modelSpecs (e.g., a Set
initialized from config.models, removing each modelSpec when you push into
filtered/filteredSet or set resolvedModelProfiles) and after the runtime loop,
if the Set is non-empty throw an Error("Requested model selectors not found:
...") including the missing keys.
In `@test/signal-assessment.test.ts`:
- Around line 1-3: Add the repo-standard header to the top of
test/signal-assessment.test.ts by expanding the existing Purpose block to
include an Exports section (list exported test helpers or `none` if the file
exports nothing) and an Invariants section (any preconditions or assumptions the
tests rely on), matching the shape used in test/machine-profile-probe.test.ts;
ensure the header sits before any imports or code and clearly names the file
purpose, exported symbols (or `none`), and invariants.
---
Outside diff comments:
In `@apps/dashboard/scripts/build-index.ts`:
- Around line 192-239: The code currently only redacts INTERNAL_TRACE_PATTERN
for generation.output but still publishes internal transcripts present in
generation.error and the various failure.message fields; update the sanitization
logic in the mapping for generation.error, generationFailure.message,
scoringFailure.message, and frontierEvalFailure.message to first test
INTERNAL_TRACE_PATTERN and if it matches replace with "[redacted internal tool
transcript]" (same token used for output) otherwise call sanitizePublishedText;
ensure you apply the same pattern check for any other places where a raw
message/error string is being set (e.g., the generation.error branch and the
message fields inside generationFailure, scoringFailure, frontierEvalFailure) so
all internal traces are consistently redacted before publishing.
In `@src/harnesses/code-output-policy.ts`:
- Around line 224-233: The raw-signal accept path currently returns trimmed but
uses extracted.method, causing fenced/prose to slip through; change the returned
code and taint reasoning to use the extracted candidate (e.g., extracted.code or
extracted.candidate as provided by extractCode()) when hasRawCodeSignals is true
so the accepted payload matches the extracted method: replace usages of trimmed
with the extracted code value and pass that extracted value into
getAcceptedOutputTaintReasons, keeping method: extracted.method unchanged.
---
Duplicate comments:
In `@src/lib/machine-profile/legacy.ts`:
- Around line 147-176: The current branch performs ad-hoc shape checks then
casts rawMachine to LegacyMachineProfile, which lets malformed JSON slip
through; replace those checks with a Zod schema (e.g.,
LegacyMachineProfileSchema) that fully describes the expected legacy shape, use
Schema.parse(rawMachine) instead of casting, and validate/normalize result.data
with that schema before calling migrateLegacyObservedHardware,
normalizeMachineProfile, buildLegacyInstanceId, buildMachineProfileKey, and
buildMachineProfileLabel so only parsed/typed legacy objects reach the rest of
the logic.
In `@src/lib/machine-profile/probe-linux.ts`:
- Around line 107-123: The current code returns dedupedAccelerators derived from
accelerators (which includes raw parseLspciAccelerators results) even when a
stronger probe (e.g., nvidia-smi) corroborates real devices; change the return
path so that when hasCorroboratedAccelerators is true you filter
dedupedAccelerators to only include entries that are corroborated (i.e.,
accelerator.backend !== undefined || accelerator.vendor !== undefined) before
returning so raw lspci-only controllers (from runProbe("lspci") /
parseLspciAccelerators) are excluded; update the block that builds the returned
object (using dedupeAccelerators, hasCorroboratedAccelerators, and the
accelerators array) to apply this filter and return the filtered list instead of
the unfiltered dedupedAccelerators.
In `@src/lib/machine-profile/probe-macos.ts`:
- Around line 76-84: The object spread currently uses a truthy check that drops
valid zero values: replace the conditional spread ...(memoryBytes ? {
memoryBytes } : {}) with a check that only omits memoryBytes when it is
null/undefined (e.g. memoryBytes != null or typeof memoryBytes === "number") so
that memoryBytes === 0 is preserved; update the returned object in the function
that builds the probe result (the place referencing memoryBytes and
modelRaw/kind/backend) to use that null/undefined check instead of a truthy
check.
In `@src/tests/workspace-smoke/test.meta.json`:
- Around line 7-10: The test meta for workspace-smoke removed the
"workspace-mkdir" capability causing tasks that write nested logs/ to fail;
restore "workspace-mkdir" in the requiredHarnessCapabilities array in the
test.meta.json for the workspace-smoke test (update the
requiredHarnessCapabilities JSON array to include "workspace-mkdir" alongside
"workspace-read" and "workspace-write") so the harness can create nested
directories for logs.
In `@test/stats.test.ts`:
- Around line 35-54: The helper createRunStats merges default timing but then
re-applies overrides via the final spread, which lets overrides.timing clobber
the merged defaults; fix by excluding timing from the final spread so the merged
timing is preserved — e.g., build the merged timing as you already do, then when
returning the RunStats object spread the rest of overrides without timing (use a
destructuring like const { timing: _, ...rest } = overrides) and return {
timing, ...rest } to keep the merged defaults intact; reference createRunStats
and the RunStats fixture to locate the change.
---
Nitpick comments:
In `@apps/dashboard/src/components/leaderboard/leaderboard-filters.ts`:
- Around line 64-65: Update the stale doc comment that reads "Builds machine
filter options keyed by stable machine profile ID." to use "key" instead of "ID"
(so it reads "Builds machine filter options keyed by stable machine profile key"
or similar), i.e., edit the comment in leaderboard-filters.ts where that
sentence appears to reflect that the implementation keys by machineProfileKey.
In `@apps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsx`:
- Around line 29-37: The fallback chain for deriving the machine badge differs
between getRunMachineBadgeLabel (uses machineDisplayLabel → machineLabel →
machineProfileLabel → machineProfileKey → machineProfileId) and the logic in
run-card.tsx (machineDisplayLabel → machineProfileLabel); extract the canonical
logic into a shared utility function (e.g., export function
getRunMachineBadgeLabel(run: RunListItem)) that implements the full trimmed
fallback chain, replace the local implementations in leaderboard-latest-runs.tsx
and run-card.tsx to import and call this shared function, and if the divergence
was intentional instead document the reason near the function and align both
files to reference that single documented helper.
In `@src/harnesses/opencode-adapter.ts`:
- Around line 48-59: The permission-denied detection and the subsequent
signal-assessment/error-decoration logic in this file should be extracted into a
small helper module to reduce file size and separate concerns: move the
hasPermissionDeniedStderr function and the code that inspects process signals
and decorates results/errors (the "signal-assessment/error-decoration" path
referenced around lines 336-340, 449-462, 528-541) into a new helper (e.g.,
createOpenCodeErrorHelper) that exports a clear API like
isPermissionDenied(stderr: string) and decorateProcessResult(result, signal,
stderr) and update the adapter to call those helpers (replace direct regex
checks and inline decoration with isPermissionDenied and decorateProcessResult)
so process orchestration remains in the adapter while parsing/decoration lives
in the helper.
In `@src/lib/model-profile/registry.ts`:
- Around line 270-292: mergeModelProfiles currently allows later registries to
overwrite the profile.provenance, losing earlier provenance (e.g.,
configured_profile overwriting legacy_alias); update mergeModelProfiles to
preserve provenance when combining profiles by detecting existing.provenance and
profile.provenance and merging them (e.g., concatenate into an array or keep a
deduplicated list) instead of simply taking the later value—apply this change in
the merge branch where existing and profile are combined (the block that sets
merged[profileKey] = existing ? { ...existing, ...profile, variants: {...} } :
{...}), ensuring provenance is computed and included in the returned merged
object.
In `@src/results/compare.ts`:
- Around line 270-429: computeSummary has grown past the file-size limit;
extract the aggregate and trusted-summary helper logic (the scoring/frontier
aggregation and trusted-metrics checks) into a new sibling module and import
them into src/results/compare.ts. Specifically, move implementations and any
small helpers used by computeSummary such as computeAggregateScoring,
computeAggregateFrontier, hasCompleteSignalAssessments and the
trusted-matching/trusted-aggregation logic that computes
trustedScoringDelta/trustedFrontierEvalDelta and
metricAvailability/trustedComparedRows into a new module (e.g.,
results/summaryHelpers) and update computeSummary to call the exported functions
(keeping the signature/behavior identical), then update imports and run tests to
ensure no behavioral changes.
In `@src/tests/calculator-stateful/prompt.blind.md`:
- Line 24: Update the documentation to use explicit TypeScript-style method
signatures for the Calculator API instead of prose; e.g., replace descriptions
like "result() and memoryRecall() return numbers; the other methods remain
chainable" with explicit signatures such as result(): number; memoryRecall():
number; memoryStore(value: number): Calculator; clear(): Calculator; add(value:
number): Calculator; subtract(value: number): Calculator (include every method
from the prompt example with exact names used in the diff so the contract is
unambiguous).
In `@src/tests/event-emitter/prompt.blind.md`:
- Line 18: Update the specification sentence for listenerCount to explicitly
state its behavior when no listeners exist: change the line describing
listenerCount to read that it "returns the current listener count for that event
(0 if no listeners are registered)" so callers and tests know it returns 0 for
events that have never been registered or that currently have no listeners;
reference the listenerCount API in the EventEmitter spec or test description to
ensure clarity.
In `@src/tests/rate-limiter/prompt.blind.md`:
- Line 16: Update the prompt text to require the exact API surface: replace
"Return an object with `allow(key: string, nowMs: number)`, `remaining(key:
string, nowMs: number)`, and `reset(key: string)`." with a stricter phrasing
that specifies "returns an object with exactly these methods: `allow(key:
string, nowMs: number)`, `remaining(key: string, nowMs: number)`, and
`reset(key: string)`" so it cannot be interpreted as allowing additional
methods; reference the same method names (`allow`, `remaining`, `reset`) to
ensure clarity.
In `@test/machine-profile-instance-id.test.ts`:
- Around line 24-31: The win32 test uses a weak .toContain("AppData") assertion;
change it to assert the full exact path returned by
resolveDefaultInstanceIdPath("win32", {...}) — e.g.,
expect(...).toBe("C:\\Users\\tester\\AppData\\Local\\plebdev-bench\\machine-instance-id")
— ensuring you escape backslashes in the test string; update the assertion for
resolveDefaultInstanceIdPath to match the exact Windows path and include the
full plebdev-bench/machine-instance-id suffix.
In `@test/workspace-prompt-parity.test.ts`:
- Line 83: The assertion for memoryClear is too loose—update the test that
checks calculatorPrompt to assert the exact method signature by replacing the
current toContain("`memoryClear`") with a check that includes the parentheses
(e.g., toContain("`memoryClear()`")), so the test for memoryClear() matches the
exact method-signature text used by the other signature checks (target the
calculatorPrompt variable and adjust the assertion accordingly).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 555b9328-c221-4afb-b1e0-ece45de4ad5b
📒 Files selected for processing (104)
README.mdapps/dashboard/public/results/20260305-135118-b11d08/plan.jsonapps/dashboard/public/results/20260305-135118-b11d08/run.jsonapps/dashboard/public/results/20260305-144041-4d5950/plan.jsonapps/dashboard/public/results/20260305-144041-4d5950/run.jsonapps/dashboard/public/results/20260306-135715-91b0ee/plan.jsonapps/dashboard/public/results/20260306-135715-91b0ee/run.jsonapps/dashboard/public/results/20260307-161553-c25c6a/plan.jsonapps/dashboard/public/results/20260307-161553-c25c6a/run.jsonapps/dashboard/public/results/20260316-141600-6e18e2/plan.jsonapps/dashboard/public/results/20260316-141600-6e18e2/run.jsonapps/dashboard/public/results/20260317-082000-7b3bf3/plan.jsonapps/dashboard/public/results/20260317-082000-7b3bf3/run.jsonapps/dashboard/public/results/20260317-170728-697a9a/plan.jsonapps/dashboard/public/results/20260317-170728-697a9a/run.jsonapps/dashboard/public/results/20260318-091214-a0dbda/plan.jsonapps/dashboard/public/results/20260318-091214-a0dbda/run.jsonapps/dashboard/public/results/20260319-090246-e5f8be/plan.jsonapps/dashboard/public/results/20260319-090246-e5f8be/run.jsonapps/dashboard/public/results/aggregates/chk_sha256v1_2c13ad3e58b3.jsonapps/dashboard/public/results/aggregates/chk_sha256v1_61a037f87105.jsonapps/dashboard/public/results/aggregates/latest.jsonapps/dashboard/public/results/index.jsonapps/dashboard/scripts/build-index.tsapps/dashboard/src/components/leaderboard/leaderboard-filters.tsapps/dashboard/src/components/leaderboard/leaderboard-latest-runs.tsxapps/dashboard/src/components/leaderboard/leaderboard-results-table.tsxapps/dashboard/src/components/leaderboard/leaderboard-summary-cards.tsxapps/dashboard/src/components/run-detail/run-detail-page.tsxapps/dashboard/src/components/run-list/checkpoint-groups.tsapps/dashboard/src/components/run-list/run-card.tsxapps/dashboard/src/components/run-list/run-list-page.tsxapps/dashboard/src/lib/aggregations-compare.tsapps/dashboard/src/lib/aggregations-core.tsapps/dashboard/src/lib/api.tsapps/dashboard/src/lib/schemas.tsapps/dashboard/src/lib/types.tsllm/implementation/multi-runtime-mvp-implementation.mdmodels.example.jsonmodels.jsonsrc/cli/compare-command.tssrc/cli/index.tssrc/cli/migrate-machine-command.tssrc/cli/run-command.tssrc/harnesses/code-output-policy.tssrc/harnesses/direct-adapter.tssrc/harnesses/goose-adapter.tssrc/harnesses/harness.tssrc/harnesses/opencode-adapter.tssrc/lib/hardware-profile.tssrc/lib/machine-profile/instance-id.tssrc/lib/machine-profile/legacy.tssrc/lib/machine-profile/normalization.tssrc/lib/machine-profile/probe-linux.tssrc/lib/machine-profile/probe-macos.tssrc/lib/machine-profile/probe-utils.tssrc/lib/machine-profile/probe-windows.tssrc/lib/machine-profile/probe.tssrc/lib/model-profile/normalization.tssrc/lib/model-profile/registry.tssrc/lib/model-profiles.tssrc/lib/signal-assessment.tssrc/lib/stats-format.tssrc/lib/stats.tssrc/results/aggregate.tssrc/results/compare.tssrc/results/reader.tssrc/runner/generation-retry.tssrc/runner/index.tssrc/runner/item-executor.tssrc/runner/item-retry.tssrc/runner/plan-builder.tssrc/schemas/common.schema.tssrc/schemas/config.schema.tssrc/schemas/index.tssrc/schemas/model-profile.schema.tssrc/schemas/plan.schema.tssrc/schemas/result.schema.tssrc/tests/calculator-stateful/prompt.blind.mdsrc/tests/event-emitter/prompt.blind.mdsrc/tests/file-delete-smoke/prompt.blind.mdsrc/tests/rate-limiter/prompt.blind.mdsrc/tests/safe-cleanup/prompt.blind.mdsrc/tests/workspace-smoke/test.meta.jsontest/aggregate.test.tstest/build-index.test.tstest/code-output-policy.test.tstest/compare-checkpoint-guard.test.tstest/compare.test.tstest/dashboard-aggregations-compare.test.tstest/dashboard-schema-compat.test.tstest/goose-adapter.test.tstest/hardware-profile.test.tstest/machine-profile-instance-id.test.tstest/machine-profile-migration.test.tstest/machine-profile-probe.test.tstest/model-profile-registry.test.tstest/ollama-adapter.test.tstest/opencode-adapter.test.tstest/plan-builder.test.tstest/schemas.test.tstest/signal-assessment.test.tstest/stats.test.tstest/workspace-prompt-parity.test.ts
Summary
This PR lands two related identity layers that now both exist on
staging:Canonical model identities that group runtime-specific variants under one model profile. This lets us treat the same base model as one thing in the product while still preserving runtime, quantization, and format details underneath.
Examples:
qwen3-27b-instructvia Ollamaqwen3-27b-instructprofile via vLLMCanonical hardware profiles that are separate from per-machine instance IDs. This lets us aggregate by shared hardware shape while still distinguishing individual hosts.
Examples:
What Changed
migrate-machine-profilesCLI.Why This Matters
Before this branch, model names were mostly runtime-specific strings and machine identity was too coarse. With these changes:
Validation
bun run typecheckbun run testSummary by CodeRabbit
New Features
Improvements
Schema Updates