Feature/multi runtime#6
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR introduces a runtime-first architecture: adds a distinct "runtime" dimension (ollama, vllm) to the benchmark matrix (runtime × harness × model × test × passType), implements runtime adapters and discovery, updates harnesses to be runtime-aware, adds model-aliasing and managed vLLM lifecycle, updates schemas to 0.2.1, and wires runtime support through the runner, CLI, tests, dashboard, and docs. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI as CLI/Config
participant Discovery
participant RuntimeMgr as Runtime Manager
participant Runtime as Runtime (Ollama/vLLM)
participant Harness
participant Scorer
participant Dashboard
User->>CLI: run --runtimes ollama,vllm
CLI->>Discovery: discoverRuntimes(config)
Discovery->>Runtime: ping()
Runtime-->>Discovery: available?
Discovery-->>CLI: [ollama, vllm]
CLI->>RuntimeMgr: listModels(runtime)
RuntimeMgr->>Runtime: listModels()
Runtime-->>RuntimeMgr: [models]
CLI->>CLI: build matrix (runtime × harness × model × test × passType)
loop Each matrix item
CLI->>RuntimeMgr: createRuntime(name, config)
RuntimeMgr->>Runtime: init(baseUrl)
Runtime-->>RuntimeMgr: ready
CLI->>Harness: createHarness(name)
Harness-->>CLI: ready
CLI->>Harness: generate(prompt, { runtime })
Harness->>Runtime: HTTP call (runtime.baseUrl)
Runtime-->>Harness: output + tokens
Harness-->>CLI: GenerateResult
CLI->>Scorer: score(output)
Scorer-->>CLI: AutomatedScore
end
CLI->>Dashboard: write run results (include runtime)
Dashboard->>Dashboard: aggregate by runtime
Dashboard-->>User: render runtime-aware charts/tables
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 14
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (6)
src/harnesses/goose-adapter.ts (2)
30-235: 🛠️ Refactor suggestion | 🟠 MajorSplit Goose parsing helpers to keep the file under 500 lines.
This file now exceeds the 500-line cap; consider moving normalization/extraction helpers into a smallsrc/harnesses/goose-output.ts(or similar) to keep responsibilities separated.As per coding guidelines: "Keep files under 500 lines; split aggressively by responsibility."
211-289:⚠️ Potential issue | 🟠 MajorAdd Zod validation for Goose JSON output.
Goose CLI output is a boundary input; it's parsed and then cast without schema validation. Define a
GoosePayloadSchemausing Zod and replace the unsafe type assertion inparseGooseJsonPayloadwithsafeParse, returning only validated data.Per guidelines: "Treat all boundary inputs (CLI args, env, JSON files, HTTP responses) as 'unknown' and parse immediately with Zod."
Example approach
function parseGooseJsonPayload(raw: string): unknown | null { const trimmed = raw.trim(); if (!trimmed) return null; @@ for (const candidate of candidates) { try { - return JSON.parse(candidate) as unknown; + const parsed = JSON.parse(candidate) as unknown; + const result = GoosePayloadSchema.safeParse(parsed); + if (result.success) { + return result.data; + } } catch { // Continue trying other candidate slices. }Import Zod at the top and define the schema for the expected Goose message structure.
apps/dashboard/src/components/run-detail/matrix-table.tsx (1)
1-26:⚠️ Potential issue | 🟡 MinorAdd header details and TSDoc for the exported component.
The file header should document purpose, exports, and invariants, and the exportedMatrixTablefunction needs TSDoc with params/returns/throws.As per coding guidelines: "Every file must start with a descriptive header describing purpose, exports, and invariants" and "All exported functions require TSDoc/JSDoc documentation (purpose, params, returns, throws)."📝 Suggested doc additions
/** * Purpose: Matrix table component for displaying run items. * Shows all matrix items with status, model, harness, test, scores, and timing. + * Exports: MatrixTable + * Invariants: items are already validated and include runtime/harness/model/test/passType fields. */ @@ -export function MatrixTable({ items, onRowClick }: MatrixTableProps) { +/** + * Render the matrix table for a run. + * + * `@param` items - Matrix items to display + * `@param` onRowClick - Optional row click handler + * `@returns` Table markup for matrix items + * `@throws` {Error} Never directly + */ +export function MatrixTable({ items, onRowClick }: MatrixTableProps) {apps/dashboard/src/lib/aggregations.ts (1)
106-116:⚠️ Potential issue | 🟡 MinorUpdate
groupByModelHarnessdocs to include runtime.Line 115 now builds keys as
runtime / model / harness, but the doc still says “model + harness”.✍️ Suggested doc fix
-/** - * Groups items by combined model + harness name. - * `@param` items - Matrix items - * `@returns` Map keyed by "model · harness" - */ +/** + * Groups items by combined runtime + model + harness name. + * `@param` items - Matrix items + * `@returns` Map keyed by "runtime / model / harness" + */llm/project/design-rules.md (1)
29-34:⚠️ Potential issue | 🟡 MinorInclude runtime in the naming-consistency note.
Line 34 mentions model/harness/test/passType but runtime is now first-class.
✍️ Suggested edit
- - keep naming consistent everywhere (model/harness/test/passType) + - keep naming consistent everywhere (runtime/model/harness/test/passType)apps/dashboard/src/components/run-detail/dimension-detail-dialog.tsx (1)
350-356:⚠️ Potential issue | 🟡 MinorAdd TSDoc for the exported dialog component.
Exported functions in this repo must include TSDoc/JSDoc with purpose, params, returns, and throws.
📚 Suggested doc block
+/** + * Renders the dimension detail dialog with summary, breakdowns, and item list. + * + * `@param` props - Dialog props including dimension, name, items, and open state. + * `@returns` The dialog JSX, or null when `name` is missing. + * `@throws` {Error} Never. + */ export function DimensionDetailDialog({ dimension, name, items, open, onOpenChange, }: DimensionDetailDialogProps) {As per coding guidelines: “All exported functions require TSDoc/JSDoc documentation (purpose, params, returns, throws).”
🤖 Fix all issues with AI agents
In `@apps/dashboard/src/lib/aggregations.ts`:
- Around line 64-76: Update the JSDoc for the exported function groupByRuntime
to include `@param` and `@returns` tags: add an `@param` describing the items
parameter (items: MatrixItemResult[] — array of items to group by runtime) and
an `@returns` describing the Map<string, MatrixItemResult[]> result (a Map keyed
by runtime name with arrays of MatrixItemResult). Keep the existing brief
purpose line, and include `@throws` if the function can throw (otherwise omit) so
the doc meets the project's TSDoc/JSDoc requirements.
In `@llm/project/project-rules.md`:
- Line 16: Update the documentation line that lists runtime adapters to include
vLLM alongside Ollama: edit the entry mentioning `src/runtimes/` to reference
both "Ollama" and "vLLM" (or similar phrasing), and ensure it aligns with the
implemented `vllm-runtime` (see `src/runtimes/vllm-runtime.ts`) and the
`RUNTIME_NAMES` constant so the docs reflect the actual supported runtimes.
In `@models.example.json`:
- Around line 6-9: The vLLM reference for the model key "qwen3-4b" is incorrect:
the "vllm" value currently points to an MLX-format package
("mlx-community/Qwen3-4B-Instruct-4bit") which vLLM cannot load; update the
"vllm" entry for "qwen3-4b" to a vLLM-compatible HuggingFace path (e.g.,
"Qwen/Qwen3-4B-Instruct") or to a compatible quantized format
(AWQ/GPTQ/bitsandbytes/GGUF) so vLLM can load it correctly. Ensure you only
change the "vllm" field for the "qwen3-4b" object and keep the "ollama" entry
("qwen3:4b") untouched.
In `@models.json`:
- Around line 2-4: The mapping for "qwen3-8b" is inconsistent: the vLLM target
"Qwen/Qwen2.5-14B-Instruct" is a 14B model while the id implies 8B; either
update the vllm value to an 8B variant (e.g., replace
"Qwen/Qwen2.5-14B-Instruct" with the correct 8B model name) or rename the key
"qwen3-8b" to reflect the 14B target, making sure to keep the "ollama" and
"vllm" entries coherent (check the "qwen3-8b" key, its "ollama" value, and the
"vllm" value when making the change).
In `@src/harnesses/goose-adapter.ts`:
- Around line 294-299: The JSDoc for the exported function createGooseAdapter is
missing a `@throws` tag; update the docblock above export function
createGooseAdapter(): Harness to include a `@throws` entry describing the error
conditions the function may raise (for example initialization/config or
resource-creation failures) and the type of the thrown error (e.g., Error), so
consumers know what exceptions to handle when calling createGooseAdapter.
In `@src/harnesses/harness.ts`:
- Around line 120-125: normalizeHarnessName currently casts any string to
HarnessName without validation; update it to validate input against
LEGACY_HARNESS_ALIAS and the canonical set (HARNESS_NAMES) and throw a clear
error on invalid inputs instead of returning an unsafe cast. Locate
normalizeHarnessName and change its logic to: if name equals
LEGACY_HARNESS_ALIAS return "direct"; else if name is included in HARNESS_NAMES
return name as HarnessName; otherwise throw a descriptive Error (e.g., `Invalid
harness name: ${name}`) so callers receive a programmer/config error rather than
a silent invalid value.
In `@src/lib/ollama-client.ts`:
- Around line 84-101: The streaming loop must buffer partial NDJSON lines and
validate each complete line with Zod instead of blindly JSON.parse; replace the
per-chunk parse in the reader loop (symbols: decoder, reader, output,
promptTokens, completionTokens) with a buffer string that appends
decoder.decode(value, {stream:true}), split complete lines on "\n", keep any
trailing partial line in the buffer for the next iteration, and for each
complete line parse it to JSON then validate against a Zod schema (e.g., create
an OllamaStreamSchema with optional fields response, done, prompt_eval_count,
eval_count) before using the fields; update promptTokens and completionTokens
only when the schema result has non-undefined values (use explicit !== undefined
checks so zero values are preserved) and handle Zod/parse errors by skipping or
logging the invalid line rather than crashing.
In `@src/lib/openai-compat-client.ts`:
- Around line 111-133: The SSE chunk handling currently does an unchecked
JSON.parse into ChatCompletionChunk; replace that cast with runtime Zod
validation: create a ChatCompletionChunkSchema (or equivalent) using zod that
matches the expected shape (choices[0].delta.content optional,
usage.prompt_tokens and usage.completion_tokens optional), then inside the try
block parse the raw data and call ChatCompletionChunkSchema.safeParse(parsed)
(or parse with try/catch) and only if validation succeeds update output,
promptTokens and completionTokens; if validation fails, skip the chunk and
optionally log/debug the validation error so malformed/attacker-controlled SSE
chunks cannot corrupt state.
In `@src/runner/plan-builder.ts`:
- Line 318: Replace the hardcoded schemaVersion "0.2.0" in plan-builder.ts with
the shared constant: import SCHEMA_VERSION from the schemas module (same module
used by src/runner/index.ts) and use that constant where schemaVersion is
assigned (the object or variable that currently contains schemaVersion:
"0.2.0"); update the top of the file to add the import for SCHEMA_VERSION from
src/schemas/common.schema.ts and replace the literal with SCHEMA_VERSION so the
file uses the single source of truth.
In `@src/runtimes/vllm-runtime.ts`:
- Around line 120-151: Replace console.debug calls in the ping() method with the
project Pino logger (e.g., logger.debug) and emit structured logs: when
healthResponse or modelsResponse are non-ok use logger.debug with an object like
{ status: healthResponse.status, url: `${baseUrl}/health` } and when catching
errors log the error object (e.g., logger.debug({ err: error }, 'vLLM /health
failed') or similar). Ensure the file imports/uses the shared Pino instance (or
the existing logger variable) and preserve existing messages and behavior in the
fallback to `${baseUrl}/v1/models` while replacing all console.debug occurrences
in ping().
In `@src/schemas/config.schema.ts`:
- Around line 10-18: BenchConfigSchema currently lacks a version marker; add a
schemaVersion property near the top of the z.object inside BenchConfigSchema
(before runtimes) to track schema versions. Add schemaVersion:
z.number().default(1) (or z.string() if you prefer string versions) as the first
property in BenchConfigSchema so new configs get a default version and existing
code can rely on an explicit version field.
In `@src/schemas/model-alias.schema.ts`:
- Around line 1-16: The file header for the model-alias schema is missing the
required "invariants" section; update the top-of-file comment to include a brief
invariants list (e.g., keys are canonical model names, values map runtime
identifier strings, values must be non-empty strings, and mapping is one-to-one
per runtime) and mention the exported symbols ModelAliasMapSchema and
ModelAliasMap; ensure the comment block directly above the module
exports/docstrings includes Purpose, Exports, and Invariants so readers and
linters can verify expectations for ModelAliasMapSchema and ModelAliasMap.
In `@src/schemas/plan.schema.ts`:
- Around line 65-69: The plan schema's config object validates vllmBaseUrl as
z.string() but should match BenchConfigSchema by using z.string().url(); update
the config object in src/schemas/plan.schema.ts to change the vllmBaseUrl
validator from z.string() to z.string().url(), keeping other fields
(ollamaBaseUrl, generateTimeoutMs, passTypes) unchanged so plan validation
aligns with BenchConfigSchema and enforces a proper URL format.
In `@test/harness-compatibility.test.ts`:
- Around line 1-3: Update the existing top-of-file comment block (the /* ... */
header) to include explicit "Exports" and "Invariants" entries: add "Exports:
none (this is a test file)" and one-line invariants such as "Invariants:
deterministic tests, no real network I/O, only mutate globals via harness mocks;
tests should be idempotent and independent." Put these lines directly beneath
the existing Purpose line in the header comment.
🧹 Nitpick comments (13)
llm/project/project-overview.md (1)
43-44: Consider aligning the “Harnesses” bullet above with the new runtime/harness split.The Architecture & Stack section still reads like Ollama/Goose/OpenCode are all harnesses; updating that wording would prevent confusion now that runtimes and harnesses are separated.
Suggested wording tweak
-- **Harnesses:** shell out to Ollama / Goose / OpenCode CLIs or call APIs directly +- **Runtimes:** inference backends (e.g., Ollama, vLLM) +- **Harnesses:** adapters (direct, goose, opencode) that call CLIs/APIsllm/project/user-flow.md (1)
48-59: Suggestion: mention runtimes in the catalog state for consistency.Since S3 now includes runtime in the matrix, it would help to surface runtime discovery in S2 as well.
Suggested wording tweak
-### State S2 — Catalog Browse (Tests / Harnesses / Models) -- **User sees**: Lists of available tests, harness adapters, and discoverable local models. +### State S2 — Catalog Browse (Tests / Runtimes / Harnesses / Models) +- **User sees**: Lists of available tests, runtimes, harness adapters, and discoverable local models. @@ - - Which harness(es) to run through? - - Which model(s) to benchmark? + - Which runtime(s) to use? + - Which harness(es) to run through? + - Which model(s) to benchmark?test/opencode-model.test.ts (1)
36-40: Consider adding a test for models with multiple slashes.The implementation uses
replaceAll("/", "__"), which handles multiple slashes. A test case like"org/sub/model"→"org__sub__model"would document this behavior explicitly.📝 Suggested additional test case
it("converts slash-separated models to stable keys", () => { expect(toOpenCodeModelKey("Qwen/Qwen2.5-14B-Instruct")).toBe( "Qwen__Qwen2.5-14B-Instruct", ); }); + + it("handles models with multiple slashes", () => { + expect(toOpenCodeModelKey("org/sub/model-name")).toBe( + "org__sub__model-name", + ); + });src/schemas/result.schema.ts (1)
145-146: UseRuntimeNameSchemainstead ofz.string()for theruntimefield.The
RuntimeNameSchemais defined incommon.schema.tsand provides typed validation for runtime names. Using it instead ofz.string()ensures consistent validation with the schema's intent and improves type safety across the codebase. Note thatplan.schema.tsusesz.string()for the same field, so this change would also bring consistency across related schemas.src/schemas/model-alias.schema.ts (1)
20-36: Consider a versioned wrapper schema for persisted alias files.If alias maps are stored independently (e.g., as standalone JSON), add a versioned wrapper to keep schema evolution safe without changing the internal map shape.
♻️ Suggested addition
+import { SCHEMA_VERSION } from "./common.schema.js"; export const ModelAliasMapSchema = z.record( z.string(), // alias name (e.g., "qwen3-8b") ModelAliasEntrySchema, ); + +/** Versioned wrapper for persisted alias files. */ +export const ModelAliasFileSchema = z.object({ + schemaVersion: z.string().default(SCHEMA_VERSION), + aliases: ModelAliasMapSchema, +}); + +export type ModelAliasFile = z.infer<typeof ModelAliasFileSchema>;Based on learnings, "Maintain versioned schemas with 'schemaVersion' early in Zod schema definitions".
src/harnesses/direct-adapter.ts (1)
59-87: Add exhaustive check forruntime.apiFormatswitch.The switch statement handles
"ollama"and"openai-compat"but lacks a default case. If a new API format is added to theApiFormattype in the future, TypeScript won't catch the missing handler at compile time, and the function would silently produceundefinedforoutput,promptTokens, andcompletionTokens.♻️ Proposed fix: add exhaustive check
case "openai-compat": { const response = await generateOpenAiCompat({ baseUrl: runtime.baseUrl, model, prompt: fullPrompt, timeoutMs, apiKey: process.env.VLLM_API_KEY, }); output = response.output; promptTokens = response.promptTokens; completionTokens = response.completionTokens; break; } + + default: { + const _exhaustive: never = runtime.apiFormat; + throw new Error(`Unsupported API format: ${_exhaustive}`); + } }src/runtimes/discovery.ts (1)
63-85: Consider parallelizing runtime discovery checks.The current sequential approach works but could be optimized for faster startup when multiple runtimes are configured.
♻️ Optional: Parallelize discovery checks
export async function discoverRuntimes( config: RuntimeDiscoveryConfig, ): Promise<RuntimeName[]> { - const available: RuntimeName[] = []; - - // Check Ollama - const ollamaAvailable = await isRuntimeAvailable("ollama", config); - if (ollamaAvailable) { - available.push("ollama"); - } else { - logger.debug({ baseUrl: config.ollamaBaseUrl }, "Ollama runtime not available"); - } - - // Check vLLM - const vllmAvailable = await isRuntimeAvailable("vllm", config); - if (vllmAvailable) { - available.push("vllm"); - } else { - logger.debug({ baseUrl: config.vllmBaseUrl }, "vLLM runtime not available"); - } - - return available; + const runtimes: RuntimeName[] = ["ollama", "vllm"]; + const results = await Promise.all( + runtimes.map(async (name) => { + const available = await isRuntimeAvailable(name, config); + if (!available) { + const baseUrl = name === "ollama" ? config.ollamaBaseUrl : config.vllmBaseUrl; + logger.debug({ baseUrl }, `${name} runtime not available`); + } + return { name, available }; + }), + ); + return results.filter((r) => r.available).map((r) => r.name); }src/runner/item-executor.ts (1)
57-61: Consider consolidatingRuntimeUrlswith the schema or runtimes module.The
RuntimeUrlsinterface duplicates fields fromRuntimeConfiginsrc/runtimes/index.ts. This could be imported or shared to maintain single source of truth.test/schemas.test.ts (1)
30-33: Consider adding a test for the "vllm" runtime name.The current test only validates "ollama". Adding "vllm" would ensure complete coverage of the
RuntimeNameSchemaenum.♻️ Add vllm runtime name test
it("should validate runtime names", () => { expect(RuntimeNameSchema.parse("ollama")).toBe("ollama"); + expect(RuntimeNameSchema.parse("vllm")).toBe("vllm"); expect(() => RuntimeNameSchema.parse("unknown")).toThrow(); });src/harnesses/opencode-adapter.ts (3)
1-14: File exceeds 500-line limit; consider splitting by responsibility.At 964 lines, this file is nearly 2x the 500-line guideline. Consider extracting:
- Output parsing (lines 264-367):
normalizeOpenCodeOutput, tool-call extraction →opencode-output.ts- Tool call extraction (lines 117-253): Tool name detection, content extraction →
opencode-tool-call.ts- Process management (lines 382-442):
forceKillProcess,computeStaleOutputTimeoutMs→ shared utilityThis would improve testability and align with the "split by responsibility" guideline.
As per coding guidelines: "Keep files under 500 lines; split aggressively by responsibility."
279-312: JSON parsing without Zod validation.The JSONL parsing at lines 279-312 uses plain
JSON.parsewith type assertions. Per coding guidelines, boundary inputs (including JSON responses) should be validated with Zod.While this is internal parsing of OpenCode's output format (not a public boundary), using Zod would provide better error messages and runtime safety.
As per coding guidelines: "Treat all boundary inputs (CLI args, env, JSON files, HTTP responses) as 'unknown' and parse immediately with Zod."
689-762: Complex Promise.race pattern is hard to follow.The timeout and stale-output detection setup (lines 689-762) uses nested closures and shared mutable state (
killAttempted,timedOut,staleKilled). This works but is error-prone.Consider extracting this into a dedicated timeout manager class or using a library pattern for cleaner abort handling.
src/runner/plan-builder.ts (1)
183-200: Redundant runtime creation in error path.When models don't match the filter, the code creates new runtime instances (lines 187-192) just to fetch available models for the error message. These runtimes were already created earlier in the loop.
Consider caching the available models during the first discovery pass to avoid recreating runtimes.
♻️ Proposed refactor to cache discovered models
+ // Track all discovered models for error messages + const allDiscoveredModels = new Map<RuntimeName, string[]>(); + for (const runtimeName of runtimes) { const runtime = createRuntime(runtimeName, { ollamaBaseUrl: config.ollamaBaseUrl, vllmBaseUrl: config.vllmBaseUrl, defaultTimeoutMs: config.generateTimeoutMs, }); // ... ping check ... const discovered = await runtime.listModels(); + allDiscoveredModels.set(runtimeName, discovered); // ... filtering logic ... } // In error path, use cached models instead of re-fetching if (allModels.length === 0) { if (config.models.length > 0) { const availableByRuntime: string[] = []; - for (const runtimeName of runtimes) { - const runtime = createRuntime(runtimeName, { ... }); - const available = await runtime.listModels(); + for (const [runtimeName, available] of allDiscoveredModels) { if (available.length > 0) { availableByRuntime.push(`${runtimeName}: ${available.slice(0, 5).join(", ")}...`); } }
- Harden streaming parsers (NDJSON/SSE) with Zod boundary validation - Split Goose/OpenCode helpers to keep files <500 lines - Fix model alias mismatches and schemaVersion single source of truth - Split dashboard aggregations into focused modules - Document vLLM + OrbStack setup and compose CLI args
There was a problem hiding this comment.
Actionable comments posted: 10
🤖 Fix all issues with AI agents
In `@apps/dashboard/src/lib/aggregations-core.ts`:
- Around line 169-179: The p90 calculation currently uses
Math.floor(durations.length * 0.9) which for length 10 yields index 9 (the max);
update the p90Index calculation to use Math.ceil(durations.length * 0.9) - 1 to
select the element at which ~90% of observations are at or below (ensure you
clamp the index into [0, durations.length-1] and handle empty durations to avoid
division/index errors); modify the p90 assignment (the p90 property) to use this
new p90Index and keep the fallback to durations[durations.length - 1] if needed.
In `@llm/project/design-rules.md`:
- Line 34: The dimension ordering on the noted line uses
"runtime/model/harness/test/passType" which swaps harness and model; update that
instance to "runtime/harness/model/test/passType" to match the rest of the
document (see other occurrences at lines referencing
runtime/harness/model/test/passType) and verify there are no other swapped
occurrences of the phrase "runtime/model/harness/test/passType" remaining; keep
the exact token order "runtime/harness/model/test/passType".
In `@src/harnesses/goose-adapter.ts`:
- Around line 70-106: The switch on runtime.apiFormat in goose-adapter.ts sets
provider for "ollama" and "openai-compat" but lacks a default branch, so
provider can remain undefined; add a default case to the switch that assigns a
sensible fallback provider (e.g., provider = "unknown" or throw a clear error)
and/or log a warning; ensure the default handles extraEnv setup expectations or
short-circuits further logic that relies on provider to avoid runtime errors
(refer to symbols: runtime.apiFormat, provider, extraEnv,
normalizeOpenAiBasePath).
In `@src/harnesses/goose-output.ts`:
- Around line 138-150: The code treats obj.arguments as an object without
guarding against null (since typeof null === "object"), so update the checks
before casting to ensure obj.arguments !== null && typeof obj.arguments ===
"object"; then use the guarded value when creating args and reading
args.file_text / args.fileText in the developer__text_editor branch (and apply
the same null guard to the analogous blocks referenced around the other checks
at the locations that read args in the file, e.g., the other branches at the
spots you noted), so that args is only used when obj.arguments is non-null and
an object to avoid TypeError on property access.
- Around line 68-75: The replacements in decodeEscapedText currently run the
double-backslash replacement last, causing sequences like "\\n" to be
mis-decoded; change the order so the backslash unescape (pattern /\\\\/g ->
"\\") runs before the other escapes (e.g., /\\n/g, /\\r/g, /\\t/g, /\\"/g) to
ensure literal backslash+char sequences remain intact; update the
decodeEscapedText function accordingly so /\\\\/g is applied first.
In `@src/harnesses/opencode-adapter.ts`:
- Around line 44-48: Update the exported OpenCode harness factory JSDoc for the
function createOpenCodeHarness (the OpenCode harness adapter creator) to include
`@throws` tags: document that it may throw a timeout error, a stale-output error,
an error for non-zero exit codes, and an empty-output error; for each `@throws`
include the error type/name (or a brief description if there is no dedicated
Error subclass) and a short sentence describing the condition under which it’s
thrown so callers know what to catch.
In `@src/harnesses/opencode-config.ts`:
- Around line 125-144: The returned object spreads process.env (values
string|undefined) but asserts Record<string,string>, risking undefineds; update
the return in opencode-config.ts to first build a safeEnv by filtering out
undefined values from Object.entries(process.env) (keep only entries where value
!== undefined) and then spread safeEnv along with the OPENCODE_* keys (and the
runtimeName conditional OPENCODE_EXPERIMENTAL_OUTPUT_TOKEN_MAX) so the final
object truly satisfies Record<string,string>; ensure you use
Object.fromEntries/back to an object before returning.
In `@src/lib/openai-compat-client.ts`:
- Around line 66-76: The generateOpenAiCompat function currently uses timeoutMs
without validating it, which can cause immediate aborts or undefined behavior;
add an early guard in generateOpenAiCompat to validate that timeoutMs is a
finite positive number (e.g., Number.isFinite(timeoutMs) && timeoutMs > 0) and
throw a clear Error if not, before creating the AbortController or calling
setTimeout; reference the timeoutMs parameter, the AbortController usage and
timeoutId creation so the check runs at the top of the function and prevents
proceeding with an invalid timeout.
- Around line 131-157: The try/catch around JSON.parse currently swallows parse
errors silently; update the catch to capture the error (e) and call log.debug
with the error and the raw data (e.g. log.debug({ err: e, raw: data }, "Skipping
malformed SSE chunk")) so malformed SSE JSON is recorded like schema failures;
keep behavior of skipping the chunk but ensure you only log parse errors (leave
other logic around ChatCompletionChunkSchema, parsed, chunk, delta,
promptTokens, completionTokens unchanged).
- Around line 111-185: The SSE parsing loop uses a TextDecoder with stream: true
but never flushes its internal buffer when the reader ends; after the
while(reader.read()) loop completes call decoder.decode() (without
{stream:true}) to get any remaining decoded text, append it to the existing
buffer, and then run the same "process remaining buffer content" logic that
checks buffer.trim().startsWith("data: ") and parses the final chunk (the code
around decoder, buffer, the reader.read() loop, and the final buffer processing
using ChatCompletionChunkSchema and output/promptTokens/completionTokens).
🧹 Nitpick comments (11)
src/runtimes/vllm-runtime.ts (2)
180-191: Consider defensive parsing for the models response.The type assertion trusts the response structure. If the API returns an unexpected format, this could throw a confusing error.
♻️ Optional: Add defensive check
// OpenAI-compatible format const data = (await response.json()) as { data: Array<{ id: string; object: string; created?: number; owned_by?: string; }>; }; + if (!Array.isArray(data?.data)) { + throw new Error("Unexpected response format from /v1/models"); + } + return data.data.map((m) => m.id);
193-206: Consider documenting the size estimation heuristic.The 0.6 multiplier for estimating model size is a reasonable approximation for quantized models, but a brief inline comment explaining the rationale would aid maintainability.
📝 Optional: Add explanatory comment
async getModelInfo(model: string): Promise<ModelInfo> { // vLLM doesn't expose parameter counts via API // Estimate from model name const parametersBillions = estimateParametersFromName(model); - // Estimate size in bytes (rough: ~0.5-1 byte per parameter for quantized) + // Estimate size in bytes: ~0.6 bytes/param assumes typical int4/int8 quantization + // fp16 would be ~2 bytes/param, fp32 ~4 bytes/param const sizeBytes = parametersBillions * 1e9 * 0.6;src/harnesses/opencode-process.ts (1)
56-66: Redundant try/catch withreject: false.When
reject: falseis set, execa won't throw on non-zero exit codes, making the surrounding try/catch blocks unnecessary. You can simplify by removing the try/catch wrappers.♻️ Suggested simplification
- try { - await execa("pkill", ["-9", "-P", String(pid)], { reject: false }); - } catch { - // ignore - } + await execa("pkill", ["-9", "-P", String(pid)], { reject: false }); - try { - await execa("kill", ["-9", String(pid)], { reject: false }); - } catch { - // ignore - } + await execa("kill", ["-9", String(pid)], { reject: false });src/runner/plan-builder.ts (1)
182-206: Redundant runtime instantiation in error path.When no models match the filter, the code re-creates runtime instances (lines 187-193) that were already created earlier in the loop (lines 134-138). Consider caching the runtime instances or the discovered models to avoid redundant work.
♻️ Suggested improvement
+ // Cache discovered models for error reporting + const allDiscoveredModels = new Map<RuntimeName, string[]>(); + for (const runtimeName of runtimes) { const runtime = createRuntime(runtimeName, { // ... }); // ... const discovered = await runtime.listModels(); + allDiscoveredModels.set(runtimeName, discovered); // ... } // Later in error path: if (allModels.length === 0) { if (config.models.length > 0) { const availableByRuntime: string[] = []; for (const runtimeName of runtimes) { - const runtime = createRuntime(runtimeName, { ... }); - const available = await runtime.listModels(); + const available = allDiscoveredModels.get(runtimeName) ?? []; // ... } } }src/harnesses/opencode-output.ts (1)
171-183: Regex may miss escaped content in JSON strings.The regex pattern uses
["]to match both backticks and quotes as delimiters, but JSON string values typically use"for quotes. The pattern may not correctly extract content with embedded escaped characters or multiline strings encoded with\n`.💡 Consider more robust JSON value extraction
The current regex approach is a reasonable fallback when JSON.parse fails, but may produce incomplete extractions for complex escaped strings. If this becomes an issue in practice, consider using a more robust JSON string parser or accepting that this is a best-effort fallback.
src/harnesses/goose-adapter.ts (1)
221-228: Consider using a typed error class instead of property augmentation.The pattern of attaching
outputanddurationMsproperties to an Error via type assertion is awkward and loses type safety. Consider creating a custom error class or returning a structured failure result.♻️ Alternative approach
class GooseEmptyOutputError extends Error { constructor( message: string, public readonly output: string, public readonly durationMs: number, ) { super(message); this.name = "GooseEmptyOutputError"; } } // Usage: throw new GooseEmptyOutputError( "Goose returned no usable output and did not produce code", output, durationMs, );src/harnesses/opencode-adapter.ts (2)
304-314: Redundant error checks after Promise.race.If
timeoutPromiseorstalePromisewon the race, they already rejected and execution wouldn't reach lines 305-314. These checks would only trigger if the rejection was caught and execution continued, but that's not the case here since exceptions propagate.However, this defensive coding provides clarity and guards against future refactoring that might change error handling. Consider adding a comment explaining this is intentional defensive coding.
395-429: Code extraction fallback logic duplicates goose-adapter.Lines 396-429 closely mirror the same pattern in
goose-adapter.ts(lines 190-229). Consider extracting this into a shared helper to reduce duplication.♻️ Potential shared helper
// In a shared module like lib/harness-utils.ts export function handleCodeExtraction(opts: { output: string; solutionPath: string; durationMs: number; model: string; harnessName: string; toolCallDetected: boolean; log: pino.Logger; }): { codeFilePath?: string } { const { output, solutionPath, durationMs, model, harnessName, toolCallDetected, log } = opts; if (durationMs < 2000 && (!output || output.trim().length < MIN_OUTPUT_LENGTH)) { throw new Error( `${harnessName} returned empty output instantly (${durationMs}ms) - model "${model}" may not be recognized`, ); } const extracted = extractCode(output); const extractedCode = extracted.code.trim(); if (extractedCode.length >= MIN_OUTPUT_LENGTH && extracted.method !== "raw") { fs.writeFileSync(solutionPath, extractedCode, "utf-8"); log.info({ /* ... */ }, "Persisted extracted code"); return { codeFilePath: solutionPath }; } log.warn({ /* ... */ }, `${harnessName} finished without usable code output`); if (output.trim().length < MIN_OUTPUT_LENGTH) { throw new Error(`${harnessName} returned no usable output and did not produce code`); } return {}; }apps/dashboard/src/lib/aggregations-tooling.ts (1)
30-35: Consider extracting duplicatefailureTypeextraction logic.The pattern to extract
failureTypefrom eithergenerationFailure?.typeorgeneration?.failureTypeis repeated in four places (inferToolHarnesses,computeToolUseStats,computeToolScoreBreakdown,computeCompositeMetrics).♻️ Proposed helper extraction
+/** Extracts failure type from either generationFailure or generation. */ +function getFailureType(item: MatrixItemResult): string | undefined { + return item.generationFailure?.type ?? item.generation?.failureType; +} + export function inferToolHarnesses(items: MatrixItemResult[]): Set<string> { const toolHarnesses = new Set<string>(); for (const item of items) { - const failureType = item.generationFailure?.type ?? item.generation?.failureType; + const failureType = getFailureType(item); if (item.test === TOOL_SMOKE_TEST_SLUG || failureType === "tool_missing") { toolHarnesses.add(item.harness); } }Apply similar changes to the other three locations.
Also applies to: 48-51, 87-90, 185-188
apps/dashboard/src/lib/aggregations-core.ts (2)
127-143: Function name doesn't match key composition.The function is named
groupByModelHarnessbut the key includesruntime,model, andharness(line 137). Consider renaming togroupByRuntimeModelHarnessto accurately reflect the grouping dimensions.♻️ Proposed rename
/** - * Groups items by combined runtime + model + harness name. + * Groups items by combined runtime, model, and harness name. * * `@param` items - Matrix items * `@returns` Map keyed by `"runtime / model / harness"` */ -export function groupByModelHarness( +export function groupByRuntimeModelHarness( items: MatrixItemResult[] ): Map<string, MatrixItemResult[]> {
65-125: Consider a genericgroupByhelper to reduce duplication.The four
groupBy*functions share identical structure, differing only in the property accessor. A generic helper would reduce code duplication and simplify maintenance.♻️ Proposed generic helper
function groupBy<K extends keyof MatrixItemResult>( items: MatrixItemResult[], key: K ): Map<string, MatrixItemResult[]> { return items.reduce((map, item) => { const groupKey = String(item[key]); const group = map.get(groupKey) || []; group.push(item); map.set(groupKey, group); return map; }, new Map<string, MatrixItemResult[]>()); } // Then export thin wrappers: export const groupByModel = (items: MatrixItemResult[]) => groupBy(items, "model"); export const groupByRuntime = (items: MatrixItemResult[]) => groupBy(items, "runtime"); // etc.
- Log JSON.parse errors for SSE chunks at debug level - Flush TextDecoder at end of stream and process trailing buffer - Validate timeoutMs is finite positive
- Filter undefined env values before building OpenCode env map - Document OpenCode adapter throw conditions
- Fix escape decoding order for double-escaped sequences - Guard tool-call arguments against null objects - Add exhaustive default handling for runtime apiFormat
- Standardize dimension order to runtime/harness/model/test/passType - Use ceil-based p90 index with clamping
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@llm/project/design-rules.md`:
- Around line 30-34: The document inconsistently uses "pass-type" and
"passType"; normalize to a single form (use "passType") throughout this section
and any related headings or bullet points (e.g., the runtime × harness × model ×
test × passType matrix and the naming guidance line that lists
runtime/harness/model/test/passType) so all occurrences match; update any inline
references and examples to use "passType" and ensure the naming consistency
sentence uses that exact token.
In `@src/harnesses/opencode-config.ts`:
- Around line 35-55: Update the JSDoc for all exported functions in this file
(including resolveOpenCodeToolOutputRoot and the OpenCode config builder
function) to add `@throws` tags describing the kinds of errors that can be raised
(e.g., validation errors and errors thrown by toOpenAiCompatBaseUrl and
toOpenCodeModelKey); for each function include a brief sentence under `@throws`
specifying the error type/condition (e.g., "throws Error when base URL is
invalid" or "throws ValidationError on invalid model key"). Ensure the `@throws`
entries are present alongside existing `@param/`@returns blocks and reference the
relevant converter/validator functions (toOpenAiCompatBaseUrl,
toOpenCodeModelKey) so readers know the source of exceptions.
- Around line 56-108: buildOpenCodeConfig (and the sibling functions
buildOpenCodeEnv and resolveOpenCodeToolOutputRoot) currently consume
runtime/env-derived inputs directly; add Zod validation at each function entry
to parse the incoming data as unknown and enforce required invariants (e.g.,
non-empty runtimeBaseUrl and model, allowed runtimeName values, optional apiKey
string) before using them. Create small Zod schemas that mirror the expected
BuildOpenCodeConfigOpts shape and use schema.parse(...) at the top of
buildOpenCodeConfig (and analogous schemas/parses in buildOpenCodeEnv and
resolveOpenCodeToolOutputRoot) so downstream code (e.g., use of
toOpenAiCompatBaseUrl, toOpenCodeModelKey, and providerOptions construction)
always receives validated, typed values and throws a clear error on invalid
inputs.
In `@src/lib/openai-compat-client.ts`:
- Around line 17-29: The OpenAiCompatGenerateOpts boundary object is not fully
validated (only timeoutMs is checked) so add a Zod schema for
OpenAiCompatGenerateOpts and parse the incoming opts immediately where it enters
the module (e.g., in the function that sends the request—locate usage of
OpenAiCompatGenerateOpts and any functions like generate/openaiCompatGenerate in
this file); the schema must validate baseUrl (string, url-like), model (nonempty
string), prompt (string), timeoutMs (number, positive integer), and apiKey
(optional string) and you should replace direct property access with the
parsed/validated result; apply the same schema/parse logic to the other entry
points mentioned (the code around lines 66-74) so all external inputs are
treated as unknown and validated before use.
🧹 Nitpick comments (3)
src/lib/openai-compat-client.ts (1)
126-199: Deduplicate SSE data handling to avoid drift.
The loop path and final-buffer path reimplement the same parsing logic. A small helper keeps behavior consistent and easier to maintain.Proposed refactor
+ function handleSseData(data: string): void { + if (data === "[DONE]") return; + try { + const parsed = JSON.parse(data) as unknown; + const result = ChatCompletionChunkSchema.safeParse(parsed); + if (!result.success) { + log.debug( + { issues: result.error.issues }, + "Skipping SSE chunk that failed schema validation", + ); + return; + } + + const chunk = result.data; + const delta = chunk.choices?.[0]?.delta; + if (typeof delta?.content === "string" && delta.content.length > 0) { + output += delta.content; + } + if (chunk.usage?.prompt_tokens !== undefined) { + promptTokens = chunk.usage.prompt_tokens; + } + if (chunk.usage?.completion_tokens !== undefined) { + completionTokens = chunk.usage.completion_tokens; + } + } catch (err) { + log.debug( + { err, raw: data }, + "Skipping malformed SSE chunk", + ); + } + } + while (true) { const { done, value } = await reader.read(); if (done) break; @@ for (const line of lines) { const trimmed = line.trim(); if (!trimmed || !trimmed.startsWith("data: ")) continue; - - const data = trimmed.slice(6); // Remove "data: " prefix - if (data === "[DONE]") continue; - - try { - const parsed = JSON.parse(data) as unknown; - const result = ChatCompletionChunkSchema.safeParse(parsed); - if (!result.success) { - log.debug( - { issues: result.error.issues }, - "Skipping SSE chunk that failed schema validation", - ); - continue; - } - - const chunk = result.data; - const delta = chunk.choices?.[0]?.delta; - if (typeof delta?.content === "string" && delta.content.length > 0) { - output += delta.content; - } - // Token counts often come in the final chunk. - // Preserve zeros by checking undefined explicitly. - if (chunk.usage?.prompt_tokens !== undefined) { - promptTokens = chunk.usage.prompt_tokens; - } - if (chunk.usage?.completion_tokens !== undefined) { - completionTokens = chunk.usage.completion_tokens; - } - } catch (err) { - log.debug( - { err, raw: data }, - "Skipping malformed SSE chunk", - ); - } + handleSseData(trimmed.slice(6)); } } @@ if (buffer.trim().startsWith("data: ")) { - const data = buffer.trim().slice(6); - if (data !== "[DONE]") { - try { - const parsed = JSON.parse(data) as unknown; - const result = ChatCompletionChunkSchema.safeParse(parsed); - if (result.success) { - const chunk = result.data; - const delta = chunk.choices?.[0]?.delta; - if (typeof delta?.content === "string" && delta.content.length > 0) { - output += delta.content; - } - if (chunk.usage?.prompt_tokens !== undefined) { - promptTokens = chunk.usage.prompt_tokens; - } - if (chunk.usage?.completion_tokens !== undefined) { - completionTokens = chunk.usage.completion_tokens; - } - } - } catch (err) { - log.debug( - { err, raw: data }, - "Skipping malformed SSE chunk", - ); - } - } + handleSseData(buffer.trim().slice(6)); }src/harnesses/opencode-adapter.ts (1)
190-323: Cap stdout/stderr buffering to avoid unbounded memory.The current chunk arrays can grow without bound if OpenCode streams large outputs. Consider capping collected output to a max size (keeping the most recent N chars) to prevent memory spikes.
♻️ Suggested bounded buffering
-// Track stdout/stderr for output collection -const stdoutChunks: string[] = []; -const stderrChunks: string[] = []; +// Track stdout/stderr for output collection (bounded) +const MAX_OUTPUT_CHARS = 1_000_000; +let stdout = ""; +let stderr = ""; proc.stdout?.on("data", (chunk: Buffer) => { lastOutputTime = Date.now(); - stdoutChunks.push(chunk.toString()); + stdout = (stdout + chunk.toString()).slice(-MAX_OUTPUT_CHARS); }); proc.stderr?.on("data", (chunk: Buffer) => { lastOutputTime = Date.now(); - stderrChunks.push(chunk.toString()); + stderr = (stderr + chunk.toString()).slice(-MAX_OUTPUT_CHARS); }); -// Collect output from chunks -const stdout = stdoutChunks.join(""); -const stderr = stderrChunks.join(""); +// stdout/stderr already collected (bounded)src/harnesses/goose-adapter.ts (1)
268-272: Temp directories accumulate without cleanup.Each
generate()call creates a new temp directory under/tmp/plebdev-bench-goose-*but no cleanup occurs. For long benchmark sessions, this could consume significant disk space. The comment mentions "leave cleanup to the caller/GC" but Node.js doesn't automatically clean filesystem resources.Consider either:
- Cleaning up after scoring completes (requires coordination with caller)
- Reusing a single work directory per adapter instance
- Documenting this as expected behavior with guidance to clear
/tmp/plebdev-bench-goose-*between runs
- Replace pass-type with passType in matrix wording
|
@CodeRabbit full review |
✅ Actions performedFull review triggered. |
|
Caution Failed to replace (edit) comment. This is likely due to insufficient permissions or the comment being deleted. Error details |
2 similar comments
|
Caution Failed to replace (edit) comment. This is likely due to insufficient permissions or the comment being deleted. Error details |
|
Caution Failed to replace (edit) comment. This is likely due to insufficient permissions or the comment being deleted. Error details |
There was a problem hiding this comment.
Actionable comments posted: 19
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (11)
test/compare.test.ts (1)
30-40:⚠️ Potential issue | 🟠 MajorInclude runtime in compare keys to prevent collisions across different runtime environments.
The key composition in
buildCompareKey(src/results/compare.ts:17-24) and test expectations (test/compare.test.ts:67-79) omitruntimefrom the composite key. This causes items with identicalmodel|harness|test|passTypefrom different runtimes to incorrectly match. For example, comparing runs from Ollama against runs from GPT-4 would collide on the same test configuration even though they are from different runtime environments.Update
buildCompareKeyto includeruntimein the key composition and align test expectations accordingly.src/schemas/result.schema.ts (1)
141-150:⚠️ Potential issue | 🔴 CriticalAddress breaking change:
runtimeis now required in MatrixItemResultSchema.Making
runtimemandatory (line 146) will fail validation on older run.json files that lack this field. WhileschemaVersionexists in the schema, the reader and writer do not implement version gating or migrations. Confirm either:
- All legacy run.json files will be regenerated/migrated, or
- Add version-aware parsing in
src/results/reader.tsto handle pre-runtime resultsapps/dashboard/src/components/charts/composite-score-chart.tsx (2)
1-4:⚠️ Potential issue | 🟡 MinorAdd Exports/Invariants to the file header.
The header documents purpose but omits the required exports/invariants.✍️ Suggested update
/** * Purpose: Composite score bar chart showing pass rate, tool success, and frontier score. * Replaces simple pass rate chart with multi-metric grouped bars. + * Exports: CompositeScoreChart + * Invariants: Expects MatrixItemResult data; metrics are normalized to a 0–100 scale. */As per coding guidelines, "Begin every file with a short header documenting purpose, exports, and invariants".
295-358:⚠️ Potential issue | 🟡 MinorAdd TSDoc for the exported component.
Provide purpose/params/returns/throws forCompositeScoreChart.✍️ Suggested update
-export function CompositeScoreChart({ items, onDimensionClick }: CompositeScoreChartProps) { +/** + * Purpose: Render composite score charts by model/runtime/harness/test. + * `@param` props - Component props. + * `@param` props.items - Matrix item results for the run. + * `@param` props.onDimensionClick - Optional handler for row clicks by dimension. + * `@returns` Composite score chart card element. + * `@throws` None. + */ +export function CompositeScoreChart({ items, onDimensionClick }: CompositeScoreChartProps) {As per coding guidelines, "All exported functions require TSDoc/JSDoc documentation (purpose, params, returns, throws)".
apps/dashboard/src/lib/types.ts (2)
1-5:⚠️ Potential issue | 🟡 MinorAdd Exports/Invariants to the file header.
The header has purpose but is missing exports and invariants.✍️ Suggested update
/** * Purpose: Dashboard data types mirroring CLI schemas. * These should stay in sync with src/schemas/ in the CLI. * Source of truth: src/schemas/result.schema.ts, plan.schema.ts, common.schema.ts + * Exports: Dashboard schema types and helpers used by the UI. + * Invariants: Mirrors CLI schemas; update alongside src/schemas changes. */As per coding guidelines, "Begin every file with a short header documenting purpose, exports, and invariants".
262-265:⚠️ Potential issue | 🟡 MinorExpand
isToolSmokeItemTSDoc to include params/returns/throws.
This is an exported function and should include the full doc fields.✍️ Suggested update
-/** Checks if an item is a tool-smoke test */ -export function isToolSmokeItem(item: { test: string }): boolean { +/** + * Purpose: Checks if an item is a tool-smoke test. + * `@param` item - Matrix item containing a test slug. + * `@returns` True when the test slug is TOOL_SMOKE_TEST_SLUG. + * `@throws` None. + */ +export function isToolSmokeItem(item: { test: string }): boolean {As per coding guidelines, "All exported functions require TSDoc/JSDoc documentation (purpose, params, returns, throws)".
test/stats.test.ts (1)
1-3:⚠️ Potential issue | 🟡 MinorAdd Exports/Invariants to the file header.
The header currently only states purpose.✍️ Suggested update
/** * Purpose: Test run statistics calculation and formatting. + * Exports: none (test file) + * Invariants: Unit tests only; no external I/O or time-based flakiness. */As per coding guidelines, "Begin every file with a short header documenting purpose, exports, and invariants".
src/schemas/plan.schema.ts (1)
1-6:⚠️ Potential issue | 🟡 MinorAdd Invariants to the file header.
Purpose and exports are present, but invariants are missing.✍️ Suggested update
/** * Purpose: RunPlan schema capturing the expanded matrix before execution. * Exports: MatrixItemSchema, MatrixItem, RunPlanSchema, RunPlan + * Invariants: schemaVersion is defined first and mirrors plan.json structure. * * The plan is written to results/<runId>/plan.json for reproducibility. */As per coding guidelines, "Begin every file with a short header documenting purpose, exports, and invariants".
test/schemas.test.ts (1)
1-3:⚠️ Potential issue | 🟡 MinorAdd exports/invariants to the file header.
The header lists purpose only; please include exports/invariants (even “Exports: none”).
As per coding guidelines: “Begin every file with a short header documenting purpose, exports, and invariants.”✏️ Suggested header tweak
/** * Purpose: Test Zod schema validation for config, plan, and result schemas. + * Exports: none + * Invariants: Tests are deterministic; no network calls. */llm/implementation/harnesses-implementation.md (2)
110-123:⚠️ Potential issue | 🟡 MinorDoc example still shows Goose
-q, but adapter no longer passes it.Either update the doc snippet or reintroduce the flag so behavior stays consistent.
196-203:⚠️ Potential issue | 🟡 MinorOpenCode CLI example includes
--agent build, but adapter doesn’t.Align the documentation with actual invocation (or add the flag if still required).
🤖 Fix all issues with AI agents
In `@apps/dashboard/src/lib/aggregations-tooling.ts`:
- Around line 85-106: The current logic sets toolSuccessRate to 0 when toolTotal
is 0, penalizing groups that never expect tools; change the fallback to neutral
1.0 by replacing the ternary that computes toolSuccessRate with one that returns
1.0 when toolTotal === 0 (i.e. toolSuccessRate = toolTotal > 0 ? (toolTotal -
toolMissing) / toolTotal : 1.0). Apply the same change to the second occurrence
of this computation (the block around the other mention at lines ~184-201) so
groups with no TOOL_SMOKE_TEST_SLUG/test items are treated as neutral rather
than failing.
In `@llm/implementation/setup-phase-implementation.md`:
- Around line 24-25: Update the stale schema version references in the docs so
they match the code: replace occurrences of "0.2.0" with "0.2.1" wherever the
SCHEMA_VERSION constant is referenced (e.g., in the lines that mention
common.schema.ts and config.schema.ts and the other two occurrences referenced
in the comment); ensure any textual mentions of SCHEMA_VERSION reflect 0.2.1 to
match the actual constant in code.
In `@llm/implementation/vllm-orbstack-setup.md`:
- Around line 100-103: The snippet uses a user-specific absolute path in the cd
command which is not portable; update the markdown example in
vllm-orbstack-setup.md to use a placeholder (e.g., ~/path/to/project,
<project-dir>, or $PROJECT_DIR) or a relative path, and add a brief note telling
readers to replace the placeholder with their own project directory so the
subsequent command (the bun pb invocation) works as copy/paste.
In `@README.md`:
- Around line 115-125: The README example uses a user-specific absolute path (cd
/Users/plebdev/Desktop/code/plebdev-bench); update that line to use a portable
placeholder (e.g., /path/to/plebdev-bench or ${PROJECT_DIR} or ~/<repo>) so the
command block is reusable across machines, keeping the rest of the bun pb
invocation (flags like --runtimes, --harnesses, --vllm-model, etc.) unchanged.
In `@src/harnesses/direct-adapter.ts`:
- Around line 74-81: The code passes process.env.VLLM_API_KEY directly into
generateOpenAiCompat; instead, parse and validate VLLM_API_KEY at module
initialization using Zod (treat env as unknown), export or define a constant
like vllmApiKey and validate it (e.g.,
z.string().nonempty().parse(process.env.VLLM_API_KEY as unknown)); then update
the call inside the "openai-compat" case to use apiKey: vllmApiKey instead of
process.env.VLLM_API_KEY and remove direct env usage elsewhere (refer to
generateOpenAiCompat and the "openai-compat" case to locate the change).
In `@src/harnesses/goose-adapter.ts`:
- Around line 57-65: The tool is using process.cwd() as executionCwd while
solutionPath is in workDir, so files written by Goose may land in the repo root
and be missed; update the code to run Goose with the working directory set to
workDir (i.e., set executionCwd to workDir or pass workDir as the cwd when
spawning/running the tool) wherever executionCwd is used (referencing runId,
workDir, solutionPath, executionCwd, SOLUTION_FILENAME), and apply the same
change to the second invocation block around the later lines (the other
executionCwd usage) so the tool's relative outputs are written into workDir.
In `@src/harnesses/opencode-output.ts`:
- Around line 163-187: The fallback in extractFromToolCall currently manually
unescapes sequences (\\n, \\t, \\\") which can corrupt legitimate code literals;
change the fallback to avoid manual replacements by either returning the raw
matched group or properly decoding it with JSON.parse on a quoted string: after
finding contentMatch[1] in extractFromToolCall, construct a quoted string (e.g.,
`"` + contentMatch[1] + `"`) and pass it to JSON.parse to get correct
unescaping, falling back to returning contentMatch[1] unchanged if JSON.parse
fails; keep the rest of extractFromToolCall (including the call to
extractFromToolCallObject) unchanged.
In `@src/harnesses/opencode-process.ts`:
- Around line 26-33: The JSDoc for the exported function that "forcefully kills
a process and its entire process tree" is missing `@returns` and `@throws` tags;
update the documentation for that function (the one with parameters proc, pid,
log, reason) to include a `@returns` describing the resolved value or void and a
`@throws` describing which errors are thrown (e.g., when killing fails or
process-tree traversal errors occur), and make the same addition for the other
exported function around lines 75–80 so all exported functions include purpose,
`@param`, `@returns`, and `@throws` entries.
- Around line 81-86: The function computeStaleOutputTimeoutMs currently silently
accepts invalid timeoutMs; add a guard at the start of
computeStaleOutputTimeoutMs to validate timeoutMs (check
Number.isFinite(timeoutMs) and timeoutMs > 0) and throw a descriptive RangeError
(or TypeError) for non‑finite or non‑positive inputs so programmer/config errors
surface immediately; keep the rest of the logic using
MAX_STALE_OUTPUT_TIMEOUT_MS and STALE_OUTPUT_TIMEOUT_MS unchanged after the
validation.
In `@src/lib/model-aliases.ts`:
- Around line 68-122: The returned alias map from parseInlineAlias should be
validated with Zod before returning: after constructing the object (the entry
and the final {[aliasName]: entry} map), call ModelAliasMapSchema.safeParse(...)
on that map and, if validation fails, throw a clear Error containing the
validation issues; update parseInlineAlias to use ModelAliasMapSchema.safeParse
on the final ModelAliasMap and return only when validation passes (referencing
parseInlineAlias, ModelAliasMapSchema, and ModelAliasMap to locate the code).
In `@src/runner/index.ts`:
- Around line 151-216: The tool-smoke key (toolSmokeKey) currently uses only
item.harness and item.model, causing tool-smoke failures to incorrectly affect
items on different runtimes; update the key to include item.runtime (e.g.,
`${item.harness}::${item.model}::${item.runtime}`) wherever toolSmokeKey is
created and referenced so toolSmokeStatus entries are scoped per-runtime
(affecting the creation/lookup in the block that computes toolSmokeKey, the
toolSmokeStatus.get/set usage, and any logic relying on that key such as the
skip check with isToolHarness/isToolSmoke and the block that sets the status
after executeItem).
In `@src/runner/item-executor.ts`:
- Around line 101-106: The code is using an unnecessary type assertion on
item.runtime when calling createRuntime; remove the "as RuntimeName" assertion
and fix the typing chain so MatrixItem.runtime is inferred from
MatrixItemSchema/RuntimeNameSchema. Inspect and adjust the types where
MatrixItem is produced (e.g., buildRunPlan → plan.items) and where items are
consumed (executeItem and the createRuntime call) to ensure MatrixItem is
declared as z.infer<typeof MatrixItemSchema> (and that RuntimeName comes from
RuntimeNameSchema) so item.runtime already has type RuntimeName and the
assertion can be dropped.
In `@src/runner/plan-builder.ts`:
- Around line 137-138: The modelCanonicalMap is currently a Map<string,string>
keyed only by model name, causing cross-runtime collisions; change its structure
to be runtime-scoped (either use a Map<string, Map<string,string>> keyed by
runtime -> (modelName -> canonicalName) or use a composite key like
`${runtime}:${modelName}`) and update all usages that set/get entries and
iterate it (references: modelCanonicalMap declarations and all code blocks
interacting with it at the noted spots). Ensure functions that resolve aliases
(alias resolution logic around the modelCanonicalMap uses) pass and use the
runtime identifier when reading/writing the map so alias metadata is attributed
per-runtime.
In `@src/runtimes/index.ts`:
- Around line 1-7: Add an "Invariants" section to the top-of-file header
documenting the file-level guarantees: state that createRuntime(name:
RuntimeName) returns a Runtime whose apiFormat (or equivalent property) matches
the provided name and that createRuntime throws for unknown names; document that
discoverRuntimes() returns the list of available RuntimeName values present on
the system and isRuntimeAvailable(name) returns a boolean indicating presence;
include any type invariants for Runtime and RuntimeName (e.g., RuntimeName is
the canonical set of supported names, Runtime instances conform to the Runtime
interface). Reference createRuntime, discoverRuntimes, isRuntimeAvailable,
Runtime, and RuntimeName in the invariants text.
- Around line 17-52: The createRuntime function uses process.env.VLLM_API_KEY
directly when creating the vLLM runtime; validate and parse this environment
variable with Zod first (similar to src/harnesses/opencode-config.ts) before
passing it to createVllmRuntime: import z from "zod", define a simple schema for
the optional/required VLLM_API_KEY, parse/process.env to get a typed apiKey, and
then pass that validated apiKey into createVllmRuntime (referencing
createRuntime and createVllmRuntime so the change is localized).
In `@src/runtimes/vllm-lifecycle.ts`:
- Around line 111-130: The vLLM readiness loop currently casts the /v1/models
response with "as" which is unsafe; create a Zod schema (e.g., ModelsResponse =
z.object({ data: z.array(z.object({ id: z.string().optional() })).optional() }))
and use ModelsResponse.safeParse on the fetched JSON before mapping ids in the
function that polls using vllmBaseUrl, managed.startupTimeoutMs and
READY_POLL_MS; if validation fails, log the validation error with log.debug or
log.warn and continue the poll (do not throw), otherwise extract ids from the
parsed data and keep the existing readiness checks and logging.
In `@src/schemas/common.schema.ts`:
- Around line 11-20: The runtime names and type are duplicated; remove the local
definitions in src/runtimes/runtime.ts and import the canonical runtimeNames
array and RuntimeName type from src/schemas/common.schema.ts (use
RuntimeNameSchema if needed for runtime validation). Replace any local
RUNTIME_NAMES or typeof-based RuntimeName declarations with the imported
runtimeNames (or create a small alias const RUNTIME_NAMES = runtimeNames as
const if external code expects that identifier) and update all references to use
the imported RuntimeName type and runtimeNames array to ensure a single source
of truth.
In `@src/schemas/config.schema.ts`:
- Around line 11-52: The BenchConfigSchema's runtimes array currently uses
z.array(z.string()) allowing any string; update it to import RuntimeNameSchema
from common.schema.js and use z.array(RuntimeNameSchema).default([]) so each
item is validated against the existing runtime name rules; modify the import
line to include RuntimeNameSchema and change the runtimes field in
BenchConfigSchema accordingly (look for BenchConfigSchema and the runtimes
property to apply the fix).
In `@test/ollama-adapter.test.ts`:
- Around line 1-3: Update the file header that currently reads "Purpose: Unit
tests for Ollama runtime and Direct adapter with mocked fetch." to include
Export and Invariant entries (e.g., add "Exports: none" and an "Invariants:"
line such as "Invariants: tests are hermetic and use mocked fetch"); locate the
existing header comment and append these two lines so the top-of-file comment
documents Purpose, Exports, and Invariants per the coding guideline.
🧹 Nitpick comments (14)
apps/dashboard/src/components/compare/compare-table.tsx (1)
22-22: Consider adding TSDoc for exported functions.The three exported functions (
CompareTable,RegressionsTable,ImprovementsTable) lack proper TSDoc documentation. As per coding guidelines: "All exported functions must have TSDoc/JSDoc with purpose, params, returns, and throws."📝 Example TSDoc for CompareTable
+/** + * Displays a comparison table of matched benchmark items with deltas. + * `@param` props.items - Array of matched items to compare + * `@param` props.showOnlyChanges - When true, filters to show only items with changes + */ export function CompareTable({ items, showOnlyChanges = false }: CompareTableProps) {Also applies to: 134-134, 175-175
results/20260208-122510-cb6911/plan.json (1)
506-512: Verify if this results file should be committed.According to the project rules,
results/is runtime output and should not be committed by default. If this is intentionally included as a fixture or example, consider moving it to a fixtures directory or documenting why it's committed.Additionally, the
"models": 2count may be misleading since both runtimes use the same logical model (viamodelAlias: "qwen2.5-14b"). The count reflects uniquemodelfield values, not unique aliases. Consider clarifying this distinction in documentation or renaming touniqueModelStrings.src/harnesses/opencode-process.ts (1)
56-64: Consider centralizing process execution insrc/lib/exec.ts.This keeps process orchestration consistent with the repo’s execution conventions and reduces duplication.
Based on learnings: “Put process execution logic in src/lib/exec.ts.”
apps/dashboard/src/components/run-detail/item-detail-dialog.tsx (1)
50-52: Consider a fallback label for missing runtime.Line 51 now renders
item.runtime. If older results without a runtime are still viewable, this will show “undefined · …”. A small fallback avoids a rough UX.Suggested tweak
- {item.runtime} · {item.model} via {item.harness} + {(item.runtime ?? "unknown runtime")} · {item.model} via {item.harness}apps/dashboard/src/components/run-detail/run-detail-page.tsx (1)
144-146: Optional fallback for legacy plan summaries.If older
plan.jsonfiles don’t include these new counts, this will renderundefined. A small guard keeps the UI clean.Suggested tweak
- {plan.summary.runtimes} runtimes · {plan.summary.harnesses} harnesses · {plan.summary.tests} tests + {(plan.summary.runtimes ?? 0)} runtimes · {(plan.summary.harnesses ?? 0)} harnesses · {(plan.summary.tests ?? 0)} testsAGENTS.md (1)
53-54: Documentation should list all supported runtimes.Line 53 mentions only "Ollama" as a runtime backend, but the PR introduces vLLM as well (per
runtimeNames = ["ollama", "vllm"]incommon.schema.ts).📝 Suggested documentation update
-- `src/runtimes/` — runtime adapters (inference backends: Ollama) +- `src/runtimes/` — runtime adapters (inference backends: Ollama, vLLM)src/runtimes/ollama-runtime.ts (2)
99-109: Consider adding Zod validation for HTTP response.The HTTP response is cast with a type assertion rather than validated with Zod. Per coding guidelines, boundary inputs (including HTTP responses) should be parsed with Zod. This would provide runtime safety if the Ollama API changes.
♻️ Proposed Zod validation
+const OllamaTagsResponseSchema = z.object({ + models: z.array(z.object({ + name: z.string(), + size: z.number().optional(), + modified_at: z.string().optional(), + digest: z.string().optional(), + })), +}); + async listModels(): Promise<string[]> { const response = await fetchWithTimeout( `${baseUrl}/api/tags`, defaultTimeoutMs, ); if (!response.ok) { throw new Error( `Failed to list models: ${response.status} ${response.statusText}`, ); } - const data = (await response.json()) as { - models: Array<{ - name: string; - size: number; - modified_at: string; - digest: string; - }>; - }; + const raw: unknown = await response.json(); + const data = OllamaTagsResponseSchema.parse(raw); return data.models.map((m) => m.name); },As per coding guidelines: "Treat all boundary inputs (CLI args, env, JSON files, HTTP responses) as 'unknown' and parse immediately with Zod".
128-154: Consider adding Zod validation for model info response.Similar to
listModels(), the/api/showresponse should be validated with Zod for runtime safety.♻️ Proposed Zod validation
+const OllamaShowResponseSchema = z.object({ + details: z.object({ + parameter_size: z.string().optional(), + }).optional(), + model_info: z.object({ + "general.parameter_count": z.number().optional(), + }).optional(), +}).passthrough(); + async getModelInfo(model: string): Promise<ModelInfo> { // ... fetch code ... - const data = (await response.json()) as { - details?: { - parameter_size?: string; - }; - model_info?: { - "general.parameter_count"?: number; - }; - }; + const raw: unknown = await response.json(); + const data = OllamaShowResponseSchema.parse(raw); // ... rest of parsing logic ... }As per coding guidelines: "Validate at boundaries with Zod and pass typed data inward".
src/schemas/config.schema.ts (1)
20-41: MakemanagedVllm.modelrequired only when enabled.Right now
modelis always required even ifenabledis false. Consider optional + refine to reduce config friction.♻️ Suggested refactor
export const ManagedVllmSchema = z.object({ /** Enable managed vLLM lifecycle for a single run. */ enabled: z.boolean().default(false), /** The model to serve in vLLM (sets VLLM_MODEL for docker compose). */ - model: z.string().min(1), + model: z.string().min(1).optional(), ... -}); +}).refine((cfg) => !cfg.enabled || !!cfg.model, { + message: "model is required when enabled", + path: ["model"], +});src/runtimes/vllm-runtime.ts (1)
180-191: Consider validating the OpenAI-compatible response with Zod.The response from
/v1/modelsis cast directly to a typed object without validation. Per coding guidelines, boundary inputs (HTTP responses) should be parsed with Zod.♻️ Suggested validation schema
+const OpenAIModelsResponseSchema = z.object({ + data: z.array( + z.object({ + id: z.string(), + object: z.string().optional(), + created: z.number().optional(), + owned_by: z.string().optional(), + }).passthrough() + ), +}); // In listModels(): - const data = (await response.json()) as { - data: Array<{ - id: string; - object: string; - created?: number; - owned_by?: string; - }>; - }; + const json = await response.json() as unknown; + const data = OpenAIModelsResponseSchema.parse(json);As per coding guidelines: "Treat all boundary inputs (CLI args, env, JSON files, HTTP responses) as 'unknown' and parse immediately with Zod".
src/runtimes/discovery.ts (1)
1-8: File header could include invariants section.The header documents purpose and exports but omits invariants. Consider adding a brief invariants note (e.g., "Discovery checks are sequential and short-circuit on first reachable endpoint").
src/harnesses/goose-output.ts (1)
126-155: Inconsistent indentation in extractFromToolCall.The code has mixed indentation levels that reduce readability. Lines 128, 151, and the closing braces don't align with their logical blocks.
♻️ Suggested formatting fix
if (typeof parsed !== "object" || parsed === null) { return null; - } + } - const obj = parsed as Record<string, unknown>; + const obj = parsed as Record<string, unknown>; // ... rest of function with consistent indentationapps/dashboard/src/lib/aggregations-core.ts (2)
127-143: Function name may cause confusion.The function
groupByModelHarnessactually groups byruntime / model / harness(three dimensions), but the name suggests only model and harness. Consider renaming togroupByRuntimeModelHarnessfor clarity.✏️ Suggested rename
-/** - * Groups items by combined runtime + model + harness name. - * - * `@param` items - Matrix items - * `@returns` Map keyed by `"runtime / model / harness"` - */ -export function groupByModelHarness( +/** + * Groups items by combined runtime + model + harness name. + * + * `@param` items - Matrix items + * `@returns` Map keyed by `"runtime / model / harness"` + */ +export function groupByRuntimeModelHarness( items: MatrixItemResult[] ): Map<string, MatrixItemResult[]> {
200-215: Spread operator with Math.min/max has size limits.Using
Math.min(...scores)can cause stack overflow for very large arrays (>100k items). For typical benchmark sizes this is fine, but if runs scale significantly, consider using a reduce-based approach.♻️ More robust min/max calculation
if (scores.length === 0) return null; const sum = scores.reduce((a, b) => a + b, 0); + const minScore = scores.reduce((a, b) => Math.min(a, b), Infinity); + const maxScore = scores.reduce((a, b) => Math.max(a, b), -Infinity); return { avgScore: sum / scores.length, - minScore: Math.min(...scores), - maxScore: Math.max(...scores), + minScore, + maxScore, count: scores.length, };
|
Caution Failed to replace (edit) comment. This is likely due to insufficient permissions or the comment being deleted. Error details |
3 similar comments
|
Caution Failed to replace (edit) comment. This is likely due to insufficient permissions or the comment being deleted. Error details |
|
Caution Failed to replace (edit) comment. This is likely due to insufficient permissions or the comment being deleted. Error details |
|
Caution Failed to replace (edit) comment. This is likely due to insufficient permissions or the comment being deleted. Error details |
|
@CodeRabbit full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 17
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (11)
src/lib/openrouter-client.ts (1)
114-187:⚠️ Potential issue | 🟠 MajorUse Zod to validate the OpenRouter response at the boundary.
The
parseOpenRouterResponsefunction manually checks the shape of the HTTP response usingisRecord(),Array.isArray(), and type assertions instead of parsing with Zod. This violates the boundary-validation guideline: "Treat all boundary inputs (CLI args, env, JSON files, HTTP responses) as 'unknown' and parse immediately with Zod." Please introduce a Zod schema forOpenRouterResponseand parseresponseTextwithz.parse()immediately after JSON parsing to ensure type safety and catch schema drift.apps/dashboard/src/components/run-detail/timing-stats.tsx (1)
1-3:⚠️ Potential issue | 🟡 MinorExpand the file header to include exports and invariants.
Right now the header only states the purpose. Please add exports and invariants.
Proposed update
/** - * Purpose: Timing statistics component showing generation duration metrics. + * Purpose: Timing statistics component showing generation duration metrics. + * Exports: TimingStats + * Invariants: items are validated MatrixItemResult[]; generation.durationMs is in milliseconds. */As per coding guidelines, begin every file with a short header documenting purpose, exports, and invariants.
apps/dashboard/src/components/ui/dialog.tsx (1)
1-123:⚠️ Potential issue | 🟡 MinorMove/expand the module header and document exported components.
The header should be the first lines of the file and include purpose/exports/invariants, and exported component functions (DialogOverlay, DialogContent, etc.) need JSDoc with purpose/params/returns/throws.
Example header placement/update
- import { cn } from "@/lib/utils"; - import * as DialogPrimitive from "@radix-ui/react-dialog"; - import { X } from "lucide-react"; - /** - * Purpose: Dialog component for modals/drill-downs. - * Based on shadcn/ui Dialog (Radix UI) with terminal-native styling. - */ - import * as React from "react"; +/** + * Purpose: Dialog component primitives for modals/drill-downs. + * Exports: Dialog, DialogPortal, DialogOverlay, DialogTrigger, DialogClose, + * DialogContent, DialogHeader, DialogFooter, DialogTitle, DialogDescription + * Invariants: UI-only components; no side effects beyond rendering. + */ +import { cn } from "@/lib/utils"; +import * as DialogPrimitive from "@radix-ui/react-dialog"; +import { X } from "lucide-react"; +import * as React from "react";As per coding guidelines: “Every file must start with a descriptive header describing purpose, exports, and invariants” and “All exported functions require TSDoc/JSDoc documentation (purpose, params, returns, throws).”
apps/dashboard/src/hooks/use-compare.ts (1)
1-26:⚠️ Potential issue | 🟡 MinorAlign module + hook JSDoc with required fields.
The module header should be first and include exports/invariants, and
useCompareneeds@returnsand@throws.Proposed doc updates
- import { compareRuns } from "@/lib/aggregations"; - import { fetchRun } from "@/lib/api"; - import type { CompareResult, RunResult } from "@/lib/types"; - /** - * Purpose: React hook for comparing two runs. - * Exports: useCompare - */ - import { useEffect, useState } from "react"; +/** + * Purpose: React hook for comparing two runs. + * Exports: useCompare, UseCompareResult + * Invariants: When runIdA/runIdB are missing, comparison/runA/runB are null and loading is false. + */ +import { compareRuns } from "@/lib/aggregations"; +import { fetchRun } from "@/lib/api"; +import type { CompareResult, RunResult } from "@/lib/types"; +import { useEffect, useState } from "react"; ... /** * Fetches two runs and computes their comparison. * `@param` runIdA - First run ID (baseline) * `@param` runIdB - Second run ID (comparison) + * `@returns` Hook state including comparison, runs, loading, and error. + * `@throws` Never. Errors are surfaced via the returned error field. */As per coding guidelines: “Every file must start with a descriptive header describing purpose, exports, and invariants” and “All exported functions require TSDoc/JSDoc documentation (purpose, params, returns, throws).”
apps/dashboard/src/lib/api.ts (2)
12-66:⚠️ Potential issue | 🟡 MinorAdd
@returnsto exported function docs.The exported functions have JSDoc but are missing
@returns, which is required for exports.Example JSDoc fix (apply similarly to all exported functions)
/** * Fetches the list of all available runs from index.json. * The index is generated by the build-index script. * `@throws` Error if fetch fails or index doesn't exist + * `@returns` List of available runs. */As per coding guidelines: “All exported functions require TSDoc/JSDoc documentation (purpose, params, returns, throws).”
17-56:⚠️ Potential issue | 🟠 MajorAdd Zod validation and Pino logging; complete JSDoc
@returnsentries.These fetch functions treat boundary inputs (
response.json()) as trusted, but HTTP responses must be validated at parse time. Add Zod schemas to parse and validate the JSON before returning, matching the coding guideline "Validate at boundaries with Zod and pass typed data inward."Additionally:
- Replace
console.warn(line 22–24) with Pino logger per the "Use Pino for logging" guideline.- Add
@returnsJSDoc entries tofetchRuns(),fetchRun(), andfetchPlan();fetchRunWithPlan()already has it.apps/dashboard/src/components/compare/compare-summary.tsx (1)
1-16:⚠️ Potential issue | 🟡 MinorAdd required module header fields and JSDoc for CompareSummaryCard.
The header should list exports/invariants, and
CompareSummaryCardneeds JSDoc with purpose/params/returns/throws.Proposed doc updates
/** - * Purpose: Compare summary component showing high-level deltas. - * Mirrors the CLI `bench compare` summary output. + * Purpose: Compare summary component showing high-level deltas. + * Exports: CompareSummaryCard + * Invariants: Pure presentational component; renders based on CompareSummary only. + * Mirrors the CLI `bench compare` summary output. */ ... +/** + * Renders high-level comparison deltas. + * `@param` summary - Aggregated comparison summary. + * `@returns` Summary cards for match/status/scoring/frontier deltas. + * `@throws` Never. Rendering errors should surface via React error boundaries. + */ export function CompareSummaryCard({ summary }: CompareSummaryCardProps) {As per coding guidelines: “Every file must start with a descriptive header describing purpose, exports, and invariants” and “All exported functions require TSDoc/JSDoc documentation (purpose, params, returns, throws).”
apps/dashboard/src/components/ui/tabs.tsx (1)
9-56:⚠️ Potential issue | 🟡 MinorAdd TSDoc for exported components.
Tabs,TabsList,TabsTrigger, andTabsContentare exported but lack TSDoc with purpose/params/returns/throws.📘 Example TSDoc (apply to each exported component)
+/** + * Purpose: Render the tab list container. + * `@param` props - Tabs list props. + * `@returns` Tabs list element. + * `@throws` Does not throw. + */ const TabsList = React.forwardRef< React.ElementRef<typeof TabsPrimitive.List>, React.ComponentPropsWithoutRef<typeof TabsPrimitive.List> >(({ className, ...props }, ref) => (As per coding guidelines: "All exported functions require TSDoc/JSDoc documentation (purpose, params, returns, throws)".
src/harnesses/discovery.ts (2)
1-11:⚠️ Potential issue | 🟡 MinorAdd an explicit invariants section to the header.
The header lists purpose/exports but doesn’t document invariants.✏️ Suggested change
/** * Purpose: Discover available harnesses on the system. * Exports: discoverHarnesses, isHarnessAvailable * * Checks for: * - direct: Always available (runtime availability checked separately) * - goose: CLI available via `which goose` * - opencode: CLI available via `which opencode` * * Note: Runtime availability (e.g., Ollama) is checked separately. + * + * Invariants: + * - "direct" is always reported available. + * - CLI availability is determined solely by `which`. + * - Unknown harness names return false. */As per coding guidelines, "Begin every file with a short header documenting purpose, exports, and invariants".
31-56:⚠️ Potential issue | 🟡 MinorComplete
@throwsin exported JSDoc blocks.
Both exported functions are missing the required@throwsentry.✏️ Suggested change
/** * Check if a specific harness is available. * * `@param` name - Harness name to check * `@returns` true if the harness is available + * `@throws` {Error} Never. (CLI lookup errors are handled internally.) */ export async function isHarnessAvailable(name: HarnessName): Promise<boolean> { /** * Discover all available harnesses on the system. * * `@returns` Array of available harness names + * `@throws` {Error} Never. (CLI lookup errors are handled internally.) */ export async function discoverHarnesses(): Promise<HarnessName[]> {As per coding guidelines, "All exported functions must have TSDoc/JSDoc with purpose, params, returns, and throws".
src/runner/item-executor.ts (1)
102-126:⚠️ Potential issue | 🟠 MajorDon’t swallow runtime/harness creation failures.
createRuntime/createHarnessthrow on unknown names (config/programmer errors). They currently get wrapped as generation failures, which hides misconfiguration. Move these outside the generation try/catch (and let them throw), or explicitly rethrow those errors.Suggested update
- try { - // Load prompt - log.debug("Loading prompt..."); - const prompt = loadPrompt(item.test, item.passType); - - // Create runtime instance - const runtime = createRuntime(item.runtime, { - ollamaBaseUrl: runtimeConfig.ollamaBaseUrl, - vllmBaseUrl: runtimeConfig.vllmBaseUrl, - defaultTimeoutMs: timeoutMs, - }); - - // Create harness adapter - log.debug({ harness: item.harness }, "Creating harness..."); - const harness = createHarness(item.harness); + // Load prompt (throws on missing file) + log.debug("Loading prompt..."); + const prompt = loadPrompt(item.test, item.passType); + + // Create runtime + harness (config errors should throw) + const runtime = createRuntime(item.runtime, { + ollamaBaseUrl: runtimeConfig.ollamaBaseUrl, + vllmBaseUrl: runtimeConfig.vllmBaseUrl, + defaultTimeoutMs: timeoutMs, + }); + log.debug({ harness: item.harness }, "Creating harness..."); + const harness = createHarness(item.harness); + + try {Based on learnings: "Throw on programmer/config errors; record runtime per-item failures in results instead of silent fallbacks".
🤖 Fix all issues with AI agents
In `@apps/dashboard/scripts/build-index.ts`:
- Around line 11-22: Remove the duplicated RunListItem interface declaration in
build-index.ts and replace it with an import of the shared RunListItem type from
the module that exports it; specifically, delete the local interface block named
RunListItem and add an import like `import { RunListItem } from '...';` pointing
to the shared types module that defines RunListItem so the file uses the single
canonical type export.
In `@apps/dashboard/src/App.tsx`:
- Around line 11-27: The exported React component App is missing the required
TSDoc; add a TSDoc block above the App declaration that documents its purpose
(what the component renders/route layout), any props (none — mark as none or
omit params if none), the return type (JSX.Element), and any possible throws
(none) to satisfy the "All exported functions must have TSDoc/JSDoc" rule;
ensure the TSDoc sits immediately above the function App() definition and covers
purpose, `@returns`, and `@throws` (if applicable).
In `@apps/dashboard/src/components/charts/frontier-eval-scatter.tsx`:
- Around line 26-30: Update the HARNESS_COLORS map so the key matches the
current harness name: replace the legacy "ollama" key with "direct" in the
HARNESS_COLORS constant (the mapping used in frontier-eval-scatter.tsx) so items
with harness: "direct" receive the intended "hsl(212, 100%, 67%)" color instead
of falling back to gray.
In `@apps/dashboard/src/components/charts/pass-rate-chart.tsx`:
- Around line 111-141: The exported PassRateChart component is missing JSDoc and
Zod boundary validation for its items prop; add a TSDoc/JSDoc block describing
purpose, params (PassRateChartProps, items: MatrixItemResult[]), returns and any
thrown errors, then validate the incoming items at the top of PassRateChart
using a Zod schema (e.g., z.array(MatrixItemResultSchema).parse(items)) before
calling computeBreakdown and groupByModel/groupByHarness/groupByTest, or if
validation occurs upstream document that assumption in the JSDoc and skip
re-parse to avoid duplicate work.
In `@apps/dashboard/src/components/charts/timing-distribution.tsx`:
- Around line 92-221: Add a TSDoc block above the exported TimingDistribution
component that documents its purpose, parameters, return value, and any possible
throws; specifically reference the TimingDistribution component and its prop
type TimingDistributionProps, describe that it renders a timing histogram
(including p50/p90 and item count) from items with generation.durationMs,
specify the param {TimingDistributionProps items} and the React.ReactElement
return, and note any error conditions (e.g., none thrown) or warnings if input
is empty.
In `@apps/dashboard/src/components/compare/delta-badge.tsx`:
- Around line 18-49: Add a full TSDoc block to the exported DeltaBadge function
(and any other exported components in this module) describing the component
purpose and including `@param` entries for value, suffix, invert, decimals (and
reference DeltaBadgeProps), an `@returns` describing the rendered JSX.Element, and
an `@throws` if the component can throw (or explicitly state it does not); place
the TSDoc immediately above the DeltaBadge declaration so tools and linters pick
up the documentation.
In `@apps/dashboard/src/components/compare/run-selector.tsx`:
- Around line 22-28: Add a TSDoc/JSDoc block above the exported RunSelector
function describing its purpose and usage, including `@param` entries for runs
(array of run objects), value (selected run id or value), onValueChange
(callback invoked when selection changes), placeholder (optional string), and
label (optional string), an `@returns` stating it returns a ReactElement, and an
`@throws` note if the component can throw (or state that it does not throw);
ensure the doc mentions prop types/shape briefly and matches the
RunSelectorProps signature so tooling and reviewers can verify the contract.
In `@apps/dashboard/src/components/run-detail/timing-stats.tsx`:
- Around line 11-16: The API currently returns unvalidated JSON that flows into
TimingStats (via fetchRun/fetchRunWithPlan → useRunDetail → RunDetailPage →
TimingStats), so add a Zod schema for the RunResult shape (including
MatrixItemResult array) and validate the response at the data boundary
(preferably inside fetchRun or fetchRunWithPlan in api.ts). Define a Zod schema
matching MatrixItemResult and the overall RunResult, import it into the fetch
function, call schema.parse (or safeParse and throw/log a clear error) on
response.json() before returning, and ensure downstream callers
(useRunDetail/RunDetailPage/TimingStats) consume the typed, validated data.
In `@apps/dashboard/src/components/ui/skeleton.tsx`:
- Around line 6-15: Add a full TSDoc block above the exported Skeleton function
(Skeleton(props: React.HTMLAttributes<HTMLDivElement>)) explaining its purpose
(renders a pulsing skeleton div), document each parameter (className and the
rest of props), state the return type (JSX.Element) and note throws behavior
(none) or that it does not throw; include brief usage example or remarks if
helpful and ensure the doc contains `@param`, `@returns`, and `@throws` tags per
guidelines.
In `@apps/dashboard/src/lib/types.ts`:
- Around line 207-214: Change MatchedItem.passType from string to the PassType
union so it matches MatrixItem and prevents invalid values: update the interface
MatchedItem to use passType: PassType (not string). Also add complete JSDoc to
the isToolSmokeItem function (including `@param` for the input parameter type and
`@returns` describing the boolean result) so the function has proper documentation
and type expectations; locate isToolSmokeItem by name and update its JSDoc block
accordingly.
In `@apps/dashboard/src/lib/utils.ts`:
- Around line 12-34: Add full TSDoc blocks to all exported utilities: cn,
formatDuration, formatNumber, formatPercent, formatDate, and formatShortDate;
for each function include a one-line description, `@param` entries describing each
parameter and its type/units (e.g., ms in milliseconds for formatDuration),
`@returns` describing the return value and format, and an `@throws` entry if the
function can throw (or explicitly state it does not throw). Update the existing
comment for formatDuration and add new TSDoc stubs for cn, formatNumber,
formatPercent, formatDate, and formatShortDate following the project pattern
(purpose, `@param`, `@returns`, `@throws`) so all exported functions comply with the
guidelines.
In `@apps/dashboard/src/pages/compare.tsx`:
- Around line 7-8: Add a TSDoc block above the exported ComparePage function
describing its purpose (what the component/page renders), any parameters (none),
the return value (JSX.Element) and any runtime errors it may throw; place the
TSDoc immediately above the ComparePage declaration and reference the rendered
ComparePageContent in the purpose and returns fields so reviewers can see the
mapping between ComparePage and ComparePageContent.
In `@apps/dashboard/src/pages/runs.tsx`:
- Around line 7-8: Add a TSDoc block to the exported RunsPage function that
documents its purpose, return value, and any errors it may throw; locate the
RunsPage export (export function RunsPage()) and add a comment above it
describing that the component renders/returns the RunListPage component
(purpose), specify the JSX.Element/ReactNode it returns (returns), and note any
runtime errors or exceptional conditions it may throw (throws) following the
project's TSDoc format and fields requirement.
In `@apps/dashboard/vite.config.ts`:
- Around line 16-36: The middleware registered in configureServer uses req.url
directly to build a filesystem path (in the block that calls
existsSync/readFileSync), which allows path traversal and mishandles query
strings; fix it by parsing req.url (e.g., new URL(req.url, "http://localhost")
or using url.parse) to strip the query, decodeURIComponent the pathname, then
resolve the target path with path.resolve(resultsDir, sanitizedPathSegment) (or
path.join + path.normalize) and verify the resolved path begins with resultsDir
(e.g., resolvedPath.startsWith(resultsDir)) before calling
existsSync/readFileSync; if the containment check fails, call next() to deny
access.
In `@src/runner/index.ts`:
- Around line 160-164: The startup of managed vLLM (the block checking
managedVllm && runtimeName === "vllm") must be wrapped in a try/catch so a
thrown error doesn't leave the container running; call
startManagedVllm(managedVllm, config.vllmBaseUrl) inside a try, set
managedVllmStarted = true only after it succeeds, and in the catch do a
best‑effort cleanup (e.g., call stopManagedVllm(managedVllm) or the equivalent
Docker-compose teardown), log the error, then rethrow so the outer finally can
run reliably.
In `@src/runtimes/discovery.ts`:
- Around line 32-55: The isRuntimeAvailable function currently returns
runtime.ping() directly which will propagate exceptions and abort discovery;
wrap the ping calls for both createOllamaRuntime and createVllmRuntime in
try/catch blocks inside isRuntimeAvailable so that any errors from
runtime.ping() are caught and the function returns false on error, preserving
existing return behavior on success; reference the isRuntimeAvailable function
and the runtime.ping() calls created via createOllamaRuntime and
createVllmRuntime when making this change.
In `@src/runtimes/ollama-runtime.ts`:
- Around line 87-135: The responses from /api/tags and /api/show in listModels
and getModelInfo must be validated with Zod at the boundary: add inline Zod
schemas for the tags response (an object with a models array of objects
containing name:string, size:number, modified_at:string, digest:string) and for
the show response (object with optional details.parameter_size:string and
optional model_info["general.parameter_count"]:number), then replace the type
assertions on (await response.json()) with schema.parse(await response.json())
calls to validate and coerce the data; on parse failure throw a clear Error
including the endpoint and validation errors so callers of listModels and
getModelInfo (which call fetchWithTimeout/baseUrl) fail fast with actionable
diagnostics.
🟡 Minor comments (37)
apps/dashboard/src/components/ui/card.tsx-35-47 (1)
35-47:⚠️ Potential issue | 🟡 MinorType mismatch: ref type should be
HTMLHeadingElement.The ref type parameter is
HTMLParagraphElement(line 36), but the component renders an<h3>element (line 39), which is anHTMLHeadingElement. This inconsistency can cause incorrect type inference when consumers use the ref.🐛 Proposed fix
const CardTitle = React.forwardRef< - HTMLParagraphElement, + HTMLHeadingElement, React.HTMLAttributes<HTMLHeadingElement> >(({ className, ...props }, ref) => (apps/dashboard/src/components/ui/button.tsx-45-55 (1)
45-55:⚠️ Potential issue | 🟡 MinorAdd TSDoc for exported
Button.Exported functions/components require TSDoc with purpose, params, returns, and throws.
As per coding guidelines, All exported functions must have TSDoc/JSDoc with purpose, params, returns, and throws.💡 Proposed update
-const Button = React.forwardRef<HTMLButtonElement, ButtonProps>( +/** + * Purpose: Render a button (or Slot child) with variant/size styling. + * `@param` props - Button props including variant, size, and asChild. + * `@param` ref - Forwarded ref to the underlying button element. + * `@returns` The rendered button or Slot child. + * `@throws` {Error} Not expected to throw. + */ +const Button = React.forwardRef<HTMLButtonElement, ButtonProps>(apps/dashboard/src/components/ui/button.tsx-1-8 (1)
1-8:⚠️ Potential issue | 🟡 MinorMove and expand the file header to the top with exports/invariants.
The header is currently after imports and doesn’t list exports/invariants. Please move it to the top and include those details.
As per coding guidelines, Begin every file with a short header documenting purpose, exports, and invariants.💡 Proposed update
-import { cn } from "@/lib/utils"; -import { Slot } from "@radix-ui/react-slot"; -import { type VariantProps, cva } from "class-variance-authority"; -/** - * Purpose: Button component with variants. - * Based on shadcn/ui Button with terminal-native styling. - */ -import * as React from "react"; +/** + * Purpose: Button component with variants for the dashboard UI. + * Exports: buttonVariants, Button, ButtonProps. + * Invariants: className is composed via cn(buttonVariants(...)); asChild toggles Slot vs button. + */ +import { cn } from "@/lib/utils"; +import { Slot } from "@radix-ui/react-slot"; +import { type VariantProps, cva } from "class-variance-authority"; +import * as React from "react";src/lib/failure-classifier.ts-26-28 (1)
26-28:⚠️ Potential issue | 🟡 MinorAdd
@throwsto exported function docs.Exported functions are required to document throws behavior. Both functions currently omit
@throws.📝 Suggested doc update
/** * Classifies a generation error message into a failure type. * * `@param` errorMessage - The error message from harness/execution * `@returns` The classified failure type + * `@throws` - Does not throw; returns "unknown" for unrecognized patterns. * * `@example` * ```typescript * classifyGenerationError("Request timed out after 60s") // => "timeout" * classifyGenerationError("Empty output received") // => "harness_error" * ``` */ @@ /** * Classifies a scoring error message into a failure phase. * * `@param` errorMessage - The error message from scoring * `@returns` The classified scoring failure type + * `@throws` - Does not throw; returns "unknown" for unrecognized patterns. * * `@example` * ```typescript * classifyScoringError("Code extraction failed: no code blocks") // => "extraction" * classifyScoringError("Import failed: SyntaxError") // => "import" * ``` */As per coding guidelines, All exported functions require TSDoc/JSDoc documentation (purpose, params, returns, throws).
apps/dashboard/src/components/compare/run-selector.tsx-31-33 (1)
31-33:⚠️ Potential issue | 🟡 MinorAdd aria-labelledby to associate the visual label with the Select component.
The
<p>label is purely visual and provides no semantic connection to the Select for screen reader users. Associate the label with the SelectTrigger usingaria-labelledbyby adding anidto the<p>element and referencing it on the Select or SelectTrigger.<div className="space-y-1"> {label && ( <p id="run-selector-label" className="text-sm font-medium text-foreground-muted">{label}</p> )} <Select value={value} onValueChange={onValueChange}> <SelectTrigger aria-labelledby="run-selector-label" className="w-full md:w-[300px]">apps/dashboard/src/components/run-detail/timing-stats.tsx-11-16 (1)
11-16:⚠️ Potential issue | 🟡 MinorAdd TSDoc/JSDoc for the exported
TimingStatsfunction.This exported function needs a doc comment with purpose, params, returns, and throws.
Proposed doc comment
-export function TimingStats({ items }: TimingStatsProps) { +/** + * Renders timing statistics for matrix items. + * `@param` items - Validated matrix items used to compute timing stats. + * `@returns` A timing stats card or an empty-state card when no data exists. + * `@throws` Never. + */ +export function TimingStats({ items }: TimingStatsProps) {As per coding guidelines, all exported functions must have TSDoc/JSDoc with purpose, params, returns, and throws.
src/lib/scorer.ts-88-91 (1)
88-91:⚠️ Potential issue | 🟡 MinorGuard test name construction against non‑serializable args.
JSON.stringify(testCase.args)at line 90 is outside thetryblock (which begins at line 92). IftestCase.argscontains BigInt values or circular references,JSON.stringifywill throw before entering the try-catch, causing the function to abort without returning a result.🔧 Suggested fix
-const name = - testCase.description || - `${testCase.fn}(${JSON.stringify(testCase.args).slice(1, -1)})`; +let name = testCase.description ?? testCase.fn; +if (!testCase.description) { + try { + const args = JSON.stringify(testCase.args).slice(1, -1); + name = `${testCase.fn}(${args})`; + } catch { + name = `${testCase.fn}(...)`; + } +}llm/project/phases/mvp-phase.md-27-28 (1)
27-28:⚠️ Potential issue | 🟡 MinorInclude vLLM in the auto-discover runtimes step.
Line 27 mentions only Ollama for runtime discovery, but the PR implements both Ollama and vLLM runtimes. Update for consistency with the implementation:
📝 Suggested fix
- - auto-discover runtimes (Ollama) + - auto-discover runtimes (Ollama, vLLM)AGENTS.md-52-55 (1)
52-55:⚠️ Potential issue | 🟡 MinorUpdate runtime adapters list to include vLLM.
The runtime bullet still names only Ollama even though vLLM is now supported.
Proposed doc update
- - `src/runtimes/` — runtime adapters (inference backends: Ollama) + - `src/runtimes/` — runtime adapters (inference backends: Ollama, vLLM)llm/project/project-overview.md-38-45 (1)
38-45:⚠️ Potential issue | 🟡 MinorReflect the runtime dimension in later bullets for consistency.
Since runtime is introduced here as a first-class dimension, the “Result Captures” and “Pass rate by dimension” bullets later in the file still omit runtime. Please update those bullets to keep the doc consistent with the 5D matrix.
Proposed doc update
- - Model, harness, test name, pass type (blind/informed) + - Runtime, model, harness, test name, pass type (blind/informed) ... - - Pass rate by dimension (model/harness/test) + - Pass rate by dimension (runtime/model/harness/test)src/harnesses/goose-openai.ts-14-31 (1)
14-31:⚠️ Potential issue | 🟡 MinorAdd
@throwsto the exported helper JSDoc.This export needs an explicit
@throwsentry (even if it never throws) to satisfy the documentation requirement.Proposed JSDoc update
/** * Normalizes an OpenAI-compatible base path for Goose. * * `@param` pathname - URL pathname (e.g., "/v1" or "/v1/chat/completions") * `@returns` Normalized base path (e.g., "v1") + * `@throws` Never. This helper only normalizes strings. */As per coding guidelines: “All exported functions require TSDoc/JSDoc documentation (purpose, params, returns, throws).”
src/runtimes/vllm-lifecycle.ts-124-129 (1)
124-129:⚠️ Potential issue | 🟡 MinorAdd timeout to the readiness fetch call.
The
fetchcall lacks a timeout. If vLLM becomes unresponsive mid-startup, this could hang indefinitely within each poll iteration.🛡️ Proposed fix
try { - const res = await fetch(`${vllmBaseUrl}/v1/models`, { method: "GET" }); + const res = await fetch(`${vllmBaseUrl}/v1/models`, { + method: "GET", + signal: AbortSignal.timeout(5000), + }); if (!res.ok) {apps/dashboard/src/components/ui/tabs.tsx-1-7 (1)
1-7:⚠️ Potential issue | 🟡 MinorMove and expand the file header to include exports/invariants.
The header starts after imports and doesn’t document exports/invariants. Please move it to the top and include those details.
✍️ Suggested header update
-import { cn } from "@/lib/utils"; -import * as TabsPrimitive from "@radix-ui/react-tabs"; -/** - * Purpose: Tabs component for view switching. - * Based on shadcn/ui Tabs (Radix UI) with terminal-native styling. - */ -import * as React from "react"; +/** + * Purpose: Tabs component for view switching. + * Exports: Tabs, TabsList, TabsTrigger, TabsContent. + * Invariants: Uses Radix Tabs primitives with terminal-native styling classes. + */ +import { cn } from "@/lib/utils"; +import * as TabsPrimitive from "@radix-ui/react-tabs"; +import * as React from "react";As per coding guidelines: "Begin every file with a short header documenting purpose, exports, and invariants".
apps/dashboard/src/pages/run.tsx-1-11 (1)
1-11:⚠️ Potential issue | 🟡 MinorAdd a complete file header (purpose/exports/invariants) at the top.
The header is after imports and omits exports/invariants. Please move it to the top and include those details.
✍️ Suggested header update
-import { PageContainer, PageHeader } from "@/components/layout/page-container"; -import { - RunDetailPage, - RunDetailPageSkeleton, -} from "@/components/run-detail/run-detail-page"; -import { useRunDetail } from "@/hooks/use-run-detail"; -/** - * Purpose: Single run page route component. - * Displays run details by runId from URL params. - */ -import { useParams } from "react-router-dom"; +/** + * Purpose: Single run page route component. + * Exports: RunPage. + * Invariants: Renders loading/error states until both run and plan are available. + */ +import { PageContainer, PageHeader } from "@/components/layout/page-container"; +import { + RunDetailPage, + RunDetailPageSkeleton, +} from "@/components/run-detail/run-detail-page"; +import { useRunDetail } from "@/hooks/use-run-detail"; +import { useParams } from "react-router-dom";As per coding guidelines: "Begin every file with a short header documenting purpose, exports, and invariants".
apps/dashboard/src/pages/run.tsx-13-34 (1)
13-34:⚠️ Potential issue | 🟡 MinorAdd TSDoc for
RunPageexport.The exported component is missing purpose/returns/throws documentation.
📘 Example TSDoc
+/** + * Purpose: Render the run-detail route for the current runId param. + * `@returns` Run detail route content. + * `@throws` Does not throw. + */ export function RunPage() {As per coding guidelines: "All exported functions require TSDoc/JSDoc documentation (purpose, params, returns, throws)".
apps/dashboard/src/components/run-list/run-list-page.tsx-1-8 (1)
1-8:⚠️ Potential issue | 🟡 MinorAdd a complete file header (purpose/exports/invariants) at the top.
The header is after imports and omits exports/invariants. Please move it to the top and include those details.
✍️ Suggested header update
-import { PageContainer, PageHeader } from "@/components/layout/page-container"; -import { Skeleton } from "@/components/ui/skeleton"; -/** - * Purpose: Run list page component displaying all benchmark runs. - * Shows a grid of RunCard components sorted by date (newest first). - */ -import { useRuns } from "@/hooks/use-runs"; -import { RunCard } from "./run-card"; +/** + * Purpose: Run list page component displaying all benchmark runs. + * Exports: RunListPage. + * Invariants: Renders loading skeletons, empty state, or run cards. + */ +import { PageContainer, PageHeader } from "@/components/layout/page-container"; +import { Skeleton } from "@/components/ui/skeleton"; +import { useRuns } from "@/hooks/use-runs"; +import { RunCard } from "./run-card";As per coding guidelines: "Begin every file with a short header documenting purpose, exports, and invariants".
apps/dashboard/src/components/run-list/run-list-page.tsx-12-61 (1)
12-61:⚠️ Potential issue | 🟡 MinorAdd TSDoc for
RunListPageexport.The exported component is missing purpose/returns/throws documentation.
📘 Example TSDoc
+/** + * Purpose: Render the run list page with loading and empty states. + * `@returns` Run list page content. + * `@throws` Does not throw. + */ export function RunListPage() {As per coding guidelines: "All exported functions require TSDoc/JSDoc documentation (purpose, params, returns, throws)".
apps/dashboard/src/components/ui/table.tsx-1-6 (1)
1-6:⚠️ Potential issue | 🟡 MinorMove and expand the file header to include exports/invariants.
The header isn’t at the top and doesn’t document exports/invariants. Please move it above imports and include those details.
✍️ Suggested header update
-import { cn } from "@/lib/utils"; -/** - * Purpose: Table components for matrix displays. - * Based on shadcn/ui Table with terminal-native styling. - */ -import * as React from "react"; +/** + * Purpose: Table components for matrix displays. + * Exports: Table, TableHeader, TableBody, TableFooter, TableHead, TableRow, TableCell, TableCaption. + * Invariants: Uses semantic table elements with terminal-native styling classes. + */ +import { cn } from "@/lib/utils"; +import * as React from "react";As per coding guidelines: "Begin every file with a short header documenting purpose, exports, and invariants".
apps/dashboard/src/components/ui/table.tsx-8-120 (1)
8-120:⚠️ Potential issue | 🟡 MinorAdd TSDoc for exported table components.
All exported components are missing purpose/params/returns/throws documentation.
📘 Example TSDoc (apply to each exported component)
+/** + * Purpose: Render a styled table element. + * `@param` props - Table element props. + * `@returns` Table element. + * `@throws` Does not throw. + */ const Table = React.forwardRef< HTMLTableElement, React.HTMLAttributes<HTMLTableElement> >(({ className, ...props }, ref) => (As per coding guidelines: "All exported functions require TSDoc/JSDoc documentation (purpose, params, returns, throws)".
models.example.json-10-13 (1)
10-13:⚠️ Potential issue | 🟡 MinorDocument HuggingFace access requirement for meta-llama model.
The
meta-llama/Llama-3.1-8B-Instructmodel is a gated repository on HuggingFace requiring:
- Accepting the Llama 3.1 Community License Agreement and Acceptable Use Policy
- A valid HuggingFace account and access token for authentication (via
huggingface-cli loginor by passing anhf_token)Consider adding a note in the example file documenting this requirement so users understand the setup steps needed before using this model.
apps/dashboard/src/lib/types.ts-262-265 (1)
262-265:⚠️ Potential issue | 🟡 MinorAdd full JSDoc tags for exported
isToolSmokeItem.The exported function needs explicit
@param/@returns/@throws tags to satisfy the doc rules.✍️ Suggested update
-/** Checks if an item is a tool-smoke test */ +/** + * Checks if an item is a tool-smoke test. + * + * `@param` item - Item containing the test slug. + * `@returns` True when the item is a tool-smoke test. + * `@throws` Never throws. + */ export function isToolSmokeItem(item: { test: string }): boolean { return item.test === TOOL_SMOKE_TEST_SLUG; }As per coding guidelines: "All exported functions must have TSDoc/JSDoc with purpose, params, returns, and throws".
apps/dashboard/src/hooks/use-runs.ts-1-21 (1)
1-21:⚠️ Potential issue | 🟡 MinorMove the file header before imports and add full JSDoc tags for
useRuns.The header should be the first thing in the file and include invariants; exported functions also need explicit
@returns/@throws tags.📄 Suggested update
-import { fetchRuns } from "@/lib/api"; -import type { RunListItem } from "@/lib/types"; -/** - * Purpose: React hook for fetching the list of all runs. - * Exports: useRuns - */ -import { useCallback, useEffect, useState } from "react"; +/** + * Purpose: React hook for fetching the list of all runs. + * Exports: useRuns + * Invariants: returns runs sorted by startedAt (desc); refetch triggers a reload. + */ +import { fetchRuns } from "@/lib/api"; +import type { RunListItem } from "@/lib/types"; +import { useCallback, useEffect, useState } from "react"; @@ /** * Fetches the list of all available benchmark runs. * Returns runs sorted by startedAt descending (newest first). + * + * `@returns` Runs with loading/error state and a refetch callback. + * `@throws` Never throws; errors are surfaced via the `error` state. */As per coding guidelines: "Begin every file with a short header documenting purpose, exports, and invariants" and "All exported functions must have TSDoc/JSDoc with purpose, params, returns, and throws".
apps/dashboard/src/hooks/use-runs.ts-25-37 (1)
25-37:⚠️ Potential issue | 🟡 MinorAdd Zod validation for
fetchRuns()result at the API boundary.The
fetchRuns()function inapi.tsfetches external JSON (/results/index.json) and returns it directly without validation. Create a Zod schema forRunListItem[]and parse the response before returning it. This ensures the hook receives validated data and prevents runtime crashes from malformed or missing fields.Per coding guidelines: "Validate at boundaries with Zod and pass typed data inward."
apps/dashboard/src/components/compare/compare-page.tsx-24-24 (1)
24-24:⚠️ Potential issue | 🟡 MinorAdd TSDoc for the exported ComparePageContent.
The exported component is missing the required TSDoc.✏️ Suggested change
+/** + * Renders the compare page content with run selectors and comparison results. + * + * `@returns` Compare page layout with selectors and result panels. + * `@throws` {Error} This component does not throw; errors surface via hook state. + */ export function ComparePageContent() {As per coding guidelines, "All exported functions must have TSDoc/JSDoc with purpose, params, returns, and throws".
apps/dashboard/src/components/charts/composite-score-chart.tsx-349-352 (1)
349-352:⚠️ Potential issue | 🟡 MinorAdd full TSDoc for the exported CompositeScoreChart.
The exported component lacks the required TSDoc (purpose/params/returns/throws).✏️ Suggested change
+/** + * Renders the composite score chart tabs for the provided matrix items. + * + * `@param` items - Matrix item results to aggregate. + * `@param` onDimensionClick - Optional handler when a dimension row is clicked. + * `@returns` Composite score chart card. + * `@throws` {Error} This component does not throw; errors surface via data inputs. + */ export function CompositeScoreChart({ items, onDimensionClick, }: CompositeScoreChartProps) {As per coding guidelines, "All exported functions must have TSDoc/JSDoc with purpose, params, returns, and throws".
src/runtimes/ollama-runtime.ts-61-67 (1)
61-67:⚠️ Potential issue | 🟡 MinorAdd
@throwsto the exported factory JSDoc.
The exportedcreateOllamaRuntimedocblock is missing a@throwssection.✏️ Suggested change
/** * Creates an Ollama runtime instance. * * `@param` config - Runtime configuration * `@returns` Runtime instance for Ollama + * `@throws` {Error} This factory does not throw; runtime methods surface request failures. */ export function createOllamaRuntime(config: OllamaRuntimeConfig): Runtime {As per coding guidelines, "All exported functions must have TSDoc/JSDoc with purpose, params, returns, and throws".
apps/dashboard/src/components/compare/compare-page.tsx-1-13 (1)
1-13:⚠️ Potential issue | 🟡 MinorMove the file header to the top and document exports/invariants.
The header is below imports and doesn’t list exports/invariants. Please move it to the top and include the required sections.✏️ Suggested change
+/** + * Purpose: Compare page component for comparing two benchmark runs. + * Exports: ComparePageContent + * + * Invariants: + * - URL params reflect selected runs when both are chosen. + * - Comparison UI renders only when both runs are selected. + */ import { PageContainer, PageHeader } from "@/components/layout/page-container"; // ... -/** - * Purpose: Compare page component for comparing two benchmark runs. - * Provides run selectors and displays comparison results. - */As per coding guidelines, "Begin every file with a short header documenting purpose, exports, and invariants".
src/harnesses/direct-adapter.ts-34-42 (1)
34-42:⚠️ Potential issue | 🟡 MinorAdd a
@throwsentry to the exported factory JSDoc.
ThecreateDirectAdapterdocblock is missing a@throwsentry, which the guidelines require even when the factory doesn’t throw.✏️ Suggested change
/** * Creates a direct harness adapter. * * The direct adapter always returns true for ping() since availability * is determined by the runtime, not the harness. * * `@returns` Harness instance for direct HTTP communication + * `@throws` {Error} This factory does not throw; errors surface from the returned harness methods. */ export function createDirectAdapter(): Harness {As per coding guidelines, "All exported functions must have TSDoc/JSDoc with purpose, params, returns, and throws".
apps/dashboard/src/components/charts/composite-score-chart.tsx-1-19 (1)
1-19:⚠️ Potential issue | 🟡 MinorMove the file header to the top and document exports/invariants.
The header is below imports and doesn’t state exports/invariants. Please place it at the top and include required sections.✏️ Suggested change
+/** + * Purpose: Composite score bar chart showing pass rate, tool success, and frontier score. + * Exports: CompositeScoreChart + * + * Invariants: + * - items represent results from a single run matrix. + * - Display labels may be truncated; selection uses raw names. + */ import { Card, CardContent, CardHeader, CardTitle } from "@/components/ui/card"; // ... -/** - * Purpose: Composite score bar chart showing pass rate, tool success, and frontier score. - * Replaces simple pass rate chart with multi-metric grouped bars. - */As per coding guidelines, "Begin every file with a short header documenting purpose, exports, and invariants".
apps/dashboard/src/components/charts/composite-score-chart.tsx-96-114 (1)
96-114:⚠️ Potential issue | 🟡 MinorSeparate display label from lookup key to avoid name collision on Y-axis click.
prepareChartDatatruncatesnameto fit on the chart, but the Y-axis click handler (lines 270–274) uses this truncated value to find the corresponding data item. When two items share the same first 18 characters, they produce identical truncated names, causingfind()to always return the first match and route the second item's clicks incorrectly.Keep
nameas the full untruncated value for reliable lookup, and add a separatelabelfield with the truncated display text. UpdateClickableYAxisTickand the YAxis configuration to use the label for rendering while passing the full name to the click handler.🐛 Suggested change
interface ChartData { name: string; + label: string; effectiveScore: number; passRate: number; toolSuccess: number | null; frontier: number | null; // Keep raw data for tooltip raw: CompositeMetrics; } function prepareChartData(metrics: CompositeMetrics[]): ChartData[] { - return metrics.map((m) => ({ - name: m.name.length > 20 ? `${m.name.slice(0, 18)}...` : m.name, - effectiveScore: m.effectiveScore * 100, - passRate: m.passRate * 100, - toolSuccess: m.toolTotal > 0 ? m.toolSuccessRate * 100 : null, - frontier: m.frontierAvg !== null ? (m.frontierAvg / 10) * 100 : null, - raw: m, - })); + return metrics.map((m) => { + const label = m.name.length > 20 ? `${m.name.slice(0, 18)}...` : m.name; + return { + name: m.name, + label, + effectiveScore: m.effectiveScore * 100, + passRate: m.passRate * 100, + toolSuccess: m.toolTotal > 0 ? m.toolSuccessRate * 100 : null, + frontier: m.frontierAvg !== null ? (m.frontierAvg / 10) * 100 : null, + raw: m, + }; + }); } function ClickableYAxisTick({ x, y, payload, + label, onClick, }: { x: number; y: number; payload: { value: string }; + label?: string; onClick?: (name: string) => void; }) { const isInteractive = Boolean(onClick); + const displayLabel = label ?? payload.value; const handleActivate = () => { onClick?.(payload.value); }; // ... return ( <g transform={`translate(${x},${y})`} onClick={isInteractive ? handleActivate : undefined} onKeyDown={handleKeyDown} role={isInteractive ? "button" : undefined} tabIndex={isInteractive ? 0 : undefined} - aria-label={isInteractive ? `Select ${payload.value}` : undefined} + aria-label={isInteractive ? `Select ${displayLabel}` : undefined} style={{ cursor: onClick ? "pointer" : "default" }} > <text x={-5} y={0} dy={4} textAnchor="end" fill="hsl(210, 30%, 92%)" fontSize={12} className={onClick ? "hover:fill-amber-400 transition-colors" : ""} > - {payload.value} + {displayLabel} </text> </g> ); } <YAxis // ... tick={(props) => ( <ClickableYAxisTick {...props} - onClick={ - onBarClick - ? (name) => { - const item = data.find((d) => d.name === name); - if (item) onBarClick(item.raw.name); - } - : undefined - } + label={data.find((d) => d.name === props.payload.value)?.label} + onClick={onBarClick} /> )} />apps/dashboard/src/components/run-detail/run-detail-page.tsx-1-37 (1)
1-37:⚠️ Potential issue | 🟡 MinorMove the header to the top and add JSDoc for exported components.
Both
RunDetailPageandRunDetailPageSkeletonshould have TSDoc, and the header must be the first content in the file.Suggested update
-import { BlindVsInformedChart } from "@/components/charts/blind-vs-informed-chart"; +/** + * Purpose: Run detail page component displaying a single run's results. + * Exports: RunDetailPage, RunDetailPageSkeleton + * Invariants: `run` and `plan` are required and already validated. + */ +import { BlindVsInformedChart } from "@/components/charts/blind-vs-informed-chart"; @@ -export function RunDetailPage({ run, plan }: RunDetailPageProps) { +/** + * Run detail page. + * `@param` props - Run data and plan metadata + * `@returns` Rendered run detail page + */ +export function RunDetailPage({ run, plan }: RunDetailPageProps) { @@ -export function RunDetailPageSkeleton() { +/** + * Skeleton for run detail page loading state. + * `@returns` Skeleton layout + */ +export function RunDetailPageSkeleton() {As per coding guidelines: "Every file must start with a descriptive header describing purpose, exports, and invariants" and "All exported functions require TSDoc/JSDoc documentation (purpose, params, returns, throws)".
Also applies to: 223-223
apps/dashboard/src/components/run-detail/item-detail-dialog.tsx-1-30 (1)
1-30:⚠️ Potential issue | 🟡 MinorPlace the file header first and add JSDoc for
ItemDetailDialog.The header should be the first thing in the file and document exports/invariants, and the exported component needs TSDoc.
Suggested update
-import { Badge } from "@/components/ui/badge"; -/** - * Purpose: Dialog component for displaying detailed information about a matrix item. - * Shows generation output, scores, and frontier eval reasoning. - */ +/** + * Purpose: Dialog component for displaying detailed information about a matrix item. + * Exports: ItemDetailDialog + * Invariants: Requires non-null MatrixItemResult when open. + */ +import { Badge } from "@/components/ui/badge"; @@ -export function ItemDetailDialog({ +/** + * Item detail dialog. + * `@param` props - Dialog props and the selected item + * `@returns` Dialog UI or null when no item is selected + */ +export function ItemDetailDialog({As per coding guidelines: "Every file must start with a descriptive header describing purpose, exports, and invariants" and "All exported functions require TSDoc/JSDoc documentation (purpose, params, returns, throws)".
apps/dashboard/src/components/run-detail/item-detail-dialog.tsx-31-135 (1)
31-135:⚠️ Potential issue | 🟡 MinorReset
showCodewhen the dialog closes or a new item is selected.
showCodepersists across item switches, so a previously expanded code block can appear for a new item unintentionally.Suggested update
-import { useState } from "react"; +import { useEffect, useState } from "react"; @@ const [showCode, setShowCode] = useState(false); + + useEffect(() => { + if (!open || !item) setShowCode(false); + }, [open, item?.id]);llm/context/codebase-overview.md-95-101 (1)
95-101:⚠️ Potential issue | 🟡 MinorDocs still list only
ollamaruntime.
The PR adds vLLM as a runtime, but the table + schema description still call out onlyollama. Please update these sections to reflect both runtimes.📘 Suggested doc update
-| `ollama` | HTTP | Local Ollama server at configured URL | +| `ollama` | HTTP | Local Ollama server at configured URL | +| `vllm` | HTTP | vLLM OpenAI-compatible server at configured URL | -Discovery: `discoverRuntimes()` checks if Ollama endpoint is reachable. +Discovery: `discoverRuntimes()` checks if Ollama and vLLM endpoints are reachable.-| `RuntimeNameSchema` | common.schema.ts | Valid runtime names ("ollama") | +| `RuntimeNameSchema` | common.schema.ts | Valid runtime names ("ollama", "vllm") |Also applies to: 149-150
apps/dashboard/src/components/run-detail/dimension-detail-dialog.tsx-1-9 (1)
1-9:⚠️ Potential issue | 🟡 MinorMove the file header above imports.
The header must be the first lines in the file; currently an import precedes it.✏️ Suggested fix
-import { Badge } from "@/components/ui/badge"; -/** - * Purpose: Dialog showing detailed breakdown when clicking on a model/harness/test bar. - * Exports: DimensionDetailDialog, DimensionType - * - * Invariants: - * - `items` is the full run matrix; filtering is done inside the dialog - * - Tool-smoke is excluded from pass-rate scoring summaries by default - */ +/** + * Purpose: Dialog showing detailed breakdown when clicking on a model/harness/test bar. + * Exports: DimensionDetailDialog, DimensionType + * + * Invariants: + * - `items` is the full run matrix; filtering is done inside the dialog + * - Tool-smoke is excluded from pass-rate scoring summaries by default + */ +import { Badge } from "@/components/ui/badge";As per coding guidelines: "Begin every file with a short header documenting purpose, exports, and invariants".
src/cli/run-command.ts-93-112 (1)
93-112:⚠️ Potential issue | 🟡 MinorParse alias-related CLI inputs with Zod at the boundary.
modelConfigandmodelAliasare CLI boundary inputs used beforeBenchConfigSchemavalidation. While the merged result is validated downstream, the input parameters themselves should be parsed with Zod immediately upon entry to ensure type safety and follow the boundary validation principle.🔧 Suggested approach
+import { z } from "zod"; + +const AliasOptionsSchema = z.object({ + modelConfig: z.string().min(1).optional(), + modelAlias: z.array(z.string()).optional(), +});- // Build model aliases from file and/or inline definitions - let modelAliases: ModelAliasMap = {}; + const { modelConfig, modelAlias } = AliasOptionsSchema.parse(options); + // Build model aliases from file and/or inline definitions + let modelAliases: ModelAliasMap = {}; - if (options.modelConfig) { - const fileAliases = loadModelAliases(options.modelConfig); + if (modelConfig) { + const fileAliases = loadModelAliases(modelConfig); modelAliases = mergeAliases(modelAliases, fileAliases); logger.info( - { file: options.modelConfig, count: Object.keys(fileAliases).length }, + { file: modelConfig, count: Object.keys(fileAliases).length }, "Loaded model aliases from file", ); } - if (options.modelAlias) { - const inlineAliases = parseInlineAliases(options.modelAlias); + if (modelAlias?.length) { + const inlineAliases = parseInlineAliases(modelAlias);src/harnesses/harness.ts-124-137 (1)
124-137:⚠️ Potential issue | 🟡 MinorDocument the thrown error in
normalizeHarnessName.This function throws on invalid names, but the JSDoc omits
@throws. Add the throws tag to satisfy the exported-function doc requirement.✍️ Suggested doc fix
/** * Normalizes a harness name, mapping legacy names to current ones. * `@param` name - Harness name (may be legacy) * `@returns` Normalized harness name + * `@throws` {Error} When the name is not a supported harness. */ export function normalizeHarnessName(name: string): HarnessName {As per coding guidelines: "All exported functions must have TSDoc/JSDoc with purpose, params, returns, and throws".
|
@CodeRabbit full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 2
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (11)
src/cli/compare-command.ts (3)
1-7:⚠️ Potential issue | 🟡 MinorAdd invariants to the file header.
The header documents purpose/exports but omits invariants. Please add a short invariants line (e.g., deterministic stdout formatting, no filesystem writes).
As per coding guidelines: "Every file must start with a descriptive header describing purpose, exports, and invariants."✅ Suggested header tweak
/** * Purpose: `bench compare` command for comparing two benchmark runs. * Exports: compareCommand + * Invariants: Deterministic stdout formatting; no filesystem writes. * * Reads two run.json files, computes deltas, and prints a terminal-native * diff table with status changes, score changes, and duration changes. */
46-54:⚠️ Potential issue | 🟡 MinorMake timestamp output deterministic using fixed UTC timezone.
toLocaleString()without an explicittimeZoneparameter uses the system's local timezone, causing output to differ across machines. UseIntl.DateTimeFormatwithtimeZone: "UTC"and include the year for consistency.♻️ Suggested fix
function formatTimestamp(iso: string): string { const date = new Date(iso); - return date.toLocaleString("en-US", { - month: "short", - day: "2-digit", - hour: "2-digit", - minute: "2-digit", - hour12: false, - }); + return ( + new Intl.DateTimeFormat("en-US", { + month: "short", + day: "2-digit", + year: "numeric", + hour: "2-digit", + minute: "2-digit", + hour12: false, + timeZone: "UTC", + }).format(date) + " UTC" + ); }
275-304:⚠️ Potential issue | 🟠 MajorAdd Zod validation for CLI boundary inputs in compare-command.
The
runA,runB, andoptionsparameters are boundary inputs and must be validated with Zod before use, as required by the coding guidelines: "Treat all boundary inputs (CLI args, env, JSON files, HTTP responses) as 'unknown' and parse immediately with Zod."Suggested implementation
+import { z } from "zod"; + +const CompareArgsSchema = z.object({ + runA: z.string().min(1), + runB: z.string().min(1), + output: z.string().min(1), + json: z.boolean().optional(), +}); + export const compareCommand = new Command("compare") .description("Compare two benchmark runs") .argument("<run-a>", "First run ID or path (baseline)") .argument("<run-b>", "Second run ID or path (comparison)") .option( "-o, --output <dir>", "Output directory for results", DEFAULT_OUTPUT_DIR, ) .option("--json", "Output raw JSON instead of formatted table") -.action( - async ( - runA: string, - runB: string, - options: { output: string; json?: boolean }, - ) => { + .action(async (runA: string, runB: string, options: unknown) => { + const parsed = CompareArgsSchema.parse({ + runA, + runB, + ...(options as Record<string, unknown>), + }); try { // Find and read run A - const dirA = findRunDir(options.output, runA); + const dirA = findRunDir(parsed.output, parsed.runA); const resultA = await readResult(dirA); // Find and read run B - const dirB = findRunDir(options.output, runB); + const dirB = findRunDir(parsed.output, parsed.runB); const resultB = await readResult(dirB); // Compare runs const comparison = compareRuns(resultA, resultB); // Output - if (options.json) { + if (parsed.json) { console.log(JSON.stringify(comparison, null, 2)); } else {src/lib/openrouter-client.ts (2)
114-187:⚠️ Potential issue | 🟠 MajorUse Zod to parse and validate OpenRouter HTTP responses at the boundary.
The current manual
isRecordandArray.isArraychecks do not validate nested fields (message.content,finish_reason,model) and violate the requirement to parse all boundary inputs (HTTP responses) with Zod. Replace the type casting with a Zod schema andsafeParse.Proposed fix (Zod schema + safeParse)
-import type { Logger } from "pino"; +import type { Logger } from "pino"; +import { z } from "zod"; import type { FrontierEvalFailure } from "../schemas/index.js"; import { logger } from "./logger.js"; @@ -/** OpenRouter API response structure. */ -interface OpenRouterResponse { - choices?: Array<{ - finish_reason?: string; - message?: { - content?: string; - reasoning?: unknown; - }; - }>; - model?: string; -} +/** OpenRouter API response structure. */ +const OpenRouterResponseSchema = z.object({ + choices: z + .array( + z.object({ + finish_reason: z.string().optional(), + message: z + .object({ + content: z.string().optional(), + reasoning: z.unknown().optional(), + }) + .optional(), + }), + ) + .min(1), + model: z.string().optional(), +}); +type OpenRouterResponse = z.infer<typeof OpenRouterResponseSchema>; @@ - const choices = isRecord(json) - ? (json as { choices?: unknown }).choices - : undefined; - if (!Array.isArray(choices) || choices.length === 0) { - log.warn( - { - hasChoices: Array.isArray(choices), - choicesLength: Array.isArray(choices) ? choices.length : 0, - testSlug, - }, - "Invalid OpenRouter response structure", - ); - return { - ok: false, - type: "invalid_response", - message: "Invalid OpenRouter response structure", - }; - } - - return { ok: true, data: json as OpenRouterResponse }; + const parsed = OpenRouterResponseSchema.safeParse(json); + if (!parsed.success) { + log.warn( + { issues: parsed.error.issues, testSlug }, + "Invalid OpenRouter response structure", + ); + return { + ok: false, + type: "invalid_response", + message: "Invalid OpenRouter response structure", + }; + } + + return { ok: true, data: parsed.data };Per coding guidelines: "/*.{ts,tsx}: Use Zod for schemas and validation" and "src/{cli,lib,harnesses}//*.{ts,tsx}: Treat all boundary inputs (CLI args, env, JSON files, HTTP responses) as 'unknown' and parse immediately with Zod".
227-255:⚠️ Potential issue | 🟠 MajorUse a Zod schema for eval JSON (score/reasoning) instead of ad‑hoc parsing.
The model response is a boundary input (HTTP response). Validate it immediately with Zod per guidelines ("src/{cli,lib,harnesses}/**/*.{ts,tsx}: Treat all boundary inputs as 'unknown' and parse immediately with Zod"). The proposed
FrontierEvalResponseSchemawithz.coerce.number()handles type coercion explicitly and is more focused than reusingFrontierEvalSchema(which includesmodelandlatencyMsfields added elsewhere).Proposed fix (Zod coercion + validation)
+const FrontierEvalResponseSchema = z.object({ + score: z.coerce.number().int().min(1).max(10), + reasoning: z.string().min(1), +}); + function parseEvalResponse(responseText: string): { score: number; reasoning: string; } { @@ - const parsed = JSON.parse(jsonStr); - - // Validate score - const score = Number(parsed.score); - if (Number.isNaN(score) || score < 1 || score > 10) { - throw new Error(`Invalid score: ${parsed.score}`); - } - - // Validate reasoning - const reasoning = String(parsed.reasoning || ""); - if (!reasoning) { - throw new Error("Missing reasoning in response"); - } - - return { score, reasoning }; + const parsed = FrontierEvalResponseSchema.safeParse(JSON.parse(jsonStr)); + if (!parsed.success) { + throw new Error("Invalid eval response"); + } + return parsed.data; }src/harnesses/discovery.ts (1)
31-56:⚠️ Potential issue | 🟡 MinorAdd
@throwstags to exported harness discovery docs.These exported functions are documented but missing the required
@throwstag.As per coding guidelines: "All exported functions must have TSDoc/JSDoc with purpose, params, returns, throws".✏️ Suggested doc update
/** * Check if a specific harness is available. * * `@param` name - Harness name to check * `@returns` true if the harness is available + * `@throws` {Error} Never thrown; returns false when unavailable. */ export async function isHarnessAvailable(name: HarnessName): Promise<boolean> { /** * Discover all available harnesses on the system. * * `@returns` Array of available harness names + * `@throws` {Error} Never thrown; returns best-effort list. */ export async function discoverHarnesses(): Promise<HarnessName[]> {src/lib/scorer.ts (1)
310-324:⚠️ Potential issue | 🟡 MinorDocument thrown error behavior for
scoreGeneration.The exported function’s JSDoc is missing the required
@throwstag.As per coding guidelines: "All exported functions must have TSDoc/JSDoc with purpose, params, returns, throws".✏️ Suggested doc update
* `@param` timeoutMs - Timeout for scoring (default: 5s) * `@param` codeFilePath - Optional path to code file written by tool-calling harness * `@returns` Scoring result with pass/fail counts + * `@throws` {Error} Never thrown; errors are captured in the returned ScoringResult.apps/dashboard/src/components/run-detail/tooling-breakdown.tsx (1)
181-185:⚠️ Potential issue | 🟡 MinorAdd
@throwsto the exported function docs.The exported JSDoc needs the required
@throwstag.As per coding guidelines: "All exported functions must have TSDoc/JSDoc with purpose, params, returns, throws".✏️ Suggested doc update
/** * Renders tool usage vs scoring breakdown for a run. * `@param` items - Matrix items for the run * `@returns` Tooling breakdown card + * `@throws` {Error} Never thrown; rendering is best-effort. */ export function ToolingBreakdown({ items }: ToolingBreakdownProps) {src/runner/index.ts (1)
43-50:⚠️ Potential issue | 🟡 MinorAdd
@returnstorunBenchmarkJSDoc.✍️ Suggested doc tweak
/** * Runs the complete benchmark workflow. * * `@param` config - Benchmark configuration + * `@returns` Promise resolved when the run completes * * `@throws` {Error} Only on setup/write failures (crashes). * Item execution failures are recorded, not thrown. */As per coding guidelines: "All exported functions must have TSDoc/JSDoc with purpose, params, returns, and throws".
llm/implementation/setup-phase-implementation.md (1)
49-55:⚠️ Potential issue | 🟡 MinorUse
timeoutMsin theGenerateOptsdoc snippet.The code uses
timeoutMs, nottimeout.✍️ Suggested fix
- - `GenerateOpts`: `{ model, prompt, timeout, runtime }` + - `GenerateOpts`: `{ model, prompt, timeoutMs, runtime }`llm/project/phases/setup-phase.md (1)
116-120:⚠️ Potential issue | 🟡 MinorAdd a language identifier to the fenced code block.
✍️ Suggested fix
-``` +```text Update `@llm/project/phases/setup-phase.md` with project-specific steps. Keep scope minimal and shippable. List 3-5 actionable steps per feature.</details> </blockquote></details> </blockquote></details>🤖 Fix all issues with AI agents
In `@apps/dashboard/src/components/run-detail/status-badge.tsx`: - Around line 27-35: Add a TSDoc block for the exported function StatusBadge describing its purpose, the parameters (StatusBadgeProps: status and optional showIcon), the return value (JSX.Element) and any possible thrown errors or edge cases; place the doc above the StatusBadge declaration and reference StatusBadgeProps in the `@param` tags and mention that it uses statusConfig and Badge internally. Ensure the doc includes `@param` entries for "status" and "showIcon", an `@returns` describing the rendered Badge element, and an `@throws` note for invalid/unrecognized status values. In `@llm/context/codebase-overview.md`: - Around line 145-146: Update the documented schema version in the markdown from "0.2.0" to "0.2.1" so it matches the SCHEMA_VERSION constant (SCHEMA_VERSION = "0.2.1") defined in the codebase; locate the schema version text in the codebase-overview.md and replace the old version string with "0.2.1" to keep docs consistent with the implementation.🟡 Minor comments (21)
apps/dashboard/src/components/ui/button.tsx-45-56 (1)
45-56:⚠️ Potential issue | 🟡 MinorAdd TSDoc to the exported
Buttoncomponent.
Exported functions must include purpose, params, returns, and throws.Proposed fix
+/** + * Purpose: Render a styled button or Slot with variant/size styling. + * `@param` props - ButtonProps controlling variants, size, and native button attributes. + * `@param` ref - Forwarded ref to the underlying button element. + * `@returns` A React element rendering the button/Slot. + * `@throws` None. + */ const Button = React.forwardRef<HTMLButtonElement, ButtonProps>( ({ className, variant, size, asChild = false, ...props }, ref) => {As per coding guidelines, "All exported functions must have TSDoc/JSDoc with purpose, params, returns, and throws".
apps/dashboard/src/components/ui/button.tsx-1-8 (1)
1-8:⚠️ Potential issue | 🟡 MinorMove and expand the file header to the top (include exports + invariants).
The header is not at the beginning of the file and doesn’t document exports/invariants.Proposed fix
+/** + * Purpose: Button component with variants (terminal-native styling). + * Exports: Button, buttonVariants. + * Invariants: buttonVariants keys define all supported variant/size combinations. + */ import { cn } from "@/lib/utils"; import { Slot } from "@radix-ui/react-slot"; import { type VariantProps, cva } from "class-variance-authority"; -/** - * Purpose: Button component with variants. - * Based on shadcn/ui Button with terminal-native styling. - */ import * as React from "react";As per coding guidelines, "Begin every file with a short header documenting purpose, exports, and invariants".
src/harnesses/goose-openai.ts-24-29 (1)
24-29:⚠️ Potential issue | 🟡 MinorEdge case: path consisting only of the suffix is not normalized to "v1".
The suffix
"/chat/completions"includes a leading/, so after trimming, an input like"/chat/completions"becomes"chat/completions", which does not end with"/chat/completions"and is returned as-is instead of normalizing to"v1".
Input Current Output Expected Output /v1/chat/completionsv1✓v1/chat/completionschat/completions✗v1chat/completionschat/completions✗v1🐛 Proposed fix to handle the bare suffix case
export function normalizeOpenAiBasePath(pathname: string): string { const trimmed = pathname.replace(/^\/+/, "").replace(/\/+$/, ""); if (!trimmed) return "v1"; - const suffix = "/chat/completions"; - const lower = trimmed.toLowerCase(); - if (lower.endsWith(suffix)) { - const without = trimmed.slice(0, -suffix.length); - return without.length > 0 ? without : "v1"; - } + const suffixWithSlash = "/chat/completions"; + const suffixBare = "chat/completions"; + const lower = trimmed.toLowerCase(); + + if (lower.endsWith(suffixWithSlash)) { + const without = trimmed.slice(0, -suffixWithSlash.length); + return without.length > 0 ? without : "v1"; + } + if (lower === suffixBare) { + return "v1"; + } return trimmed; }apps/dashboard/src/hooks/use-runs.ts-16-19 (1)
16-19:⚠️ Potential issue | 🟡 MinorAdd complete TSDoc for
useRuns(returns/throws).The exported function’s docstring is missing required tags.
📝 Suggested TSDoc
/** * Fetches the list of all available benchmark runs. * Returns runs sorted by startedAt descending (newest first). + * `@returns` The runs list, loading/error state, and a refetch function. + * `@throws` {never} Errors are captured in `error` instead of being thrown. */ export function useRuns(): UseRunsResult {As per coding guidelines, “All exported functions must have TSDoc/JSDoc with purpose, params, returns, and throws.”
apps/dashboard/src/hooks/use-runs.ts-1-7 (1)
1-7:⚠️ Potential issue | 🟡 MinorMove the file header above imports and include invariants.
The header isn’t at the top and doesn’t document invariants.
🔧 Suggested update
-import { fetchRuns } from "@/lib/api"; -import type { RunListItem } from "@/lib/types"; -/** - * Purpose: React hook for fetching the list of all runs. - * Exports: useRuns - */ -import { useCallback, useEffect, useState } from "react"; +/** + * Purpose: React hook for fetching the list of all runs. + * Exports: useRuns + * Invariants: `runs` is sorted by startedAt desc; `error` is string|null; `refetch` is stable. + */ +import { fetchRuns } from "@/lib/api"; +import type { RunListItem } from "@/lib/types"; +import { useCallback, useEffect, useState } from "react";As per coding guidelines, “Begin every file with a short header documenting purpose, exports, and invariants.”
src/cli/compare-command.ts-314-318 (1)
314-318:⚠️ Potential issue | 🟡 MinorUse the centralized Pino logger instead of console.error.
console.errorbypasses the project's configured logger. Use the centralized logger fromsrc/lib/logger.jsfor consistency with other CLI commands likerun-command.ts, and to respect the coding guideline: "Use Pino for logging."Suggested fix
+import { logger } from "../lib/logger.js"; + } catch (error) { - console.error( - `Error: ${error instanceof Error ? error.message : String(error)}`, - ); + logger.error( + { error }, + "Compare failed: %s", + error instanceof Error ? error.message : String(error), + ); process.exit(1); }apps/dashboard/vite.config.ts-1-7 (1)
1-7:⚠️ Potential issue | 🟡 MinorAdd a top-of-file header that includes exports and invariants.
The header is after imports and doesn’t document exports/invariants. Please move it to the very top and expand it.
✍️ Suggested header update
+/** + * Purpose: Vite configuration for plebdev-bench dashboard. + * Exports: default Vite config (React plugin, aliases, results middleware). + * Invariants: /results/* requests are resolved under ../../results only. + */ import { existsSync, readFileSync } from "node:fs"; import path from "node:path"; import react from "@vitejs/plugin-react"; -/** - * Purpose: Vite configuration for plebdev-bench dashboard. - * Configures React plugin, path aliases, and dev server for results. - */As per coding guidelines: Begin every file with a short header documenting purpose, exports, and invariants.
apps/dashboard/vite.config.ts-57-62 (1)
57-62:⚠️ Potential issue | 🟡 MinorNarrow
server.fs.allowto the results directory for better security.Using
["../.."]is overly broad; Vite's default strict mode can still be bypassed if unnecessary paths are allowed. The allowlist should be specific to what the dev server actually needs. Since your customserveResultsPluginalready computes the results path, extract it as a constant and reuse it:♻️ Suggested adjustment
+const RESULTS_DIR = path.resolve(__dirname, "../../results"); + function serveResultsPlugin() { - const resultsDir = path.resolve(__dirname, "../../results"); + const resultsDir = RESULTS_DIR; return { // ... }; } export default defineConfig({ // ... server: { port: 5173, fs: { // Allow serving files from results directory at project root - allow: ["../.."], + allow: [RESULTS_DIR], }, }, });Alternatively, use the relative path directly:
allow: ["../../results"].apps/dashboard/vite.config.ts-18-31 (1)
18-31:⚠️ Potential issue | 🟡 MinorGuard URL parsing/decoding to avoid unhandled request exceptions.
new URL()throwsTypeErroranddecodeURIComponent()throwsURIErroron malformed input. An invalid/results/request would currently bubble an unhandled exception. Wrap parsing/decoding in try-catch and return a 400 status.🛡️ Proposed fix
- const parsed = new URL(req.url, "http://localhost"); - const relativePath = decodeURIComponent( - parsed.pathname.replace("/results/", ""), - ); + let parsed: URL; + let relativePath: string; + try { + parsed = new URL(req.url, "http://localhost"); + relativePath = decodeURIComponent( + parsed.pathname.replace("/results/", ""), + ); + } catch { + res.statusCode = 400; + res.end("Invalid results URL"); + return; + }apps/dashboard/src/components/ui/card.tsx-35-47 (1)
35-47:⚠️ Potential issue | 🟡 MinorType mismatch in CardTitle ref.
The
forwardRefgeneric specifiesHTMLParagraphElementfor the ref type, but the component renders an<h3>element and acceptsReact.HTMLAttributes<HTMLHeadingElement>. This inconsistency could cause type errors when using the ref.Suggested fix
const CardTitle = React.forwardRef< - HTMLParagraphElement, + HTMLHeadingElement, React.HTMLAttributes<HTMLHeadingElement> >(({ className, ...props }, ref) => (llm/implementation/vllm-orbstack-setup.md-129-134 (1)
129-134:⚠️ Potential issue | 🟡 MinorAvoid user-specific absolute paths in docs.
The OrbStack configuration example uses hardcoded paths that won't work for other users. Use a portable placeholder or environment variable.
✏️ Suggested wording
-/Users/plebdev/.orbstack/bin/orbctl config set memory_mib 49152 -/Users/plebdev/.orbstack/bin/orbctl stop -/Users/plebdev/.orbstack/bin/orbctl start +# Adjust the path if orbctl is not in your PATH +orbctl config set memory_mib 49152 +orbctl stop +orbctl startOr if the full path is needed:
-/Users/plebdev/.orbstack/bin/orbctl config set memory_mib 49152 +~/.orbstack/bin/orbctl config set memory_mib 49152apps/dashboard/src/lib/aggregations-compare.ts-20-26 (1)
20-26:⚠️ Potential issue | 🟡 MinorAdd
@throwsto the exported function docs.The exported JSDoc should include a
@throwstag.As per coding guidelines: "All exported functions must have TSDoc/JSDoc with purpose, params, returns, throws".✏️ Suggested doc update
* `@param` runA - First run (baseline) * `@param` runB - Second run (comparison) * `@returns` Compare result with matched items and deltas + * `@throws` {Error} Never thrown; returns best-effort deltas when data is missing.src/harnesses/opencode-output.ts-194-205 (1)
194-205:⚠️ Potential issue | 🟡 MinorAdd Zod parsing and
@throwsdocumentation for boundary input.The function accepts raw CLI output which is a boundary input per coding guidelines and should be parsed as
unknowntype with Zod. Additionally, the exported function lacks@throwsdocumentation.Suggested update
/** * Normalize OpenCode JSON/JSONL output into plain assistant text or extracted tool-call code. * * `@param` raw - Raw stdout/stderr from OpenCode * `@returns` Normalized output and method indicator + * `@throws` {Error} Never thrown; normalization is best-effort. */ -export function normalizeOpenCodeOutput(raw: string): OpenCodeNormalizedOutput { - const trimmed = raw.trim(); +export function normalizeOpenCodeOutput(raw: unknown): OpenCodeNormalizedOutput { + const parsedRaw = z.string().safeParse(raw); + const input = parsedRaw.success ? parsedRaw.data : String(raw ?? ""); + const trimmed = input.trim(); if (!trimmed) { - return { output: raw, method: "raw" }; + return { output: input, method: "raw" }; } @@ - const directToolCallCode = extractFromToolCall(raw); + const directToolCallCode = extractFromToolCall(input); if (directToolCallCode) { return { output: directToolCallCode, method: "tool_call" }; } - return { output: raw, method: "raw" }; + return { output: input, method: "raw" }; }Per coding guidelines: "Treat all boundary inputs (CLI args, env, JSON files, HTTP responses) as 'unknown' and parse immediately with Zod" and "All exported functions must have TSDoc/JSDoc with purpose, params, returns, throws".
src/harnesses/opencode-adapter.ts-217-292 (1)
217-292:⚠️ Potential issue | 🟡 MinorPotential unhandled rejection from dangling timeout/stale promises.
When the process completes normally, the
timeoutPromiseandstalePromiseremain pending with their timers still active. Although timers are cleared afterPromise.raceresolves, if the timer fires between the race resolution and theclearTimeout/clearIntervalcalls, thereject()will create an unhandled rejection.🛡️ Suggested fix: Guard rejection with completion flag
+let completed = false; + const timeoutPromise: Promise<never> = new Promise((_, reject) => { timeoutId = setTimeout(() => { - if (killAttempted) return; + if (killAttempted || completed) return; killAttempted = true; // ... rest of handler }, timeoutMs); }); const stalePromise: Promise<never> = new Promise((_, reject) => { staleCheckId = setInterval(() => { - if (killAttempted) return; + if (killAttempted || completed) return; // ... rest of handler }, STALE_CHECK_INTERVAL_MS); }); const result = await Promise.race([proc, timeoutPromise, stalePromise]); +completed = true;apps/dashboard/src/components/charts/composite-score-chart.tsx-349-352 (1)
349-352:⚠️ Potential issue | 🟡 MinorAdd full TSDoc for exported
CompositeScoreChart.✍️ Suggested doc block
+/** + * Renders the composite score chart with tabs by dimension. + * `@param` props - Chart props including items and click handler + * `@returns` JSX element for the composite score card + */ export function CompositeScoreChart({ items, onDimensionClick, }: CompositeScoreChartProps) {As per coding guidelines: "All exported functions must have TSDoc/JSDoc with purpose, params, returns, and throws".
llm/implementation/setup-phase-implementation.md-77-78 (1)
77-78:⚠️ Potential issue | 🟡 MinorUpdate the CLI runtime option example to include vLLM.
✍️ Suggested wording
-| `-r, --runtimes <runtimes...>` | Limit to specific runtimes: ollama | All available | +| `-r, --runtimes <runtimes...>` | Limit to specific runtimes: ollama, vllm | All available |apps/dashboard/src/components/charts/composite-score-chart.tsx-96-115 (1)
96-115:⚠️ Potential issue | 🟡 MinorAvoid truncating the data key used for clicks.
Truncating
namecan collide and route clicks to the wrong row. Keep the raw name as the key and truncate only for display in the tick.🛠️ Suggested adjustment
interface ChartData { name: string; effectiveScore: number; passRate: number; @@ // Keep raw data for tooltip raw: CompositeMetrics; } function prepareChartData(metrics: CompositeMetrics[]): ChartData[] { return metrics.map((m) => ({ - name: m.name.length > 20 ? `${m.name.slice(0, 18)}...` : m.name, + name: m.name, effectiveScore: m.effectiveScore * 100, // Primary ranking metric @@ function ClickableYAxisTick({ x, y, payload, onClick, }: { @@ }) { const isInteractive = Boolean(onClick); + const displayValue = + payload.value.length > 20 + ? `${payload.value.slice(0, 18)}...` + : payload.value; @@ <text x={-5} y={0} dy={4} textAnchor="end" fill="hsl(210, 30%, 92%)" fontSize={12} className={onClick ? "hover:fill-amber-400 transition-colors" : ""} > - {payload.value} + {displayValue} </text>Also applies to: 117-167, 260-278
llm/project/phases/setup-phase.md-49-55 (1)
49-55:⚠️ Potential issue | 🟡 MinorAlign the
GenerateOptssnippet withtimeoutMs.✍️ Suggested update
- - `GenerateOpts`: `{ model, prompt, timeout, runtime }` + - `GenerateOpts`: `{ model, prompt, timeoutMs, runtime }`llm/implementation/setup-phase-implementation.md-28-33 (1)
28-33:⚠️ Potential issue | 🟡 MinorInclude the vLLM runtime in the architecture tree.
The runtimes section lists
ollama-runtime.tsbut omits the vLLM implementation now present in the codebase.✍️ Suggested update
├── runtimes/ # Runtime adapters (inference backends) │ ├── runtime.ts # Runtime interface + types │ ├── ollama-runtime.ts # Ollama HTTP implementation +│ ├── vllm-runtime.ts # vLLM HTTP implementation │ ├── discovery.ts # Detect available runtimes │ └── index.ts # Factory + exportsapps/dashboard/src/components/charts/composite-score-chart.tsx-1-19 (1)
1-19:⚠️ Potential issue | 🟡 MinorMove the file header above imports and include exports/invariants.
The header exists but isn’t the first element and doesn’t list exports/invariants.
✏️ Suggested header placement
+/** + * Purpose: Composite score bar chart showing pass rate, tool success, and frontier score. + * Exports: CompositeScoreChart + * Invariants: + * - Chart values are percent-scaled (0–100) for rendering + */ import { Card, CardContent, CardHeader, CardTitle } from "@/components/ui/card"; import { WithInfoTooltip } from "@/components/ui/info-tooltip"; import { Tabs, TabsContent, TabsList, TabsTrigger } from "@/components/ui/tabs"; @@ -import type { KeyboardEvent } from "react"; -/** - * Purpose: Composite score bar chart showing pass rate, tool success, and frontier score. - * Replaces simple pass rate chart with multi-metric grouped bars. - */ +import type { KeyboardEvent } from "react";As per coding guidelines: "Begin every file with a short header documenting purpose, exports, and invariants".
src/harnesses/harness.ts-124-137 (1)
124-137:⚠️ Potential issue | 🟡 MinorDocument the thrown error in
normalizeHarnessName.✍️ Suggested doc update
/** * Normalizes a harness name, mapping legacy names to current ones. * `@param` name - Harness name (may be legacy) * `@returns` Normalized harness name + * `@throws` {Error} When the name is not a supported harness */ export function normalizeHarnessName(name: string): HarnessName {As per coding guidelines: "All exported functions must have TSDoc/JSDoc with purpose, params, returns, and throws".
|
@CodeRabbit full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 2
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (10)
apps/dashboard/src/components/ui/skeleton.tsx (1)
1-3:⚠️ Potential issue | 🟡 MinorComplete the file header (purpose, exports, invariants).
Current header only states purpose; please add exports and invariants as required.
✅ Suggested header update
-/** - * Purpose: Skeleton loading placeholder component. - */ +/** + * Purpose: Skeleton loading placeholder component. + * Exports: `Skeleton`. + * Invariants: Renders a single div with skeleton styles; forwards HTML div props. + */As per coding guidelines, "Begin every file with a short header documenting purpose, exports, and invariants".
apps/dashboard/src/components/ui/dialog.tsx (1)
10-123:⚠️ Potential issue | 🟡 MinorAdd TSDoc/JSDoc for exported components.
Exported components are missing required TSDoc/JSDoc (purpose, params, returns, throws).
📝 Example pattern (apply to each exported component)
+/** + * Purpose: Root dialog component. + * `@returns` React element for Dialog root. + * `@throws` Never. + */ const Dialog = DialogPrimitive.Root;As per coding guidelines: “All exported functions must have TSDoc/JSDoc with purpose, params, returns, and throws.”
src/cli/compare-command.ts (1)
124-192:⚠️ Potential issue | 🔴 CriticalAdd RUNTIME to compare key and display in regression/improvement tables.
Runtime is a dimension in the benchmark matrix (
runtimes × harnesses × models × tests × passTypes), butbuildCompareKey()omits it (currently:model|harness|test|passType). This causes collisions when a single run contains items with the same model/harness/test/passType across different runtimes—only the last variant is retained during map building. Additionally, the regression and improvement tables omit the RUNTIME column, making output ambiguous. UpdatebuildCompareKey()to include runtime, expose runtime inMatchedItem, and display it in the tables.apps/dashboard/src/lib/utils.ts (1)
1-4:⚠️ Potential issue | 🟡 MinorComplete the file header (exports + invariants).
The header omits some exports and doesn’t state invariants. Please list all exported utilities and add an invariants line.
🔧 Suggested header update
/** * Purpose: Utility functions for the dashboard. - * Exports: cn (class name merger), formatDuration, formatNumber, formatDate + * Exports: cn (class name merger), formatDuration, formatNumber, formatPercent, formatDate, formatShortDate + * Invariants: Inputs are expected to be valid numbers or ISO date strings. */As per coding guidelines, "Begin every file with a short header documenting purpose, exports, and invariants".
src/lib/failure-classifier.ts (1)
14-28:⚠️ Potential issue | 🟡 MinorAdd
@throwsto exported function docs.Both exported functions should declare
@throws(even if they don’t throw).As per coding guidelines: "All exported functions require TSDoc/JSDoc (purpose, params, returns, throws)."📌 Suggested doc additions
* `@param` errorMessage - The error message from harness/execution * `@returns` The classified failure type + * `@throws` Does not throw. @@ * `@param` errorMessage - The error message from scoring * `@returns` The classified scoring failure type + * `@throws` Does not throw.Also applies to: 81-93
apps/dashboard/src/lib/api.ts (1)
63-70:⚠️ Potential issue | 🟡 MinorAdd
@throwsto the exported function JSDoc.
fetchRunWithPlancan throw when either fetch/validation fails; document that explicitly.📝 Suggested JSDoc update
/** * Fetches both run and plan data for a run. * `@param` runId - The run ID (directory name in results/) * `@returns` Object with run and plan data + * `@throws` Error if run/plan fetch fails or validation fails */As per coding guidelines: "All exported functions must have TSDoc/JSDoc with purpose, params, returns, and throws".
src/runner/item-executor.ts (2)
102-183:⚠️ Potential issue | 🟠 MajorDon’t swallow config/programmer errors inside the generation catch.
loadPrompt,createRuntime, andcreateHarnessfailures should crash the run (config/programmer errors), not be recorded as per-item generation failures.Based on learnings, “Throw on programmer/config errors; record runtime per-item failures in results instead of silent fallbacks.”Suggested restructuring
- try { - // Load prompt - log.debug("Loading prompt..."); - const prompt = loadPrompt(item.test, item.passType); - - // Create runtime instance - const runtime = createRuntime(item.runtime, { - ollamaBaseUrl: runtimeConfig.ollamaBaseUrl, - vllmBaseUrl: runtimeConfig.vllmBaseUrl, - defaultTimeoutMs: timeoutMs, - }); - - // Create harness adapter - log.debug({ harness: item.harness }, "Creating harness..."); - const harness = createHarness(item.harness); - - // Generate completion (pass runtime to harness) - generationStartTime = performance.now(); - const result = await harness.generate({ - model: item.model, - prompt, - timeoutMs, - unloadAfter, - runtime, - }); + // Load prompt (config error should throw) + log.debug("Loading prompt..."); + const prompt = loadPrompt(item.test, item.passType); + + // Create runtime instance (config error should throw) + const runtime = createRuntime(item.runtime, { + ollamaBaseUrl: runtimeConfig.ollamaBaseUrl, + vllmBaseUrl: runtimeConfig.vllmBaseUrl, + defaultTimeoutMs: timeoutMs, + }); + + // Create harness adapter (config error should throw) + log.debug({ harness: item.harness }, "Creating harness..."); + const harness = createHarness(item.harness); + + try { + // Generate completion (pass runtime to harness) + generationStartTime = performance.now(); + const result = await harness.generate({ + model: item.model, + prompt, + timeoutMs, + unloadAfter, + runtime, + });
69-80:⚠️ Potential issue | 🟡 MinorAdd
@throwsto the exportedexecuteItemJSDoc.
It should document the config errors you now allow to bubble.As per coding guidelines, “All exported functions must have TSDoc/JSDoc with purpose, params, returns, and throws.”Suggested JSDoc tweak
/** * Executes a single matrix item. @@ * `@returns` The execution result + * `@throws` {Error} For configuration/programmer errors (missing prompt, invalid runtime/harness) * * Note: This function does NOT throw on generation failures. * Instead, failures are recorded in the result. */llm/implementation/setup-phase-implementation.md (1)
239-245:⚠️ Potential issue | 🟡 MinorUpdate the execution loop snippet to the new
executeItemsignature.
It still shows the old(item, ollamaBaseUrl, timeout)call.Suggested edit
- result = executeItem(item, ollamaBaseUrl, timeout) + result = executeItem(item, runtimeConfig, timeout)src/cli/run-command.ts (1)
1-11:⚠️ Potential issue | 🟡 MinorAdd an explicit Invariants section to the file header.
The header has purpose/exports but no invariants, which the guidelines require.✏️ Suggested header addition
/** * Purpose: `bench run` command - execute benchmark matrix. * Exports: runCommand * * This command orchestrates: * 1. Config parsing from CLI options * 2. Runtime/Model/Test discovery * 3. Matrix expansion and plan creation * 4. Item execution * 5. Result writing + * + * Invariants: + * - CLI options are validated with BenchConfigSchema before execution + * - Run aborts early on invalid configuration */As per coding guidelines: "Every file must start with a descriptive header describing purpose, exports, and invariants".
🤖 Fix all issues with AI agents
In `@apps/dashboard/scripts/build-index.ts`:
- Around line 29-38: The code reads and JSON.parse's run.json directly; instead
use the existing RunResultSchema to validate the parsed data immediately: after
awaiting readFile(runJsonPath, "utf-8") and JSON.parse(content) treat the parsed
value as unknown and call RunResultSchema.parse (or safeParse and handle errors)
to get a typed run object, then extract run.runId, run.startedAt,
run.completedAt, run.durationMs, run.summary into the runs array; on parse
failure log or skip the file instead of silently proceeding. Ensure you
reference RunResultSchema and the local variables runJsonPath, content/readFile,
and runs when modifying build-index.ts.
In `@apps/dashboard/src/components/compare/compare-page.tsx`:
- Around line 26-33: Extracted route params (urlRunA, urlRunB from useParams)
must be runtime-validated with Zod before being used to seed state; import or
create a Zod schema (e.g., compareRouteParamsSchema or reuse the existing schema
in schemas.ts), parse the params from useParams via
compareRouteParamsSchema.parse or .safeParse, and only call
setSelectedRunA/setSelectedRunB (and initialize selectedRunA/selectedRunB state)
with the validated values (or undefined on failure) so invalid route values are
rejected at the boundary; keep useRuns usage unchanged and ensure any parsing
errors are handled silently or logged but do not propagate invalid strings into
selectedRunA/selectedRunB.
🟡 Minor comments (49)
apps/dashboard/src/components/ui/dialog.tsx-1-8 (1)
1-8:⚠️ Potential issue | 🟡 MinorMove and expand the file header to the very top (include exports + invariants).
The header starts after imports and doesn’t list exports or invariants. Please move it above Line 1 and expand it.
📄 Suggested header placement/content
+/** + * Purpose: Dialog UI primitives for modals/drill-downs. + * Exports: Dialog, DialogTrigger, DialogPortal, DialogClose, DialogOverlay, + * DialogContent, DialogHeader, DialogFooter, DialogTitle, DialogDescription. + * Invariants: Radix Dialog primitives are re-exported with terminal-native styling. + */ import { cn } from "@/lib/utils"; import * as DialogPrimitive from "@radix-ui/react-dialog"; import { X } from "lucide-react"; -/** - * - * Purpose: Dialog component for modals/drill-downs. - * Based on shadcn/ui Dialog (Radix UI) with terminal-native styling. - */ import * as React from "react";As per coding guidelines: “Begin every file with a short header documenting purpose, exports, and invariants.”
apps/dashboard/src/components/ui/button.tsx-45-55 (1)
45-55:⚠️ Potential issue | 🟡 MinorAdd TSDoc for exported
Button.
Buttonis exported but lacks the required TSDoc with purpose, params, returns, and throws.🧾 Suggested docblock
-const Button = React.forwardRef<HTMLButtonElement, ButtonProps>( +/** + * Purpose: Render a styled button (or Slot when asChild is true). + * `@param` props - Button props including variant, size, asChild, and native button attributes. + * `@param` ref - Forwarded ref to the underlying button element. + * `@returns` The rendered button/slot element. + * `@throws` None. + */ +const Button = React.forwardRef<HTMLButtonElement, ButtonProps>(As per coding guidelines, All exported functions must have TSDoc/JSDoc with purpose, params, returns, and throws.
apps/dashboard/src/components/ui/button.tsx-1-7 (1)
1-7:⚠️ Potential issue | 🟡 MinorMove/expand the header to the top and document exports/invariants.
The file header appears after imports and does not list exports or invariants. Please move it to the top and include exports/invariants explicitly.
🔧 Suggested update
- import { cn } from "@/lib/utils"; - import { Slot } from "@radix-ui/react-slot"; - import { type VariantProps, cva } from "class-variance-authority"; - /** - * Purpose: Button component with variants. - * Based on shadcn/ui Button with terminal-native styling. - */ - import * as React from "react"; +/** + * Purpose: Button component with variants (terminal-native styling). + * Exports: Button, buttonVariants. + * Invariants: buttonVariants defines allowed {variant,size}; Button merges className via cn and supports asChild Slot. + */ +import { cn } from "@/lib/utils"; +import { Slot } from "@radix-ui/react-slot"; +import { type VariantProps, cva } from "class-variance-authority"; +import * as React from "react";As per coding guidelines, Begin every file with a short header documenting purpose, exports, and invariants.
apps/dashboard/src/hooks/use-runs.ts-16-20 (1)
16-20:⚠️ Potential issue | 🟡 MinorComplete the exported hook’s JSDoc.
Add explicit@returnsand@throws(even if it never throws).✅ Suggested update
/** * Fetches the list of all available benchmark runs. * Returns runs sorted by startedAt descending (newest first). + * `@returns` UseRunsResult containing runs, loading, error, and refetch. + * `@throws` {never} Errors are surfaced via `error` state instead of thrown. */ export function useRuns(): UseRunsResult {As per coding guidelines, "All exported functions must have TSDoc/JSDoc with purpose, params, returns, and throws".
apps/dashboard/src/hooks/use-runs.ts-1-7 (1)
1-7:⚠️ Potential issue | 🟡 MinorMove the file header above imports and add invariants.
The header isn’t at the top and doesn’t document invariants.✅ Suggested update
- import { fetchRuns } from "@/lib/api"; - import type { RunListItem } from "@/lib/types"; - /** - * Purpose: React hook for fetching the list of all runs. - * Exports: useRuns - */ - import { useCallback, useEffect, useState } from "react"; +/** + * Purpose: React hook for fetching the list of all runs. + * Exports: useRuns, UseRunsResult + * Invariants: runs is sorted by startedAt descending; loading is true only while fetching. + */ +import { fetchRuns } from "@/lib/api"; +import type { RunListItem } from "@/lib/types"; +import { useCallback, useEffect, useState } from "react";As per coding guidelines, "Begin every file with a short header documenting purpose, exports, and invariants".
.gitignore-141-144 (1)
141-144:⚠️ Potential issue | 🟡 MinorPreserved sample run files will still be ignored.
results/*ignores all files;!results/20260209-080211-751e64/only unignores the directory itself, not its contents, so the sample run won’t be committed. Add an exception for the directory contents as well.✅ Suggested fix
results/* !results/index.json !results/20260209-080211-751e64/ +!results/20260209-080211-751e64/**apps/dashboard/src/components/ui/info-tooltip.tsx-1-7 (1)
1-7:⚠️ Potential issue | 🟡 MinorMove the file header to the top and include exports/invariants.
The header comment is not the first statement and doesn’t document exports/invariants. Please move it above imports and add the missing details.
As per coding guidelines, Begin every file with a short header documenting purpose, exports, and invariants.Suggested update
+/** + * Purpose: Reusable info tooltip component with question mark trigger. + * Exports: InfoTooltip, WithInfoTooltip. + * Invariants: Uses Radix Tooltip primitives; content is required text; side is one of "top" | "right" | "bottom" | "left". + */ import { cn } from "@/lib/utils"; import * as TooltipPrimitive from "@radix-ui/react-tooltip"; -/** - * Purpose: Reusable info tooltip component with question mark trigger. - * Uses Radix UI Tooltip for accessibility and consistent styling. - */ import type * as React from "react";apps/dashboard/src/components/ui/info-tooltip.tsx-15-23 (1)
15-23:⚠️ Potential issue | 🟡 MinorAdd full TSDoc (
@param/@returns/@throws) for exportedInfoTooltip.The current comment is descriptive but missing required tags.
As per coding guidelines, All exported functions must have TSDoc/JSDoc with purpose, params, returns, and throws.Suggested update
/** * Info tooltip with question mark icon trigger. * Shows explanatory text on hover/focus. + * `@param` content - Tooltip text to display. + * `@param` side - Tooltip placement relative to trigger. + * `@param` className - Optional class overrides for the trigger. + * `@returns` The tooltip trigger and content. + * `@throws` {Error} Never throws. */ export function InfoTooltip({apps/dashboard/src/components/ui/info-tooltip.tsx-69-81 (1)
69-81:⚠️ Potential issue | 🟡 MinorAdd full TSDoc (
@param/@returns/@throws) for exportedWithInfoTooltip.Same requirement applies here.
As per coding guidelines, All exported functions must have TSDoc/JSDoc with purpose, params, returns, and throws.Suggested update
/** * Wrapper to add info tooltip next to any content. * Provides consistent spacing and alignment. + * `@param` children - Content to decorate with an info tooltip. + * `@param` tooltip - Tooltip text to display. + * `@param` side - Tooltip placement relative to trigger. + * `@returns` A span with content and tooltip. + * `@throws` {Error} Never throws. */ export function WithInfoTooltip({src/cli/compare-command.ts-291-298 (1)
291-298:⚠️ Potential issue | 🟡 MinorRemove unnecessary
awaiton synchronousreadResultcalls.
readResultreturnsRunResultsynchronously (lines 294 and 298). Removeawaitfor clarity:Lines 294, 298
- const resultA = await readResult(dirA); + const resultA = readResult(dirA); - const resultB = await readResult(dirB); + const resultB = readResult(dirB);apps/dashboard/src/components/charts/pass-rate-chart.tsx-1-9 (1)
1-9:⚠️ Potential issue | 🟡 MinorMove the required file header to the top and include exports/invariants.
The header comment is not at the beginning of the file and doesn’t document exports/invariants. Please place it before imports and expand it per the guideline.
🔧 Suggested header placement/update
+/** + * Purpose: Pass rate bar chart component using Recharts. + * Exports: PassRateChart. + * Invariants: `items` are validated MatrixItemResult entries; passRate is in [0, 1]. + */ import { Card, CardContent, CardHeader, CardTitle } from "@/components/ui/card"; import { Tabs, TabsContent, TabsList, TabsTrigger } from "@/components/ui/tabs"; import { computeBreakdown, groupByHarness, groupByModel, groupByTest, } from "@/lib/aggregations"; import type { MatrixItemResult } from "@/lib/types"; -/** - * Purpose: Pass rate bar chart component using Recharts. - * Shows pass rate by model, harness, or test dimension. - */ import { Bar, BarChart, CartesianGrid, Cell, ResponsiveContainer, Tooltip, XAxis, YAxis, } from "recharts";As per coding guidelines, "Begin every file with a short header documenting purpose, exports, and invariants".
apps/dashboard/src/components/charts/pass-rate-chart.tsx-29-109 (1)
29-109:⚠️ Potential issue | 🟡 MinorUse block comments for functions instead of line comments.
The functions rely on
//comments, but the guideline asks for descriptive block comments. Consider converting the function headers to JSDoc-style blocks.📝 Example conversion
-// Custom tooltip component +/** + * Custom tooltip for pass-rate bars. + */ function CustomTooltip({ active, payload, }: { active?: boolean; payload?: Array<{ payload: { name: string; passRate: number; passed: number; total: number }; }>; }) { // ... } -// Get bar color based on pass rate +/** + * Map a pass rate to a semantic color. + */ function getBarColor(passRate: number): string { // ... } +/** + * Render a vertical bar chart for pass rate breakdowns. + */ function PassRateBarChart({ data, }: { data: Array<{ name: string; passRate: number; passed: number; total: number; }>; }) { // ... }Based on learnings, "Write concise, technical code with descriptive block comments for functions".
apps/dashboard/vite.config.ts-1-8 (1)
1-8:⚠️ Potential issue | 🟡 MinorMove the file header above imports and include exports/invariants.
The header isn’t at the top of the file and doesn’t document exports/invariants as required.
As per coding guidelines, "Begin every file with a short header documenting purpose, exports, and invariants".💡 Proposed fix
-import { existsSync, readFileSync } from "node:fs"; -import path from "node:path"; -import react from "@vitejs/plugin-react"; -/** - * - * Purpose: Vite configuration for plebdev-bench dashboard. - * Configures React plugin, path aliases, and dev server for results. - */ -import { type ViteDevServer, defineConfig } from "vite"; +/** + * Purpose: Vite configuration for plebdev-bench dashboard. + * Exports: Default Vite config for the dashboard app. + * Invariants: /results/ middleware serves only files under ../../results. + */ +import { existsSync, readFileSync } from "node:fs"; +import path from "node:path"; +import react from "@vitejs/plugin-react"; +import { type ViteDevServer, defineConfig } from "vite";apps/dashboard/vite.config.ts-19-23 (1)
19-23:⚠️ Potential issue | 🟡 MinorGuard
decodeURIComponentto prevent crashes from malformed URLs.
decodeURIComponentthrows aURIErroron invalid percent-encoded strings (e.g., stray%, incomplete escapes, or invalid UTF-8 sequences). Without a guard, this exception can bubble up and crash the middleware. Wrap the call in try-catch and return a 400 response for invalid paths.Proposed fix
- const parsed = new URL(req.url, "http://localhost"); - const relativePath = decodeURIComponent( - parsed.pathname.replace("/results/", ""), - ); + let relativePath: string; + try { + const parsed = new URL(req.url, "http://localhost"); + relativePath = decodeURIComponent( + parsed.pathname.replace("/results/", ""), + ); + } catch { + res.statusCode = 400; + res.end("Invalid results path"); + return; + }apps/dashboard/src/components/layout/header.tsx-8-9 (1)
8-9:⚠️ Potential issue | 🟡 MinorAdd required TSDoc for
Header.The exported
Headerfunction is missing purpose,@returns, and@throwstags.📝 TSDoc example
+/** + * Purpose: Renders the application header with primary navigation. + * `@returns` Header JSX element. + * `@throws` Never. + */ export function Header() {As per coding guidelines, "All exported functions must have TSDoc/JSDoc with purpose, params, returns, and throws".
apps/dashboard/src/components/ui/badge.tsx-32-35 (1)
32-35:⚠️ Potential issue | 🟡 MinorAdd required TSDoc for
Badge.The exported
Badgefunction is missing purpose, params, returns, and throws tags.📝 TSDoc example
+/** + * Purpose: Renders a styled badge. + * `@param` className - Additional class names. + * `@param` variant - Visual variant key. + * `@param` props - Remaining HTML div attributes. + * `@returns` Badge JSX element. + * `@throws` Never. + */ function Badge({ className, variant, ...props }: BadgeProps) {As per coding guidelines, "All exported functions must have TSDoc/JSDoc with purpose, params, returns, and throws".
apps/dashboard/src/components/layout/header.tsx-1-5 (1)
1-5:⚠️ Potential issue | 🟡 MinorMove the file header to the top and include exports/invariants.
Line 1 import precedes the header, and the header doesn’t list exports or invariants. Please move the header above imports and add the missing fields.
🔧 Suggested header placement
-import { cn } from "@/lib/utils"; /** * Purpose: Application header with navigation. * Provides links to Runs, Compare views. + * Exports: Header + * Invariants: Navigation items remain static and keyed by href. */ +import { cn } from "@/lib/utils";As per coding guidelines, "Begin every file with a short header documenting purpose, exports, and invariants".
apps/dashboard/src/components/ui/badge.tsx-1-7 (1)
1-7:⚠️ Potential issue | 🟡 MinorMove the file header to the top and include exports/invariants.
The header comes after imports and doesn’t list exports or invariants. Please move it above imports and include the missing fields.
🔧 Suggested header placement
-import { cn } from "@/lib/utils"; -import { type VariantProps, cva } from "class-variance-authority"; /** * Purpose: Badge component for status and labels. * Based on shadcn/ui Badge with terminal-native styling. + * Exports: Badge, badgeVariants + * Invariants: Variants map remains the single source of badge styles. */ +import { cn } from "@/lib/utils"; +import { type VariantProps, cva } from "class-variance-authority"; import type * as React from "react";As per coding guidelines, "Begin every file with a short header documenting purpose, exports, and invariants".
apps/dashboard/src/components/run-list/run-card.tsx-11-18 (1)
11-18:⚠️ Potential issue | 🟡 MinorAdd TSDoc for the exported
RunCardcomponent.
Include purpose, params, returns, and throws.📝 Suggested TSDoc
interface RunCardProps { run: RunListItem; /** Optional pass rate (0-1 scale) */ passRate?: number; } +/** + * Purpose: Render a summary card for a benchmark run. + * `@param` run - Run summary data used for rendering. + * `@param` passRate - Optional pass rate in the 0-1 range. + * `@returns` Rendered run card element. + * `@throws` None. + */ export function RunCard({ run, passRate }: RunCardProps) {As per coding guidelines: "All exported functions require TSDoc/JSDoc documentation (purpose, params, returns, throws)".
apps/dashboard/src/components/run-list/run-card.tsx-1-9 (1)
1-9:⚠️ Potential issue | 🟡 MinorMove the file header above imports and include exports/invariants.
The required header must be the first content in the file and include purpose, exports, and invariants.🔧 Suggested header placement
+/** + * Purpose: Card component for displaying a single run summary. + * Exports: RunCard + * Invariants: passRate is 0-1 when provided; summary counts are non-negative. + */ import { Badge } from "@/components/ui/badge"; import { Card, CardContent, CardHeader, CardTitle } from "@/components/ui/card"; import type { RunListItem } from "@/lib/types"; import { formatDate, formatDuration, formatPercent } from "@/lib/utils"; -/** - * Purpose: Card component for displaying a single run summary. - * Shows runId, timestamp, duration, pass rate, and item counts. - */ import { Link } from "react-router-dom";As per coding guidelines: "Every file must start with a descriptive header describing purpose, exports, and invariants".
apps/dashboard/src/components/run-list/run-list-page.tsx-12-13 (1)
12-13:⚠️ Potential issue | 🟡 MinorAdd TSDoc for the exported
RunListPagecomponent.
Include purpose, params (none), returns, and throws.📝 Suggested TSDoc
const RUN_LIST_SKELETON_KEYS = ["s1", "s2", "s3", "s4", "s5", "s6"] as const; +/** + * Purpose: Render the benchmark runs list page. + * `@returns` Rendered runs list page element. + * `@throws` None. + */ export function RunListPage() {As per coding guidelines: "All exported functions require TSDoc/JSDoc documentation (purpose, params, returns, throws)".
apps/dashboard/src/components/run-list/run-list-page.tsx-1-6 (1)
1-6:⚠️ Potential issue | 🟡 MinorMove the file header above imports and include exports/invariants.
The required header must be the first content in the file and include purpose, exports, and invariants.🔧 Suggested header placement
+/** + * Purpose: Page component listing benchmark runs. + * Exports: RunListPage + * Invariants: useRuns provides sorted runs; skeleton keys are stable. + */ import { PageContainer, PageHeader } from "@/components/layout/page-container"; import { Skeleton } from "@/components/ui/skeleton"; -/** - * Purpose: Run list page component displaying all benchmark runs. - * Shows a grid of RunCard components sorted by date (newest first). - */ import { useRuns } from "@/hooks/use-runs";As per coding guidelines: "Every file must start with a descriptive header describing purpose, exports, and invariants".
apps/dashboard/src/components/ui/tabs.tsx-1-7 (1)
1-7:⚠️ Potential issue | 🟡 MinorMove the file header to the top and document exports/invariants.
The header no longer begins the file and doesn’t list exports/invariants.
As per coding guidelines: "Begin every file with a short header documenting purpose, exports, and invariants."📝 Suggested header placement and content
-import { cn } from "@/lib/utils"; -import * as TabsPrimitive from "@radix-ui/react-tabs"; -/** - * Purpose: Tabs component for view switching. - * Based on shadcn/ui Tabs (Radix UI) with terminal-native styling. - */ -import * as React from "react"; +/** + * Purpose: Tabs component for view switching. + * Exports: Tabs, TabsList, TabsTrigger, TabsContent + * Invariants: Terminal-native styling wraps Radix Tabs primitives. + */ +import { cn } from "@/lib/utils"; +import * as TabsPrimitive from "@radix-ui/react-tabs"; +import * as React from "react";apps/dashboard/src/components/ui/card.tsx-1-6 (1)
1-6:⚠️ Potential issue | 🟡 MinorMove the file header to the top and document exports/invariants.
The header is no longer the first content in the file and doesn’t list exports/invariants.
As per coding guidelines: "Begin every file with a short header documenting purpose, exports, and invariants."📝 Suggested header placement and content
-import { cn } from "@/lib/utils"; -/** - * Purpose: Card component for content containers. - * Based on shadcn/ui Card with terminal-native styling. - */ -import * as React from "react"; +/** + * Purpose: Card components for content containers. + * Exports: Card, CardHeader, CardTitle, CardDescription, CardContent, CardFooter + * Invariants: Terminal-native styling wraps shadcn/ui Card primitives. + */ +import { cn } from "@/lib/utils"; +import * as React from "react";apps/dashboard/src/components/ui/card.tsx-35-38 (1)
35-38:⚠️ Potential issue | 🟡 MinorFix the ref element type for
CardTitle.
CardTitlerenders an<h3>, so the ref should beHTMLHeadingElement, notHTMLParagraphElement.✅ Correct ref element type
-const CardTitle = React.forwardRef< - HTMLParagraphElement, - React.HTMLAttributes<HTMLHeadingElement> ->(({ className, ...props }, ref) => ( +const CardTitle = React.forwardRef< + HTMLHeadingElement, + React.HTMLAttributes<HTMLHeadingElement> +>(({ className, ...props }, ref) => (apps/dashboard/src/components/compare/run-selector.tsx-31-36 (1)
31-36:⚠️ Potential issue | 🟡 MinorWire the label to
SelectTriggerfor screen reader accessibility.The
<p>label is not programmatically associated with the select control. Addaria-labelledbyto make the label discoverable by assistive technologies.Implementation using `useId`
+ const labelId = useId(); {label && ( - <p className="text-sm font-medium text-foreground-muted">{label}</p> + <p id={labelId} className="text-sm font-medium text-foreground-muted"> + {label} + </p> )} <Select value={value} onValueChange={onValueChange}> - <SelectTrigger className="w-full md:w-[300px]"> + <SelectTrigger + aria-labelledby={label ? labelId : undefined} + className="w-full md:w-[300px]" + > <SelectValue placeholder={placeholder} /> </SelectTrigger>Add the import:
import { useId } from "react";apps/dashboard/src/hooks/use-run-detail.ts-1-7 (1)
1-7:⚠️ Potential issue | 🟡 MinorMove the header comment to the top of the file.
The file currently begins with imports; the header should be the first content.
As per coding guidelines: "Every file must start with a descriptive header describing purpose, exports, and invariants".
apps/dashboard/src/components/compare/compare-table.tsx-1-5 (1)
1-5:⚠️ Potential issue | 🟡 MinorPlace the header comment at the top of the file.
The header currently follows imports; it should be the first content in the file.
As per coding guidelines: "Every file must start with a descriptive header describing purpose, exports, and invariants".
apps/dashboard/src/components/compare/compare-table.tsx-22-25 (1)
22-25:⚠️ Potential issue | 🟡 MinorAdd full TSDoc for CompareTable.
The exported component needs
@param,@returns, and@throws.📝 Suggested JSDoc block
+/** + * Renders the compare table for matched items. + * `@param` props - Compare table props + * `@returns` Table element with delta columns + * `@throws` Never + */ export function CompareTable({ items, showOnlyChanges = false, }: CompareTableProps) {As per coding guidelines: "All exported functions require TSDoc/JSDoc documentation (purpose, params, returns, throws)".
apps/dashboard/src/hooks/use-run-detail.ts-16-20 (1)
16-20:⚠️ Potential issue | 🟡 MinorAdd
@returns/@throws to the hook JSDoc.The exported hook’s documentation is missing required tags.
📝 Suggested JSDoc update
/** * Fetches both run.json and plan.json for a single run. * `@param` runId - The run ID to fetch + * `@returns` Run/plan data plus loading/error state + * `@throws` Never (errors are captured into state) */As per coding guidelines: "All exported functions require TSDoc/JSDoc documentation (purpose, params, returns, throws)".
apps/dashboard/src/components/compare/compare-table.tsx-138-181 (1)
138-181:⚠️ Potential issue | 🟡 MinorAdd full TSDoc for RegressionsTable and ImprovementsTable.
Both exported components require
@param,@returns, and@throwstags.📝 Suggested JSDoc blocks
/** Table showing regressions (completed → failed) */ +/** + * Renders the regressions table. + * `@param` props - Items to render + * `@returns` Table element listing regressions + * `@throws` Never + */ export function RegressionsTable({ items }: { items: MatchedItem[] }) { ... /** Table showing improvements (failed → completed) */ +/** + * Renders the improvements table. + * `@param` props - Items to render + * `@returns` Table element listing improvements + * `@throws` Never + */ export function ImprovementsTable({ items }: { items: MatchedItem[] }) {As per coding guidelines: "All exported functions require TSDoc/JSDoc documentation (purpose, params, returns, throws)".
apps/dashboard/scripts/build-index.ts-16-17 (1)
16-17:⚠️ Potential issue | 🟡 MinorUse Pino logger instead of console.log throughout this script.
The file matches the TS pattern where Pino should be used for logging. Multiple console.log calls (lines 16, 41, 44, 57, 60, 61, 65, 72) should be replaced with Pino logger calls for consistency with project logging standards.
llm/implementation/setup-phase-implementation.md-77-82 (1)
77-82:⚠️ Potential issue | 🟡 MinorUpdate runtime examples to include all supported runtimes.
The options table, MatrixItem example, and discovery notes still imply only “ollama.”Suggested edits
-| `-r, --runtimes <runtimes...>` | Limit to specific runtimes: ollama | All available | +| `-r, --runtimes <runtimes...>` | Limit to specific runtimes: ollama, vllm | All available | - runtime: string // "ollama" + runtime: string // "ollama" | "vllm" -- Runtimes: If `config.runtimes` is empty, discover available runtimes (Ollama) +-- Runtimes: If `config.runtimes` is empty, discover available runtimes (Ollama, vLLM)Also applies to: 136-141, 223-226
apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx-270-270 (1)
270-270:⚠️ Potential issue | 🟡 MinorAdd JSDoc for exported
BlindVsInformedChart.
Exported functions should document purpose/params/returns/throws.As per coding guidelines, “All exported functions must have TSDoc/JSDoc with purpose, params, returns, and throws.”Suggested JSDoc
+/** + * Purpose: Render blind vs informed pass-rate comparison by model/harness. + * `@param` items - Matrix item results to summarize + * `@returns` Chart card element + * `@throws` Never (renders empty states when data is missing) + */ export function BlindVsInformedChart({ items }: BlindVsInformedChartProps) {src/harnesses/opencode-output.ts-194-200 (1)
194-200:⚠️ Potential issue | 🟡 MinorAdd an explicit
@throwsline to the exported JSDoc.
Even “never throws” should be documented per guideline.As per coding guidelines, “All exported functions must have TSDoc/JSDoc with purpose, params, returns, and throws.”Suggested JSDoc tweak
/** * Normalize OpenCode JSON/JSONL output into plain assistant text or extracted tool-call code. * * `@param` raw - Raw stdout/stderr from OpenCode * `@returns` Normalized output and method indicator + * `@throws` Never (malformed output falls back to `{ method: "raw" }`) */ export function normalizeOpenCodeOutput(raw: string): OpenCodeNormalizedOutput {apps/dashboard/src/components/charts/frontier-eval-scatter.tsx-1-9 (1)
1-9:⚠️ Potential issue | 🟡 MinorMove/expand the file header to the top and include exports/invariants.
Right now the header appears after imports and doesn’t list exports/invariants.As per coding guidelines, “Begin every file with a short header documenting purpose, exports, and invariants.”Suggested update
+/** + * Purpose: Frontier eval scatter plot using Recharts. + * Exports: FrontierEvalScatter + * Invariants: items include automatedScore + frontierEval for plotting. + */ import { Card, CardContent, CardHeader, CardTitle } from "@/components/ui/card"; import { WithInfoTooltip } from "@/components/ui/info-tooltip"; import { computeItemPassRate } from "@/lib/aggregations"; import { scatter as scatterTooltips } from "@/lib/tooltip-content"; import type { MatrixItemResult } from "@/lib/types"; -/** - * Purpose: Frontier eval scatter plot using Recharts. - * Shows relationship between automated pass rate and frontier eval score. - */apps/dashboard/src/components/charts/frontier-eval-scatter.tsx-66-66 (1)
66-66:⚠️ Potential issue | 🟡 MinorAdd JSDoc for exported
FrontierEvalScatter.
Exported functions require purpose/params/returns/throws.As per coding guidelines, “All exported functions must have TSDoc/JSDoc with purpose, params, returns, and throws.”Suggested JSDoc
+/** + * Purpose: Render frontier-eval scatter plot for a set of matrix items. + * `@param` items - Matrix item results to plot + * `@returns` Chart card element + * `@throws` Never (renders empty state when no data) + */ export function FrontierEvalScatter({ items }: FrontierEvalScatterProps) {apps/dashboard/src/components/charts/blind-vs-informed-chart.tsx-1-15 (1)
1-15:⚠️ Potential issue | 🟡 MinorMove/expand the file header to the top and include exports/invariants.
The header sits after imports and doesn’t list exports/invariants.As per coding guidelines, “Begin every file with a short header documenting purpose, exports, and invariants.”Suggested update
+/** + * Purpose: Blind vs informed prompt comparison chart. + * Exports: BlindVsInformedChart + * Invariants: items include passType values (blind|informed). + */ import { Card, CardContent, CardHeader, CardTitle } from "@/components/ui/card"; import { WithInfoTooltip } from "@/components/ui/info-tooltip"; import { Tabs, TabsContent, TabsList, TabsTrigger } from "@/components/ui/tabs"; @@ -import type { MatrixItemResult } from "@/lib/types"; -/** - * Purpose: Blind vs informed prompt comparison chart. - * Shows paired bars comparing pass rates between blind and informed prompts. - */ +import type { MatrixItemResult } from "@/lib/types";apps/dashboard/src/components/run-detail/matrix-table.tsx-69-73 (1)
69-73:⚠️ Potential issue | 🟡 MinorClarify column header semantics: "PASS" header shows prompt type, not pass status.
The column header "PASS" (line 71) and tooltip "Prompt type: blind (no hints) or informed (includes guidance and examples)" indicate this column displays the prompt type dimension, but the header label is ambiguous—it could be confused with the TESTS column that shows actual pass rates. The cell content at line 123 confirms it displays
item.passType("blind" or "informed"), not a pass/fail indicator.Consider renaming the header to "PASS TYPE" or "TYPE" to align with the actual content and clarify the distinction from the TESTS column.
apps/dashboard/src/components/charts/composite-score-chart.tsx-106-114 (1)
106-114:⚠️ Potential issue | 🟡 MinorAvoid truncated-name collisions in Y-axis click handling.
Lines 108 and 271–273 use truncated names to look up raw items. If two names share the same prefix, clicks can resolve to the wrong row.
✅ Safer mapping (keep raw name, truncate only for display)
- return metrics.map((m) => ({ - name: m.name.length > 20 ? `${m.name.slice(0, 18)}...` : m.name, + return metrics.map((m) => ({ + name: m.name, effectiveScore: m.effectiveScore * 100, @@ - <text + <text x={-5} y={0} dy={4} textAnchor="end" fill="hsl(210, 30%, 92%)" fontSize={12} className={onClick ? "hover:fill-amber-400 transition-colors" : ""} > - {payload.value} + {payload.value.length > 20 + ? `${payload.value.slice(0, 18)}...` + : payload.value} </text> @@ - onClick={ - onBarClick - ? (name) => { - // Find the raw name from the data - const item = data.find((d) => d.name === name); - if (item) onBarClick(item.raw.name); - } - : undefined - } + onClick={onBarClick ? (name) => onBarClick(name) : undefined} /> )} />Also applies to: 155-165, 260-276
apps/dashboard/src/components/charts/composite-score-chart.tsx-349-352 (1)
349-352:⚠️ Potential issue | 🟡 MinorAdd required TSDoc for
CompositeScoreChart.The exported component needs purpose/params/returns/throws documentation.
📝 Example TSDoc
+/** + * Purpose: Render composite score charts by dimension. + * `@param` items - Matrix items used to compute composite metrics. + * `@param` onDimensionClick - Optional click handler for a selected dimension row. + * `@returns` Composite score chart card element. + * `@throws` Never. + */ export function CompositeScoreChart({ items, onDimensionClick, }: CompositeScoreChartProps) {As per coding guidelines, "All exported functions must have TSDoc/JSDoc with purpose, params, returns, and throws".
apps/dashboard/src/components/compare/compare-page.tsx-24-24 (1)
24-24:⚠️ Potential issue | 🟡 MinorAdd required TSDoc for
ComparePageContent.Line 24 exports a component without the required TSDoc (purpose, params, returns, throws).
📝 Example TSDoc
+/** + * Purpose: Compare page component for comparing two benchmark runs. + * `@returns` Compare page content element. + * `@throws` Never. + */ export function ComparePageContent() {As per coding guidelines, "All exported functions must have TSDoc/JSDoc with purpose, params, returns, and throws".
apps/dashboard/src/lib/aggregations-tooling.ts-21-26 (1)
21-26:⚠️ Potential issue | 🟡 MinorAdd
@throwsto exported function TSDoc blocks.The exported functions include purpose/params/returns but omit
@throws.📝 Example fix
/** * Infers which harnesses are expected to use tools. * * `@param` items - Matrix items * `@returns` Set of harness names with expected tool usage + * `@throws` Never. */As per coding guidelines, "All exported functions must have TSDoc/JSDoc with purpose, params, returns, and throws".
apps/dashboard/src/components/charts/composite-score-chart.tsx-1-19 (1)
1-19:⚠️ Potential issue | 🟡 MinorMove the header to the top and document exports/invariants.
The current purpose block starts after imports and omits exports/invariants.
💡 Suggested header placement
+/** + * Purpose: Composite score bar chart showing pass rate, tool success, and frontier score. + * Exports: CompositeScoreChart + * Invariants: + * - Metrics are normalized to 0–100 for display. + */ import { Card, CardContent, CardHeader, CardTitle } from "@/components/ui/card"; import { WithInfoTooltip } from "@/components/ui/info-tooltip"; import { Tabs, TabsContent, TabsList, TabsTrigger } from "@/components/ui/tabs"; @@ -import type { KeyboardEvent } from "react"; -/** - * Purpose: Composite score bar chart showing pass rate, tool success, and frontier score. - * Replaces simple pass rate chart with multi-metric grouped bars. - */ +import type { KeyboardEvent } from "react";As per coding guidelines, "Begin every file with a short header documenting purpose, exports, and invariants".
apps/dashboard/src/components/compare/compare-page.tsx-1-13 (1)
1-13:⚠️ Potential issue | 🟡 MinorMove the file header above imports and include exports/invariants.
Line 8 places the header after imports and it doesn’t document exports/invariants. Please move it to the top and include those fields.
💡 Suggested header placement
+/** + * Purpose: Compare page component for comparing two benchmark runs. + * Exports: ComparePageContent + * Invariants: + * - Requires two run selections to render comparison results. + */ import { PageContainer, PageHeader } from "@/components/layout/page-container"; import { Card, CardContent, CardHeader, CardTitle } from "@/components/ui/card"; import { Skeleton } from "@/components/ui/skeleton"; import { Tabs, TabsContent, TabsList, TabsTrigger } from "@/components/ui/tabs"; import { useCompare } from "@/hooks/use-compare"; import { useRuns } from "@/hooks/use-runs"; import { formatDate } from "@/lib/utils"; -/** - * Purpose: Compare page component for comparing two benchmark runs. - * Provides run selectors and displays comparison results. - */ import { useEffect, useState } from "react";As per coding guidelines, "Begin every file with a short header documenting purpose, exports, and invariants".
src/runtimes/discovery.ts-48-48 (1)
48-48:⚠️ Potential issue | 🟡 MinorValidate
VLLM_API_KEYwith Zod before use.
process.env.VLLM_API_KEYis a boundary input and should be validated with Zod. The same validation is already done insrc/runtimes/index.ts- consider importing that validated value or applying the same pattern here.✅ Suggested fix
+import { z } from "zod"; import { logger } from "../lib/logger.js"; import { createOllamaRuntime } from "./ollama-runtime.js"; import type { RuntimeName } from "./runtime.js"; import { createVllmRuntime } from "./vllm-runtime.js"; +const VllmApiKeySchema = z.string().min(1).optional(); +const vllmApiKey = VllmApiKeySchema.parse(process.env.VLLM_API_KEY); + // ... in isRuntimeAvailable: case "vllm": { const runtime = createVllmRuntime({ baseUrl: config.vllmBaseUrl, defaultTimeoutMs: config.timeoutMs, - apiKey: process.env.VLLM_API_KEY, + apiKey: vllmApiKey, });As per coding guidelines: "Treat all boundary inputs (CLI args, env, JSON files, HTTP responses) as 'unknown' and parse immediately with Zod."
src/harnesses/goose-output.ts-222-274 (1)
222-274:⚠️ Potential issue | 🟡 MinorAdd
@throwsto the exportednormalizeGooseOutputJSDoc.
The function doesn’t throw, but the guideline requires explicit@throwscoverage.📘 Doc update
/** * Normalizes Goose JSON output into plain assistant text or extracted code. * * `@param` raw - Raw stdout from Goose * `@returns` Normalized output and method indicator + * `@throws` {Error} Never directly; malformed input returns a raw fallback. */ export function normalizeGooseOutput(raw: string): GooseNormalizedOutput {As per coding guidelines: "All exported functions require TSDoc/JSDoc documentation (purpose, params, returns, throws)".
src/lib/model-aliases.ts-133-233 (1)
133-233:⚠️ Potential issue | 🟡 MinorAdd
@throwsto the exported helper docs in this section.
parseInlineAliasescan throw viaparseInlineAlias, and the other exported helpers should explicitly document no-throw behavior to satisfy the guideline.📘 Doc updates (apply similarly to the remaining exports)
/** * Parses multiple inline alias definitions. * * `@param` inlines - Array of inline alias strings * `@returns` Combined alias map + * `@throws` {Error} If any inline definition is invalid (bubbles from parseInlineAlias) */ export function parseInlineAliases(inlines: string[]): ModelAliasMap {/** * Checks if a model name is a known alias. * * `@param` modelName - Model name to check * `@param` aliases - Alias map to check against * `@returns` true if the name is a defined alias + * `@throws` {Error} Never directly */ export function isAlias(modelName: string, aliases: ModelAliasMap): boolean {As per coding guidelines: "All exported functions require TSDoc/JSDoc documentation (purpose, params, returns, throws)".
src/runner/index.ts-302-309 (1)
302-309:⚠️ Potential issue | 🟡 MinorConsider swallowing errors in the finally block to avoid masking the original exception.
If
stopManagedVllmthrows in the finally block, it will mask any exception that caused the loop to exit early. The cleanup should be best-effort.🛡️ Proposed fix
} finally { if (managedVllm && managedVllmStarted && managedVllm.stopAfterRun) { log.warn( "Benchmark ended while managed vLLM was running; stopping vLLM in finally...", ); - await stopManagedVllm(managedVllm); + try { + await stopManagedVllm(managedVllm); + } catch (cleanupError) { + log.error({ cleanupError }, "Failed to stop managed vLLM in finally block"); + } } }
Summary by CodeRabbit
New Features
Enhancements
Documentation