dashboard: add coverage diagnostics and improve run alignment#9
dashboard: add coverage diagnostics and improve run alignment#9AustinKelsay merged 4 commits intomainfrom
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
📝 WalkthroughWalkthroughAdds diagnostics and frontier-eval failure tracking across aggregations and UI: new CoverageDiagnostics component wired into RunDetailPage, diagnostics aggregation helpers, expanded FailureStats and failure UI (including frontier-eval), timing UI refactor, tooltip entries for new failure types, and minor formatting/import tweaks. Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (4)
src/harnesses/opencode-adapter.ts (1)
471-473: Avoid silently swallowing cleanup failures.This
.catch(() => {})drops errors. Please at least log at debug/warn so failures aren’t silent. As per coding guidelines: "Throw on invalid config, invalid schemas, and programmer mistakes; avoid silent fallbacks."♻️ Proposed change
- fs.promises - .rm(workDir, { recursive: true, force: true }) - .catch(() => {}); + fs.promises + .rm(workDir, { recursive: true, force: true }) + .catch((error) => { + log.warn({ error, workDir }, "Failed to cleanup OpenCode workDir"); + });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/harnesses/opencode-adapter.ts` around lines 471 - 473, The cleanup call fs.promises.rm(workDir, { recursive: true, force: true }).catch(() => {}) is swallowing errors; change the catch to surface/log the error (e.g., .catch(err => processLogger?.warn(`Failed to remove workDir ${workDir}:`, err)) or at minimum console.warn) so failures aren’t silent—update the code around fs.promises.rm and ensure the log includes the workDir and the error object for debugging.apps/dashboard/src/components/run-detail/coverage-diagnostics.tsx (2)
31-41: Consider documenting the delta semantics.The
getDeltaTonelogic treatsactual > baseas warning. While this makes sense for alignment diagnostics (more executed than planned suggests drift), a brief inline comment would clarify the intentional semantics for future maintainers.📝 Optional: Add clarifying comment
+/** Returns color class based on alignment: exact match = success, over = warning, under = danger */ function getDeltaTone(base: number, actual: number): string { if (base === actual) return "text-success"; if (actual > base) return "text-warning"; return "text-danger"; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/components/run-detail/coverage-diagnostics.tsx` around lines 31 - 41, Add a brief inline comment above getDeltaTone (and optionally formatDelta) explaining the delta semantics: equal (base === actual) is success, actual > base is a warning because it indicates more was executed than planned (potential drift), and actual < base is danger because less was executed than expected; place the comment immediately above the getDeltaTone function to make intent clear for future maintainers.
166-176: Verify the distinction between "failures" and "missing evals" is clear to users.The panel displays
frontierEvalFailures(items where evaluation was attempted but failed) and "missing evals" (totalItems - frontierEvalItems, items without any eval data). These are different metrics, but both contribute to incomplete coverage. Consider whether a tooltip would help users understand this distinction.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/components/run-detail/coverage-diagnostics.tsx` around lines 166 - 176, The UI shows frontierEvalFailures and computes "missing evals" as coverage.totalItems - coverage.frontierEvalItems, which can confuse users because failures (attempted but failed) and missing evals (no eval attempted) are distinct; update the Frontier Eval Failures card in coverage-diagnostics.tsx to make this distinction explicit (e.g., change the secondary label text and/or add a concise tooltip or info icon) that explains frontierEvalFailures refers to attempted-but-failed evaluations and missing evals are items with no eval data (use the symbols frontierEvalFailures, coverage.totalItems and coverage.frontierEvalItems to locate the values to describe). Ensure the tooltip copy is short, clear, and attached to the warning number or the small description so users can easily see the difference.apps/dashboard/src/lib/aggregations-core.ts (1)
155-177: Consider consolidating duplicatedcomputeTimingStatsFromValueshelper.This helper is duplicated in
aggregations-diagnostics.ts(lines 49-71). Consider exporting it from one location to avoid divergent implementations.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/lib/aggregations-core.ts` around lines 155 - 177, The computeTimingStatsFromValues helper is duplicated; export the single implementation (the computeTimingStatsFromValues function) from its current module and replace the duplicate in the diagnostics aggregation with an import to that exported function (or move it to a new shared util and import from there), then remove the duplicated implementation so both aggregations use the same symbol; update any imports/exports to ensure computeTimingStatsFromValues is exported (e.g., add an export) and imported where the diagnostics aggregation currently defines its own copy.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/dashboard/src/components/run-detail/failure-breakdown.tsx`:
- Around line 29-30: The new failure keys missing_export and factory_init_failed
were added to FAILURE_LABELS but have no tooltip text in tooltip-content.ts; add
descriptive entries for these keys in tooltip-content.ts (matching the keys used
by FailureTypeList) so the UI shows consistent tooltips—for example, add tooltip
strings for "missing_export" and "factory_init_failed" alongside the existing
tooltip map entries referenced by FailureTypeList.
In `@apps/dashboard/src/lib/aggregations-core.ts`:
- Around line 161-177: computeTimingStatsFromValues currently computes median
with Math.floor(sortedValues.length / 2) which selects a single middle element
and is incorrect for even-length arrays; update the median calculation in
computeTimingStatsFromValues to match the logic used in the helper in
aggregations-diagnostics.ts by returning the average of the two middle values
when sortedValues.length is even, otherwise return the single middle element,
ensuring p90/other stats remain unchanged.
---
Nitpick comments:
In `@apps/dashboard/src/components/run-detail/coverage-diagnostics.tsx`:
- Around line 31-41: Add a brief inline comment above getDeltaTone (and
optionally formatDelta) explaining the delta semantics: equal (base === actual)
is success, actual > base is a warning because it indicates more was executed
than planned (potential drift), and actual < base is danger because less was
executed than expected; place the comment immediately above the getDeltaTone
function to make intent clear for future maintainers.
- Around line 166-176: The UI shows frontierEvalFailures and computes "missing
evals" as coverage.totalItems - coverage.frontierEvalItems, which can confuse
users because failures (attempted but failed) and missing evals (no eval
attempted) are distinct; update the Frontier Eval Failures card in
coverage-diagnostics.tsx to make this distinction explicit (e.g., change the
secondary label text and/or add a concise tooltip or info icon) that explains
frontierEvalFailures refers to attempted-but-failed evaluations and missing
evals are items with no eval data (use the symbols frontierEvalFailures,
coverage.totalItems and coverage.frontierEvalItems to locate the values to
describe). Ensure the tooltip copy is short, clear, and attached to the warning
number or the small description so users can easily see the difference.
In `@apps/dashboard/src/lib/aggregations-core.ts`:
- Around line 155-177: The computeTimingStatsFromValues helper is duplicated;
export the single implementation (the computeTimingStatsFromValues function)
from its current module and replace the duplicate in the diagnostics aggregation
with an import to that exported function (or move it to a new shared util and
import from there), then remove the duplicated implementation so both
aggregations use the same symbol; update any imports/exports to ensure
computeTimingStatsFromValues is exported (e.g., add an export) and imported
where the diagnostics aggregation currently defines its own copy.
In `@src/harnesses/opencode-adapter.ts`:
- Around line 471-473: The cleanup call fs.promises.rm(workDir, { recursive:
true, force: true }).catch(() => {}) is swallowing errors; change the catch to
surface/log the error (e.g., .catch(err => processLogger?.warn(`Failed to remove
workDir ${workDir}:`, err)) or at minimum console.warn) so failures aren’t
silent—update the code around fs.promises.rm and ensure the log includes the
workDir and the error object for debugging.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
apps/dashboard/src/lib/aggregations-core.ts (1)
276-288: Consider extracting the detail item type for clarity.The inline type for
frontierEvalFailureDetailsis fairly complex. Extracting it to a named interface would improve readability and enable reuse if this shape is needed elsewhere.♻️ Proposed extraction
+/** Detail record for a single frontier eval failure. */ +export interface FrontierEvalFailureDetail { + id: string; + runtime: string; + model: string; + harness: string; + test: string; + passType: string; + type: string; + status?: number; + latencyMs?: number; + evalModel?: string; + attempts?: number; +} + /** Failure counts by type. */ export interface FailureStats { generationFailures: Map<string, number>; scoringFailures: Map<string, number>; frontierEvalFailures: Map<string, number>; - frontierEvalFailureDetails: Array<{ - id: string; - runtime: string; - model: string; - harness: string; - test: string; - passType: string; - type: string; - status?: number; - latencyMs?: number; - evalModel?: string; - attempts?: number; - }>; + frontierEvalFailureDetails: FrontierEvalFailureDetail[]; totalGenerationFailures: number; totalScoringFailures: number; totalFrontierEvalFailures: number; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/lib/aggregations-core.ts` around lines 276 - 288, The inline complex object type used for the frontierEvalFailureDetails property should be extracted into a named interface to improve readability and reuse: define an interface (e.g., FrontierEvalFailureDetail) with the fields id, runtime, model, harness, test, passType, type, and optional status, latencyMs, evalModel, attempts, then replace the inline Array<{...}> type on frontierEvalFailureDetails with FrontierEvalFailureDetail[] and update any other places that use this shape to reference the new interface (search for frontierEvalFailureDetails and related usages to change the type annotation).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@apps/dashboard/src/lib/aggregations-core.ts`:
- Around line 276-288: The inline complex object type used for the
frontierEvalFailureDetails property should be extracted into a named interface
to improve readability and reuse: define an interface (e.g.,
FrontierEvalFailureDetail) with the fields id, runtime, model, harness, test,
passType, type, and optional status, latencyMs, evalModel, attempts, then
replace the inline Array<{...}> type on frontierEvalFailureDetails with
FrontierEvalFailureDetail[] and update any other places that use this shape to
reference the new interface (search for frontierEvalFailureDetails and related
usages to change the type annotation).
|
@CodeRabbit full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (3)
apps/dashboard/src/lib/aggregations-diagnostics.ts (1)
49-71: DuplicatecomputeTimingStatsFromValueshelper.This is an exact duplicate of the helper in
aggregations-core.ts(lines 161-183). To maintain consistency and reduce maintenance burden, consider exporting it fromaggregations-core.tsand importing it here.♻️ Suggested refactor
In
aggregations-core.ts, export the helper:-function computeTimingStatsFromValues(values: number[]): TimingStats | null { +export function computeTimingStatsFromValues(values: number[]): TimingStats | null {Then in this file, import and use it:
-import type { TimingStats } from "./aggregations-core"; +import { computeTimingStatsFromValues, type TimingStats } from "./aggregations-core"; import type { MatrixItemResult } from "./types"; -type DimensionCountItem = Pick< - MatrixItemResult, - "runtime" | "model" | "harness" | "test" | "passType" ->; - -/** Timing stats grouped by dimension name. */ -export interface GroupTimingStats { - name: string; - stats: TimingStats; -} - -... - -/** - * Computes timing statistics from an arbitrary numeric sample. - * - * `@param` values - Timing values in milliseconds - * `@returns` Timing stats or null when no values are present - */ -function computeTimingStatsFromValues(values: number[]): TimingStats | null { - const sortedValues = [...values].sort((a, b) => a - b); - if (sortedValues.length === 0) return null; - - const sum = sortedValues.reduce((a, b) => a + b, 0); - const rawP90Index = Math.ceil(sortedValues.length * 0.9) - 1; - const p90Index = Math.min(sortedValues.length - 1, Math.max(0, rawP90Index)); - - const mid = sortedValues.length / 2; - const median = - sortedValues.length % 2 === 0 - ? (sortedValues[mid - 1] + sortedValues[mid]) / 2 - : sortedValues[Math.floor(mid)]; - - return { - min: sortedValues[0], - max: sortedValues[sortedValues.length - 1], - median, - mean: sum / sortedValues.length, - p90: sortedValues[p90Index] || sortedValues[sortedValues.length - 1], - count: sortedValues.length, - }; -}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/lib/aggregations-diagnostics.ts` around lines 49 - 71, The computeTimingStatsFromValues function here is a duplicate; remove the local copy and instead export computeTimingStatsFromValues from aggregations-core.ts (preserving its signature and the TimingStats type) and import it into this file, then update any local references to use the imported function; ensure you delete the duplicate implementation in aggregations-diagnostics.ts so there’s a single shared implementation.apps/dashboard/src/components/run-detail/failure-breakdown.tsx (1)
74-81: Add JSDoc documentation for the exportedFailureBreakdownfunction.The exported function lacks TSDoc documentation. Per coding guidelines, all exported functions must have documentation including purpose, params, returns, and throws.
📝 Suggested JSDoc
+/** + * Renders a breakdown card showing failure types grouped by category. + * Displays generation, scoring, and frontier eval failures with tooltips. + * + * `@param` props - Component props + * `@param` props.items - Matrix items to compute failure stats from + * `@returns` Card component or null if no failures exist + */ export function FailureBreakdown({ items }: FailureBreakdownProps) {As per coding guidelines: "All exported functions must have TSDoc/JSDoc documentation including purpose, params, returns, and throws"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/components/run-detail/failure-breakdown.tsx` around lines 74 - 81, Add a JSDoc/TSDoc block above the exported FailureBreakdown function describing its purpose (renders a failure summary UI for given items), documenting the parameter `items` via the FailureBreakdownProps type, the return value (JSX.Element | null), and any exceptions/edge-cases (e.g., if computeFailureStats throws) — reference the function name FailureBreakdown and the helper computeFailureStats so the doc points to the behavior and inputs used.apps/dashboard/src/lib/aggregations-core.ts (1)
155-183: ExtractcomputeTimingStatsFromValuesto a shared utility to avoid duplication.This internal helper is identical in both files and can be exported from
aggregations-core.tsand imported byaggregations-diagnostics.ts. Sinceaggregations-diagnostics.tsalready importsTimingStatsfrom core, adding this function import maintains the existing dependency pattern.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/lib/aggregations-core.ts` around lines 155 - 183, Extract the duplicate helper computeTimingStatsFromValues from aggregations-diagnostics.ts into aggregations-core.ts and export it; then update aggregations-diagnostics.ts to import { computeTimingStatsFromValues } (it already imports TimingStats) and remove the local copy. Ensure the exported function signature and return type remain TimingStats | null and that callers use the imported function instead of the duplicated implementation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/dashboard/src/components/run-detail/coverage-diagnostics.tsx`:
- Around line 166-175: The "missing evals" count currently uses
(coverage.totalItems - coverage.frontierEvalItems) which double-counts
frontierEvalFailures; update the UI in coverage-diagnostics.tsx so the displayed
missing-evals number excludes failures (e.g., compute missing =
coverage.totalItems - coverage.frontierEvalItems - frontierEvalFailures) or
alternatively change the label to indicate it includes failures (e.g., "missing
or failed evals"); locate the paragraph that renders {coverage.totalItems -
coverage.frontierEvalItems} and adjust the expression or label accordingly to
avoid double-counting.
In `@apps/dashboard/src/components/run-detail/timing-stats.tsx`:
- Line 154: The exported function TimingStats lacks the required TSDoc/JSDoc;
add a JSDoc block immediately above the TimingStats declaration describing the
function's purpose, its parameter (items: TimingStatsProps) with types and
meaning, the return value (React element / JSX.Element), and any errors/throws
if applicable; ensure the comment uses TSDoc/JSDoc syntax (/** ... */) and
mentions TimingStatsProps so linters and documentation tools recognize the
export.
- Around line 186-203: The tooltip copy used by the TimingGrid is specific to
"generation" and is misleading for scoring metrics; update TimingGrid to accept
a customizable tooltip/description prop (e.g., tooltipText or description) and
use a neutral default ("Timing information for this stage") to preserve current
behavior, then pass a scoring-specific string when rendering scoring blocks (in
the scoringStats branch where TimingGrid is invoked) or rely on the neutral
default for all. Modify the TimingGrid component signature and its internal
tooltip rendering to use the new prop (or default) and update the calls in the
Generation and Scoring usages (the TimingGrid(...) instances) accordingly.
---
Nitpick comments:
In `@apps/dashboard/src/components/run-detail/failure-breakdown.tsx`:
- Around line 74-81: Add a JSDoc/TSDoc block above the exported FailureBreakdown
function describing its purpose (renders a failure summary UI for given items),
documenting the parameter `items` via the FailureBreakdownProps type, the return
value (JSX.Element | null), and any exceptions/edge-cases (e.g., if
computeFailureStats throws) — reference the function name FailureBreakdown and
the helper computeFailureStats so the doc points to the behavior and inputs
used.
In `@apps/dashboard/src/lib/aggregations-core.ts`:
- Around line 155-183: Extract the duplicate helper computeTimingStatsFromValues
from aggregations-diagnostics.ts into aggregations-core.ts and export it; then
update aggregations-diagnostics.ts to import { computeTimingStatsFromValues }
(it already imports TimingStats) and remove the local copy. Ensure the exported
function signature and return type remain TimingStats | null and that callers
use the imported function instead of the duplicated implementation.
In `@apps/dashboard/src/lib/aggregations-diagnostics.ts`:
- Around line 49-71: The computeTimingStatsFromValues function here is a
duplicate; remove the local copy and instead export computeTimingStatsFromValues
from aggregations-core.ts (preserving its signature and the TimingStats type)
and import it into this file, then update any local references to use the
imported function; ensure you delete the duplicate implementation in
aggregations-diagnostics.ts so there’s a single shared implementation.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
apps/dashboard/src/components/run-detail/timing-stats.tsx (1)
1-3: Update file header to reflect expanded scope.The header mentions only "generation duration metrics" but the component now also handles scoring timing statistics and breakdowns. Consider updating it to reflect the full purpose and exports.
/** - * Purpose: Timing statistics component showing generation duration metrics. + * Purpose: Timing statistics component showing generation and scoring duration metrics. + * Exports: TimingStats */🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/dashboard/src/components/run-detail/timing-stats.tsx` around lines 1 - 3, Update the file header comment to describe the component's expanded responsibilities: mention that TimingStats (and any related exports such as exported types/helpers) displays generation duration metrics, scoring timing statistics, and breakdowns; also note what it exports (e.g., TimingStats component and any helper functions/types) and its purpose in the run-detail UI so the header accurately reflects scope and public API.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/dashboard/src/components/run-detail/coverage-diagnostics.tsx`:
- Around line 43-49: Add a `@throws` tag to the JSDoc for the exported React
component CoverageDiagnostics; update the existing docblock above the
CoverageDiagnostics function to include a "throws" entry (if it never throws,
use "@throws none") so the docblock contains purpose, `@param`(s), `@returns` and
`@throws` as required by the coding guidelines.
---
Nitpick comments:
In `@apps/dashboard/src/components/run-detail/timing-stats.tsx`:
- Around line 1-3: Update the file header comment to describe the component's
expanded responsibilities: mention that TimingStats (and any related exports
such as exported types/helpers) displays generation duration metrics, scoring
timing statistics, and breakdowns; also note what it exports (e.g., TimingStats
component and any helper functions/types) and its purpose in the run-detail UI
so the header accurately reflects scope and public API.
Summary by CodeRabbit
New Features
Improvements