feat(tooling): paired significance testing for result-file comparisons (external-first)

## Canonical Plan
This issue body is the source of truth for implementation.

## Objective
We add statistically grounded significance checks for baseline-vs-candidate comparisons of result files.

## Why This Matters
- Raw deltas can be misleading when variance is high.
- We need a principled way to distinguish real improvements from noise.
- Better statistical signals make benchmark decisions more trustworthy.

## Why This Location
- Significance testing is analysis over result artifacts, not evaluation execution logic.
- Tooling can evolve methods faster than core runtime contracts.
- External-first keeps core focused on primitives and execution.

## Architecture Boundary
External-first post-processing (wrapper/tooling), not core evaluator runtime.

## Deliverable Location
- Primary location (in-repo tooling path): `examples/features/benchmark-tooling/`
- Script location: `examples/features/benchmark-tooling/scripts/` (significance utility)
- Usage/docs: `examples/features/benchmark-tooling/README.md`

## Design Latitude
We choose the MVP method and interface (for example, paired bootstrap or McNemar first), then iterate.

## Acceptance Signals
- We support at least one paired significance method on aligned test IDs.
- We output machine-readable statistics plus a clear human-readable verdict.
- We handle missing/unmatched pairs deterministically and transparently.

## Non-Goals
- No new built-in evaluator primitives.
- No coupling to runtime scoring/execution pipeline.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tooling): paired significance testing for result-file comparisons (external-first) #365

Canonical Plan

Objective

Why This Matters

Why This Location

Architecture Boundary

Deliverable Location

Design Latitude

Acceptance Signals

Non-Goals

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat(tooling): paired significance testing for result-file comparisons (external-first) #365

Description

Canonical Plan

Objective

Why This Matters

Why This Location

Architecture Boundary

Deliverable Location

Design Latitude

Acceptance Signals

Non-Goals

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions