Skip to content

feat(tooling): paired significance testing for result-file comparisons (external-first) #365

@christso

Description

@christso

Canonical Plan

This issue body is the source of truth for implementation.

Objective

We add statistically grounded significance checks for baseline-vs-candidate comparisons of result files.

Why This Matters

  • Raw deltas can be misleading when variance is high.
  • We need a principled way to distinguish real improvements from noise.
  • Better statistical signals make benchmark decisions more trustworthy.

Why This Location

  • Significance testing is analysis over result artifacts, not evaluation execution logic.
  • Tooling can evolve methods faster than core runtime contracts.
  • External-first keeps core focused on primitives and execution.

Architecture Boundary

External-first post-processing (wrapper/tooling), not core evaluator runtime.

Deliverable Location

  • Primary location (in-repo tooling path): examples/features/benchmark-tooling/
  • Script location: examples/features/benchmark-tooling/scripts/ (significance utility)
  • Usage/docs: examples/features/benchmark-tooling/README.md

Design Latitude

We choose the MVP method and interface (for example, paired bootstrap or McNemar first), then iterate.

Acceptance Signals

  • We support at least one paired significance method on aligned test IDs.
  • We output machine-readable statistics plus a clear human-readable verdict.
  • We handle missing/unmatched pairs deterministically and transparently.

Non-Goals

  • No new built-in evaluator primitives.
  • No coupling to runtime scoring/execution pipeline.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions