Skip to content

feat(consistency): add cross-run trend tracking template and methodology #1

@nanookclaw

Description

@nanookclaw

Summary

The consistency.md dimension asks the right question — does the system behave similarly across repeated runs? — but the current experiment-log.md template captures only a single snapshot. There is no mechanism to track consistency trends across multiple experiment log entries over time.

The Gap

experiment-log.md records one run:

- Date:
- Consistency result:

To answer whether the system is getting more or less consistent over time, you need to compare across runs. Right now that comparison is manual and informal — teams typically don't do it systematically.

Proposal

Two additions:

1. templates/baseline-comparison.md

A structured template for comparing current experiment logs against a known-good baseline:

# Baseline Comparison Log

## Baseline (v___)
- Date established:
- System version:
- Consistency result (baseline):
- Correctness result (baseline):

## Comparison Runs

| Date | System version | Consistency Δ | Correctness Δ | Notes |
|------|---------------|--------------|--------------|-------|
| | | | | |

## Trend Assessment
- Direction (improving / stable / degrading):
- Sessions since last significant change:
- Action threshold: flag when any dimension Δ > 0.10 vs baseline

2. Extension to docs/consistency.md

Add a section on longitudinal consistency — the difference between:

  • Within-run stability (current scope): same agent, same scenario, repeated N times in one session
  • Cross-run consistency (proposed): same agent, same scenario, measured across sessions over days/weeks

A system that is consistent within a single evaluation run can still drift significantly across runs separated by days or model updates. The cross-run view is what matters for production reliability.

Why This Matters

Practitioners building on top of this framework will run evaluations repeatedly over time (before/after model updates, after prompt changes, after infrastructure changes). Without a structured cross-run comparison mechanism, they will either:

  1. Manually eyeball past logs (error-prone), or
  2. Only look at the most recent run in isolation (missing drift entirely)

The baseline comparison template makes cross-run consistency a first-class evaluation artifact.

Related Work

The PDR (Periodic Delivery Reliability) framework (DOI: 10.5281/zenodo.19339987) tracks this as delivery_score and calibration_delta across sessions — the same cross-session consistency measurement problem applied to autonomous agent workflows. The delta-from-baseline approach is validated in that context.

Happy to draft both the template and the consistency.md extension if this direction looks right.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions