Summary
The consistency.md dimension asks the right question — does the system behave similarly across repeated runs? — but the current experiment-log.md template captures only a single snapshot. There is no mechanism to track consistency trends across multiple experiment log entries over time.
The Gap
experiment-log.md records one run:
- Date:
- Consistency result:
To answer whether the system is getting more or less consistent over time, you need to compare across runs. Right now that comparison is manual and informal — teams typically don't do it systematically.
Proposal
Two additions:
1. templates/baseline-comparison.md
A structured template for comparing current experiment logs against a known-good baseline:
# Baseline Comparison Log
## Baseline (v___)
- Date established:
- System version:
- Consistency result (baseline):
- Correctness result (baseline):
## Comparison Runs
| Date | System version | Consistency Δ | Correctness Δ | Notes |
|------|---------------|--------------|--------------|-------|
| | | | | |
## Trend Assessment
- Direction (improving / stable / degrading):
- Sessions since last significant change:
- Action threshold: flag when any dimension Δ > 0.10 vs baseline
2. Extension to docs/consistency.md
Add a section on longitudinal consistency — the difference between:
- Within-run stability (current scope): same agent, same scenario, repeated N times in one session
- Cross-run consistency (proposed): same agent, same scenario, measured across sessions over days/weeks
A system that is consistent within a single evaluation run can still drift significantly across runs separated by days or model updates. The cross-run view is what matters for production reliability.
Why This Matters
Practitioners building on top of this framework will run evaluations repeatedly over time (before/after model updates, after prompt changes, after infrastructure changes). Without a structured cross-run comparison mechanism, they will either:
- Manually eyeball past logs (error-prone), or
- Only look at the most recent run in isolation (missing drift entirely)
The baseline comparison template makes cross-run consistency a first-class evaluation artifact.
Related Work
The PDR (Periodic Delivery Reliability) framework (DOI: 10.5281/zenodo.19339987) tracks this as delivery_score and calibration_delta across sessions — the same cross-session consistency measurement problem applied to autonomous agent workflows. The delta-from-baseline approach is validated in that context.
Happy to draft both the template and the consistency.md extension if this direction looks right.
Summary
The
consistency.mddimension asks the right question — does the system behave similarly across repeated runs? — but the currentexperiment-log.mdtemplate captures only a single snapshot. There is no mechanism to track consistency trends across multiple experiment log entries over time.The Gap
experiment-log.mdrecords one run:To answer whether the system is getting more or less consistent over time, you need to compare across runs. Right now that comparison is manual and informal — teams typically don't do it systematically.
Proposal
Two additions:
1.
templates/baseline-comparison.mdA structured template for comparing current experiment logs against a known-good baseline:
2. Extension to
docs/consistency.mdAdd a section on longitudinal consistency — the difference between:
A system that is consistent within a single evaluation run can still drift significantly across runs separated by days or model updates. The cross-run view is what matters for production reliability.
Why This Matters
Practitioners building on top of this framework will run evaluations repeatedly over time (before/after model updates, after prompt changes, after infrastructure changes). Without a structured cross-run comparison mechanism, they will either:
The baseline comparison template makes cross-run consistency a first-class evaluation artifact.
Related Work
The PDR (Periodic Delivery Reliability) framework (DOI: 10.5281/zenodo.19339987) tracks this as
delivery_scoreandcalibration_deltaacross sessions — the same cross-session consistency measurement problem applied to autonomous agent workflows. The delta-from-baseline approach is validated in that context.Happy to draft both the template and the consistency.md extension if this direction looks right.