feat(consistency): add cross-run trend tracking template and methodology

## Summary

The `consistency.md` dimension asks the right question — *does the system behave similarly across repeated runs?* — but the current `experiment-log.md` template captures only a single snapshot. There is no mechanism to track consistency trends across multiple experiment log entries over time.

## The Gap

`experiment-log.md` records one run:
```
- Date:
- Consistency result:
```

To answer whether the system is *getting more or less consistent over time*, you need to compare across runs. Right now that comparison is manual and informal — teams typically don't do it systematically.

## Proposal

Two additions:

### 1. `templates/baseline-comparison.md`

A structured template for comparing current experiment logs against a known-good baseline:

```markdown
# Baseline Comparison Log

## Baseline (v___)
- Date established:
- System version:
- Consistency result (baseline):
- Correctness result (baseline):

## Comparison Runs

| Date | System version | Consistency Δ | Correctness Δ | Notes |
|------|---------------|--------------|--------------|-------|
| | | | | |

## Trend Assessment
- Direction (improving / stable / degrading):
- Sessions since last significant change:
- Action threshold: flag when any dimension Δ > 0.10 vs baseline
```

### 2. Extension to `docs/consistency.md`

Add a section on **longitudinal consistency** — the difference between:

- **Within-run stability** (current scope): same agent, same scenario, repeated N times in one session
- **Cross-run consistency** (proposed): same agent, same scenario, measured across sessions over days/weeks

A system that is consistent within a single evaluation run can still drift significantly across runs separated by days or model updates. The cross-run view is what matters for production reliability.

## Why This Matters

Practitioners building on top of this framework will run evaluations repeatedly over time (before/after model updates, after prompt changes, after infrastructure changes). Without a structured cross-run comparison mechanism, they will either:
1. Manually eyeball past logs (error-prone), or
2. Only look at the most recent run in isolation (missing drift entirely)

The baseline comparison template makes cross-run consistency a first-class evaluation artifact.

## Related Work

The PDR (Periodic Delivery Reliability) framework (DOI: [10.5281/zenodo.19339987](https://doi.org/10.5281/zenodo.19339987)) tracks this as `delivery_score` and `calibration_delta` across sessions — the same cross-session consistency measurement problem applied to autonomous agent workflows. The delta-from-baseline approach is validated in that context.

Happy to draft both the template and the consistency.md extension if this direction looks right.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(consistency): add cross-run trend tracking template and methodology #1

Summary

The Gap

Proposal

1. `templates/baseline-comparison.md`

2. Extension to `docs/consistency.md`

Why This Matters

Related Work

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat(consistency): add cross-run trend tracking template and methodology #1

Description

Summary

The Gap

Proposal

1. templates/baseline-comparison.md

2. Extension to docs/consistency.md

Why This Matters

Related Work

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1. `templates/baseline-comparison.md`

2. Extension to `docs/consistency.md`