history compare: support cross-suite/cross-model comparison

## Context

`protest history compare` (v0.2 sub-command form) currently compares the **two most recent runs of the same model** — useful for spotting regression run-over-run. The naive-agent test (#agent-v2 journal) surfaced a real use case it doesn't cover:

> Comparing the two suites *inside the same run* — e.g. \`helpdesk_v1\` vs \`helpdesk_v2\` produced by the same \`protest eval\` invocation.

This is exactly the workflow the docs encourage when comparing model variants (`Multi-Model Sessions`). With the current `compare`, you have to do something contrived (run v1 only, then v2 only, then compare across runs) and even then the comparison can be misleading because run timestamps and case orders differ.

## Proposed surface

Add cross-suite / cross-model comparison via flags on the existing `compare` sub-command:

```bash
# Compare two suites within the most recent run
protest history compare --suites helpdesk_v1 helpdesk_v2

# Compare two models within the most recent run
protest history compare --models rules-v1 rules-v2
```

Either flag is mutually exclusive with the other and with the default behavior (run-vs-run). When set, the comparison happens within the most recent run that contains both selected suites/models.

## Decisions to make at implementation time

1. **Run scope**: should `--suites a b` always pick the most recent run that has both, or accept an explicit run via `--run N`?
2. **Symmetry of output**: same `Fixed/Regressed/Modified/New/Deleted` table as run-over-run, with the two suites/models as the two sides?
3. **Models with multiple suites**: if `--models v1 v2` matches multiple suite pairs in the same run, do we compare each pair separately or aggregate?
4. **Naming**: `--suites` plural feels right but `argparse` defaults to space-separated values which can be brittle. Alternatives: `--suite-a/--suite-b` or `--suite NAME` repeatable.

## Acceptance criteria

- `protest history compare --suites a b` works on a run that contains both suites
- `protest history compare --models v1 v2` works on a run with both models
- Mutually exclusive with run-over-run mode (no flags)
- Per-case `+/-/⟳` markers honor the same case-hash vs eval-hash distinction
- New CLI tests covering the success and error cases
- `cli.md` documents the new flags

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

history compare: support cross-suite/cross-model comparison #101

Context

Proposed surface

Decisions to make at implementation time

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

history compare: support cross-suite/cross-model comparison #101

Description

Context

Proposed surface

Decisions to make at implementation time

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions