Memory quality benchmark: scenario-based IR metrics with publishable report

## Problem

We have a throughput benchmark (`cmd/benchmark/`) that measures ingestion speed and basic keyword precision against a live daemon. But we have no way to measure **memory quality** — whether signal survives, noise fades, and recall returns useful results across realistic scenarios.

The system audit (PR #22) found:
- 10/10 recall results were noise (Chrome leveldb, GNOME XML files)
- 10/10 patterns were duplicated junk ("Manifest and Log Recovery" x2, "XML File Creation" x2)
- 7/10 recent encodings had failed entirely
- Dreaming was blindly boosting 30 associations on junk data when LLM was down
- Consolidation was wasting 147+ seconds on timeout loops against an unavailable LLM

We've fixed the pipeline (#22), but we need proof — for ourselves and for the public — that the system actually works.

## Design Philosophy

The system will always ingest some noise (humans do too). The benchmark measures **resilience to noise**, not perfection at filtering. The key question: "Given realistic input (mix of signal + noise), does the system preserve signal and suppress noise over time?"

---

## What It Measures

### Standard IR Metrics (per query)

| Metric | What it answers | Formula |
|---|---|---|
| **Precision@K** | Are the top results relevant? | `relevant_in_top_K / K` |
| **Recall@K** | Did we find all relevant memories? | `relevant_in_top_K / total_relevant` |
| **MRR** (Mean Reciprocal Rank) | How high is the first relevant result? | `1 / rank_of_first_relevant` |
| **nDCG** (Normalized Discounted Cumulative Gain) | Are results in the right order? | `DCG / ideal_DCG` |

### System Quality Metrics (across scenarios)

| Metric | What it answers | How |
|---|---|---|
| **Noise Suppression** | Does noise fade over time? | Fraction of noise memories archived/fading after N consolidation cycles |
| **Signal Retention** | Does signal survive? | Fraction of signal memories still active after the same cycles |
| **Dedup Effectiveness** | Do near-duplicates merge? | Fraction of duplicate memories merged (LLM-gated) |

### Pass Thresholds

| Metric | Pass | Warn | Fail |
|---|---|---|---|
| Precision@5 (avg) | >= 0.70 | >= 0.50 | < 0.50 |
| MRR (avg) | >= 0.60 | >= 0.40 | < 0.40 |
| Noise Suppression | >= 0.60 | >= 0.40 | < 0.40 |
| Signal Retention | >= 0.80 | >= 0.60 | < 0.60 |

---

## Test Scenarios

Each scenario simulates a realistic developer session with labeled ground truth: every memory is tagged as signal, noise, or duplicate. Scoring is fully automated — just set membership checks.

### Scenario 1: "Debugging Session"

**Signal** (8 memories):
- Stack trace analysis, root cause identification, fix attempts, the working solution, a regression note
- Topics: nil pointer bugs, auth crashes, regressions

**Noise** (12 memories):
- Chrome tab opens, file manager browsing, clipboard URLs, node_modules lock changes, .DS_Store

**Queries**:
- "What was the nil pointer bug?"
- "How did we fix the auth crash?"
- "What regressions have we seen?"

**What this tests**: Can recall reconstruct the debugging narrative? Does noise stay out of results?

### Scenario 2: "Architecture Decision"

**Signal** (8 memories):
- "Chose SQLite over Postgres because no server needed", "Considered event sourcing vs CRUD", "Decided on 8 agents for separation of concerns", tradeoff discussions, config rationale

**Noise** (12 memories):
- GNOME dconf writes, LM Studio model downloads, Trash operations, `.DS_Store` changes

**Duplicates** (4 memories):
- Rephrased versions of 4 decisions (e.g., "We went with SQLite since Postgres requires a server" duplicating the SQLite decision)

**Queries**:
- "Why did we choose SQLite?"
- "What architecture decisions have we made?"
- "What were the tradeoffs?"

**What this tests**: Do decisions surface? Do duplicates merge instead of creating duplicate patterns? Does desktop noise stay buried?

### Scenario 3: "Learning & Insights"

**Signal** (8 memories):
- "Go's sql.NullString needed for nullable columns", "FTS5 rank returns negative BM25 scores", "Spread activation works best with 3 hops max", API quirks, framework lessons

**Noise** (12 memories):
- Clipboard pastes of URLs, terminal `ls`/`cd`/`clear` commands, PipeWire audio config changes

**Queries**:
- "What did we learn about FTS5?"
- "Go gotchas we've hit"
- "What patterns work well?"

**What this tests**: Do specific learnings surface for specific queries? Are vague clipboard pastes and terminal noise excluded?

---

## Architecture

New binary at `cmd/benchmark-quality/`. **Runs without a daemon** — instantiates store + agents directly, controls timing.

```
cmd/benchmark-quality/
  main.go         — CLI flags, component setup, phase orchestration
  scenarios.go    — Scenario definitions (memories + queries + ground truth)
  testdata.go     — Memory builders, synthetic embeddings, noise generators
  metrics.go      — IR metric computation (precision, recall, MRR, nDCG)
  scoring.go      — System quality metrics + pass/fail thresholds
  report.go       — Terminal + markdown report output
```

### Why direct component access (not HTTP/daemon)?

- The existing benchmark requires a running daemon, a 5-minute encoding wait, and cannot force consolidation. That makes it useless for measuring decay over simulated time.
- By directly instantiating `SQLiteStore`, `ConsolidationAgent`, and `RetrievalAgent`, the benchmark controls the timeline. It can write pre-encoded memories, run consolidation synchronously via `RunOnce()`, and query directly.
- This mirrors the test pattern already used in `internal/store/sqlite/sqlite_test.go` — real SQLite in `t.TempDir()`.

---

## Execution Phases

### Phase 1: Setup
- Parse flags: `--llm`, `--verbose`, `--cycles N` (default 5), `--report markdown`
- Create temp SQLite DB via `sqlite.NewSQLiteStore(tempdir)`
- Create event bus via `events.NewBus()`
- Instantiate retrieval agent (with or without LLM provider)
- Instantiate consolidation agent with standard config

### Phase 2: Per-Scenario Loop

For each of the 3 scenarios:

**2a. Ingest** — Write all memories directly via `store.WriteMemory()`. Each memory pre-built with:
- `Content`, `Summary`, `Concepts` — real text for FTS indexing
- `Embedding` — synthetic vectors (signal in one cluster, noise in another) or real via LLM if `--llm`
- `Salience` — signal: 0.5-0.8, noise: 0.3-0.4
- `State` = `"active"`
- Associations between related signal memories via `store.CreateAssociation()`

FTS5 index and embedding index auto-populate on `WriteMemory()` — no extra work needed (confirmed via code review: schema triggers populate `memories_fts`, and `embIndex.Add()` is called in `WriteMemory()` for active/fading memories).

**2b. Baseline query** — Run all scenario queries via `RetrievalAgent.Query()`. Record:
- Which memories returned (IDs), their scores, their ground-truth labels (signal/noise/duplicate)
- Compute Precision@5, Recall@5, MRR, nDCG

**2c. Access simulation** — Re-query signal topics to bump access counts and `LastAccessed` timestamps. This simulates real usage where signal memories get accessed but noise doesn't. This creates the asymmetry that consolidation's decay should exploit.

**2d. Consolidation** — Run N cycles via `ConsolidationAgent.RunOnce()`. Between cycles, fast-forward time by adjusting `LastAccessed` timestamps (subtract hours) to simulate days passing without actually waiting. Consolidation runs:
- Decay (always) — uses `BatchUpdateSalience()`
- State transitions (always) — fading/archived based on new salience
- Association pruning (always)
- Merge clusters (LLM-gated) — skipped if `--llm` not set
- Pattern extraction (LLM-gated) — skipped if `--llm` not set

**2e. Post-consolidation query** — Re-run same queries. Measure improved IR metrics. Precision should increase because noise has decayed below retrieval thresholds.

**2f. System quality scoring** — Query store for memory states:
- Count noise memories with state `"fading"` or `"archived"` → noise suppression score
- Count signal memories still `"active"` → signal retention score
- Count duplicate memories with state `"merged"` → dedup effectiveness (LLM-only)

### Phase 3: Aggregate + Report

Aggregate all per-scenario metrics, compute overall scores, print terminal scorecard. With `--report markdown`, write `benchmark-results.md`.

---

## Report Output

Terminal (default):
```
╔══════════════════════════════════════════════════════╗
║         Mnemonic Memory Quality Benchmark            ║
╠══════════════════════════════════════════════════════╣
║  Version:  0.6.0   LLM: available   Cycles: 5       ║
╠══════════════════════════════════════════════════════╣
║                                                      ║
║  SCENARIO 1: Debugging Session                       ║
║  ┌────────────────────────────────────────────┐      ║
║  │ Precision@5   0.87  ████████▋   PASS       │      ║
║  │ MRR           0.92  █████████▏  PASS       │      ║
║  │ nDCG          0.84  ████████▍   PASS       │      ║
║  │ Noise Suppr.  0.75  ███████▌    PASS       │      ║
║  │ Signal Ret.   1.00  ██████████  PASS       │      ║
║  └────────────────────────────────────────────┘      ║
║                                                      ║
║  SCENARIO 2: Architecture Decision                   ║
║  ┌────────────────────────────────────────────┐      ║
║  │ Precision@5   0.80  ████████    PASS       │      ║
║  │ MRR           0.78  ███████▊    PASS       │      ║
║  │ nDCG          0.76  ███████▋    PASS       │      ║
║  │ Noise Suppr.  0.83  ████████▎   PASS       │      ║
║  │ Signal Ret.   0.88  ████████▊   PASS       │      ║
║  │ Dedup Effect. 0.75  ███████▌    PASS [LLM] │      ║
║  └────────────────────────────────────────────┘      ║
║                                                      ║
║  SCENARIO 3: Learning & Insights                     ║
║  ┌────────────────────────────────────────────┐      ║
║  │ Precision@5   0.73  ███████▎    PASS       │      ║
║  │ MRR           0.67  ██████▋     PASS       │      ║
║  │ nDCG          0.71  ███████     PASS       │      ║
║  │ Noise Suppr.  0.67  ██████▋     PASS       │      ║
║  │ Signal Ret.   0.88  ████████▊   PASS       │      ║
║  └────────────────────────────────────────────┘      ║
║                                                      ║
╠══════════════════════════════════════════════════════╣
║  AGGREGATE                                           ║
║  Precision@5  0.80    MRR  0.79    nDCG  0.77        ║
║  Noise Suppression  0.75    Signal Retention  0.92    ║
║                                                      ║
║  Overall: PASS                                       ║
╚══════════════════════════════════════════════════════╝
```

With `--report markdown`, writes a `benchmark-results.md` file suitable for linking in the project README.

---

## Key Dependencies (no modifications to existing code)

All existing APIs needed have been verified via code review:

| Dependency | File | What it does |
|---|---|---|
| `sqlite.NewSQLiteStore()` | `internal/store/sqlite/sqlite.go` | Creates real DB in temp dir |
| `store.WriteMemory()` | `internal/store/sqlite/sqlite.go:562` | Writes memory + auto-populates FTS & embedding index |
| `store.CreateAssociation()` | `internal/store/sqlite/sqlite.go:951` | Links related signal memories |
| `store.BatchUpdateSalience()` | `internal/store/sqlite/sqlite.go:1111` | For time fast-forwarding between cycles |
| `retrieval.NewRetrievalAgent()` | `internal/agent/retrieval/agent.go:88` | Creates retrieval agent |
| `retrieval.Query()` | `internal/agent/retrieval/agent.go:102` | Returns `[]RetrievalResult` with `.Score` |
| `consolidation.NewConsolidationAgent()` | `internal/agent/consolidation/agent.go:64` | Creates consolidation agent |
| `consolidation.RunOnce()` | `internal/agent/consolidation/agent.go:115` | Runs one full consolidation cycle synchronously |
| `events.NewBus()` | `internal/events/inmemory.go` | In-memory event bus required by agents |
| `llm.Provider` | `internal/llm/provider.go:92` | Optional — for real embeddings, merge, dedup |

### Retrieval scoring pipeline (context for metric design)

The retrieval agent scores results through 5 stages:
1. **FTS entry points**: BM25-ranked full-text matches, scored as `0.3 + 0.4 * salience`
2. **Embedding entry points**: Cosine similarity against query embedding
3. **Entry point merging**: `alpha * emb + (1-alpha) * fts + dual_hit_bonus` (alpha=0.6, bonus=0.15)
4. **Spread activation**: Traverses association graph with exponential decay per hop (max 3 hops)
5. **Final ranking**: `activation * (1 + recency_bonus + activity_bonus) * significance_boost`

This means the benchmark's synthetic embeddings must place signal and noise in different regions of embedding space for the merge step to work correctly.

---

## Synthetic Embeddings (LLM-free mode)

For running without an LLM, memories get deterministic vectors:
- **Signal memories**: Vectors clustered near specific topic dimensions (e.g., debugging signal near `[1,0,0,...]`, architecture signal near `[0,1,0,...]`)
- **Noise memories**: Vectors clustered in a separate region (e.g., near `[0,0,1,...]`)
- **Duplicate memories**: Vectors near their originals with small jitter

This lets embedding search, cosine similarity, and the merge step all work correctly without a real model. The tradeoff: synthetic embeddings don't test the LLM's actual encoding quality, only the pipeline's ability to distinguish pre-separated clusters.

With `--llm`, real embeddings are generated via `llm.Provider.Embed()`, testing the full end-to-end chain including LLM encoding quality.

---

## Makefile Addition

```makefile
benchmark-quality: build
	CGO_ENABLED=1 go build $(TAGS) -o $(BUILD_DIR)/benchmark-quality ./cmd/benchmark-quality
```

### Usage

```bash
make benchmark-quality
./bin/benchmark-quality                          # fast mode, synthetic embeddings (~30s)
./bin/benchmark-quality --llm                    # full mode with real LLM (~2-5min)
./bin/benchmark-quality --llm --report markdown  # publishable report
./bin/benchmark-quality --verbose --cycles 10    # detailed output, more consolidation
```

Exit code: 0 = PASS, 1 = FAIL.

---

## Related

- PR #22 — Pipeline fixes (FTS5 ranking, LLM gating, decay tuning, noise suppression) that this benchmark validates

Metric	What it answers	Formula
Precision@K	Are the top results relevant?	`relevant_in_top_K / K`
Recall@K	Did we find all relevant memories?	`relevant_in_top_K / total_relevant`
MRR (Mean Reciprocal Rank)	How high is the first relevant result?	`1 / rank_of_first_relevant`
nDCG (Normalized Discounted Cumulative Gain)	Are results in the right order?	`DCG / ideal_DCG`

Dependency	File	What it does
`sqlite.NewSQLiteStore()`	`internal/store/sqlite/sqlite.go`	Creates real DB in temp dir
`store.WriteMemory()`	`internal/store/sqlite/sqlite.go:562`	Writes memory + auto-populates FTS & embedding index
`store.CreateAssociation()`	`internal/store/sqlite/sqlite.go:951`	Links related signal memories
`store.BatchUpdateSalience()`	`internal/store/sqlite/sqlite.go:1111`	For time fast-forwarding between cycles
`retrieval.NewRetrievalAgent()`	`internal/agent/retrieval/agent.go:88`	Creates retrieval agent
`retrieval.Query()`	`internal/agent/retrieval/agent.go:102`	Returns `[]RetrievalResult` with `.Score`
`consolidation.NewConsolidationAgent()`	`internal/agent/consolidation/agent.go:64`	Creates consolidation agent
`consolidation.RunOnce()`	`internal/agent/consolidation/agent.go:115`	Runs one full consolidation cycle synchronously
`events.NewBus()`	`internal/events/inmemory.go`	In-memory event bus required by agents
`llm.Provider`	`internal/llm/provider.go:92`	Optional — for real embeddings, merge, dedup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory quality benchmark: scenario-based IR metrics with publishable report #24

Problem

Design Philosophy

What It Measures

Standard IR Metrics (per query)

System Quality Metrics (across scenarios)

Pass Thresholds

Test Scenarios

Scenario 1: "Debugging Session"

Scenario 2: "Architecture Decision"

Scenario 3: "Learning & Insights"

Architecture

Why direct component access (not HTTP/daemon)?

Execution Phases

Phase 1: Setup

Phase 2: Per-Scenario Loop

Phase 3: Aggregate + Report

Report Output

Key Dependencies (no modifications to existing code)

Retrieval scoring pipeline (context for metric design)

Synthetic Embeddings (LLM-free mode)

Makefile Addition

Usage

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	What it answers	How
Noise Suppression	Does noise fade over time?	Fraction of noise memories archived/fading after N consolidation cycles
Signal Retention	Does signal survive?	Fraction of signal memories still active after the same cycles
Dedup Effectiveness	Do near-duplicates merge?	Fraction of duplicate memories merged (LLM-gated)

Metric	Pass	Warn	Fail
Precision@5 (avg)	>= 0.70	>= 0.50	< 0.50
MRR (avg)	>= 0.60	>= 0.40	< 0.40
Noise Suppression	>= 0.60	>= 0.40	< 0.40
Signal Retention	>= 0.80	>= 0.60	< 0.60

Memory quality benchmark: scenario-based IR metrics with publishable report #24

Description

Problem

Design Philosophy

What It Measures

Standard IR Metrics (per query)

System Quality Metrics (across scenarios)

Pass Thresholds

Test Scenarios

Scenario 1: "Debugging Session"

Scenario 2: "Architecture Decision"

Scenario 3: "Learning & Insights"

Architecture

Why direct component access (not HTTP/daemon)?

Execution Phases

Phase 1: Setup

Phase 2: Per-Scenario Loop

Phase 3: Aggregate + Report

Report Output

Key Dependencies (no modifications to existing code)

Retrieval scoring pipeline (context for metric design)

Synthetic Embeddings (LLM-free mode)

Makefile Addition

Usage

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions