Skip to content

Memory quality benchmark: scenario-based IR metrics with publishable report #24

@CalebisGross

Description

@CalebisGross

Problem

We have a throughput benchmark (cmd/benchmark/) that measures ingestion speed and basic keyword precision against a live daemon. But we have no way to measure memory quality — whether signal survives, noise fades, and recall returns useful results across realistic scenarios.

The system audit (PR #22) found:

  • 10/10 recall results were noise (Chrome leveldb, GNOME XML files)
  • 10/10 patterns were duplicated junk ("Manifest and Log Recovery" x2, "XML File Creation" x2)
  • 7/10 recent encodings had failed entirely
  • Dreaming was blindly boosting 30 associations on junk data when LLM was down
  • Consolidation was wasting 147+ seconds on timeout loops against an unavailable LLM

We've fixed the pipeline (#22), but we need proof — for ourselves and for the public — that the system actually works.

Design Philosophy

The system will always ingest some noise (humans do too). The benchmark measures resilience to noise, not perfection at filtering. The key question: "Given realistic input (mix of signal + noise), does the system preserve signal and suppress noise over time?"


What It Measures

Standard IR Metrics (per query)

Metric What it answers Formula
Precision@K Are the top results relevant? relevant_in_top_K / K
Recall@K Did we find all relevant memories? relevant_in_top_K / total_relevant
MRR (Mean Reciprocal Rank) How high is the first relevant result? 1 / rank_of_first_relevant
nDCG (Normalized Discounted Cumulative Gain) Are results in the right order? DCG / ideal_DCG

System Quality Metrics (across scenarios)

Metric What it answers How
Noise Suppression Does noise fade over time? Fraction of noise memories archived/fading after N consolidation cycles
Signal Retention Does signal survive? Fraction of signal memories still active after the same cycles
Dedup Effectiveness Do near-duplicates merge? Fraction of duplicate memories merged (LLM-gated)

Pass Thresholds

Metric Pass Warn Fail
Precision@5 (avg) >= 0.70 >= 0.50 < 0.50
MRR (avg) >= 0.60 >= 0.40 < 0.40
Noise Suppression >= 0.60 >= 0.40 < 0.40
Signal Retention >= 0.80 >= 0.60 < 0.60

Test Scenarios

Each scenario simulates a realistic developer session with labeled ground truth: every memory is tagged as signal, noise, or duplicate. Scoring is fully automated — just set membership checks.

Scenario 1: "Debugging Session"

Signal (8 memories):

  • Stack trace analysis, root cause identification, fix attempts, the working solution, a regression note
  • Topics: nil pointer bugs, auth crashes, regressions

Noise (12 memories):

  • Chrome tab opens, file manager browsing, clipboard URLs, node_modules lock changes, .DS_Store

Queries:

  • "What was the nil pointer bug?"
  • "How did we fix the auth crash?"
  • "What regressions have we seen?"

What this tests: Can recall reconstruct the debugging narrative? Does noise stay out of results?

Scenario 2: "Architecture Decision"

Signal (8 memories):

  • "Chose SQLite over Postgres because no server needed", "Considered event sourcing vs CRUD", "Decided on 8 agents for separation of concerns", tradeoff discussions, config rationale

Noise (12 memories):

  • GNOME dconf writes, LM Studio model downloads, Trash operations, .DS_Store changes

Duplicates (4 memories):

  • Rephrased versions of 4 decisions (e.g., "We went with SQLite since Postgres requires a server" duplicating the SQLite decision)

Queries:

  • "Why did we choose SQLite?"
  • "What architecture decisions have we made?"
  • "What were the tradeoffs?"

What this tests: Do decisions surface? Do duplicates merge instead of creating duplicate patterns? Does desktop noise stay buried?

Scenario 3: "Learning & Insights"

Signal (8 memories):

  • "Go's sql.NullString needed for nullable columns", "FTS5 rank returns negative BM25 scores", "Spread activation works best with 3 hops max", API quirks, framework lessons

Noise (12 memories):

  • Clipboard pastes of URLs, terminal ls/cd/clear commands, PipeWire audio config changes

Queries:

  • "What did we learn about FTS5?"
  • "Go gotchas we've hit"
  • "What patterns work well?"

What this tests: Do specific learnings surface for specific queries? Are vague clipboard pastes and terminal noise excluded?


Architecture

New binary at cmd/benchmark-quality/. Runs without a daemon — instantiates store + agents directly, controls timing.

cmd/benchmark-quality/
  main.go         — CLI flags, component setup, phase orchestration
  scenarios.go    — Scenario definitions (memories + queries + ground truth)
  testdata.go     — Memory builders, synthetic embeddings, noise generators
  metrics.go      — IR metric computation (precision, recall, MRR, nDCG)
  scoring.go      — System quality metrics + pass/fail thresholds
  report.go       — Terminal + markdown report output

Why direct component access (not HTTP/daemon)?

  • The existing benchmark requires a running daemon, a 5-minute encoding wait, and cannot force consolidation. That makes it useless for measuring decay over simulated time.
  • By directly instantiating SQLiteStore, ConsolidationAgent, and RetrievalAgent, the benchmark controls the timeline. It can write pre-encoded memories, run consolidation synchronously via RunOnce(), and query directly.
  • This mirrors the test pattern already used in internal/store/sqlite/sqlite_test.go — real SQLite in t.TempDir().

Execution Phases

Phase 1: Setup

  • Parse flags: --llm, --verbose, --cycles N (default 5), --report markdown
  • Create temp SQLite DB via sqlite.NewSQLiteStore(tempdir)
  • Create event bus via events.NewBus()
  • Instantiate retrieval agent (with or without LLM provider)
  • Instantiate consolidation agent with standard config

Phase 2: Per-Scenario Loop

For each of the 3 scenarios:

2a. Ingest — Write all memories directly via store.WriteMemory(). Each memory pre-built with:

  • Content, Summary, Concepts — real text for FTS indexing
  • Embedding — synthetic vectors (signal in one cluster, noise in another) or real via LLM if --llm
  • Salience — signal: 0.5-0.8, noise: 0.3-0.4
  • State = "active"
  • Associations between related signal memories via store.CreateAssociation()

FTS5 index and embedding index auto-populate on WriteMemory() — no extra work needed (confirmed via code review: schema triggers populate memories_fts, and embIndex.Add() is called in WriteMemory() for active/fading memories).

2b. Baseline query — Run all scenario queries via RetrievalAgent.Query(). Record:

  • Which memories returned (IDs), their scores, their ground-truth labels (signal/noise/duplicate)
  • Compute Precision@5, Recall@5, MRR, nDCG

2c. Access simulation — Re-query signal topics to bump access counts and LastAccessed timestamps. This simulates real usage where signal memories get accessed but noise doesn't. This creates the asymmetry that consolidation's decay should exploit.

2d. Consolidation — Run N cycles via ConsolidationAgent.RunOnce(). Between cycles, fast-forward time by adjusting LastAccessed timestamps (subtract hours) to simulate days passing without actually waiting. Consolidation runs:

  • Decay (always) — uses BatchUpdateSalience()
  • State transitions (always) — fading/archived based on new salience
  • Association pruning (always)
  • Merge clusters (LLM-gated) — skipped if --llm not set
  • Pattern extraction (LLM-gated) — skipped if --llm not set

2e. Post-consolidation query — Re-run same queries. Measure improved IR metrics. Precision should increase because noise has decayed below retrieval thresholds.

2f. System quality scoring — Query store for memory states:

  • Count noise memories with state "fading" or "archived" → noise suppression score
  • Count signal memories still "active" → signal retention score
  • Count duplicate memories with state "merged" → dedup effectiveness (LLM-only)

Phase 3: Aggregate + Report

Aggregate all per-scenario metrics, compute overall scores, print terminal scorecard. With --report markdown, write benchmark-results.md.


Report Output

Terminal (default):

╔══════════════════════════════════════════════════════╗
║         Mnemonic Memory Quality Benchmark            ║
╠══════════════════════════════════════════════════════╣
║  Version:  0.6.0   LLM: available   Cycles: 5       ║
╠══════════════════════════════════════════════════════╣
║                                                      ║
║  SCENARIO 1: Debugging Session                       ║
║  ┌────────────────────────────────────────────┐      ║
║  │ Precision@5   0.87  ████████▋   PASS       │      ║
║  │ MRR           0.92  █████████▏  PASS       │      ║
║  │ nDCG          0.84  ████████▍   PASS       │      ║
║  │ Noise Suppr.  0.75  ███████▌    PASS       │      ║
║  │ Signal Ret.   1.00  ██████████  PASS       │      ║
║  └────────────────────────────────────────────┘      ║
║                                                      ║
║  SCENARIO 2: Architecture Decision                   ║
║  ┌────────────────────────────────────────────┐      ║
║  │ Precision@5   0.80  ████████    PASS       │      ║
║  │ MRR           0.78  ███████▊    PASS       │      ║
║  │ nDCG          0.76  ███████▋    PASS       │      ║
║  │ Noise Suppr.  0.83  ████████▎   PASS       │      ║
║  │ Signal Ret.   0.88  ████████▊   PASS       │      ║
║  │ Dedup Effect. 0.75  ███████▌    PASS [LLM] │      ║
║  └────────────────────────────────────────────┘      ║
║                                                      ║
║  SCENARIO 3: Learning & Insights                     ║
║  ┌────────────────────────────────────────────┐      ║
║  │ Precision@5   0.73  ███████▎    PASS       │      ║
║  │ MRR           0.67  ██████▋     PASS       │      ║
║  │ nDCG          0.71  ███████     PASS       │      ║
║  │ Noise Suppr.  0.67  ██████▋     PASS       │      ║
║  │ Signal Ret.   0.88  ████████▊   PASS       │      ║
║  └────────────────────────────────────────────┘      ║
║                                                      ║
╠══════════════════════════════════════════════════════╣
║  AGGREGATE                                           ║
║  Precision@5  0.80    MRR  0.79    nDCG  0.77        ║
║  Noise Suppression  0.75    Signal Retention  0.92    ║
║                                                      ║
║  Overall: PASS                                       ║
╚══════════════════════════════════════════════════════╝

With --report markdown, writes a benchmark-results.md file suitable for linking in the project README.


Key Dependencies (no modifications to existing code)

All existing APIs needed have been verified via code review:

Dependency File What it does
sqlite.NewSQLiteStore() internal/store/sqlite/sqlite.go Creates real DB in temp dir
store.WriteMemory() internal/store/sqlite/sqlite.go:562 Writes memory + auto-populates FTS & embedding index
store.CreateAssociation() internal/store/sqlite/sqlite.go:951 Links related signal memories
store.BatchUpdateSalience() internal/store/sqlite/sqlite.go:1111 For time fast-forwarding between cycles
retrieval.NewRetrievalAgent() internal/agent/retrieval/agent.go:88 Creates retrieval agent
retrieval.Query() internal/agent/retrieval/agent.go:102 Returns []RetrievalResult with .Score
consolidation.NewConsolidationAgent() internal/agent/consolidation/agent.go:64 Creates consolidation agent
consolidation.RunOnce() internal/agent/consolidation/agent.go:115 Runs one full consolidation cycle synchronously
events.NewBus() internal/events/inmemory.go In-memory event bus required by agents
llm.Provider internal/llm/provider.go:92 Optional — for real embeddings, merge, dedup

Retrieval scoring pipeline (context for metric design)

The retrieval agent scores results through 5 stages:

  1. FTS entry points: BM25-ranked full-text matches, scored as 0.3 + 0.4 * salience
  2. Embedding entry points: Cosine similarity against query embedding
  3. Entry point merging: alpha * emb + (1-alpha) * fts + dual_hit_bonus (alpha=0.6, bonus=0.15)
  4. Spread activation: Traverses association graph with exponential decay per hop (max 3 hops)
  5. Final ranking: activation * (1 + recency_bonus + activity_bonus) * significance_boost

This means the benchmark's synthetic embeddings must place signal and noise in different regions of embedding space for the merge step to work correctly.


Synthetic Embeddings (LLM-free mode)

For running without an LLM, memories get deterministic vectors:

  • Signal memories: Vectors clustered near specific topic dimensions (e.g., debugging signal near [1,0,0,...], architecture signal near [0,1,0,...])
  • Noise memories: Vectors clustered in a separate region (e.g., near [0,0,1,...])
  • Duplicate memories: Vectors near their originals with small jitter

This lets embedding search, cosine similarity, and the merge step all work correctly without a real model. The tradeoff: synthetic embeddings don't test the LLM's actual encoding quality, only the pipeline's ability to distinguish pre-separated clusters.

With --llm, real embeddings are generated via llm.Provider.Embed(), testing the full end-to-end chain including LLM encoding quality.


Makefile Addition

benchmark-quality: build
	CGO_ENABLED=1 go build $(TAGS) -o $(BUILD_DIR)/benchmark-quality ./cmd/benchmark-quality

Usage

make benchmark-quality
./bin/benchmark-quality                          # fast mode, synthetic embeddings (~30s)
./bin/benchmark-quality --llm                    # full mode with real LLM (~2-5min)
./bin/benchmark-quality --llm --report markdown  # publishable report
./bin/benchmark-quality --verbose --cycles 10    # detailed output, more consolidation

Exit code: 0 = PASS, 1 = FAIL.


Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions