Problem
We have a throughput benchmark (cmd/benchmark/) that measures ingestion speed and basic keyword precision against a live daemon. But we have no way to measure memory quality — whether signal survives, noise fades, and recall returns useful results across realistic scenarios.
The system audit (PR #22) found:
- 10/10 recall results were noise (Chrome leveldb, GNOME XML files)
- 10/10 patterns were duplicated junk ("Manifest and Log Recovery" x2, "XML File Creation" x2)
- 7/10 recent encodings had failed entirely
- Dreaming was blindly boosting 30 associations on junk data when LLM was down
- Consolidation was wasting 147+ seconds on timeout loops against an unavailable LLM
We've fixed the pipeline (#22), but we need proof — for ourselves and for the public — that the system actually works.
Design Philosophy
The system will always ingest some noise (humans do too). The benchmark measures resilience to noise, not perfection at filtering. The key question: "Given realistic input (mix of signal + noise), does the system preserve signal and suppress noise over time?"
What It Measures
Standard IR Metrics (per query)
| Metric |
What it answers |
Formula |
| Precision@K |
Are the top results relevant? |
relevant_in_top_K / K |
| Recall@K |
Did we find all relevant memories? |
relevant_in_top_K / total_relevant |
| MRR (Mean Reciprocal Rank) |
How high is the first relevant result? |
1 / rank_of_first_relevant |
| nDCG (Normalized Discounted Cumulative Gain) |
Are results in the right order? |
DCG / ideal_DCG |
System Quality Metrics (across scenarios)
| Metric |
What it answers |
How |
| Noise Suppression |
Does noise fade over time? |
Fraction of noise memories archived/fading after N consolidation cycles |
| Signal Retention |
Does signal survive? |
Fraction of signal memories still active after the same cycles |
| Dedup Effectiveness |
Do near-duplicates merge? |
Fraction of duplicate memories merged (LLM-gated) |
Pass Thresholds
| Metric |
Pass |
Warn |
Fail |
| Precision@5 (avg) |
>= 0.70 |
>= 0.50 |
< 0.50 |
| MRR (avg) |
>= 0.60 |
>= 0.40 |
< 0.40 |
| Noise Suppression |
>= 0.60 |
>= 0.40 |
< 0.40 |
| Signal Retention |
>= 0.80 |
>= 0.60 |
< 0.60 |
Test Scenarios
Each scenario simulates a realistic developer session with labeled ground truth: every memory is tagged as signal, noise, or duplicate. Scoring is fully automated — just set membership checks.
Scenario 1: "Debugging Session"
Signal (8 memories):
- Stack trace analysis, root cause identification, fix attempts, the working solution, a regression note
- Topics: nil pointer bugs, auth crashes, regressions
Noise (12 memories):
- Chrome tab opens, file manager browsing, clipboard URLs, node_modules lock changes, .DS_Store
Queries:
- "What was the nil pointer bug?"
- "How did we fix the auth crash?"
- "What regressions have we seen?"
What this tests: Can recall reconstruct the debugging narrative? Does noise stay out of results?
Scenario 2: "Architecture Decision"
Signal (8 memories):
- "Chose SQLite over Postgres because no server needed", "Considered event sourcing vs CRUD", "Decided on 8 agents for separation of concerns", tradeoff discussions, config rationale
Noise (12 memories):
- GNOME dconf writes, LM Studio model downloads, Trash operations,
.DS_Store changes
Duplicates (4 memories):
- Rephrased versions of 4 decisions (e.g., "We went with SQLite since Postgres requires a server" duplicating the SQLite decision)
Queries:
- "Why did we choose SQLite?"
- "What architecture decisions have we made?"
- "What were the tradeoffs?"
What this tests: Do decisions surface? Do duplicates merge instead of creating duplicate patterns? Does desktop noise stay buried?
Scenario 3: "Learning & Insights"
Signal (8 memories):
- "Go's sql.NullString needed for nullable columns", "FTS5 rank returns negative BM25 scores", "Spread activation works best with 3 hops max", API quirks, framework lessons
Noise (12 memories):
- Clipboard pastes of URLs, terminal
ls/cd/clear commands, PipeWire audio config changes
Queries:
- "What did we learn about FTS5?"
- "Go gotchas we've hit"
- "What patterns work well?"
What this tests: Do specific learnings surface for specific queries? Are vague clipboard pastes and terminal noise excluded?
Architecture
New binary at cmd/benchmark-quality/. Runs without a daemon — instantiates store + agents directly, controls timing.
cmd/benchmark-quality/
main.go — CLI flags, component setup, phase orchestration
scenarios.go — Scenario definitions (memories + queries + ground truth)
testdata.go — Memory builders, synthetic embeddings, noise generators
metrics.go — IR metric computation (precision, recall, MRR, nDCG)
scoring.go — System quality metrics + pass/fail thresholds
report.go — Terminal + markdown report output
Why direct component access (not HTTP/daemon)?
- The existing benchmark requires a running daemon, a 5-minute encoding wait, and cannot force consolidation. That makes it useless for measuring decay over simulated time.
- By directly instantiating
SQLiteStore, ConsolidationAgent, and RetrievalAgent, the benchmark controls the timeline. It can write pre-encoded memories, run consolidation synchronously via RunOnce(), and query directly.
- This mirrors the test pattern already used in
internal/store/sqlite/sqlite_test.go — real SQLite in t.TempDir().
Execution Phases
Phase 1: Setup
- Parse flags:
--llm, --verbose, --cycles N (default 5), --report markdown
- Create temp SQLite DB via
sqlite.NewSQLiteStore(tempdir)
- Create event bus via
events.NewBus()
- Instantiate retrieval agent (with or without LLM provider)
- Instantiate consolidation agent with standard config
Phase 2: Per-Scenario Loop
For each of the 3 scenarios:
2a. Ingest — Write all memories directly via store.WriteMemory(). Each memory pre-built with:
Content, Summary, Concepts — real text for FTS indexing
Embedding — synthetic vectors (signal in one cluster, noise in another) or real via LLM if --llm
Salience — signal: 0.5-0.8, noise: 0.3-0.4
State = "active"
- Associations between related signal memories via
store.CreateAssociation()
FTS5 index and embedding index auto-populate on WriteMemory() — no extra work needed (confirmed via code review: schema triggers populate memories_fts, and embIndex.Add() is called in WriteMemory() for active/fading memories).
2b. Baseline query — Run all scenario queries via RetrievalAgent.Query(). Record:
- Which memories returned (IDs), their scores, their ground-truth labels (signal/noise/duplicate)
- Compute Precision@5, Recall@5, MRR, nDCG
2c. Access simulation — Re-query signal topics to bump access counts and LastAccessed timestamps. This simulates real usage where signal memories get accessed but noise doesn't. This creates the asymmetry that consolidation's decay should exploit.
2d. Consolidation — Run N cycles via ConsolidationAgent.RunOnce(). Between cycles, fast-forward time by adjusting LastAccessed timestamps (subtract hours) to simulate days passing without actually waiting. Consolidation runs:
- Decay (always) — uses
BatchUpdateSalience()
- State transitions (always) — fading/archived based on new salience
- Association pruning (always)
- Merge clusters (LLM-gated) — skipped if
--llm not set
- Pattern extraction (LLM-gated) — skipped if
--llm not set
2e. Post-consolidation query — Re-run same queries. Measure improved IR metrics. Precision should increase because noise has decayed below retrieval thresholds.
2f. System quality scoring — Query store for memory states:
- Count noise memories with state
"fading" or "archived" → noise suppression score
- Count signal memories still
"active" → signal retention score
- Count duplicate memories with state
"merged" → dedup effectiveness (LLM-only)
Phase 3: Aggregate + Report
Aggregate all per-scenario metrics, compute overall scores, print terminal scorecard. With --report markdown, write benchmark-results.md.
Report Output
Terminal (default):
╔══════════════════════════════════════════════════════╗
║ Mnemonic Memory Quality Benchmark ║
╠══════════════════════════════════════════════════════╣
║ Version: 0.6.0 LLM: available Cycles: 5 ║
╠══════════════════════════════════════════════════════╣
║ ║
║ SCENARIO 1: Debugging Session ║
║ ┌────────────────────────────────────────────┐ ║
║ │ Precision@5 0.87 ████████▋ PASS │ ║
║ │ MRR 0.92 █████████▏ PASS │ ║
║ │ nDCG 0.84 ████████▍ PASS │ ║
║ │ Noise Suppr. 0.75 ███████▌ PASS │ ║
║ │ Signal Ret. 1.00 ██████████ PASS │ ║
║ └────────────────────────────────────────────┘ ║
║ ║
║ SCENARIO 2: Architecture Decision ║
║ ┌────────────────────────────────────────────┐ ║
║ │ Precision@5 0.80 ████████ PASS │ ║
║ │ MRR 0.78 ███████▊ PASS │ ║
║ │ nDCG 0.76 ███████▋ PASS │ ║
║ │ Noise Suppr. 0.83 ████████▎ PASS │ ║
║ │ Signal Ret. 0.88 ████████▊ PASS │ ║
║ │ Dedup Effect. 0.75 ███████▌ PASS [LLM] │ ║
║ └────────────────────────────────────────────┘ ║
║ ║
║ SCENARIO 3: Learning & Insights ║
║ ┌────────────────────────────────────────────┐ ║
║ │ Precision@5 0.73 ███████▎ PASS │ ║
║ │ MRR 0.67 ██████▋ PASS │ ║
║ │ nDCG 0.71 ███████ PASS │ ║
║ │ Noise Suppr. 0.67 ██████▋ PASS │ ║
║ │ Signal Ret. 0.88 ████████▊ PASS │ ║
║ └────────────────────────────────────────────┘ ║
║ ║
╠══════════════════════════════════════════════════════╣
║ AGGREGATE ║
║ Precision@5 0.80 MRR 0.79 nDCG 0.77 ║
║ Noise Suppression 0.75 Signal Retention 0.92 ║
║ ║
║ Overall: PASS ║
╚══════════════════════════════════════════════════════╝
With --report markdown, writes a benchmark-results.md file suitable for linking in the project README.
Key Dependencies (no modifications to existing code)
All existing APIs needed have been verified via code review:
| Dependency |
File |
What it does |
sqlite.NewSQLiteStore() |
internal/store/sqlite/sqlite.go |
Creates real DB in temp dir |
store.WriteMemory() |
internal/store/sqlite/sqlite.go:562 |
Writes memory + auto-populates FTS & embedding index |
store.CreateAssociation() |
internal/store/sqlite/sqlite.go:951 |
Links related signal memories |
store.BatchUpdateSalience() |
internal/store/sqlite/sqlite.go:1111 |
For time fast-forwarding between cycles |
retrieval.NewRetrievalAgent() |
internal/agent/retrieval/agent.go:88 |
Creates retrieval agent |
retrieval.Query() |
internal/agent/retrieval/agent.go:102 |
Returns []RetrievalResult with .Score |
consolidation.NewConsolidationAgent() |
internal/agent/consolidation/agent.go:64 |
Creates consolidation agent |
consolidation.RunOnce() |
internal/agent/consolidation/agent.go:115 |
Runs one full consolidation cycle synchronously |
events.NewBus() |
internal/events/inmemory.go |
In-memory event bus required by agents |
llm.Provider |
internal/llm/provider.go:92 |
Optional — for real embeddings, merge, dedup |
Retrieval scoring pipeline (context for metric design)
The retrieval agent scores results through 5 stages:
- FTS entry points: BM25-ranked full-text matches, scored as
0.3 + 0.4 * salience
- Embedding entry points: Cosine similarity against query embedding
- Entry point merging:
alpha * emb + (1-alpha) * fts + dual_hit_bonus (alpha=0.6, bonus=0.15)
- Spread activation: Traverses association graph with exponential decay per hop (max 3 hops)
- Final ranking:
activation * (1 + recency_bonus + activity_bonus) * significance_boost
This means the benchmark's synthetic embeddings must place signal and noise in different regions of embedding space for the merge step to work correctly.
Synthetic Embeddings (LLM-free mode)
For running without an LLM, memories get deterministic vectors:
- Signal memories: Vectors clustered near specific topic dimensions (e.g., debugging signal near
[1,0,0,...], architecture signal near [0,1,0,...])
- Noise memories: Vectors clustered in a separate region (e.g., near
[0,0,1,...])
- Duplicate memories: Vectors near their originals with small jitter
This lets embedding search, cosine similarity, and the merge step all work correctly without a real model. The tradeoff: synthetic embeddings don't test the LLM's actual encoding quality, only the pipeline's ability to distinguish pre-separated clusters.
With --llm, real embeddings are generated via llm.Provider.Embed(), testing the full end-to-end chain including LLM encoding quality.
Makefile Addition
benchmark-quality: build
CGO_ENABLED=1 go build $(TAGS) -o $(BUILD_DIR)/benchmark-quality ./cmd/benchmark-quality
Usage
make benchmark-quality
./bin/benchmark-quality # fast mode, synthetic embeddings (~30s)
./bin/benchmark-quality --llm # full mode with real LLM (~2-5min)
./bin/benchmark-quality --llm --report markdown # publishable report
./bin/benchmark-quality --verbose --cycles 10 # detailed output, more consolidation
Exit code: 0 = PASS, 1 = FAIL.
Related
Problem
We have a throughput benchmark (
cmd/benchmark/) that measures ingestion speed and basic keyword precision against a live daemon. But we have no way to measure memory quality — whether signal survives, noise fades, and recall returns useful results across realistic scenarios.The system audit (PR #22) found:
We've fixed the pipeline (#22), but we need proof — for ourselves and for the public — that the system actually works.
Design Philosophy
The system will always ingest some noise (humans do too). The benchmark measures resilience to noise, not perfection at filtering. The key question: "Given realistic input (mix of signal + noise), does the system preserve signal and suppress noise over time?"
What It Measures
Standard IR Metrics (per query)
relevant_in_top_K / Krelevant_in_top_K / total_relevant1 / rank_of_first_relevantDCG / ideal_DCGSystem Quality Metrics (across scenarios)
Pass Thresholds
Test Scenarios
Each scenario simulates a realistic developer session with labeled ground truth: every memory is tagged as signal, noise, or duplicate. Scoring is fully automated — just set membership checks.
Scenario 1: "Debugging Session"
Signal (8 memories):
Noise (12 memories):
Queries:
What this tests: Can recall reconstruct the debugging narrative? Does noise stay out of results?
Scenario 2: "Architecture Decision"
Signal (8 memories):
Noise (12 memories):
.DS_StorechangesDuplicates (4 memories):
Queries:
What this tests: Do decisions surface? Do duplicates merge instead of creating duplicate patterns? Does desktop noise stay buried?
Scenario 3: "Learning & Insights"
Signal (8 memories):
Noise (12 memories):
ls/cd/clearcommands, PipeWire audio config changesQueries:
What this tests: Do specific learnings surface for specific queries? Are vague clipboard pastes and terminal noise excluded?
Architecture
New binary at
cmd/benchmark-quality/. Runs without a daemon — instantiates store + agents directly, controls timing.Why direct component access (not HTTP/daemon)?
SQLiteStore,ConsolidationAgent, andRetrievalAgent, the benchmark controls the timeline. It can write pre-encoded memories, run consolidation synchronously viaRunOnce(), and query directly.internal/store/sqlite/sqlite_test.go— real SQLite int.TempDir().Execution Phases
Phase 1: Setup
--llm,--verbose,--cycles N(default 5),--report markdownsqlite.NewSQLiteStore(tempdir)events.NewBus()Phase 2: Per-Scenario Loop
For each of the 3 scenarios:
2a. Ingest — Write all memories directly via
store.WriteMemory(). Each memory pre-built with:Content,Summary,Concepts— real text for FTS indexingEmbedding— synthetic vectors (signal in one cluster, noise in another) or real via LLM if--llmSalience— signal: 0.5-0.8, noise: 0.3-0.4State="active"store.CreateAssociation()FTS5 index and embedding index auto-populate on
WriteMemory()— no extra work needed (confirmed via code review: schema triggers populatememories_fts, andembIndex.Add()is called inWriteMemory()for active/fading memories).2b. Baseline query — Run all scenario queries via
RetrievalAgent.Query(). Record:2c. Access simulation — Re-query signal topics to bump access counts and
LastAccessedtimestamps. This simulates real usage where signal memories get accessed but noise doesn't. This creates the asymmetry that consolidation's decay should exploit.2d. Consolidation — Run N cycles via
ConsolidationAgent.RunOnce(). Between cycles, fast-forward time by adjustingLastAccessedtimestamps (subtract hours) to simulate days passing without actually waiting. Consolidation runs:BatchUpdateSalience()--llmnot set--llmnot set2e. Post-consolidation query — Re-run same queries. Measure improved IR metrics. Precision should increase because noise has decayed below retrieval thresholds.
2f. System quality scoring — Query store for memory states:
"fading"or"archived"→ noise suppression score"active"→ signal retention score"merged"→ dedup effectiveness (LLM-only)Phase 3: Aggregate + Report
Aggregate all per-scenario metrics, compute overall scores, print terminal scorecard. With
--report markdown, writebenchmark-results.md.Report Output
Terminal (default):
With
--report markdown, writes abenchmark-results.mdfile suitable for linking in the project README.Key Dependencies (no modifications to existing code)
All existing APIs needed have been verified via code review:
sqlite.NewSQLiteStore()internal/store/sqlite/sqlite.gostore.WriteMemory()internal/store/sqlite/sqlite.go:562store.CreateAssociation()internal/store/sqlite/sqlite.go:951store.BatchUpdateSalience()internal/store/sqlite/sqlite.go:1111retrieval.NewRetrievalAgent()internal/agent/retrieval/agent.go:88retrieval.Query()internal/agent/retrieval/agent.go:102[]RetrievalResultwith.Scoreconsolidation.NewConsolidationAgent()internal/agent/consolidation/agent.go:64consolidation.RunOnce()internal/agent/consolidation/agent.go:115events.NewBus()internal/events/inmemory.gollm.Providerinternal/llm/provider.go:92Retrieval scoring pipeline (context for metric design)
The retrieval agent scores results through 5 stages:
0.3 + 0.4 * saliencealpha * emb + (1-alpha) * fts + dual_hit_bonus(alpha=0.6, bonus=0.15)activation * (1 + recency_bonus + activity_bonus) * significance_boostThis means the benchmark's synthetic embeddings must place signal and noise in different regions of embedding space for the merge step to work correctly.
Synthetic Embeddings (LLM-free mode)
For running without an LLM, memories get deterministic vectors:
[1,0,0,...], architecture signal near[0,1,0,...])[0,0,1,...])This lets embedding search, cosine similarity, and the merge step all work correctly without a real model. The tradeoff: synthetic embeddings don't test the LLM's actual encoding quality, only the pipeline's ability to distinguish pre-separated clusters.
With
--llm, real embeddings are generated viallm.Provider.Embed(), testing the full end-to-end chain including LLM encoding quality.Makefile Addition
Usage
Exit code: 0 = PASS, 1 = FAIL.
Related