EXP-25: Faithfulness probe — can Qwen 2B spokes learn to encode diverse inputs?

## Problem

Live quality testing (2026-04-07) of the Qwen 3.5 2B RQ4 spokes model revealed that while JSON schema compliance is 100%, **content faithfulness is critically broken**. The model produces structurally valid but semantically wrong encodings.

### Failure Modes Observed

| Mode | Example Input | Expected | Got |
|------|--------------|----------|-----|
| **Template echoing** | Any input | Actual summary | `"What happened and why it matters in under 100 characters."` (instruction text) |
| **Cross-contamination** | PostgreSQL MVCC explanation | PostgreSQL content | `"Testing the Go runtime's garbage collector"` (from context memory) |
| **Content fabrication** | Forum communication layer description | Forum system details | `"Scheduling dreaming for 2am-6am tripled insights and boosted recall precision from 0.42 to 0.67"` (fabricated) |

### Root Cause Analysis

1. **Monotone training data**: All 4,254 training examples are encoding-task, tech-domain, Gemini-generated. The model learned to produce plausible mnemonic-domain output rather than faithfully compress the input.
2. **Prompt distribution mismatch**: Training uses `ENCODING_SYSTEM_PROMPT` + raw input. Production adds concept vocabulary (50+ terms), episode context, related memories, coaching instructions, and `SOURCE:`/`TYPE:` metadata. The model has never seen this format.
3. **Low input entropy**: Raw inputs are all synthetic tech narratives of similar length/structure. The model can "fake" outputs from domain priors alone.

## Experiment: EXP-25 — Faithfulness Probe

**Hypothesis**: The Qwen 3.5 2B architecture with 25M spoke parameters has sufficient capacity to learn faithful input-to-output encoding on diverse content. The current failure is a data problem, not a model capacity problem.

**Null hypothesis**: The 2B model at RQ4 quantization lacks the capacity to reliably follow the encoding prompt with complex contextual inputs — more data won't fix it.

**Why this matters**: If confirmed, we build v7 training data and retrain. If refuted, we either need a larger model (Gemma 4 E2B), a simpler prompt, or architectural changes.

---

## Phase 1: Build 10 Maximally Diverse Training Examples

Hand-craft 10 inputs designed to **force the model to read the actual content**. The model cannot fake these from domain priors.

### The 10 Inputs

| # | Category | Input Description | Why It Forces Faithfulness |
|---|----------|-------------------|---------------------------|
| 1 | **Out-of-domain: Recipe** | A detailed pasta carbonara recipe with specific measurements, timing, and technique warnings | Zero overlap with training distribution. Output MUST reflect eggs, guanciale, pecorino — can't fake it |
| 2 | **Out-of-domain: Legal** | A software license clause with specific permissions, restrictions, and liability terms | Legal language is structurally very different from tech narratives. Must preserve exact terms. |
| 3 | **Out-of-domain: Medical** | A clinical note about a patient presenting with specific symptoms, vitals, and differential diagnosis | Completely outside training domain. Entities (drug names, measurements) must be preserved exactly. |
| 4 | **Out-of-domain: Sports** | A basketball game recap with box scores, specific player stats, and play-by-play details | Dense numbers + proper nouns. Easy to verify — did the model preserve "LeBron: 32pts/8reb/7ast"? |
| 5 | **Adversarial twin A** | "Decided to use **PostgreSQL** over SQLite because we need concurrent write support for the multi-node deployment" | Paired with #6. If the model produces the same encoding for both, it's not reading the input. |
| 6 | **Adversarial twin B** | "Decided to use **SQLite** over PostgreSQL because the local-first architecture doesn't need concurrent writes and we want zero deployment dependencies" | Must produce meaningfully different encoding from #5. Same structure, opposite decision. |
| 7 | **Minimal input** | "WAL mode on." | 3 words. The model must produce a minimal encoding without padding with hallucinated context. Should get low salience. |
| 8 | **Dense numbers** | A monitoring alert with 15+ specific metrics: CPU 94.2%, memory 12.8/16GB, disk I/O 450MB/s, p99 latency 2847ms, error rate 3.2%, 47 active connections, etc. | Every number must appear in the output. Easy to score programmatically. |
| 9 | **Foreign language mixed** | A bilingual (English + Mandarin) code review comment discussing a race condition, with technical terms in both languages | Tests character-level attention and whether the model preserves non-Latin script |
| 10 | **Production handoff note** | A real-format mnemonic session handoff with bullet lists, file paths, known issues, and next steps (mimicking the exact format the daemon sees) | This is the production format that currently fails. Must preserve structure and specific paths/versions. |

### Output Format

Each example needs a **gold-standard encoding** — the correct JSON output that a perfect model would produce. Generate via Gemini, then **hand-verify every field** for:
- Entity preservation (all names, numbers, paths from input appear in output)
- Zero fabrication (nothing in output that wasn't in input)
- Appropriate salience (recipe = 0.3, critical decision = 0.8)
- Correct significance/tone

### Prompt Format

**Critical**: Use the **production prompt format**, not the training prompt format. Each example must include:
- The production system prompt from `buildCompressionPrompt()` (not `ENCODING_SYSTEM_PROMPT`)
- Concept vocabulary list (the 50+ terms from config.yaml)
- `SOURCE: mcp` and `TYPE: <appropriate>` metadata
- For 2 of the 10: include mock episode context and related memory context

This ensures the model learns to handle the actual prompt it will see in production.

---

## Phase 2: Train

### Config
```
Base model: Qwen/Qwen3.5-2B
Spokes: All 24 layers, standard config
Dataset: 10 examples (overfit intentionally)
Steps: 200 (enough to memorize 10 examples)
LR: 1e-3 (proven in EXP-18)
Seq length: 2048
Eval: Same 10 examples (we WANT overfitting here)
Hardware: RX 7800 XT (16GB VRAM, ROCm)
```

### Success criteria for Phase 2
The model should perfectly reproduce the gold-standard outputs for all 10 training examples. If it can't overfit to 10 examples, the architecture fundamentally can't learn this task.

---

## Phase 3: Evaluate Faithfulness

### Evaluation Set (20 inputs total)
- **10 training inputs** (should be near-perfect — verifies the model memorized correctly)
- **10 held-out inputs** from real production:
  - 5 from `~/.mnemonic/training-data/capture_*.jsonl` (real daemon encoding requests)
  - 5 hand-written edge cases (empty input, pure code block, URL-only, emoji-heavy, XML/HTML)

### Faithfulness Metrics (NEW — does not exist yet)

Build `eval_faithfulness.py` with these automated checks:

| Metric | Definition | Target |
|--------|-----------|--------|
| **Entity Preservation Rate (EPR)** | % of named entities (names, numbers, versions, paths) from input that appear in output `content` or `summary` fields | >90% |
| **Fabrication Rate (FR)** | % of entities in output that do NOT appear in input (false positives) | <5% |
| **Template Echo Detection (TED)** | Binary: does any output field contain known instruction text (`"under 60 characters"`, `"what happened"`, `"keyword"`, etc.) | 0% |
| **Cross-Contamination Score (CCS)** | For adversarial twins: cosine similarity between their encoded outputs. Should be LOW (distinct encodings). | <0.7 |
| **Minimal Input Handling (MIH)** | For input #7 ("WAL mode on."): output salience <0.4, content length < 100 chars, no fabricated detail | Pass/Fail |
| **Number Preservation (NP)** | For input #8 (dense numbers): % of numeric values preserved exactly in output | >95% |
| **Schema Compliance (SC)** | Existing: valid JSON, all required fields, correct enum values | 100% |

### Verdict Matrix

| Training Set EPR | Held-Out EPR | Verdict |
|-----------------|-------------|---------|
| >90% | >80% | **CONFIRMED** — architecture works, scale to v7 data |
| >90% | <60% | **PARTIAL** — model can learn but doesn't generalize, need more diverse data (still scale to v7) |
| <70% | any | **REFUTED** — architecture/capacity issue, need different approach |

---

## Phase 4: If Confirmed → Build V7 Dataset

Only proceed here if Phase 3 confirms the hypothesis.

### V7 Data Mix (~1,500 new examples + 4,254 existing)

| Category | Count | Source | Purpose |
|----------|-------|--------|---------|
| **Out-of-domain diverse** | 300 | Gemini generation from diverse seed topics (cooking, law, medicine, sports, music, history, etc.) | Break domain monotony |
| **Adversarial twins** | 100 pairs (200) | Hand-crafted pairs that differ in one key detail | Force careful reading |
| **Minimal inputs** | 100 | 1-10 word inputs | Prevent hallucinated padding |
| **Dense-number inputs** | 100 | Monitoring alerts, benchmark tables, config dumps | Train number preservation |
| **Production-format prompts** | 300 | Real daemon prompts with vocabulary + context + coaching | Close the prompt distribution gap |
| **Real MCP memories** | 300 | From `~/.mnemonic/training-data/capture_*.jsonl` | Train on actual production inputs |
| **Negative examples** | 100 | Inputs paired with WRONG outputs (template echoes, fabrications) — low salience label | Teach the model what NOT to do |
| **Existing v6 encoding** | 4,254 | Current dataset | Maintain schema compliance |

**Total: ~5,754 examples** (35% increase, but dramatically more diverse)

### Generation Pipeline

1. `generate_diverse_inputs.py` — Create raw inputs for each category using Gemini with explicit diversity constraints
2. `batch_encode.py` — Generate gold-standard outputs via Gemini Batch API
3. `eval_faithfulness.py` — Validate every example passes faithfulness checks before inclusion
4. `validate.py` — Existing 3-level quality pipeline
5. Tokenize with production prompt format (including vocab list, context stubs)
6. Manual spot-check: read 50 random examples, verify gold-standard quality

---

## Files to Create/Modify

| File | Action | Description |
|------|--------|-------------|
| `training/data/faithfulness_probe/` | Create | Directory for EXP-25 data (10 train + 10 eval) |
| `training/scripts/eval_faithfulness.py` | Create | New faithfulness evaluation script with EPR, FR, TED, CCS, MIH, NP metrics |
| `training/scripts/generate_diverse_inputs.py` | Create | V7 diverse input generator (Phase 4 only) |
| `training/scripts/stress_test_hallucination.py` | Modify | Add faithfulness metrics alongside existing hallucination checks |
| `training/scripts/training_constants.py` | Modify | Add `PRODUCTION_ENCODING_PROMPT` matching daemon's `buildCompressionPrompt()` output |
| `training/docs/experiment_registry.md` | Modify | Pre-register EXP-25 |

---

## Definition of Done

- [ ] 10 diverse training examples hand-crafted with gold-standard outputs
- [ ] Examples use production prompt format (vocab list, source/type, context stubs)
- [ ] `eval_faithfulness.py` implemented with all 7 metrics
- [ ] EXP-25 trained (200 steps on 10 examples)
- [ ] Evaluation run on 10 train + 10 held-out inputs
- [ ] Results recorded in experiment registry with verdict
- [ ] If confirmed: V7 dataset plan refined with specific generation scripts
- [ ] If refuted: alternative approaches documented (larger model, simpler prompt, etc.)

---

## Time Estimate

- Phase 1 (build examples): ~2 hours (hand-craft + Gemini gold-standard + verify)
- Phase 2 (train): ~15 minutes (200 steps on 10 examples)
- Phase 3 (evaluate): ~1 hour (build eval script + run + analyze)
- Phase 4 (v7 data): ~4 hours if confirmed (generate + encode + validate + tokenize)

**Total minimum: ~3.5 hours through verdict. ~7.5 hours if scaling to v7.**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EXP-25: Faithfulness probe — can Qwen 2B spokes learn to encode diverse inputs? #381

Problem

Failure Modes Observed

Root Cause Analysis

Experiment: EXP-25 — Faithfulness Probe

Phase 1: Build 10 Maximally Diverse Training Examples

The 10 Inputs

Output Format

Prompt Format

Phase 2: Train

Config

Success criteria for Phase 2

Phase 3: Evaluate Faithfulness

Evaluation Set (20 inputs total)

Faithfulness Metrics (NEW — does not exist yet)

Verdict Matrix

Phase 4: If Confirmed → Build V7 Dataset

V7 Data Mix (~1,500 new examples + 4,254 existing)

Generation Pipeline

Files to Create/Modify

Definition of Done

Time Estimate

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Mode	Example Input	Expected	Got
Template echoing	Any input	Actual summary	`"What happened and why it matters in under 100 characters."` (instruction text)
Cross-contamination	PostgreSQL MVCC explanation	PostgreSQL content	`"Testing the Go runtime's garbage collector"` (from context memory)
Content fabrication	Forum communication layer description	Forum system details	`"Scheduling dreaming for 2am-6am tripled insights and boosted recall precision from 0.42 to 0.67"` (fabricated)

#	Category	Input Description	Why It Forces Faithfulness
1	Out-of-domain: Recipe	A detailed pasta carbonara recipe with specific measurements, timing, and technique warnings	Zero overlap with training distribution. Output MUST reflect eggs, guanciale, pecorino — can't fake it
2	Out-of-domain: Legal	A software license clause with specific permissions, restrictions, and liability terms	Legal language is structurally very different from tech narratives. Must preserve exact terms.
3	Out-of-domain: Medical	A clinical note about a patient presenting with specific symptoms, vitals, and differential diagnosis	Completely outside training domain. Entities (drug names, measurements) must be preserved exactly.
4	Out-of-domain: Sports	A basketball game recap with box scores, specific player stats, and play-by-play details	Dense numbers + proper nouns. Easy to verify — did the model preserve "LeBron: 32pts/8reb/7ast"?
5	Adversarial twin A	"Decided to use PostgreSQL over SQLite because we need concurrent write support for the multi-node deployment"	Paired with #6. If the model produces the same encoding for both, it's not reading the input.
6	Adversarial twin B	"Decided to use SQLite over PostgreSQL because the local-first architecture doesn't need concurrent writes and we want zero deployment dependencies"	Must produce meaningfully different encoding from #5. Same structure, opposite decision.
7	Minimal input	"WAL mode on."	3 words. The model must produce a minimal encoding without padding with hallucinated context. Should get low salience.
8	Dense numbers	A monitoring alert with 15+ specific metrics: CPU 94.2%, memory 12.8/16GB, disk I/O 450MB/s, p99 latency 2847ms, error rate 3.2%, 47 active connections, etc.	Every number must appear in the output. Easy to score programmatically.
9	Foreign language mixed	A bilingual (English + Mandarin) code review comment discussing a race condition, with technical terms in both languages	Tests character-level attention and whether the model preserves non-Latin script
10	Production handoff note	A real-format mnemonic session handoff with bullet lists, file paths, known issues, and next steps (mimicking the exact format the daemon sees)	This is the production format that currently fails. Must preserve structure and specific paths/versions.

Metric	Definition	Target
Entity Preservation Rate (EPR)	% of named entities (names, numbers, versions, paths) from input that appear in output `content` or `summary` fields	>90%
Fabrication Rate (FR)	% of entities in output that do NOT appear in input (false positives)	<5%
Template Echo Detection (TED)	Binary: does any output field contain known instruction text (`"under 60 characters"`, `"what happened"`, `"keyword"`, etc.)	0%
Cross-Contamination Score (CCS)	For adversarial twins: cosine similarity between their encoded outputs. Should be LOW (distinct encodings).	<0.7
Minimal Input Handling (MIH)	For input #7 ("WAL mode on."): output salience <0.4, content length < 100 chars, no fabricated detail	Pass/Fail
Number Preservation (NP)	For input #8 (dense numbers): % of numeric values preserved exactly in output	>95%
Schema Compliance (SC)	Existing: valid JSON, all required fields, correct enum values	100%

Training Set EPR	Held-Out EPR	Verdict
>90%	>80%	CONFIRMED — architecture works, scale to v7 data
>90%	<60%	PARTIAL — model can learn but doesn't generalize, need more diverse data (still scale to v7)
<70%	any	REFUTED — architecture/capacity issue, need different approach

Category	Count	Source	Purpose
Out-of-domain diverse	300	Gemini generation from diverse seed topics (cooking, law, medicine, sports, music, history, etc.)	Break domain monotony
Adversarial twins	100 pairs (200)	Hand-crafted pairs that differ in one key detail	Force careful reading
Minimal inputs	100	1-10 word inputs	Prevent hallucinated padding
Dense-number inputs	100	Monitoring alerts, benchmark tables, config dumps	Train number preservation
Production-format prompts	300	Real daemon prompts with vocabulary + context + coaching	Close the prompt distribution gap
Real MCP memories	300	From `~/.mnemonic/training-data/capture_*.jsonl`	Train on actual production inputs
Negative examples	100	Inputs paired with WRONG outputs (template echoes, fabrications) — low salience label	Teach the model what NOT to do
Existing v6 encoding	4,254	Current dataset	Maintain schema compliance

File	Action	Description
`training/data/faithfulness_probe/`	Create	Directory for EXP-25 data (10 train + 10 eval)
`training/scripts/eval_faithfulness.py`	Create	New faithfulness evaluation script with EPR, FR, TED, CCS, MIH, NP metrics
`training/scripts/generate_diverse_inputs.py`	Create	V7 diverse input generator (Phase 4 only)
`training/scripts/stress_test_hallucination.py`	Modify	Add faithfulness metrics alongside existing hallucination checks
`training/scripts/training_constants.py`	Modify	Add `PRODUCTION_ENCODING_PROMPT` matching daemon's `buildCompressionPrompt()` output
`training/docs/experiment_registry.md`	Modify	Pre-register EXP-25

EXP-25: Faithfulness probe — can Qwen 2B spokes learn to encode diverse inputs? #381

Description

Problem

Failure Modes Observed

Root Cause Analysis

Experiment: EXP-25 — Faithfulness Probe

Phase 1: Build 10 Maximally Diverse Training Examples

The 10 Inputs

Output Format

Prompt Format

Phase 2: Train

Config

Success criteria for Phase 2

Phase 3: Evaluate Faithfulness

Evaluation Set (20 inputs total)

Faithfulness Metrics (NEW — does not exist yet)

Verdict Matrix

Phase 4: If Confirmed → Build V7 Dataset

V7 Data Mix (~1,500 new examples + 4,254 existing)

Generation Pipeline

Files to Create/Modify

Definition of Done

Time Estimate

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions