Problem
Live quality testing (2026-04-07) of the Qwen 3.5 2B RQ4 spokes model revealed that while JSON schema compliance is 100%, content faithfulness is critically broken. The model produces structurally valid but semantically wrong encodings.
Failure Modes Observed
| Mode |
Example Input |
Expected |
Got |
| Template echoing |
Any input |
Actual summary |
"What happened and why it matters in under 100 characters." (instruction text) |
| Cross-contamination |
PostgreSQL MVCC explanation |
PostgreSQL content |
"Testing the Go runtime's garbage collector" (from context memory) |
| Content fabrication |
Forum communication layer description |
Forum system details |
"Scheduling dreaming for 2am-6am tripled insights and boosted recall precision from 0.42 to 0.67" (fabricated) |
Root Cause Analysis
- Monotone training data: All 4,254 training examples are encoding-task, tech-domain, Gemini-generated. The model learned to produce plausible mnemonic-domain output rather than faithfully compress the input.
- Prompt distribution mismatch: Training uses
ENCODING_SYSTEM_PROMPT + raw input. Production adds concept vocabulary (50+ terms), episode context, related memories, coaching instructions, and SOURCE:/TYPE: metadata. The model has never seen this format.
- Low input entropy: Raw inputs are all synthetic tech narratives of similar length/structure. The model can "fake" outputs from domain priors alone.
Experiment: EXP-25 — Faithfulness Probe
Hypothesis: The Qwen 3.5 2B architecture with 25M spoke parameters has sufficient capacity to learn faithful input-to-output encoding on diverse content. The current failure is a data problem, not a model capacity problem.
Null hypothesis: The 2B model at RQ4 quantization lacks the capacity to reliably follow the encoding prompt with complex contextual inputs — more data won't fix it.
Why this matters: If confirmed, we build v7 training data and retrain. If refuted, we either need a larger model (Gemma 4 E2B), a simpler prompt, or architectural changes.
Phase 1: Build 10 Maximally Diverse Training Examples
Hand-craft 10 inputs designed to force the model to read the actual content. The model cannot fake these from domain priors.
The 10 Inputs
| # |
Category |
Input Description |
Why It Forces Faithfulness |
| 1 |
Out-of-domain: Recipe |
A detailed pasta carbonara recipe with specific measurements, timing, and technique warnings |
Zero overlap with training distribution. Output MUST reflect eggs, guanciale, pecorino — can't fake it |
| 2 |
Out-of-domain: Legal |
A software license clause with specific permissions, restrictions, and liability terms |
Legal language is structurally very different from tech narratives. Must preserve exact terms. |
| 3 |
Out-of-domain: Medical |
A clinical note about a patient presenting with specific symptoms, vitals, and differential diagnosis |
Completely outside training domain. Entities (drug names, measurements) must be preserved exactly. |
| 4 |
Out-of-domain: Sports |
A basketball game recap with box scores, specific player stats, and play-by-play details |
Dense numbers + proper nouns. Easy to verify — did the model preserve "LeBron: 32pts/8reb/7ast"? |
| 5 |
Adversarial twin A |
"Decided to use PostgreSQL over SQLite because we need concurrent write support for the multi-node deployment" |
Paired with #6. If the model produces the same encoding for both, it's not reading the input. |
| 6 |
Adversarial twin B |
"Decided to use SQLite over PostgreSQL because the local-first architecture doesn't need concurrent writes and we want zero deployment dependencies" |
Must produce meaningfully different encoding from #5. Same structure, opposite decision. |
| 7 |
Minimal input |
"WAL mode on." |
3 words. The model must produce a minimal encoding without padding with hallucinated context. Should get low salience. |
| 8 |
Dense numbers |
A monitoring alert with 15+ specific metrics: CPU 94.2%, memory 12.8/16GB, disk I/O 450MB/s, p99 latency 2847ms, error rate 3.2%, 47 active connections, etc. |
Every number must appear in the output. Easy to score programmatically. |
| 9 |
Foreign language mixed |
A bilingual (English + Mandarin) code review comment discussing a race condition, with technical terms in both languages |
Tests character-level attention and whether the model preserves non-Latin script |
| 10 |
Production handoff note |
A real-format mnemonic session handoff with bullet lists, file paths, known issues, and next steps (mimicking the exact format the daemon sees) |
This is the production format that currently fails. Must preserve structure and specific paths/versions. |
Output Format
Each example needs a gold-standard encoding — the correct JSON output that a perfect model would produce. Generate via Gemini, then hand-verify every field for:
- Entity preservation (all names, numbers, paths from input appear in output)
- Zero fabrication (nothing in output that wasn't in input)
- Appropriate salience (recipe = 0.3, critical decision = 0.8)
- Correct significance/tone
Prompt Format
Critical: Use the production prompt format, not the training prompt format. Each example must include:
- The production system prompt from
buildCompressionPrompt() (not ENCODING_SYSTEM_PROMPT)
- Concept vocabulary list (the 50+ terms from config.yaml)
SOURCE: mcp and TYPE: <appropriate> metadata
- For 2 of the 10: include mock episode context and related memory context
This ensures the model learns to handle the actual prompt it will see in production.
Phase 2: Train
Config
Base model: Qwen/Qwen3.5-2B
Spokes: All 24 layers, standard config
Dataset: 10 examples (overfit intentionally)
Steps: 200 (enough to memorize 10 examples)
LR: 1e-3 (proven in EXP-18)
Seq length: 2048
Eval: Same 10 examples (we WANT overfitting here)
Hardware: RX 7800 XT (16GB VRAM, ROCm)
Success criteria for Phase 2
The model should perfectly reproduce the gold-standard outputs for all 10 training examples. If it can't overfit to 10 examples, the architecture fundamentally can't learn this task.
Phase 3: Evaluate Faithfulness
Evaluation Set (20 inputs total)
- 10 training inputs (should be near-perfect — verifies the model memorized correctly)
- 10 held-out inputs from real production:
- 5 from
~/.mnemonic/training-data/capture_*.jsonl (real daemon encoding requests)
- 5 hand-written edge cases (empty input, pure code block, URL-only, emoji-heavy, XML/HTML)
Faithfulness Metrics (NEW — does not exist yet)
Build eval_faithfulness.py with these automated checks:
| Metric |
Definition |
Target |
| Entity Preservation Rate (EPR) |
% of named entities (names, numbers, versions, paths) from input that appear in output content or summary fields |
>90% |
| Fabrication Rate (FR) |
% of entities in output that do NOT appear in input (false positives) |
<5% |
| Template Echo Detection (TED) |
Binary: does any output field contain known instruction text ("under 60 characters", "what happened", "keyword", etc.) |
0% |
| Cross-Contamination Score (CCS) |
For adversarial twins: cosine similarity between their encoded outputs. Should be LOW (distinct encodings). |
<0.7 |
| Minimal Input Handling (MIH) |
For input #7 ("WAL mode on."): output salience <0.4, content length < 100 chars, no fabricated detail |
Pass/Fail |
| Number Preservation (NP) |
For input #8 (dense numbers): % of numeric values preserved exactly in output |
>95% |
| Schema Compliance (SC) |
Existing: valid JSON, all required fields, correct enum values |
100% |
Verdict Matrix
| Training Set EPR |
Held-Out EPR |
Verdict |
| >90% |
>80% |
CONFIRMED — architecture works, scale to v7 data |
| >90% |
<60% |
PARTIAL — model can learn but doesn't generalize, need more diverse data (still scale to v7) |
| <70% |
any |
REFUTED — architecture/capacity issue, need different approach |
Phase 4: If Confirmed → Build V7 Dataset
Only proceed here if Phase 3 confirms the hypothesis.
V7 Data Mix (~1,500 new examples + 4,254 existing)
| Category |
Count |
Source |
Purpose |
| Out-of-domain diverse |
300 |
Gemini generation from diverse seed topics (cooking, law, medicine, sports, music, history, etc.) |
Break domain monotony |
| Adversarial twins |
100 pairs (200) |
Hand-crafted pairs that differ in one key detail |
Force careful reading |
| Minimal inputs |
100 |
1-10 word inputs |
Prevent hallucinated padding |
| Dense-number inputs |
100 |
Monitoring alerts, benchmark tables, config dumps |
Train number preservation |
| Production-format prompts |
300 |
Real daemon prompts with vocabulary + context + coaching |
Close the prompt distribution gap |
| Real MCP memories |
300 |
From ~/.mnemonic/training-data/capture_*.jsonl |
Train on actual production inputs |
| Negative examples |
100 |
Inputs paired with WRONG outputs (template echoes, fabrications) — low salience label |
Teach the model what NOT to do |
| Existing v6 encoding |
4,254 |
Current dataset |
Maintain schema compliance |
Total: ~5,754 examples (35% increase, but dramatically more diverse)
Generation Pipeline
generate_diverse_inputs.py — Create raw inputs for each category using Gemini with explicit diversity constraints
batch_encode.py — Generate gold-standard outputs via Gemini Batch API
eval_faithfulness.py — Validate every example passes faithfulness checks before inclusion
validate.py — Existing 3-level quality pipeline
- Tokenize with production prompt format (including vocab list, context stubs)
- Manual spot-check: read 50 random examples, verify gold-standard quality
Files to Create/Modify
| File |
Action |
Description |
training/data/faithfulness_probe/ |
Create |
Directory for EXP-25 data (10 train + 10 eval) |
training/scripts/eval_faithfulness.py |
Create |
New faithfulness evaluation script with EPR, FR, TED, CCS, MIH, NP metrics |
training/scripts/generate_diverse_inputs.py |
Create |
V7 diverse input generator (Phase 4 only) |
training/scripts/stress_test_hallucination.py |
Modify |
Add faithfulness metrics alongside existing hallucination checks |
training/scripts/training_constants.py |
Modify |
Add PRODUCTION_ENCODING_PROMPT matching daemon's buildCompressionPrompt() output |
training/docs/experiment_registry.md |
Modify |
Pre-register EXP-25 |
Definition of Done
Time Estimate
- Phase 1 (build examples): ~2 hours (hand-craft + Gemini gold-standard + verify)
- Phase 2 (train): ~15 minutes (200 steps on 10 examples)
- Phase 3 (evaluate): ~1 hour (build eval script + run + analyze)
- Phase 4 (v7 data): ~4 hours if confirmed (generate + encode + validate + tokenize)
Total minimum: ~3.5 hours through verdict. ~7.5 hours if scaling to v7.
Problem
Live quality testing (2026-04-07) of the Qwen 3.5 2B RQ4 spokes model revealed that while JSON schema compliance is 100%, content faithfulness is critically broken. The model produces structurally valid but semantically wrong encodings.
Failure Modes Observed
"What happened and why it matters in under 100 characters."(instruction text)"Testing the Go runtime's garbage collector"(from context memory)"Scheduling dreaming for 2am-6am tripled insights and boosted recall precision from 0.42 to 0.67"(fabricated)Root Cause Analysis
ENCODING_SYSTEM_PROMPT+ raw input. Production adds concept vocabulary (50+ terms), episode context, related memories, coaching instructions, andSOURCE:/TYPE:metadata. The model has never seen this format.Experiment: EXP-25 — Faithfulness Probe
Hypothesis: The Qwen 3.5 2B architecture with 25M spoke parameters has sufficient capacity to learn faithful input-to-output encoding on diverse content. The current failure is a data problem, not a model capacity problem.
Null hypothesis: The 2B model at RQ4 quantization lacks the capacity to reliably follow the encoding prompt with complex contextual inputs — more data won't fix it.
Why this matters: If confirmed, we build v7 training data and retrain. If refuted, we either need a larger model (Gemma 4 E2B), a simpler prompt, or architectural changes.
Phase 1: Build 10 Maximally Diverse Training Examples
Hand-craft 10 inputs designed to force the model to read the actual content. The model cannot fake these from domain priors.
The 10 Inputs
Output Format
Each example needs a gold-standard encoding — the correct JSON output that a perfect model would produce. Generate via Gemini, then hand-verify every field for:
Prompt Format
Critical: Use the production prompt format, not the training prompt format. Each example must include:
buildCompressionPrompt()(notENCODING_SYSTEM_PROMPT)SOURCE: mcpandTYPE: <appropriate>metadataThis ensures the model learns to handle the actual prompt it will see in production.
Phase 2: Train
Config
Success criteria for Phase 2
The model should perfectly reproduce the gold-standard outputs for all 10 training examples. If it can't overfit to 10 examples, the architecture fundamentally can't learn this task.
Phase 3: Evaluate Faithfulness
Evaluation Set (20 inputs total)
~/.mnemonic/training-data/capture_*.jsonl(real daemon encoding requests)Faithfulness Metrics (NEW — does not exist yet)
Build
eval_faithfulness.pywith these automated checks:contentorsummaryfields"under 60 characters","what happened","keyword", etc.)Verdict Matrix
Phase 4: If Confirmed → Build V7 Dataset
Only proceed here if Phase 3 confirms the hypothesis.
V7 Data Mix (~1,500 new examples + 4,254 existing)
~/.mnemonic/training-data/capture_*.jsonlTotal: ~5,754 examples (35% increase, but dramatically more diverse)
Generation Pipeline
generate_diverse_inputs.py— Create raw inputs for each category using Gemini with explicit diversity constraintsbatch_encode.py— Generate gold-standard outputs via Gemini Batch APIeval_faithfulness.py— Validate every example passes faithfulness checks before inclusionvalidate.py— Existing 3-level quality pipelineFiles to Create/Modify
training/data/faithfulness_probe/training/scripts/eval_faithfulness.pytraining/scripts/generate_diverse_inputs.pytraining/scripts/stress_test_hallucination.pytraining/scripts/training_constants.pyPRODUCTION_ENCODING_PROMPTmatching daemon'sbuildCompressionPrompt()outputtraining/docs/experiment_registry.mdDefinition of Done
eval_faithfulness.pyimplemented with all 7 metricsTime Estimate
Total minimum: ~3.5 hours through verdict. ~7.5 hours if scaling to v7.