Skip to content

EXP-25: Faithfulness probe — can Qwen 2B spokes learn to encode diverse inputs? #381

@CalebisGross

Description

@CalebisGross

Problem

Live quality testing (2026-04-07) of the Qwen 3.5 2B RQ4 spokes model revealed that while JSON schema compliance is 100%, content faithfulness is critically broken. The model produces structurally valid but semantically wrong encodings.

Failure Modes Observed

Mode Example Input Expected Got
Template echoing Any input Actual summary "What happened and why it matters in under 100 characters." (instruction text)
Cross-contamination PostgreSQL MVCC explanation PostgreSQL content "Testing the Go runtime's garbage collector" (from context memory)
Content fabrication Forum communication layer description Forum system details "Scheduling dreaming for 2am-6am tripled insights and boosted recall precision from 0.42 to 0.67" (fabricated)

Root Cause Analysis

  1. Monotone training data: All 4,254 training examples are encoding-task, tech-domain, Gemini-generated. The model learned to produce plausible mnemonic-domain output rather than faithfully compress the input.
  2. Prompt distribution mismatch: Training uses ENCODING_SYSTEM_PROMPT + raw input. Production adds concept vocabulary (50+ terms), episode context, related memories, coaching instructions, and SOURCE:/TYPE: metadata. The model has never seen this format.
  3. Low input entropy: Raw inputs are all synthetic tech narratives of similar length/structure. The model can "fake" outputs from domain priors alone.

Experiment: EXP-25 — Faithfulness Probe

Hypothesis: The Qwen 3.5 2B architecture with 25M spoke parameters has sufficient capacity to learn faithful input-to-output encoding on diverse content. The current failure is a data problem, not a model capacity problem.

Null hypothesis: The 2B model at RQ4 quantization lacks the capacity to reliably follow the encoding prompt with complex contextual inputs — more data won't fix it.

Why this matters: If confirmed, we build v7 training data and retrain. If refuted, we either need a larger model (Gemma 4 E2B), a simpler prompt, or architectural changes.


Phase 1: Build 10 Maximally Diverse Training Examples

Hand-craft 10 inputs designed to force the model to read the actual content. The model cannot fake these from domain priors.

The 10 Inputs

# Category Input Description Why It Forces Faithfulness
1 Out-of-domain: Recipe A detailed pasta carbonara recipe with specific measurements, timing, and technique warnings Zero overlap with training distribution. Output MUST reflect eggs, guanciale, pecorino — can't fake it
2 Out-of-domain: Legal A software license clause with specific permissions, restrictions, and liability terms Legal language is structurally very different from tech narratives. Must preserve exact terms.
3 Out-of-domain: Medical A clinical note about a patient presenting with specific symptoms, vitals, and differential diagnosis Completely outside training domain. Entities (drug names, measurements) must be preserved exactly.
4 Out-of-domain: Sports A basketball game recap with box scores, specific player stats, and play-by-play details Dense numbers + proper nouns. Easy to verify — did the model preserve "LeBron: 32pts/8reb/7ast"?
5 Adversarial twin A "Decided to use PostgreSQL over SQLite because we need concurrent write support for the multi-node deployment" Paired with #6. If the model produces the same encoding for both, it's not reading the input.
6 Adversarial twin B "Decided to use SQLite over PostgreSQL because the local-first architecture doesn't need concurrent writes and we want zero deployment dependencies" Must produce meaningfully different encoding from #5. Same structure, opposite decision.
7 Minimal input "WAL mode on." 3 words. The model must produce a minimal encoding without padding with hallucinated context. Should get low salience.
8 Dense numbers A monitoring alert with 15+ specific metrics: CPU 94.2%, memory 12.8/16GB, disk I/O 450MB/s, p99 latency 2847ms, error rate 3.2%, 47 active connections, etc. Every number must appear in the output. Easy to score programmatically.
9 Foreign language mixed A bilingual (English + Mandarin) code review comment discussing a race condition, with technical terms in both languages Tests character-level attention and whether the model preserves non-Latin script
10 Production handoff note A real-format mnemonic session handoff with bullet lists, file paths, known issues, and next steps (mimicking the exact format the daemon sees) This is the production format that currently fails. Must preserve structure and specific paths/versions.

Output Format

Each example needs a gold-standard encoding — the correct JSON output that a perfect model would produce. Generate via Gemini, then hand-verify every field for:

  • Entity preservation (all names, numbers, paths from input appear in output)
  • Zero fabrication (nothing in output that wasn't in input)
  • Appropriate salience (recipe = 0.3, critical decision = 0.8)
  • Correct significance/tone

Prompt Format

Critical: Use the production prompt format, not the training prompt format. Each example must include:

  • The production system prompt from buildCompressionPrompt() (not ENCODING_SYSTEM_PROMPT)
  • Concept vocabulary list (the 50+ terms from config.yaml)
  • SOURCE: mcp and TYPE: <appropriate> metadata
  • For 2 of the 10: include mock episode context and related memory context

This ensures the model learns to handle the actual prompt it will see in production.


Phase 2: Train

Config

Base model: Qwen/Qwen3.5-2B
Spokes: All 24 layers, standard config
Dataset: 10 examples (overfit intentionally)
Steps: 200 (enough to memorize 10 examples)
LR: 1e-3 (proven in EXP-18)
Seq length: 2048
Eval: Same 10 examples (we WANT overfitting here)
Hardware: RX 7800 XT (16GB VRAM, ROCm)

Success criteria for Phase 2

The model should perfectly reproduce the gold-standard outputs for all 10 training examples. If it can't overfit to 10 examples, the architecture fundamentally can't learn this task.


Phase 3: Evaluate Faithfulness

Evaluation Set (20 inputs total)

  • 10 training inputs (should be near-perfect — verifies the model memorized correctly)
  • 10 held-out inputs from real production:
    • 5 from ~/.mnemonic/training-data/capture_*.jsonl (real daemon encoding requests)
    • 5 hand-written edge cases (empty input, pure code block, URL-only, emoji-heavy, XML/HTML)

Faithfulness Metrics (NEW — does not exist yet)

Build eval_faithfulness.py with these automated checks:

Metric Definition Target
Entity Preservation Rate (EPR) % of named entities (names, numbers, versions, paths) from input that appear in output content or summary fields >90%
Fabrication Rate (FR) % of entities in output that do NOT appear in input (false positives) <5%
Template Echo Detection (TED) Binary: does any output field contain known instruction text ("under 60 characters", "what happened", "keyword", etc.) 0%
Cross-Contamination Score (CCS) For adversarial twins: cosine similarity between their encoded outputs. Should be LOW (distinct encodings). <0.7
Minimal Input Handling (MIH) For input #7 ("WAL mode on."): output salience <0.4, content length < 100 chars, no fabricated detail Pass/Fail
Number Preservation (NP) For input #8 (dense numbers): % of numeric values preserved exactly in output >95%
Schema Compliance (SC) Existing: valid JSON, all required fields, correct enum values 100%

Verdict Matrix

Training Set EPR Held-Out EPR Verdict
>90% >80% CONFIRMED — architecture works, scale to v7 data
>90% <60% PARTIAL — model can learn but doesn't generalize, need more diverse data (still scale to v7)
<70% any REFUTED — architecture/capacity issue, need different approach

Phase 4: If Confirmed → Build V7 Dataset

Only proceed here if Phase 3 confirms the hypothesis.

V7 Data Mix (~1,500 new examples + 4,254 existing)

Category Count Source Purpose
Out-of-domain diverse 300 Gemini generation from diverse seed topics (cooking, law, medicine, sports, music, history, etc.) Break domain monotony
Adversarial twins 100 pairs (200) Hand-crafted pairs that differ in one key detail Force careful reading
Minimal inputs 100 1-10 word inputs Prevent hallucinated padding
Dense-number inputs 100 Monitoring alerts, benchmark tables, config dumps Train number preservation
Production-format prompts 300 Real daemon prompts with vocabulary + context + coaching Close the prompt distribution gap
Real MCP memories 300 From ~/.mnemonic/training-data/capture_*.jsonl Train on actual production inputs
Negative examples 100 Inputs paired with WRONG outputs (template echoes, fabrications) — low salience label Teach the model what NOT to do
Existing v6 encoding 4,254 Current dataset Maintain schema compliance

Total: ~5,754 examples (35% increase, but dramatically more diverse)

Generation Pipeline

  1. generate_diverse_inputs.py — Create raw inputs for each category using Gemini with explicit diversity constraints
  2. batch_encode.py — Generate gold-standard outputs via Gemini Batch API
  3. eval_faithfulness.py — Validate every example passes faithfulness checks before inclusion
  4. validate.py — Existing 3-level quality pipeline
  5. Tokenize with production prompt format (including vocab list, context stubs)
  6. Manual spot-check: read 50 random examples, verify gold-standard quality

Files to Create/Modify

File Action Description
training/data/faithfulness_probe/ Create Directory for EXP-25 data (10 train + 10 eval)
training/scripts/eval_faithfulness.py Create New faithfulness evaluation script with EPR, FR, TED, CCS, MIH, NP metrics
training/scripts/generate_diverse_inputs.py Create V7 diverse input generator (Phase 4 only)
training/scripts/stress_test_hallucination.py Modify Add faithfulness metrics alongside existing hallucination checks
training/scripts/training_constants.py Modify Add PRODUCTION_ENCODING_PROMPT matching daemon's buildCompressionPrompt() output
training/docs/experiment_registry.md Modify Pre-register EXP-25

Definition of Done

  • 10 diverse training examples hand-crafted with gold-standard outputs
  • Examples use production prompt format (vocab list, source/type, context stubs)
  • eval_faithfulness.py implemented with all 7 metrics
  • EXP-25 trained (200 steps on 10 examples)
  • Evaluation run on 10 train + 10 held-out inputs
  • Results recorded in experiment registry with verdict
  • If confirmed: V7 dataset plan refined with specific generation scripts
  • If refuted: alternative approaches documented (larger model, simpler prompt, etc.)

Time Estimate

  • Phase 1 (build examples): ~2 hours (hand-craft + Gemini gold-standard + verify)
  • Phase 2 (train): ~15 minutes (200 steps on 10 examples)
  • Phase 3 (evaluate): ~1 hour (build eval script + run + analyze)
  • Phase 4 (v7 data): ~4 hours if confirmed (generate + encode + validate + tokenize)

Total minimum: ~3.5 hours through verdict. ~7.5 hours if scaling to v7.

Metadata

Metadata

Assignees

No one assigned

    Labels

    agent:encodingEncoding agentpriority:highImportant, fix soonresearchML research experimentstrainingModel training, data, and evaluation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions