Skip to content

fix: type-filtered recall surfaces recent memories first#395

Merged
CalebisGross merged 5 commits intomainfrom
feat/gemma-e2b-spokes
Apr 11, 2026
Merged

fix: type-filtered recall surfaces recent memories first#395
CalebisGross merged 5 commits intomainfrom
feat/gemma-e2b-spokes

Conversation

@CalebisGross
Copy link
Copy Markdown
Collaborator

Summary

Fixes #394recall with type filter (e.g. type:"handoff") now reliably surfaces the most recent memory of that type, instead of older memories with richer association graphs.

  • Type-filtered recency boost: New config params TypeFilterRecencyWeight (0.5) and TypeFilterRecencyHalfLife (7 days) override general recency (0.2 / 30 days) for type-filtered queries. When you filter by type, you've already constrained what — the system now prioritizes when.
  • check_memory content: Output now includes the full Content field (was simply omitted from the format string, not truncated).

Changes

File What
internal/agent/retrieval/agent.go Config struct + ranking branch for type-filtered recency
internal/config/config.go Config struct + defaults (0.5 weight, 7-day half-life)
cmd/mnemonic/runtime.go Wire new config fields to retrieval agent
internal/mcp/server.go Add Content line to check_memory output
internal/agent/retrieval/config_behavior_test.go 2 new tests for type-filter recency
internal/mcp/server_test.go 1 new test for check_memory content

Verified

  • Recency bonus for ~11h-old handoff: 0.469 (was 0.197)
  • Most recent handoff now gets recency_bonus: 0.499 (near max)
  • check_memory shows full content
  • All tests pass, lint clean

Test plan

  • TestConfigTypeFilterRecencyBoostsRecent — recent handoff ranks above older one with more associations
  • TestConfigTypeFilterRecencyParamsUsed — aggressive params override general ones
  • TestHandleCheckMemoryIncludesContent — content field present in output
  • Live verification via daemon HTTP endpoint (curl to /mcp)
  • Live verification via MCP tools after killing stale subprocess processes

🤖 Generated with Claude Code

CalebisGross and others added 5 commits April 10, 2026 22:32
…ysis

Best eval loss: 1.2002 (PPL 3.3) at step 4800. Early stopped at step
5800 after 9.5h on RX 7800 XT. Two-phase learning: peak LR caused
instability (regression steps 1200-1600), minimum LR produced steady
second descent through 14 consecutive new bests. Full per-checkpoint
loss table in registry. Evaluation of SC/EPR/FR/NP pending.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Eval loss improved (1.68→1.20) but generation is degenerate: 2/13
valid JSON (15%), 0 SC. Base model without spokes achieves 24/25 valid.

Root cause: autoregressive generation compounds spoke perturbations
through NF4 dequantization noise. Teacher-forced eval loss does not
predict generation quality for spoke adapters on quantized models.

Production path: Gemma E2B + faithful prompt + GBNF grammar (no
spokes). Spoke training requires full bf16 (MI300X).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Multiple eval runs confirm: 1/10 valid JSON (10%), 0 SC. Base model
without spokes achieves 24/25 valid. The spokes generate faithful
content but cannot maintain JSON structure despite training on 5,238
perfectly structured examples.

Eval loss (-0.483 improvement) does not predict generation quality
for NF4 spoke adapters. Teacher-forced training and autoregressive
generation have fundamentally different error dynamics on quantized
models.

Production path: Gemma E2B + faithful prompt + GBNF grammar.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Python HF generate() produces valid faithful JSON with trained spokes.
llama.cpp server produces garbage with the same GGUF. The discrepancy
is an inference engine bug, not a training failure. GBNF grammar was
never tested through a working path. Verdict suspended pending
llama.cpp debugging and spokes + GBNF evaluation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Type-filtered queries (e.g. type:"handoff") now use stronger recency
scoring: weight 0.5 with 7-day half-life (vs general 0.2/30-day).
When you filter by type, you've already constrained relevance — recency
should dominate. Also adds Content field to check_memory output.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@CalebisGross CalebisGross merged commit 86673ea into main Apr 11, 2026
@CalebisGross CalebisGross deleted the feat/gemma-e2b-spokes branch April 11, 2026 13:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Recall fails to surface most recent handoffs

1 participant