fix: training crash loop safety + episode synthesis quality by CalebisGross · Pull Request #407 · AppSprout-dev/mnemonic

CalebisGross · 2026-04-14T13:14:22Z

Summary

Root cause fix: MarkExperienceUsedInTraining() was never called — untrained count never dropped, causing every dreaming cycle to re-trigger training (21 failed runs overnight, crashing the PC)
Training safety: e-stop file, circuit breaker (3 failures), 24h cooldown, VRAM pre-flight check, seq_len 2048→1024, OOM abort after 3 consecutive errors
Episode synthesis rewrite: simplified 4-field schema, directive TASK/RULES/SCHEMA prompt (no placeholder values), logit confidence validation with heuristic fallback, event budget truncation

Test plan

go test ./... — all pass
golangci-lint run — 0 issues
go vet ./... — clean
Daemon rebuilt with embedded LLM, running healthy
Wait for episode to close, verify real title/summary on dashboard
Create ~/.mnemonic/training.disabled, verify auto-trigger skipped in logs
Verify train_model MCP tool respects e-stop file

Closes #391

🤖 Generated with Claude Code

The Phase C continuous learning pipeline ran 21 failed GPU training cycles overnight, crashing the PC repeatedly. Root cause: MarkExperienceUsedInTraining was never called, so untrained count never dropped and every dreaming cycle re-triggered training on the same data. Combined with Gemma 4 E2B at seq_len=2048 exceeding the 16GB VRAM budget. Training pipeline fixes: - Call MarkExperienceUsedInTraining after batch assembly (root cause) - Add e-stop sentinel file (~/.mnemonic/training.disabled) - Circuit breaker: 3 consecutive failures disables auto-trigger - Cooldown: 24h wait after failed run before retrying - Fix circuit breaker SQL to count stale "requested" runs as failures - Reduce seq_len 2048→1024 in continuous_train.sh (VRAM ~15.4GB→~8-9GB) - Add VRAM pre-flight check (abort if >1GB still in use after daemon stop) - Lower OOM abort threshold to 3 consecutive errors in train_spokes.py Episode synthesis fixes: - Simplify LLM schema from 7 fields to 4 (title, summary, concepts, salience) - Rewrite prompt: directive TASK/RULES/SCHEMA format, no placeholder values - Add logit confidence validation (MeanProb < 0.10 falls back to heuristic) - Add heuristic enrichment for emotional_tone, outcome, narrative - Add event budget truncation (bookend strategy, 6000 char limit) - Update GBNF grammar for 4-field schema with constrained salience - Reduce temperature 0.3→0.2, max_tokens 1024→256 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

CalebisGross merged commit f0a8a95 into main Apr 14, 2026

CalebisGross deleted the fix/training-safety-and-episode-echo branch April 14, 2026 13:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: training crash loop safety + episode synthesis quality#407

fix: training crash loop safety + episode synthesis quality#407
CalebisGross merged 1 commit intomainfrom
fix/training-safety-and-episode-echo

CalebisGross commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

CalebisGross commented Apr 14, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant