Skip to content

fix: training crash loop safety + episode synthesis quality#407

Merged
CalebisGross merged 1 commit intomainfrom
fix/training-safety-and-episode-echo
Apr 14, 2026
Merged

fix: training crash loop safety + episode synthesis quality#407
CalebisGross merged 1 commit intomainfrom
fix/training-safety-and-episode-echo

Conversation

@CalebisGross
Copy link
Copy Markdown
Collaborator

Summary

  • Root cause fix: MarkExperienceUsedInTraining() was never called — untrained count never dropped, causing every dreaming cycle to re-trigger training (21 failed runs overnight, crashing the PC)
  • Training safety: e-stop file, circuit breaker (3 failures), 24h cooldown, VRAM pre-flight check, seq_len 2048→1024, OOM abort after 3 consecutive errors
  • Episode synthesis rewrite: simplified 4-field schema, directive TASK/RULES/SCHEMA prompt (no placeholder values), logit confidence validation with heuristic fallback, event budget truncation

Test plan

  • go test ./... — all pass
  • golangci-lint run — 0 issues
  • go vet ./... — clean
  • Daemon rebuilt with embedded LLM, running healthy
  • Wait for episode to close, verify real title/summary on dashboard
  • Create ~/.mnemonic/training.disabled, verify auto-trigger skipped in logs
  • Verify train_model MCP tool respects e-stop file

Closes #391

🤖 Generated with Claude Code

The Phase C continuous learning pipeline ran 21 failed GPU training cycles
overnight, crashing the PC repeatedly. Root cause: MarkExperienceUsedInTraining
was never called, so untrained count never dropped and every dreaming cycle
re-triggered training on the same data. Combined with Gemma 4 E2B at seq_len=2048
exceeding the 16GB VRAM budget.

Training pipeline fixes:
- Call MarkExperienceUsedInTraining after batch assembly (root cause)
- Add e-stop sentinel file (~/.mnemonic/training.disabled)
- Circuit breaker: 3 consecutive failures disables auto-trigger
- Cooldown: 24h wait after failed run before retrying
- Fix circuit breaker SQL to count stale "requested" runs as failures
- Reduce seq_len 2048→1024 in continuous_train.sh (VRAM ~15.4GB→~8-9GB)
- Add VRAM pre-flight check (abort if >1GB still in use after daemon stop)
- Lower OOM abort threshold to 3 consecutive errors in train_spokes.py

Episode synthesis fixes:
- Simplify LLM schema from 7 fields to 4 (title, summary, concepts, salience)
- Rewrite prompt: directive TASK/RULES/SCHEMA format, no placeholder values
- Add logit confidence validation (MeanProb < 0.10 falls back to heuristic)
- Add heuristic enrichment for emotional_tone, outcome, narrative
- Add event budget truncation (bookend strategy, 6000 char limit)
- Update GBNF grammar for 4-field schema with constrained salience
- Reduce temperature 0.3→0.2, max_tokens 1024→256

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@CalebisGross CalebisGross merged commit f0a8a95 into main Apr 14, 2026
@CalebisGross CalebisGross deleted the fix/training-safety-and-episode-echo branch April 14, 2026 13:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Continuous learning: encoding model that improves from operational experience

1 participant