AppSprout-dev · CalebisGross · Apr 14, 2026 · Apr 13, 2026 · Apr 13, 2026 · Apr 14, 2026
diff --git a/.claude/rules/code-quality.md b/.claude/rules/code-quality.md
@@ -3,30 +3,13 @@
 ## Finish the Work First
 
 - The goal is working, tested code — not a PR. PRs are the delivery mechanism, not the deliverable.
-- Do NOT rush to commit, push, or open a PR until the actual work is done and verified.
-- A task is not done when the code compiles. It's done when it runs correctly and is tested.
-- Stay in the problem. If something is "working," verify it properly before moving to git/PR mechanics.
-- Only create a PR when explicitly asked, or when the work is genuinely complete and tested.
-- Never treat "open a PR" as a natural next step — wait for the user to decide when the work is ready.
+- Do NOT rush to commit/push/PR until work is done and verified. A task is done when it runs correctly and is tested, not when it compiles.
+- Only create a PR when explicitly asked, or when work is genuinely complete and tested.
 
 ## Scope
 
-- Only change what was asked for — don't touch surrounding code
-- If you spot something worth fixing but it wasn't requested, call it out instead of silently doing it
-- No drive-by refactors, no "while I'm here" improvements
-- One logical change per task — don't bundle unrelated fixes
-
-## Change Safety
-
-- Read before edit — always understand a file before modifying it
-- Build and test after changes, don't assume it works
-- No new dependencies without discussing it first
-- Don't delete code you don't fully understand
-
-## Review Mindset
-
-- Don't add comments, docstrings, or type annotations to untouched code
-- Don't rename things that aren't part of the task
-- Don't "improve" error messages or formatting in adjacent code
-- Keep PRs reviewable — small, focused diffs
-- If a change is getting large, pause and check in with the user
+- Only change what was asked for. No drive-by refactors, no "while I'm here" improvements.
+- If you spot something worth fixing, call it out — don't silently do it.
+- One logical change per task. Keep PRs reviewable — small, focused diffs.
+- No new dependencies without discussing it first.
+- If a change is getting large, pause and check in with the user.
diff --git a/.claude/rules/experiment-logging.md b/.claude/rules/experiment-logging.md
diff --git a/.claude/rules/git-safety.md b/.claude/rules/git-safety.md
@@ -2,48 +2,20 @@
 
 ## Branch Workflow
 
-- Remote: `origin` (https://github.com/appsprout-dev/mnemonic.git)
-- Primary branch: `main`
-- **All new work starts on a feature branch** — never commit directly to `main`
-- Branch naming: `feat/<description>`, `fix/<description>`
+- Remote: `origin` (https://github.com/appsprout-dev/mnemonic.git), primary branch: `main`
+- All new work on feature branches (`feat/<desc>`, `fix/<desc>`) — never commit directly to `main`
 - Before branching: `git stash` (if dirty), `git pull origin main`, then `git checkout -b <branch>`
-- **Before committing:** Run `git branch --show-current` to verify you're on the intended branch. Bash tool does not persist shell state — a prior `git checkout` may not have taken effect.
-- **All changes go through a PR** — push the branch, open a PR with `gh pr create`, get it reviewed
-- **Closing issues:** When a PR resolves a GitHub issue, comment on the issue with a reference to the PR before or after closing it. Never close issues silently.
-- No blind commits to main, no YOLO pushes
+- **Before committing:** `git branch --show-current` to verify — Bash tool doesn't persist shell state
+- All changes go through PRs (`gh pr create`). When a PR resolves an issue, comment on the issue with a PR reference.
 
-## Forbidden Operations
+## Forbidden (enforced by hooks)
 
-Enforced by `.claude/hooks/protect-git.sh` and `.claude/hooks/no-secrets.sh`:
-
-- `git push --force` / `git push -f` -- destroys remote history
-- `git reset --hard` -- destroys local changes
-- `git clean -f` -- permanently deletes untracked files
-- `git checkout .` / `git restore .` -- discards all unstaged changes
-- Staging `.env`, `credentials`, `*.db`, `settings.local.json`
+`.claude/hooks/protect-git.sh` and `.claude/hooks/no-secrets.sh` block: force push, `reset --hard`, `clean -f`, `checkout .`/`restore .`, staging `.env`/`credentials`/`*.db`/`settings.local.json`.
 
 ## Commit Messages (Conventional Commits)
 
-Use [Conventional Commits](https://www.conventionalcommits.org/) format — release-please uses these to auto-generate changelogs and version bumps:
-
-- `feat: add memory source tracking` — new feature (bumps minor)
-- `fix: prevent nil pointer in retrieval` — bug fix (bumps patch)
-- `docs: update README with Gemini setup` — documentation only
-- `refactor: simplify consolidation loop` — code change, no behavior change
-- `test: add encoding agent coverage` — tests only
-- `chore: update dependencies` — maintenance
-- `ci: fix release workflow runner` — CI/CD changes
-
-Rules:
-
-- Short, direct subject line describing the change
-- Body for context when non-obvious
-- No issue-closing keywords in commit messages unless explicitly asked
-- Use Co-Authored-By for Claude contributions
-- Append `!` after the type for breaking changes: `feat!: redesign store interface`
+Format: `type: description` — release-please uses these for changelogs/version bumps.
 
-## Secrets
+Types: `feat` (minor), `fix` (patch), `docs`, `refactor`, `test`, `chore`, `ci`. Append `!` for breaking changes.
 
-- `settings.local.json` contains machine-specific permissions -- NEVER commit
-- `*.db` files contain user data -- gitignored
-- Never include API tokens in commit messages or code
+Rules: short subject, body when non-obvious, no issue-closing keywords unless asked, Co-Authored-By for Claude, `settings.local.json` and `*.db` never committed.
diff --git a/.claude/rules/peer-review-standard.md b/.claude/rules/peer-review-standard.md
diff --git a/.claude/rules/research-standards.md b/.claude/rules/research-standards.md
@@ -0,0 +1,69 @@
+# Research Standards
+
+This project is under review by Aaron Gokaslan and Andrej Karpathy. All work must meet the standard of a published research project. Follow the scientific method — every experiment is a test of a hypothesis, not a fishing expedition.
+
+## Core Principles
+
+- **Let the data decide.** Judge results by numbers, not desire. No reinterpreting negative results as "needs more training" without evidence.
+- **No motivated reasoning.** Report the number you got, not the number you wanted. Negative results get the same documentation quality.
+- **Actively disprove.** After a positive result: could this be an LR artifact? Param count mismatch? Training duration effect? Random seed noise? Not confirmed until alternatives are ruled out.
+- **Reproducibility.** Every result must be reproducible from the registry entry alone — exact commands, configs, hardware, data paths.
+
+## Pre-Registration (BEFORE any training or sweep)
+
+Create an entry in `training/docs/experiment_registry.md`:
+
+```markdown
+### EXP-{number}: {name}
+- **Date:** {YYYY-MM-DD}
+- **Status:** REGISTERED | RUNNING | COMPLETED | FAILED
+- **Hypothesis:** {What you expect and why}
+- **Variable:** {The ONE thing changed vs control}
+- **Control:** {Comparison target with its result}
+- **Prediction:** {Quantitative — e.g., "expect LR 1e-3 to beat 6e-4 by 5-10%"}
+- **Config:** {model, HP, hardware, data}
+- **Result:** {filled after run}
+- **Verdict:** CONFIRMED | REFUTED | INCONCLUSIVE
+- **Analysis:** {What happened, why, what it means}
+```
+
+## After Every Run
+
+1. Record result in registry (Status -> COMPLETED)
+2. Compare to prediction — was your mental model right?
+3. Positive result: list alternative explanations, which are ruled out
+4. Negative result: what does this tell us about config/architecture?
+5. Update `training/sweep_results.tsv` with raw numbers
+6. Update `training/docs/experiments.md` with analysis paragraph (not bullet points)
+7. If result changes prior conclusions, update those entries too
+
+## Experiment Document Structure
+
+`training/docs/experiments.md` follows: Overview, Experimental Protocol, Baselines, HP Sweep Results (by variable), Pretraining Runs, Planned Experiments, Summary.
+
+Every entry needs: header line (name/date/config/hardware), control and variable, results table (loss + PPL minimum), analysis paragraph (quantitative, mechanistic, implications). Sweep phases get a combined table + cross-group analysis.
+
+## Benchmark Logging
+
+Benchmarks require: exact command, software state (commit hash, version, config), environment (hardware, provider, model), ALL metrics, comparison context (baseline and target).
+
+## Evaluation Protocol
+
+Standard budgets (RX 7800 XT): short test 1K-2K optimizer steps, full sweep 4K+ micro-steps, full pretrain ~400K micro-steps.
+
+Metrics — Training: loss, PPL, tokens/sec, VRAM peak (report ALL). Quality: nDCG@5 (primary), Precision@5, Recall@5, MRR, JSON compliance, latency.
+
+## Claims Bar
+
+- "Doesn't hallucinate" requires evidence. "X > Y" requires controlled comparison on matched conditions.
+- Fabrication rate of 10% is not "low." 25 test inputs is a pilot, not proof.
+- Look at actual outputs, not just aggregates.
+- Code: no dead code, scripts self-documenting, pipelines run from clean checkout, evals deterministic.
+
+## Red Flags
+
+- Running without a hypothesis -> pre-register first
+- 3+ experiments all confirmed -> testing hard enough?
+- Comparing across different LRs/steps/batch sizes -> unfair
+- Explaining away negatives -> data is probably right
+- Registry config drifts from actual run -> update immediately
diff --git a/.claude/rules/scientific-method.md b/.claude/rules/scientific-method.md