╔══════════════════════════════════════════════════════════════════════════════╗
║ ║
║ ███╗ ███╗██╗ ██╗████████╗██╗ ██╗ ██████╗ ███████╗ ║
║ ████╗ ████║╚██╗ ██╔╝╚══██╔══╝██║ ██║██╔═══██╗██╔════╝ ║
║ ██╔████╔██║ ╚████╔╝ ██║ ███████║██║ ██║███████╗ ║
║ ██║╚██╔╝██║ ╚██╔╝ ██║ ██╔══██║██║ ██║╚════██║ ║
║ ██║ ╚═╝ ██║ ██║ ██║ ██║ ██║╚██████╔╝███████║ ║
║ ╚═╝ ╚═╝ ╚═╝ ╚═╝ ╚═╝ ╚═╝ ╚═════╝ ╚══════╝ ║
║ ║
║ OpenMythos — recurrent-depth transformer, fixed. ║
║ ║
║ v0.2.0-improved ║
║ ║
║ @QBe1n (fork of @kyegomez) ║
║ ║
║ ⣠⣴⣶⣶⣤⡀ ║
║ ⢀⣴⡟⠋⠁ ⠙⢿⣦⡀ ┌─────────────────────────┐ ║
║ ⢠⣾⠋⠁ ⠈⠻⣷⡄ │ Loop the block. │ ║
║ ⣰⡟⠁ ⣀⣤⣤⣀ ⠘⢿⣆ │ Halt when ready. │ ║
║ ⢠⡿⠁ ⢠⣾⠟⠛⠛⠻⣷⡄ ⠈⢿⡄ │ Extrapolate depth. │ ║
║ ⢸⡇ ⢠⡿⠋ ⠙⢿⣆ ⢸⡇ │ │ ║
║ ⢸⡇ ⢸⡇ ● ⢸⡇ ⢸⡇ │ ρ(A) ≤ 1 by design. │ ║
║ ⢸⡇ ⠸⣧ ⣼⠇ ⢸⡇ └─────────────────────────┘ ║
║ ⠸⣧ ⠻⣦⣄⣀⣠⣴⠟ ⣼⠇ ║
║ ⠹⣦⡀ ⠉⠉⠉ ⢀⣴⠏ ║
║ ⠙⢷⣤⡀ ⢀⣤⡶⠋ ║
║ ⠙⠻⢶⣤⣄⣀⣠⣤⡶⠟⠁ ║
║ ⠉⠛⠛⠛⠉ ║
║ ║
╚══════════════════════════════════════════════════════════════════════════════╝
Author: kubalskiy / QBEin (@QBe1n)
Upstream: kyegomez/OpenMythos by Kye Gomez
License: MIT (see LICENSE)
Repository: https://github.com/QBe1n/OpenMythos
Status: 73/73 tests passing. Trainable. MoE 2.94× faster. ACT actually halts.
OpenMythos is an open-source theoretical reconstruction of the Claude Mythos
architecture — a Recurrent-Depth Transformer (RDT) that loops a shared
middle block T times between a Prelude and a Coda, halts adaptively per
token, and supports depth extrapolation: train with 4 loops, run
inference with 16.
This fork takes the original reference implementation and makes it actually work end-to-end: fixes correctness bugs, modernizes kernels, and adds a training loop that proves the architecture learns.
Core building blocks, implemented from first principles:
- Prelude — 2 dense transformer blocks encoding input tokens.
- Recurrent block — shared weights, looped
Ttimes at inference. - Coda — 2 dense transformer blocks decoding to logits.
- MLA or GQA attention — switchable; MLA uses DeepSeek-V2-style low-rank KV compression.
- Fine-grained MoE — 64 routed + 2 shared experts, top-K routing, DeepSeekMoE load balancing.
- ACT halting — per-token adaptive computation time, Graves 2016 remainder formulation.
- LTI injection — stability-guaranteed state update with
ρ(A) ≤ 1by construction. - LoRA adapters — depth-wise, one per loop step, low-rank on top of the shared block.
- Loop-index embeddings — sinusoidal positional signal for the current loop iteration.
- RoPE — rotary position embeddings on the sequence axis.
- Scatter-based MoE dispatch — sort-and-group, 2.94× faster than the naive loop.
- Training loop — CE + annealed ponder cost + MoE aux loss, cosine LR, grad clip.
Disclaimer: it's a theoretical reconstruction, and I made it less broken. This is not Claude. It's not even close to Claude. It's a reference implementation of what the architecture could look like based on public research. The original fork had a fake load-balancer, broken ACT math, and a test that failed on its own headline claim. I fixed those. That's the whole pitch.
- MoE dispatch rewritten — nested
O(topk × n_experts)Python loop replaced with a single scatter-sort. 2.94× faster at 64 experts, topk=4. - Router load-balancing actually works —
update_router_bias()per DeepSeekMoE Eq. 17 + auxiliary load-balance loss. Previously the bias was a buffer that nothing updated. - ACT halting fixed — halted positions no longer receive updates; output is a proper convex combination summing to 1. Per-position ponder cost exposed for training.
- Attention modernized —
F.scaled_dot_product_attentionfor both GQA and MLA paths. Flash attention on GPU, tuned kernel on CPU. - Loop embeddings cached — precomputed once per
(n_loops, device, dtype), not rebuilt inside the hot loop. - Stability test corrected — the ZOH construction bounds
ρ(A)in(0, 1], not(0, 1). Test now asserts the mathematically true bound. - Training loop —
train.pyruns on Dyck-1 depth, demonstrates depth extrapolation (more inference loops → higher OOD accuracy). - New tests — 7 added (73/73 passing, up from 66/67 in upstream).
- Dead code removed —
example.pynow runs both GQA and MLA branches.
See CHANGELOG.md for details and before/after benchmarks.
# 1. Clone
git clone https://github.com/QBe1n/OpenMythos.git
cd OpenMythos
# 2. Install (editable)
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install -e .
# 3. Run the example (builds GQA + MLA models, forwards, generates)
python example.py
# 4. Run the tests
pip install pytest
python -m pytest test_main.py -q
# 5. Train on the toy task
python train.py --steps 2000 --log-every 200import torch
from open_mythos.main import MythosConfig, OpenMythos
cfg = MythosConfig(
vocab_size=1000, dim=256, n_heads=8, n_kv_heads=2,
max_seq_len=128, max_loop_iters=4,
prelude_layers=1, coda_layers=1,
n_experts=8, n_shared_experts=1, n_experts_per_tok=2,
expert_dim=64, lora_rank=8, attn_type="gqa",
)
model = OpenMythos(cfg)
ids = torch.randint(0, cfg.vocab_size, (2, 16))
logits = model(ids, n_loops=4) # (2, 16, 1000)
out = model.generate(ids, max_new_tokens=8, n_loops=8)
# Depth extrapolation: trained at n_loops=4, run inference with n_loops=16
logits_deep = model(ids, n_loops=16)# After any forward pass:
model.recurrent.last_ponder_cost # (B, T) — expected loops per position
for mod in model.modules():
if isinstance(mod, MoEFFN):
mod.last_aux_loss # scalar — add to training loss
mod.last_expert_load # (n_experts,) — token counts
mod.update_router_bias() # call once per step input tokens
│
▼
┌───────────────┐
│ Prelude │ 2× dense transformer blocks
└───────┬───────┘
│ e (frozen, injected each loop)
▼
┌───────────────┐ ◀─┐
│ Recurrent │ │ ×T loops (shared weights)
│ block │ │ per-loop LoRA adapter
└───────┬───────┘ │ LTI injection: h ← A·h + B·e + out
│ │ ACT halting: per-token
└───────────┘
│
▼
┌───────────────┐
│ Coda │ 2× dense transformer blocks
└───────┬───────┘
▼
RMSNorm + LM head
│
▼
output logits
- Attention: GQA (small, fast) or MLA (DeepSeek-V2 low-rank KV compression).
- FFN: fine-grained MoE with scatter dispatch. Shared experts always fire.
- Halting: ACT on the recurrent block only. Prelude and Coda are dense.
- Stability:
A = exp(-exp(log_dt + log_A))guaranteesρ(A) ∈ (0, 1]for any parameter values.
See: CHANGELOG.md for what changed vs upstream, docs/open_mythos.md for the theory.
| Benchmark (CPU, 2 threads) | Upstream | This fork | Speedup |
|---|---|---|---|
| Small fwd (B=4, T=32, 1.8M) | 23.1 ms | 17.6 ms | 1.31× |
| Small training step | 75.3 ms | 54.0 ms | 1.39× |
| MoE dispatch, 64 experts, topk=4 | 36.5 ms | 12.4 ms | 2.94× |
| Tests passing | 66 / 67 | 73 / 73 | — |
ACT early-halt verified empirically: n_loops=8 and n_loops=16 cost the
same wall-clock as n_loops=4 once positions halt.
| Component | Upstream | This fork |
|---|---|---|
| LTI stability guarantee | ✅ ρ(A) ≤ 1 verified |
|
| ACT halting math | ❌ Halted positions kept updating | ✅ Proper convex combination |
| Ponder cost exposure | ❌ Not exposed | ✅ last_ponder_cost available |
| MoE load balancing | ❌ Buffer never updated | ✅ update_router_bias() + aux loss |
| MoE aux loss | ❌ Not computed | ✅ last_aux_loss exposed |
| Training loop | ❌ Absent | ✅ train.py with CE + ponder + aux |
| MLA attention kernel | ✅ F.scaled_dot_product_attention |
|
| GQA attention kernel | ✅ F.scaled_dot_product_attention |
|
| Loop-index embeddings | ✅ Precomputed + cached | |
example.py GQA branch |
❌ Dead code | ✅ Both branches run |
open_mythos/
main.py # all modules: GQA, MLA, MoE, LoRA, ACT, LTI, RecurrentBlock, OpenMythos
variants.py # config presets
__init__.py
example.py # build + forward + generate, both attn types
train.py # end-to-end training on Dyck-1 depth
test_main.py # 73 tests
CHANGELOG.md # what this fork changes
docs/
open_mythos.md # theoretical background
This is a research toy, not a product. PRs welcome for:
- Depth-extrapolation benchmarks on harder tasks (ListOps, Long Range Arena, modular arithmetic with grokking).
- GPU kernel for the MoE dispatch — the current scatter is still CPU-bound on the per-expert
index_add_loop. - Proper checkpoint I/O — save/load is unimplemented.
- Tokenizer — everything currently assumes token IDs are already integers.
- Your idea here — the architecture has a lot of headroom.
Open issues or ping @QBe1n.
MIT License — Copyright (c) 2026 kubalskiy / QBEin
Original work Copyright (c) 2026 Kye Gomez. See LICENSE for full text.
- Issues: https://github.com/QBe1n/OpenMythos/issues
- Upstream: https://github.com/kyegomez/OpenMythos
- Author: @QBe1n
- Changelog: CHANGELOG.md
Disclaimer: OpenMythos is an independent community reconstruction based solely on public research. Not affiliated with Anthropic. The name "Claude Mythos" refers to the rumored architecture described in community speculation, not any shipping product.