OpenMythos — recurrent-depth transformer, fixed

╔══════════════════════════════════════════════════════════════════════════════╗
║                                                                              ║
║   ███╗   ███╗██╗   ██╗████████╗██╗  ██╗ ██████╗ ███████╗                     ║
║   ████╗ ████║╚██╗ ██╔╝╚══██╔══╝██║  ██║██╔═══██╗██╔════╝                     ║
║   ██╔████╔██║ ╚████╔╝    ██║   ███████║██║   ██║███████╗                     ║
║   ██║╚██╔╝██║  ╚██╔╝     ██║   ██╔══██║██║   ██║╚════██║                     ║
║   ██║ ╚═╝ ██║   ██║      ██║   ██║  ██║╚██████╔╝███████║                     ║
║   ╚═╝     ╚═╝   ╚═╝      ╚═╝   ╚═╝  ╚═╝ ╚═════╝ ╚══════╝                     ║
║                                                                              ║
║              OpenMythos — recurrent-depth transformer, fixed.                ║
║                                                                              ║
║                              v0.2.0-improved                                 ║
║                                                                              ║
║                       @QBe1n (fork of @kyegomez)                             ║
║                                                                              ║
║                ⣠⣴⣶⣶⣤⡀                                                        ║
║             ⢀⣴⡟⠋⠁  ⠙⢿⣦⡀              ┌─────────────────────────┐             ║
║           ⢠⣾⠋⠁        ⠈⠻⣷⡄           │   Loop the block.       │             ║
║          ⣰⡟⠁    ⣀⣤⣤⣀    ⠘⢿⣆          │   Halt when ready.      │             ║
║         ⢠⡿⠁   ⢠⣾⠟⠛⠛⠻⣷⡄   ⠈⢿⡄         │   Extrapolate depth.    │             ║
║         ⢸⡇   ⢠⡿⠋    ⠙⢿⣆   ⢸⡇         │                         │             ║
║         ⢸⡇   ⢸⡇   ●   ⢸⡇   ⢸⡇         │   ρ(A) ≤ 1 by design.   │             ║
║         ⢸⡇   ⠸⣧      ⣼⠇   ⢸⡇         └─────────────────────────┘             ║
║         ⠸⣧    ⠻⣦⣄⣀⣠⣴⠟    ⣼⠇                                                  ║
║          ⠹⣦⡀    ⠉⠉⠉    ⢀⣴⠏                                                   ║
║           ⠙⢷⣤⡀        ⢀⣤⡶⠋                                                   ║
║             ⠙⠻⢶⣤⣄⣀⣠⣤⡶⠟⠁                                                      ║
║                ⠉⠛⠛⠛⠉                                                         ║
║                                                                              ║
╚══════════════════════════════════════════════════════════════════════════════╝

OpenMythos — recurrent-depth transformer, fixed

Author: kubalskiy / QBEin (@QBe1n)

Upstream: kyegomez/OpenMythos by Kye Gomez

License: MIT (see LICENSE)

Repository: https://github.com/QBe1n/OpenMythos

Status: 73/73 tests passing. Trainable. MoE 2.94× faster. ACT actually halts.

What is OpenMythos?

OpenMythos is an open-source theoretical reconstruction of the Claude Mythos architecture — a Recurrent-Depth Transformer (RDT) that loops a shared middle block T times between a Prelude and a Coda, halts adaptively per token, and supports depth extrapolation: train with 4 loops, run inference with 16.

This fork takes the original reference implementation and makes it actually work end-to-end: fixes correctness bugs, modernizes kernels, and adds a training loop that proves the architecture learns.

Core building blocks, implemented from first principles:

Prelude — 2 dense transformer blocks encoding input tokens.
Recurrent block — shared weights, looped T times at inference.
Coda — 2 dense transformer blocks decoding to logits.
MLA or GQA attention — switchable; MLA uses DeepSeek-V2-style low-rank KV compression.
Fine-grained MoE — 64 routed + 2 shared experts, top-K routing, DeepSeekMoE load balancing.
ACT halting — per-token adaptive computation time, Graves 2016 remainder formulation.
LTI injection — stability-guaranteed state update with ρ(A) ≤ 1 by construction.
LoRA adapters — depth-wise, one per loop step, low-rank on top of the shared block.
Loop-index embeddings — sinusoidal positional signal for the current loop iteration.
RoPE — rotary position embeddings on the sequence axis.
Scatter-based MoE dispatch — sort-and-group, 2.94× faster than the naive loop.
Training loop — CE + annealed ponder cost + MoE aux loss, cosine LR, grad clip.

Disclaimer: it's a theoretical reconstruction, and I made it less broken. This is not Claude. It's not even close to Claude. It's a reference implementation of what the architecture could look like based on public research. The original fork had a fake load-balancer, broken ACT math, and a test that failed on its own headline claim. I fixed those. That's the whole pitch.

What's improved in this fork?

MoE dispatch rewritten — nested O(topk × n_experts) Python loop replaced with a single scatter-sort. 2.94× faster at 64 experts, topk=4.
Router load-balancing actually works — update_router_bias() per DeepSeekMoE Eq. 17 + auxiliary load-balance loss. Previously the bias was a buffer that nothing updated.
ACT halting fixed — halted positions no longer receive updates; output is a proper convex combination summing to 1. Per-position ponder cost exposed for training.
Attention modernized — F.scaled_dot_product_attention for both GQA and MLA paths. Flash attention on GPU, tuned kernel on CPU.
Loop embeddings cached — precomputed once per (n_loops, device, dtype), not rebuilt inside the hot loop.
Stability test corrected — the ZOH construction bounds ρ(A) in (0, 1], not (0, 1). Test now asserts the mathematically true bound.
Training loop — train.py runs on Dyck-1 depth, demonstrates depth extrapolation (more inference loops → higher OOD accuracy).
New tests — 7 added (73/73 passing, up from 66/67 in upstream).
Dead code removed — example.py now runs both GQA and MLA branches.

See CHANGELOG.md for details and before/after benchmarks.

Quick Start

# 1. Clone
git clone https://github.com/QBe1n/OpenMythos.git
cd OpenMythos

# 2. Install (editable)
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install -e .

# 3. Run the example (builds GQA + MLA models, forwards, generates)
python example.py

# 4. Run the tests
pip install pytest
python -m pytest test_main.py -q

# 5. Train on the toy task
python train.py --steps 2000 --log-every 200

Usage

Minimal model

import torch
from open_mythos.main import MythosConfig, OpenMythos

cfg = MythosConfig(
    vocab_size=1000, dim=256, n_heads=8, n_kv_heads=2,
    max_seq_len=128, max_loop_iters=4,
    prelude_layers=1, coda_layers=1,
    n_experts=8, n_shared_experts=1, n_experts_per_tok=2,
    expert_dim=64, lora_rank=8, attn_type="gqa",
)
model = OpenMythos(cfg)

ids = torch.randint(0, cfg.vocab_size, (2, 16))
logits = model(ids, n_loops=4)          # (2, 16, 1000)
out = model.generate(ids, max_new_tokens=8, n_loops=8)

# Depth extrapolation: trained at n_loops=4, run inference with n_loops=16
logits_deep = model(ids, n_loops=16)

Accessing training-side telemetry

# After any forward pass:
model.recurrent.last_ponder_cost   # (B, T) — expected loops per position
for mod in model.modules():
    if isinstance(mod, MoEFFN):
        mod.last_aux_loss          # scalar — add to training loss
        mod.last_expert_load       # (n_experts,) — token counts
        mod.update_router_bias()   # call once per step

Architecture

            input tokens
                 │
                 ▼
         ┌───────────────┐
         │    Prelude    │  2× dense transformer blocks
         └───────┬───────┘
                 │ e (frozen, injected each loop)
                 ▼
         ┌───────────────┐ ◀─┐
         │  Recurrent    │   │  ×T loops (shared weights)
         │     block     │   │  per-loop LoRA adapter
         └───────┬───────┘   │  LTI injection: h ← A·h + B·e + out
                 │           │  ACT halting: per-token
                 └───────────┘
                 │
                 ▼
         ┌───────────────┐
         │     Coda      │  2× dense transformer blocks
         └───────┬───────┘
                 ▼
            RMSNorm + LM head
                 │
                 ▼
            output logits

Attention: GQA (small, fast) or MLA (DeepSeek-V2 low-rank KV compression).
FFN: fine-grained MoE with scatter dispatch. Shared experts always fire.
Halting: ACT on the recurrent block only. Prelude and Coda are dense.
Stability: A = exp(-exp(log_dt + log_A)) guarantees ρ(A) ∈ (0, 1] for any parameter values.

See: CHANGELOG.md for what changed vs upstream, docs/open_mythos.md for the theory.

Benchmarks

Benchmark (CPU, 2 threads)	Upstream	This fork	Speedup
Small fwd (B=4, T=32, 1.8M)	23.1 ms	17.6 ms	1.31×
Small training step	75.3 ms	54.0 ms	1.39×
MoE dispatch, 64 experts, topk=4	36.5 ms	12.4 ms	2.94×
Tests passing	66 / 67	73 / 73	—

ACT early-halt verified empirically: n_loops=8 and n_loops=16 cost the same wall-clock as n_loops=4 once positions halt.

Correctness status

Component	Upstream	This fork
LTI stability guarantee	⚠️ Test asserted wrong bound	✅ `ρ(A) ≤ 1` verified
ACT halting math	❌ Halted positions kept updating	✅ Proper convex combination
Ponder cost exposure	❌ Not exposed	✅ `last_ponder_cost` available
MoE load balancing	❌ Buffer never updated	✅ `update_router_bias()` + aux loss
MoE aux loss	❌ Not computed	✅ `last_aux_loss` exposed
Training loop	❌ Absent	✅ `train.py` with CE + ponder + aux
MLA attention kernel	⚠️ Manual softmax	✅ `F.scaled_dot_product_attention`
GQA attention kernel	⚠️ Manual softmax	✅ `F.scaled_dot_product_attention`
Loop-index embeddings	⚠️ Allocated every step	✅ Precomputed + cached
`example.py` GQA branch	❌ Dead code	✅ Both branches run

Project layout

open_mythos/
  main.py           # all modules: GQA, MLA, MoE, LoRA, ACT, LTI, RecurrentBlock, OpenMythos
  variants.py       # config presets
  __init__.py
example.py          # build + forward + generate, both attn types
train.py            # end-to-end training on Dyck-1 depth
test_main.py        # 73 tests
CHANGELOG.md        # what this fork changes
docs/
  open_mythos.md    # theoretical background

Contribute

This is a research toy, not a product. PRs welcome for:

Depth-extrapolation benchmarks on harder tasks (ListOps, Long Range Arena, modular arithmetic with grokking).
GPU kernel for the MoE dispatch — the current scatter is still CPU-bound on the per-expert index_add_ loop.
Proper checkpoint I/O — save/load is unimplemented.
Tokenizer — everything currently assumes token IDs are already integers.
Your idea here — the architecture has a lot of headroom.

Open issues or ping @QBe1n.

License

Support

Issues: https://github.com/QBe1n/OpenMythos/issues
Upstream: https://github.com/kyegomez/OpenMythos
Author: @QBe1n
Changelog: CHANGELOG.md

Disclaimer: OpenMythos is an independent community reconstruction based solely on public research. Not affiliated with Anthropic. The name "Claude Mythos" refers to the rumored architecture described in community speculation, not any shipping product.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenMythos — recurrent-depth transformer, fixed

What is OpenMythos?

What's improved in this fork?

Quick Start

Usage

Minimal model

Accessing training-side telemetry

Architecture

Benchmarks

Correctness status

Project layout

Contribute

License

Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
docs		docs
open_mythos		open_mythos
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
example.py		example.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
test_main.py		test_main.py
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

OpenMythos — recurrent-depth transformer, fixed

What is OpenMythos?

What's improved in this fork?

Quick Start

Usage

Minimal model

Accessing training-side telemetry

Architecture

Benchmarks

Correctness status

Project layout

Contribute

License

Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages