Skip to content

QBe1n/OpenMythos

 
 

Repository files navigation

╔══════════════════════════════════════════════════════════════════════════════╗
║                                                                              ║
║   ███╗   ███╗██╗   ██╗████████╗██╗  ██╗ ██████╗ ███████╗                     ║
║   ████╗ ████║╚██╗ ██╔╝╚══██╔══╝██║  ██║██╔═══██╗██╔════╝                     ║
║   ██╔████╔██║ ╚████╔╝    ██║   ███████║██║   ██║███████╗                     ║
║   ██║╚██╔╝██║  ╚██╔╝     ██║   ██╔══██║██║   ██║╚════██║                     ║
║   ██║ ╚═╝ ██║   ██║      ██║   ██║  ██║╚██████╔╝███████║                     ║
║   ╚═╝     ╚═╝   ╚═╝      ╚═╝   ╚═╝  ╚═╝ ╚═════╝ ╚══════╝                     ║
║                                                                              ║
║              OpenMythos — recurrent-depth transformer, fixed.                ║
║                                                                              ║
║                              v0.2.0-improved                                 ║
║                                                                              ║
║                       @QBe1n (fork of @kyegomez)                             ║
║                                                                              ║
║                ⣠⣴⣶⣶⣤⡀                                                        ║
║             ⢀⣴⡟⠋⠁  ⠙⢿⣦⡀              ┌─────────────────────────┐             ║
║           ⢠⣾⠋⠁        ⠈⠻⣷⡄           │   Loop the block.       │             ║
║          ⣰⡟⠁    ⣀⣤⣤⣀    ⠘⢿⣆          │   Halt when ready.      │             ║
║         ⢠⡿⠁   ⢠⣾⠟⠛⠛⠻⣷⡄   ⠈⢿⡄         │   Extrapolate depth.    │             ║
║         ⢸⡇   ⢠⡿⠋    ⠙⢿⣆   ⢸⡇         │                         │             ║
║         ⢸⡇   ⢸⡇   ●   ⢸⡇   ⢸⡇         │   ρ(A) ≤ 1 by design.   │             ║
║         ⢸⡇   ⠸⣧      ⣼⠇   ⢸⡇         └─────────────────────────┘             ║
║         ⠸⣧    ⠻⣦⣄⣀⣠⣴⠟    ⣼⠇                                                  ║
║          ⠹⣦⡀    ⠉⠉⠉    ⢀⣴⠏                                                   ║
║           ⠙⢷⣤⡀        ⢀⣤⡶⠋                                                   ║
║             ⠙⠻⢶⣤⣄⣀⣠⣤⡶⠟⠁                                                      ║
║                ⠉⠛⠛⠛⠉                                                         ║
║                                                                              ║
╚══════════════════════════════════════════════════════════════════════════════╝

OpenMythos — recurrent-depth transformer, fixed

Tests MoE speedup License PyTorch Upstream Fork


Author: kubalskiy / QBEin (@QBe1n)

Upstream: kyegomez/OpenMythos by Kye Gomez

License: MIT (see LICENSE)

Repository: https://github.com/QBe1n/OpenMythos

Status: 73/73 tests passing. Trainable. MoE 2.94× faster. ACT actually halts.


What is OpenMythos?

OpenMythos is an open-source theoretical reconstruction of the Claude Mythos architecture — a Recurrent-Depth Transformer (RDT) that loops a shared middle block T times between a Prelude and a Coda, halts adaptively per token, and supports depth extrapolation: train with 4 loops, run inference with 16.

This fork takes the original reference implementation and makes it actually work end-to-end: fixes correctness bugs, modernizes kernels, and adds a training loop that proves the architecture learns.

Core building blocks, implemented from first principles:

  1. Prelude — 2 dense transformer blocks encoding input tokens.
  2. Recurrent block — shared weights, looped T times at inference.
  3. Coda — 2 dense transformer blocks decoding to logits.
  4. MLA or GQA attention — switchable; MLA uses DeepSeek-V2-style low-rank KV compression.
  5. Fine-grained MoE — 64 routed + 2 shared experts, top-K routing, DeepSeekMoE load balancing.
  6. ACT halting — per-token adaptive computation time, Graves 2016 remainder formulation.
  7. LTI injection — stability-guaranteed state update with ρ(A) ≤ 1 by construction.
  8. LoRA adapters — depth-wise, one per loop step, low-rank on top of the shared block.
  9. Loop-index embeddings — sinusoidal positional signal for the current loop iteration.
  10. RoPE — rotary position embeddings on the sequence axis.
  11. Scatter-based MoE dispatch — sort-and-group, 2.94× faster than the naive loop.
  12. Training loop — CE + annealed ponder cost + MoE aux loss, cosine LR, grad clip.

Disclaimer: it's a theoretical reconstruction, and I made it less broken. This is not Claude. It's not even close to Claude. It's a reference implementation of what the architecture could look like based on public research. The original fork had a fake load-balancer, broken ACT math, and a test that failed on its own headline claim. I fixed those. That's the whole pitch.


What's improved in this fork?

  • MoE dispatch rewritten — nested O(topk × n_experts) Python loop replaced with a single scatter-sort. 2.94× faster at 64 experts, topk=4.
  • Router load-balancing actually worksupdate_router_bias() per DeepSeekMoE Eq. 17 + auxiliary load-balance loss. Previously the bias was a buffer that nothing updated.
  • ACT halting fixed — halted positions no longer receive updates; output is a proper convex combination summing to 1. Per-position ponder cost exposed for training.
  • Attention modernizedF.scaled_dot_product_attention for both GQA and MLA paths. Flash attention on GPU, tuned kernel on CPU.
  • Loop embeddings cached — precomputed once per (n_loops, device, dtype), not rebuilt inside the hot loop.
  • Stability test corrected — the ZOH construction bounds ρ(A) in (0, 1], not (0, 1). Test now asserts the mathematically true bound.
  • Training looptrain.py runs on Dyck-1 depth, demonstrates depth extrapolation (more inference loops → higher OOD accuracy).
  • New tests — 7 added (73/73 passing, up from 66/67 in upstream).
  • Dead code removedexample.py now runs both GQA and MLA branches.

See CHANGELOG.md for details and before/after benchmarks.


Quick Start

# 1. Clone
git clone https://github.com/QBe1n/OpenMythos.git
cd OpenMythos

# 2. Install (editable)
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install -e .

# 3. Run the example (builds GQA + MLA models, forwards, generates)
python example.py

# 4. Run the tests
pip install pytest
python -m pytest test_main.py -q

# 5. Train on the toy task
python train.py --steps 2000 --log-every 200

Usage

Minimal model

import torch
from open_mythos.main import MythosConfig, OpenMythos

cfg = MythosConfig(
    vocab_size=1000, dim=256, n_heads=8, n_kv_heads=2,
    max_seq_len=128, max_loop_iters=4,
    prelude_layers=1, coda_layers=1,
    n_experts=8, n_shared_experts=1, n_experts_per_tok=2,
    expert_dim=64, lora_rank=8, attn_type="gqa",
)
model = OpenMythos(cfg)

ids = torch.randint(0, cfg.vocab_size, (2, 16))
logits = model(ids, n_loops=4)          # (2, 16, 1000)
out = model.generate(ids, max_new_tokens=8, n_loops=8)

# Depth extrapolation: trained at n_loops=4, run inference with n_loops=16
logits_deep = model(ids, n_loops=16)

Accessing training-side telemetry

# After any forward pass:
model.recurrent.last_ponder_cost   # (B, T) — expected loops per position
for mod in model.modules():
    if isinstance(mod, MoEFFN):
        mod.last_aux_loss          # scalar — add to training loss
        mod.last_expert_load       # (n_experts,) — token counts
        mod.update_router_bias()   # call once per step

Architecture

            input tokens
                 │
                 ▼
         ┌───────────────┐
         │    Prelude    │  2× dense transformer blocks
         └───────┬───────┘
                 │ e (frozen, injected each loop)
                 ▼
         ┌───────────────┐ ◀─┐
         │  Recurrent    │   │  ×T loops (shared weights)
         │     block     │   │  per-loop LoRA adapter
         └───────┬───────┘   │  LTI injection: h ← A·h + B·e + out
                 │           │  ACT halting: per-token
                 └───────────┘
                 │
                 ▼
         ┌───────────────┐
         │     Coda      │  2× dense transformer blocks
         └───────┬───────┘
                 ▼
            RMSNorm + LM head
                 │
                 ▼
            output logits
  • Attention: GQA (small, fast) or MLA (DeepSeek-V2 low-rank KV compression).
  • FFN: fine-grained MoE with scatter dispatch. Shared experts always fire.
  • Halting: ACT on the recurrent block only. Prelude and Coda are dense.
  • Stability: A = exp(-exp(log_dt + log_A)) guarantees ρ(A) ∈ (0, 1] for any parameter values.

See: CHANGELOG.md for what changed vs upstream, docs/open_mythos.md for the theory.


Benchmarks

Benchmark (CPU, 2 threads) Upstream This fork Speedup
Small fwd (B=4, T=32, 1.8M) 23.1 ms 17.6 ms 1.31×
Small training step 75.3 ms 54.0 ms 1.39×
MoE dispatch, 64 experts, topk=4 36.5 ms 12.4 ms 2.94×
Tests passing 66 / 67 73 / 73

ACT early-halt verified empirically: n_loops=8 and n_loops=16 cost the same wall-clock as n_loops=4 once positions halt.


Correctness status

Component Upstream This fork
LTI stability guarantee ⚠️ Test asserted wrong bound ρ(A) ≤ 1 verified
ACT halting math ❌ Halted positions kept updating ✅ Proper convex combination
Ponder cost exposure ❌ Not exposed last_ponder_cost available
MoE load balancing ❌ Buffer never updated update_router_bias() + aux loss
MoE aux loss ❌ Not computed last_aux_loss exposed
Training loop ❌ Absent train.py with CE + ponder + aux
MLA attention kernel ⚠️ Manual softmax F.scaled_dot_product_attention
GQA attention kernel ⚠️ Manual softmax F.scaled_dot_product_attention
Loop-index embeddings ⚠️ Allocated every step ✅ Precomputed + cached
example.py GQA branch ❌ Dead code ✅ Both branches run

Project layout

open_mythos/
  main.py           # all modules: GQA, MLA, MoE, LoRA, ACT, LTI, RecurrentBlock, OpenMythos
  variants.py       # config presets
  __init__.py
example.py          # build + forward + generate, both attn types
train.py            # end-to-end training on Dyck-1 depth
test_main.py        # 73 tests
CHANGELOG.md        # what this fork changes
docs/
  open_mythos.md    # theoretical background

Contribute

This is a research toy, not a product. PRs welcome for:

  • Depth-extrapolation benchmarks on harder tasks (ListOps, Long Range Arena, modular arithmetic with grokking).
  • GPU kernel for the MoE dispatch — the current scatter is still CPU-bound on the per-expert index_add_ loop.
  • Proper checkpoint I/O — save/load is unimplemented.
  • Tokenizer — everything currently assumes token IDs are already integers.
  • Your idea here — the architecture has a lot of headroom.

Open issues or ping @QBe1n.


License

MIT License — Copyright (c) 2026 kubalskiy / QBEin

Original work Copyright (c) 2026 Kye Gomez. See LICENSE for full text.


Support

Disclaimer: OpenMythos is an independent community reconstruction based solely on public research. Not affiliated with Anthropic. The name "Claude Mythos" refers to the rumored architecture described in community speculation, not any shipping product.

About

Recurrent-depth transformer, fixed. Fork of kyegomez/OpenMythos with scatter-based MoE (2.94x faster), proper ACT halting, DeepSeekMoE load balancing, SDPA kernels, and a working training loop.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%