Moe train by Acatsama0871 · Pull Request #2939 · pytorch/torchtitan

Acatsama0871 · 2026-04-12T21:43:10Z

No description provided.

- Models: keep only llama3 (dense) and deepseek_v3 (MoE), remove flux/gpt_oss/qwen3/llama4/llama3_ft - Remove experiments/, docs/, scripts/, CI/docker/GitHub workflows - Remove quantization (float8/MX), fault tolerance (TorchFT), TensorBoard logging - Remove model_converter protocol, deepep backend, moe_deepep - Remove deprecated config fields (Experimental, MemoryEstimation, tokenizer_path, etc.) - Flatten model directory structure (model/model.py -> model_def.py) - Relocate shared MoE parallelization utils to models/parallelize.py - Replace tyro with simple TOML + CLI override parser - Simplify tokenizer.py to tokenizer.json-only loading - Clean dead FT code paths from checkpoint.py - Add test suite (tests/test_codebase.py, 40 tests) - Fix all ruff lint/format issues and silence pyright type errors Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Guided reading order covering all 47 source files with rationale, data flow diagrams, and file index for navigating the stripped-down framework. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Rename torchtitan/ to src/, strip unused models (llama3, deepseek_v3), remove unused distributed features (PP, CP, TP, DeepEP), and replace the config system with pydantic + typer + YAML. - New: config.py (pydantic), train.py (typer entry), src/training/trainer.py - New: src/models/moe/model.py (MoETransformer with GQA + RoPE + MoE blocks) - New: configs/moe_tiny.yaml, configs/moe_15b.yaml - New: scripts/smoke_test.sh, scripts/ep_correctness.{sh,py} - Auto-download tokenizer from HF model ID on first run - EP correctness verified: EP=1 vs EP=4 loss matches exactly (0.000000 diff) - Smoke test passes: 100 steps, loss 12.48 → 7.27, 195k tok/s on 4xH200 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Dead code removed: - protocols/ (unused model registration system) - components/loss.py, metrics.py, dataloader.py, train_state.py - models/utils.py, models/attention.py (absorbed into moe/model.py) - Profiling and Validation config dataclasses - FlexAttention, VarlenAttention, build_hf_tokenizer, build_text_validation_dataloader Structure flattened: - hf_datasets/ + dataloader.py → src/data.py - training/trainer.py + train_state.py → src/trainer.py - tools/logging.py → src/logging.py - tools/utils.py → src/utils.py - models/attention.py → absorbed into models/moe/model.py - Removed models/__init__.py (unused) Other: - Logging switched from stdlib logging to logfire - Wired up torch.compile support (config + trainer) - Auto-download tokenizer from HF model ID - Added activation_checkpoint and compile to pydantic config Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Model / kernels: - Switch attention to explicit flash_attn_func call (FA2 for now, FA3-ready) - Drop SDPA wrapper, repeat_kv, and attention transposes (flash_attn handles GQA natively) - Add QuackConfig and RMSNorm wrapper that dispatches to quack.rmsnorm/cross_entropy - Mutual exclusion check between quack and torch.compile - Work around quack's stride-0 grad bug in CE backward via _ContiguousGrad Dataset: - Drop c4 registry; data.py now takes a local dataset path directly - Supports HF save_to_disk format and raw parquet/jsonl via load_dataset - Auto-detect vocab_size from tokenizer (no more hardcoded 151936) - Move dataset_path to the `data:` section Eval: - New EvalConfig with enable / dataset_path / eval_step - run_eval() iterates full eval set once, computes loss / ppl / top1 - Uses all-reduce MIN of has_batch flag so ranks stop together (FSDP-safe) Logging & checkpoint: - Split dump_folder into logging.log_dump and checkpoint.checkpoint_dump - Rename checkpoint.interval -> checkpoint.checkpoint_step - New log_step field in logging section - src/logging.py now writes plain-text train.log alongside logfire console Pins: - torch==2.11.0+cu130 via explicit PyTorch index - flash-attn==2.8.3 prebuilt wheel for cu130/torch2.11/py313 - quack-kernels[cu13]>=0.3.9 - Python ==3.13.11 Scripts: - make_example_dataset.py generates train + eval example datasets Verified: - Smoke test passes (train+eval) with loss/ppl/top1 improving over 100 steps - EP correctness still matches exactly (EP=1 vs EP=4 diff 0.0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- WALKTHROUGH.md: general codebase tour, reading order, and troubleshooting - MOE_PARALLELISM.md: detailed guided walk through EP mesh construction, all-to-all dispatch/combine, and how ExpertParallel hooks into MoE.forward Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

meta-cla · 2026-04-12T21:43:16Z

Hi @Acatsama0871!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

Acatsama0871 and others added 6 commits April 7, 2026 19:54

Add codebase walkthrough document

3b79e22

Guided reading order covering all 47 source files with rationale, data flow diagrams, and file index for navigating the stripped-down framework. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Acatsama0871 requested review from daniellepintz, fegin, felipemello1, joecummings, tianyu-l, wconstab and wwwjn as code owners April 12, 2026 21:43

facebook-github-tools bot added the module: rocm label Apr 12, 2026

Acatsama0871 added 5 commits April 13, 2026 01:35

add note to parallel dimension

ce14cc0

add notes

2fa6054

update

acd8093

add note to parallelism

8acd34a

add note

692efbf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Moe train#2939

Moe train#2939
Acatsama0871 wants to merge 11 commits intopytorch:mainfrom
Acatsama0871:moe-train

Acatsama0871 commented Apr 12, 2026

Uh oh!

meta-cla bot commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Acatsama0871 commented Apr 12, 2026

Uh oh!

meta-cla bot commented Apr 12, 2026

Action Required

Process

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant