Skip to content

Moe train#2939

Open
Acatsama0871 wants to merge 11 commits intopytorch:mainfrom
Acatsama0871:moe-train
Open

Moe train#2939
Acatsama0871 wants to merge 11 commits intopytorch:mainfrom
Acatsama0871:moe-train

Conversation

@Acatsama0871
Copy link
Copy Markdown

No description provided.

Acatsama0871 and others added 6 commits April 7, 2026 19:54
- Models: keep only llama3 (dense) and deepseek_v3 (MoE), remove flux/gpt_oss/qwen3/llama4/llama3_ft
- Remove experiments/, docs/, scripts/, CI/docker/GitHub workflows
- Remove quantization (float8/MX), fault tolerance (TorchFT), TensorBoard logging
- Remove model_converter protocol, deepep backend, moe_deepep
- Remove deprecated config fields (Experimental, MemoryEstimation, tokenizer_path, etc.)
- Flatten model directory structure (model/model.py -> model_def.py)
- Relocate shared MoE parallelization utils to models/parallelize.py
- Replace tyro with simple TOML + CLI override parser
- Simplify tokenizer.py to tokenizer.json-only loading
- Clean dead FT code paths from checkpoint.py
- Add test suite (tests/test_codebase.py, 40 tests)
- Fix all ruff lint/format issues and silence pyright type errors

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Guided reading order covering all 47 source files with rationale,
data flow diagrams, and file index for navigating the stripped-down framework.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rename torchtitan/ to src/, strip unused models (llama3, deepseek_v3),
remove unused distributed features (PP, CP, TP, DeepEP), and replace
the config system with pydantic + typer + YAML.

- New: config.py (pydantic), train.py (typer entry), src/training/trainer.py
- New: src/models/moe/model.py (MoETransformer with GQA + RoPE + MoE blocks)
- New: configs/moe_tiny.yaml, configs/moe_15b.yaml
- New: scripts/smoke_test.sh, scripts/ep_correctness.{sh,py}
- Auto-download tokenizer from HF model ID on first run
- EP correctness verified: EP=1 vs EP=4 loss matches exactly (0.000000 diff)
- Smoke test passes: 100 steps, loss 12.48 → 7.27, 195k tok/s on 4xH200

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dead code removed:
- protocols/ (unused model registration system)
- components/loss.py, metrics.py, dataloader.py, train_state.py
- models/utils.py, models/attention.py (absorbed into moe/model.py)
- Profiling and Validation config dataclasses
- FlexAttention, VarlenAttention, build_hf_tokenizer, build_text_validation_dataloader

Structure flattened:
- hf_datasets/ + dataloader.py → src/data.py
- training/trainer.py + train_state.py → src/trainer.py
- tools/logging.py → src/logging.py
- tools/utils.py → src/utils.py
- models/attention.py → absorbed into models/moe/model.py
- Removed models/__init__.py (unused)

Other:
- Logging switched from stdlib logging to logfire
- Wired up torch.compile support (config + trainer)
- Auto-download tokenizer from HF model ID
- Added activation_checkpoint and compile to pydantic config

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Model / kernels:
- Switch attention to explicit flash_attn_func call (FA2 for now, FA3-ready)
- Drop SDPA wrapper, repeat_kv, and attention transposes (flash_attn handles GQA natively)
- Add QuackConfig and RMSNorm wrapper that dispatches to quack.rmsnorm/cross_entropy
- Mutual exclusion check between quack and torch.compile
- Work around quack's stride-0 grad bug in CE backward via _ContiguousGrad

Dataset:
- Drop c4 registry; data.py now takes a local dataset path directly
- Supports HF save_to_disk format and raw parquet/jsonl via load_dataset
- Auto-detect vocab_size from tokenizer (no more hardcoded 151936)
- Move dataset_path to the `data:` section

Eval:
- New EvalConfig with enable / dataset_path / eval_step
- run_eval() iterates full eval set once, computes loss / ppl / top1
- Uses all-reduce MIN of has_batch flag so ranks stop together (FSDP-safe)

Logging & checkpoint:
- Split dump_folder into logging.log_dump and checkpoint.checkpoint_dump
- Rename checkpoint.interval -> checkpoint.checkpoint_step
- New log_step field in logging section
- src/logging.py now writes plain-text train.log alongside logfire console

Pins:
- torch==2.11.0+cu130 via explicit PyTorch index
- flash-attn==2.8.3 prebuilt wheel for cu130/torch2.11/py313
- quack-kernels[cu13]>=0.3.9
- Python ==3.13.11

Scripts:
- make_example_dataset.py generates train + eval example datasets

Verified:
- Smoke test passes (train+eval) with loss/ppl/top1 improving over 100 steps
- EP correctness still matches exactly (EP=1 vs EP=4 diff 0.0)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- WALKTHROUGH.md: general codebase tour, reading order, and troubleshooting
- MOE_PARALLELISM.md: detailed guided walk through EP mesh construction,
  all-to-all dispatch/combine, and how ExpertParallel hooks into MoE.forward

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@meta-cla
Copy link
Copy Markdown

meta-cla bot commented Apr 12, 2026

Hi @Acatsama0871!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant