Conversation
- Models: keep only llama3 (dense) and deepseek_v3 (MoE), remove flux/gpt_oss/qwen3/llama4/llama3_ft - Remove experiments/, docs/, scripts/, CI/docker/GitHub workflows - Remove quantization (float8/MX), fault tolerance (TorchFT), TensorBoard logging - Remove model_converter protocol, deepep backend, moe_deepep - Remove deprecated config fields (Experimental, MemoryEstimation, tokenizer_path, etc.) - Flatten model directory structure (model/model.py -> model_def.py) - Relocate shared MoE parallelization utils to models/parallelize.py - Replace tyro with simple TOML + CLI override parser - Simplify tokenizer.py to tokenizer.json-only loading - Clean dead FT code paths from checkpoint.py - Add test suite (tests/test_codebase.py, 40 tests) - Fix all ruff lint/format issues and silence pyright type errors Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Guided reading order covering all 47 source files with rationale, data flow diagrams, and file index for navigating the stripped-down framework. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rename torchtitan/ to src/, strip unused models (llama3, deepseek_v3),
remove unused distributed features (PP, CP, TP, DeepEP), and replace
the config system with pydantic + typer + YAML.
- New: config.py (pydantic), train.py (typer entry), src/training/trainer.py
- New: src/models/moe/model.py (MoETransformer with GQA + RoPE + MoE blocks)
- New: configs/moe_tiny.yaml, configs/moe_15b.yaml
- New: scripts/smoke_test.sh, scripts/ep_correctness.{sh,py}
- Auto-download tokenizer from HF model ID on first run
- EP correctness verified: EP=1 vs EP=4 loss matches exactly (0.000000 diff)
- Smoke test passes: 100 steps, loss 12.48 → 7.27, 195k tok/s on 4xH200
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dead code removed: - protocols/ (unused model registration system) - components/loss.py, metrics.py, dataloader.py, train_state.py - models/utils.py, models/attention.py (absorbed into moe/model.py) - Profiling and Validation config dataclasses - FlexAttention, VarlenAttention, build_hf_tokenizer, build_text_validation_dataloader Structure flattened: - hf_datasets/ + dataloader.py → src/data.py - training/trainer.py + train_state.py → src/trainer.py - tools/logging.py → src/logging.py - tools/utils.py → src/utils.py - models/attention.py → absorbed into models/moe/model.py - Removed models/__init__.py (unused) Other: - Logging switched from stdlib logging to logfire - Wired up torch.compile support (config + trainer) - Auto-download tokenizer from HF model ID - Added activation_checkpoint and compile to pydantic config Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Model / kernels: - Switch attention to explicit flash_attn_func call (FA2 for now, FA3-ready) - Drop SDPA wrapper, repeat_kv, and attention transposes (flash_attn handles GQA natively) - Add QuackConfig and RMSNorm wrapper that dispatches to quack.rmsnorm/cross_entropy - Mutual exclusion check between quack and torch.compile - Work around quack's stride-0 grad bug in CE backward via _ContiguousGrad Dataset: - Drop c4 registry; data.py now takes a local dataset path directly - Supports HF save_to_disk format and raw parquet/jsonl via load_dataset - Auto-detect vocab_size from tokenizer (no more hardcoded 151936) - Move dataset_path to the `data:` section Eval: - New EvalConfig with enable / dataset_path / eval_step - run_eval() iterates full eval set once, computes loss / ppl / top1 - Uses all-reduce MIN of has_batch flag so ranks stop together (FSDP-safe) Logging & checkpoint: - Split dump_folder into logging.log_dump and checkpoint.checkpoint_dump - Rename checkpoint.interval -> checkpoint.checkpoint_step - New log_step field in logging section - src/logging.py now writes plain-text train.log alongside logfire console Pins: - torch==2.11.0+cu130 via explicit PyTorch index - flash-attn==2.8.3 prebuilt wheel for cu130/torch2.11/py313 - quack-kernels[cu13]>=0.3.9 - Python ==3.13.11 Scripts: - make_example_dataset.py generates train + eval example datasets Verified: - Smoke test passes (train+eval) with loss/ppl/top1 improving over 100 steps - EP correctness still matches exactly (EP=1 vs EP=4 diff 0.0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- WALKTHROUGH.md: general codebase tour, reading order, and troubleshooting - MOE_PARALLELISM.md: detailed guided walk through EP mesh construction, all-to-all dispatch/combine, and how ExpertParallel hooks into MoE.forward Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Hi @Acatsama0871! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
No description provided.