fix backward pass test in tensor parallel for Dense model by 3outeille · Pull Request #42811 · huggingface/transformers

3outeille · 2025-12-11T14:05:25Z

No description provided.

HuggingFaceDocBuilderDev · 2025-12-11T14:20:35Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

* begin Moe test tensor parallel * create tiny moe model + fix test tensor parallel Moe eaeaae * create tiny moe model + fix test tensor parallel Moe eaeaae fix tensor parallel MoE test fix tensor parallel MoE test * fix backward pass test in tensor parallel for Dense model (#42811) * fix * linting * use mixtral instead for testing * fix dtensor and tensor mismatch * linting * checkout test tensor parallel to be like main * avoid hack and create class instead

* begin Moe test tensor parallel * create tiny moe model + fix test tensor parallel Moe eaeaae * create tiny moe model + fix test tensor parallel Moe eaeaae fix tensor parallel MoE test fix tensor parallel MoE test * fix backward pass test in tensor parallel for Dense model (#42811) * fix * linting * use mixtral instead for testing * fix dtensor and tensor mismatch * linting * checkout test tensor parallel to be like main * avoid hack and create class instead * fix loading ep * add moe test * now EP inference works again but pass still fails * Add ColwiseParallelReplicate and RowwiseParallelReplicate classes for replicated layouts * clean * eaza * aeaeaea * eaeaa * linting

* begin Moe test tensor parallel * create tiny moe model + fix test tensor parallel Moe eaeaae * create tiny moe model + fix test tensor parallel Moe eaeaae fix tensor parallel MoE test fix tensor parallel MoE test * fix backward pass test in tensor parallel for Dense model (huggingface#42811) * fix * linting * use mixtral instead for testing * fix dtensor and tensor mismatch * linting * checkout test tensor parallel to be like main * avoid hack and create class instead

* begin Moe test tensor parallel * create tiny moe model + fix test tensor parallel Moe eaeaae * create tiny moe model + fix test tensor parallel Moe eaeaae fix tensor parallel MoE test fix tensor parallel MoE test * fix backward pass test in tensor parallel for Dense model (huggingface#42811) * fix * linting * use mixtral instead for testing * fix dtensor and tensor mismatch * linting * checkout test tensor parallel to be like main * avoid hack and create class instead * fix loading ep * add moe test * now EP inference works again but pass still fails * Add ColwiseParallelReplicate and RowwiseParallelReplicate classes for replicated layouts * clean * eaza * aeaeaea * eaeaa * linting

* begin Moe test tensor parallel * create tiny moe model + fix test tensor parallel Moe eaeaae * create tiny moe model + fix test tensor parallel Moe eaeaae fix tensor parallel MoE test fix tensor parallel MoE test * fix backward pass test in tensor parallel for Dense model (#42811) * fix * linting * use mixtral instead for testing * fix dtensor and tensor mismatch * linting * checkout test tensor parallel to be like main * avoid hack and create class instead * fix loading ep * add moe test * now EP inference works again but pass still fails * linting * now load from checkpoint. Creating a nn.Parameter for param_value will not transfer its attribute (especially _is_hf_initialized) * forward now works (add LocalPackedColwise + dont use EP router) * for now test in float32 * dont do all_reduce manually for GatherParellel. Convert to dtensor approach * Remove dtensor dependency in Tensor Parallel (#43157) * dense test is passing * Refactor tensor parallel implementation by removing unused partition_tensor methods * keep removing dependencies on Dtensor * rename test file * Update tensor parallel plans to use "colwise_gather_output" across multiple models * Remove unused "gather" references and update tensor parallel plans to "colwise_gather_output" in multiple model configurations. * Refactor tensor parallel plans in Fbgemm and FineGrained quantizers by removing unused configurations and comments related to "gather" operations. * add 'split_input' option in RowwiseParallel + replace rowwise_replicate 'rowwise_split_input' * Add PackedColwiseParallel and PackedRowwiseParallel + Update configuration plans * mixing files and some fix for tp and tp_plan * clean tensor paralle api * linting * linting * Refactor core model loading and tensor parallel utilities. Improved parameter handling in `set_param_for_module` and updated tensor sharding functions. Removed deprecated code and added new utility functions for block size calculations. * code quality * make fixup * tp workf for dense and moe in float32 only * fix merge conflicts that broke TP * revert parsing for tp plan * all reduce after experts * compile compatible dist ops * fix gate_up_proj gradient test by doing splitting thtat takes into account that it is fused + all_reduce to get full gradient before functional.linear * fix moe backward fp32 * remove functional.Linear to use nn.Linear in experts (this way we attach hooks) * moe work with tied embedding as well * style * all tests pass * make fix-up * typo * use transformer seed + pytest parametrized * Moved weight and bias dim mapping to ParallelInterface * simplifed shard tensor signature * sync shard_tensor logic with the one in origin/main * add function check to avoid mismatch check during set_param_for_module * remove disable. I was in an older torch version * Add pytest skip condition for tensor parallel tests requiring PyTorch >= 2.9 * linting * linting * fixing remaining modular * linting * Refactor get_expected_sharded_shape to be only one call * Remove redundant prepare_module_tp method from TensorParallelLayer subclasses --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: Arthur <arthur.zucker@gmail.com>

fix

aa01855

3outeille changed the base branch from main to v5-test_tensor_parallel_moe December 11, 2025 14:06

linting

9b18a4e

3outeille merged commit 5f548ed into v5-test_tensor_parallel_moe Dec 11, 2025
12 of 17 checks passed

3outeille deleted the fix-test_tensor_parallel_backward_dense branch December 11, 2025 14:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix backward pass test in tensor parallel for Dense model#42811

fix backward pass test in tensor parallel for Dense model#42811
3outeille merged 2 commits intov5-test_tensor_parallel_moefrom
fix-test_tensor_parallel_backward_dense

3outeille commented Dec 11, 2025

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

3outeille commented Dec 11, 2025

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants