🚨 fix + tests dense & MoE TP all reduce (decoder only) by 3outeille · Pull Request #43722 · huggingface/transformers

3outeille · 2026-02-03T23:08:40Z

Let's make sure it works for decoder only first (We skip VLM + Encoder-decoder for now)

Introduction, forward, backward, generation (with convert mapping triggering) test agains TP vs non-TP baseline

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import os
from torch.distributed.elastic.multiprocessing.errors import record

model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
# model_id = "Qwen/Qwen1.5-MoE-A2.7B-Chat"
# model_id = "Qwen/Qwen3-30B-A3B-Instruct-2507"

rank = int(os.environ["RANK"])
world_size = int(os.environ["WORLD_SIZE"])
device = torch.device(f"cuda:{rank}")
# Need to be initialized explicitly to use the `barrier` before loading
torch.distributed.init_process_group(backend="nccl", rank=rank, world_size=world_size, device_id=rank)

@record
def main():

    model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, tp_plan="auto")
    # model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto")
    tokenizer = AutoTokenizer.from_pretrained(model_id)

    messages = [
        {"role": "user", "content": "What do you think about life?"},
    ]
    inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
    input_size = inputs.input_ids.shape[-1]
    output = model.generate(**inputs, max_new_tokens=100, do_sample=False)
    text = tokenizer.batch_decode(output[:, input_size:])[0]
    print(text)

main()

torch.distributed.destroy_process_group()

./run_dense_tests.sh results_dense

- `./run_moe_tests.sh results_moe`

HuggingFaceDocBuilderDev · 2026-02-03T23:21:32Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…ed GPU management - Updated `run_dense_tests.sh` and `run_moe_tests.sh` to support parallel execution of tests using available GPU pairs. - Changed variable names for clarity, replacing `NUM_GPUS` with `GPUS_PER_TEST`. - Enhanced output messages to reflect the number of parallel test slots and GPU usage. - Implemented logic to handle skipped tests and updated result reporting to include skipped counts. - Removed `TensorParallelTesterMixin` from `CausalLMModelTest` and integrated it into `ModelTesterMixin` for better structure in test classes.

Cyrilvallez

Just a few very early thoughts!

…lecting for mergeModuleList

- Modified `run_dense_tests.sh` and `run_moe_tests.sh` to change the pytest keyword from "test_tensor_parallel" to "test_tp_" for improved test targeting. - Cleaned up comments and removed unused code in `test_tensor_parallel_mixin.py` to streamline the testing process and enhance readability.

into fix-moe-ep

ArthurZucker · 2026-03-03T18:00:24Z

cc @SunMarc this is valid but happy if you can have a look

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

3outeille · 2026-03-03T18:09:57Z

run-slow: apertus, deepseek_v2, deepseek_v3, dots1, ernie4_5_moe, exaone4, exaone_moe, flex_olmo, gemma3, gemma3n, glm4_moe, glm4_moe_lite, glm_moe_dsa

github-actions · 2026-03-03T18:10:31Z

Workflow Run ⚙️💔 This comment contains run-slow, but unknown error occurred and the workflow run aborted!

into fix-moe-ep

github-actions · 2026-03-03T18:46:37Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: apertus, deepseek_v2, deepseek_v3, dots1, ernie4_5_moe, exaone4, exaone_moe, flex_olmo, gemma3, gemma3n, glm4_moe, glm4_moe_lite, glm_moe_dsa

3outeille · 2026-03-03T18:46:40Z

run-slow: apertus, deepseek_v2, deepseek_v3, dots1, ernie4_5_moe, exaone4, exaone_moe, flex_olmo, gemma3, gemma3n, glm4_moe, glm4_moe_lite, glm_moe_dsa

github-actions · 2026-03-03T18:49:37Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/apertus", "models/deepseek_v2", "models/deepseek_v3", "models/dots1", "models/ernie4_5_moe", "models/exaone4", "models/exaone_moe", "models/flex_olmo", "models/gemma3", "models/gemma3n", "models/glm4_moe", "models/glm4_moe_lite", "models/glm_moe_dsa"]
quantizations: []

3outeille · 2026-03-03T20:04:34Z

run-slow: apertus, deepseek_v2, deepseek_v3, dots1, ernie4_5_moe, exaone4, exaone_moe, flex_olmo, gemma3, gemma3n, glm4_moe, glm4_moe_lite, glm_moe_dsa

github-actions · 2026-03-03T20:19:18Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	176c5137	workflow commit (merge commit)
PR	ebc29a8e	branch commit (from PR)
main	5c1c72be	base commit (on `main`)

⚠️ No test being reported (jobs are skipped or cancelled)!

github-actions · 2026-03-03T20:20:34Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/apertus", "models/deepseek_v2", "models/deepseek_v3", "models/dots1", "models/ernie4_5_moe", "models/exaone4", "models/exaone_moe", "models/flex_olmo", "models/gemma3", "models/gemma3n", "models/glm4_moe", "models/glm4_moe_lite", "models/glm_moe_dsa"]
quantizations: []

github-actions · 2026-03-03T22:00:09Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	176c5137	workflow commit (merge commit)
PR	ebc29a8e	branch commit (from PR)
main	5c1c72be	base commit (on `main`)

Model CI Report

❌ 3 new failed tests from this PR 😭

deepseek_v3:
tests/models/deepseek_v3/test_modeling_deepseek_v3.py::DeepseekV3ModelTest::test_eager_matches_sdpa_generate (❌ ⟹ ❌)
glm4_moe:
tests/models/glm4_moe/test_modeling_glm4_moe.py::Glm4MoeIntegrationTest::test_compile_static_cache (❌ ⟹ ❌)
glm4_moe_lite:
tests/models/glm4_moe_lite/test_modeling_glm4_moe_lite.py::Glm4MoeIntegrationTest::test_compile_static_cache (❌ ⟹ ❌)

github-actions · 2026-03-04T08:58:25Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	176c5137	workflow commit (merge commit)
PR	ebc29a8e	branch commit (from PR)
main	5c1c72be	base commit (on `main`)

⚠️ No test being reported (jobs are skipped or cancelled)!

3outeille added 2 commits February 3, 2026 23:07

introducing test tensor parallel mixing to catch TP related error

9152e86

Remove test file for tensor parallel functionality

3234776

Cyrilvallez reviewed Feb 4, 2026

View reviewed changes

Comment thread run_dense_tests.sh Outdated

Comment thread tests/test_tensor_parallel_mixin.py Outdated

ArthurZucker and others added 8 commits February 4, 2026 09:31

restore

ec2ed1d

add all reduce for ep

33ca330

fix init and bias sharding

e545ac1

fix finalize weight init

fa78068

add full stacktracing

6e4d234

fix

05fc1fa

add report to run tests

ac291e8

okay big improvement here

819698c

3outeille changed the base branch from main to fix-ep February 4, 2026 13:38

3outeille changed the title ~~EP all reduce~~ tests EP all reduce Feb 4, 2026

ArthurZucker and others added 11 commits February 4, 2026 13:44

the only case shard index should be used is when we are acctually col…

d99f834

…lecting for mergeModuleList

more fixes

f0d0de1

fix EP forward gpt oss

c5cbdc8

add test that trigger the weight converter or only dynamoc loading

381d773

Merge branch 'fix-ep' into fix-moe-ep

b8901a8

Merge branch 'fix-moe-ep' of https://github.com/huggingface/transformers

a2aa66a

into fix-moe-ep

cleaning + find_port + remove comments

1dca5f9

revert some shit

94d676c

when you are stupid sometimes you really need a brain :) :) :) :)

959b46f

fix TP

01c5774

3outeille changed the title ~~tests EP all reduce~~ tests EP all reduce (decoder only) Feb 4, 2026

ArthurZucker and others added 3 commits February 4, 2026 16:25

Ok GPT oss is fixed now

9dbb634

try to fix perms

8374298

test only causal llm

989bd9a

3outeille and others added 6 commits March 3, 2026 10:54

partially fix tp + quantization generation

f7b9aa5

partially fix tp + quantize

b2fc24f

Merge branch 'fix-moe-ep' of https://github.com/huggingface/transformers

6863072

into fix-moe-ep

skipping some tp + quantized test for now

2c43f85

guard torchao import for test_training_ci

df40b73

Merge branch 'main' into fix-moe-ep

0bb98fe

ArthurZucker approved these changes Mar 3, 2026

View reviewed changes

ArthurZucker changed the title ~~fix + tests dense & MoE TP all reduce (decoder only)~~ 🚨 fix + tests dense & MoE TP all reduce (decoder only) Mar 3, 2026

Update src/transformers/models/longcat_flash/modular_longcat_flash.py

575fdbd

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

Merge branch 'main' into fix-moe-ep

3e8e408

3outeille and others added 6 commits March 3, 2026 18:16

move file

5f38642

Merge branch 'main' into fix-moe-ep

c3e9f10

Merge branch 'fix-moe-ep' of https://github.com/huggingface/transformers

41e2373

into fix-moe-ep

fix linting

1de9baa

fix linting

de6d9aa

fix port conflict in test

ebc29a8

3outeille enabled auto-merge (squash) March 4, 2026 06:56

3outeille disabled auto-merge March 4, 2026 06:56

3outeille merged commit f49c720 into main Mar 4, 2026
27 of 28 checks passed

3outeille deleted the fix-moe-ep branch March 4, 2026 08:57

Conversation

3outeille commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Feb 3, 2026

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ArthurZucker Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

SunMarc Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

3outeille commented Mar 3, 2026

Uh oh!

github-actions Bot commented Mar 3, 2026

Uh oh!

github-actions Bot commented Mar 3, 2026

Uh oh!

3outeille commented Mar 3, 2026

Uh oh!

github-actions Bot commented Mar 3, 2026

Uh oh!

3outeille commented Mar 3, 2026

Uh oh!

github-actions Bot commented Mar 3, 2026

CI Results

Commit Info

Uh oh!

github-actions Bot commented Mar 3, 2026

Uh oh!

github-actions Bot commented Mar 3, 2026

CI Results

Commit Info

Model CI Report

Uh oh!

Uh oh!

github-actions Bot commented Mar 4, 2026

CI Results

Commit Info

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

3outeille commented Feb 3, 2026 •

edited

Loading