Skip to content

🚨 fix + tests dense & MoE TP all reduce (decoder only)#43722

Merged
3outeille merged 188 commits intomainfrom
fix-moe-ep
Mar 4, 2026
Merged

🚨 fix + tests dense & MoE TP all reduce (decoder only)#43722
3outeille merged 188 commits intomainfrom
fix-moe-ep

Conversation

@3outeille
Copy link
Copy Markdown
Member

@3outeille 3outeille commented Feb 3, 2026

Let's make sure it works for decoder only first (We skip VLM + Encoder-decoder for now)

Introduction, forward, backward, generation (with convert mapping triggering) test agains TP vs non-TP baseline

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import os
from torch.distributed.elastic.multiprocessing.errors import record

model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
# model_id = "Qwen/Qwen1.5-MoE-A2.7B-Chat"
# model_id = "Qwen/Qwen3-30B-A3B-Instruct-2507"

rank = int(os.environ["RANK"])
world_size = int(os.environ["WORLD_SIZE"])
device = torch.device(f"cuda:{rank}")
# Need to be initialized explicitly to use the `barrier` before loading
torch.distributed.init_process_group(backend="nccl", rank=rank, world_size=world_size, device_id=rank)

@record
def main():

    model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, tp_plan="auto")
    # model = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16, device_map="auto")
    tokenizer = AutoTokenizer.from_pretrained(model_id)

    messages = [
        {"role": "user", "content": "What do you think about life?"},
    ]
    inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
    input_size = inputs.input_ids.shape[-1]
    output = model.generate(**inputs, max_new_tokens=100, do_sample=False)
    text = tokenizer.batch_decode(output[:, input_size:])[0]
    print(text)

main()

torch.distributed.destroy_process_group()
image
  • ./run_dense_tests.sh results_dense
image - `./run_moe_tests.sh results_moe` image

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…ed GPU management

- Updated `run_dense_tests.sh` and `run_moe_tests.sh` to support parallel execution of tests using available GPU pairs.
- Changed variable names for clarity, replacing `NUM_GPUS` with `GPUS_PER_TEST`.
- Enhanced output messages to reflect the number of parallel test slots and GPU usage.
- Implemented logic to handle skipped tests and updated result reporting to include skipped counts.
- Removed `TensorParallelTesterMixin` from `CausalLMModelTest` and integrated it into `ModelTesterMixin` for better structure in test classes.
Copy link
Copy Markdown
Member

@Cyrilvallez Cyrilvallez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few very early thoughts!

Comment thread run_dense_tests.sh Outdated
Comment thread tests/test_tensor_parallel_mixin.py Outdated
@3outeille 3outeille changed the base branch from main to fix-ep February 4, 2026 13:38
@3outeille 3outeille changed the title EP all reduce tests EP all reduce Feb 4, 2026
ArthurZucker and others added 11 commits February 4, 2026 13:44
- Modified `run_dense_tests.sh` and `run_moe_tests.sh` to change the pytest keyword from "test_tensor_parallel" to "test_tp_" for improved test targeting.
- Cleaned up comments and removed unused code in `test_tensor_parallel_mixin.py` to streamline the testing process and enhance readability.
@3outeille 3outeille changed the title tests EP all reduce tests EP all reduce (decoder only) Feb 4, 2026
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @SunMarc this is valid but happy if you can have a look

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG !

Comment thread src/transformers/models/longcat_flash/modular_longcat_flash.py Outdated
@ArthurZucker ArthurZucker changed the title fix + tests dense & MoE TP all reduce (decoder only) 🚨 fix + tests dense & MoE TP all reduce (decoder only) Mar 3, 2026
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
@3outeille
Copy link
Copy Markdown
Member Author

run-slow: apertus, deepseek_v2, deepseek_v3, dots1, ernie4_5_moe, exaone4, exaone_moe, flex_olmo, gemma3, gemma3n, glm4_moe, glm4_moe_lite, glm_moe_dsa

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 3, 2026

Workflow Run ⚙️💔 This comment contains run-slow, but unknown error occurred and the workflow run aborted!

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 3, 2026

[For maintainers] Suggested jobs to run (before merge)

run-slow: apertus, deepseek_v2, deepseek_v3, dots1, ernie4_5_moe, exaone4, exaone_moe, flex_olmo, gemma3, gemma3n, glm4_moe, glm4_moe_lite, glm_moe_dsa

@3outeille
Copy link
Copy Markdown
Member Author

run-slow: apertus, deepseek_v2, deepseek_v3, dots1, ernie4_5_moe, exaone4, exaone_moe, flex_olmo, gemma3, gemma3n, glm4_moe, glm4_moe_lite, glm_moe_dsa

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 3, 2026

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/apertus", "models/deepseek_v2", "models/deepseek_v3", "models/dots1", "models/ernie4_5_moe", "models/exaone4", "models/exaone_moe", "models/flex_olmo", "models/gemma3", "models/gemma3n", "models/glm4_moe", "models/glm4_moe_lite", "models/glm_moe_dsa"]
quantizations: []

@3outeille
Copy link
Copy Markdown
Member Author

run-slow: apertus, deepseek_v2, deepseek_v3, dots1, ernie4_5_moe, exaone4, exaone_moe, flex_olmo, gemma3, gemma3n, glm4_moe, glm4_moe_lite, glm_moe_dsa

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 3, 2026

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 176c5137 workflow commit (merge commit)
PR ebc29a8e branch commit (from PR)
main 5c1c72be base commit (on main)

⚠️ No test being reported (jobs are skipped or cancelled)!

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 3, 2026

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/apertus", "models/deepseek_v2", "models/deepseek_v3", "models/dots1", "models/ernie4_5_moe", "models/exaone4", "models/exaone_moe", "models/flex_olmo", "models/gemma3", "models/gemma3n", "models/glm4_moe", "models/glm4_moe_lite", "models/glm_moe_dsa"]
quantizations: []

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 3, 2026

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 176c5137 workflow commit (merge commit)
PR ebc29a8e branch commit (from PR)
main 5c1c72be base commit (on main)

Model CI Report

3 new failed tests from this PR 😭

  • deepseek_v3:
    tests/models/deepseek_v3/test_modeling_deepseek_v3.py::DeepseekV3ModelTest::test_eager_matches_sdpa_generate (❌ ⟹ ❌)

  • glm4_moe:
    tests/models/glm4_moe/test_modeling_glm4_moe.py::Glm4MoeIntegrationTest::test_compile_static_cache (❌ ⟹ ❌)

  • glm4_moe_lite:
    tests/models/glm4_moe_lite/test_modeling_glm4_moe_lite.py::Glm4MoeIntegrationTest::test_compile_static_cache (❌ ⟹ ❌)

@3outeille 3outeille enabled auto-merge (squash) March 4, 2026 06:56
@3outeille 3outeille disabled auto-merge March 4, 2026 06:56
@3outeille 3outeille merged commit f49c720 into main Mar 4, 2026
27 of 28 checks passed
@3outeille 3outeille deleted the fix-moe-ep branch March 4, 2026 08:57
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 4, 2026

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 176c5137 workflow commit (merge commit)
PR ebc29a8e branch commit (from PR)
main 5c1c72be base commit (on main)

⚠️ No test being reported (jobs are skipped or cancelled)!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants