[nemotron_h] Add support for MLP mixers by xenova · Pull Request #44763 · huggingface/transformers

xenova · 2026-03-16T17:04:36Z

What does this PR do?

Adds support for MLP mixers, used by nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16. Previously, it would crash because it would not recognize the - char in the pattern.

Fixes # (issue)

Example usage:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16")
model = AutoModelForCausalLM.from_pretrained(
    "nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
messages = [
    {"role": "user", "content": "Write a haiku about GPUs"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=1024,
    eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(outputs[0]))

See example output

<|im_start|>system
<|im_end|>
<|im_start|>user
Write a haiku about GPUs<|im_end|>
<|im_start|>assistant
<think>
We need to produce a haiku about GPUs. Haiku is 5-7-5 syllable structure. Let's craft: "Silent silicon hums" (5? Silent (2) si-licon (2) hums (1) =5). Next line 7 syllables: "Lightning speed across the screen" Count: Light-ning (2) speed (1) a-cross (2) the (1) screen (1) =7? Let's count: Lightning (2), speed (1) =3, across (2) =5, the (1)=6, screen (1)=7. Good. Third line 5 syllables: "Pixels dance bright". Count: Pi-cles (2) dance (1) bright (1) =4. Need 5. Maybe "Pixels dance, bright". That's 5? Pixels (2) dance (1) bright (1) =4. Hmm. "Pixels dance, bright" still 4. Maybe "Pixels dance, so bright". Count: Pixels (2) dance (1) so (1) bright (1) =5. Good. So haiku:

Silent silicon hums  
Lightning speed across the screen  
Pixels dance, so bright

Check syllable: "Pixels dance, so bright" = Pixels (2) dance (1) so (1) bright (1) =5. Good.

Return that.
</think>
Silent silicon hums  
Lightning speed across the screen  
Pixels dance, so bright<|im_end|>

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker @Cyrilvallez

HuggingFaceDocBuilderDev · 2026-03-16T17:19:37Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Cyrilvallez · 2026-03-18T11:21:28Z

cc @ArthurZucker, looks good but pinging you since you're the one who reviewed the model!

ArthurZucker

Yeah I don't mind, my main concern is adding this when maybe nemotron3 itelsef is another FIXED arch vs this pattern mapping that :

is not standard
ends up actually just being 2 or 3 architectures that probably already exits in transformers

ArthurZucker

Thanks for the PR. I don't think we'll move forward with it.
It goes against the whole point of making standards.

1. How many distinct `nemotron_h` patterns actually exist?

All released nvidia/ models that use model_type: nemotron_h and their hybrid_override_pattern character sets:

Model	Params	Pattern chars	n_routed_experts
NVIDIA-Nemotron-3-Nano-4B-BF16	4B	`M` `-` `*`	—
NVIDIA-Nemotron-Nano-9B-v2	9B	`M` `-` `*`	—
NVIDIA-Nemotron-Nano-12B-v2	12B	`M` `-` `*`	—
Nemotron-Cascade-2-30B-A3B	30B	`M` `E` `*`	128
NVIDIA-Nemotron-3-Nano-30B-A3B-BF16	30B	`M` `E` `*`	128
NVIDIA-Nemotron-3-Super-120B-A12B	120B	`M` `E` `*`	512
Nemotron-Ultra-253B-v1	253B	`M` `E` `*`	512

There are only 2 distinct architectural patterns:

**Dense Hybrid ** (4B / 9B / 12B models, no MoE):

Layer types: Mamba2 + plain MLP + Attention (GQA)

MoE Hybrid (30B / 120B / 253B models):

Layer types: Mamba2 + MoE + Attention (GQA)

No released model mixes all 4 character types. E (MoE) and - (MLP) are mutually exclusive across the entire hub release history and there is thus absolutely no need for composability.

2. These architectures collapse to existing ones.

Dense Hybrid → Bamba

BambaConfig supports:

Mamba2 with mamba_n_heads + mamba_d_head (NemotronH's mamba_num_heads / mamba_head_dim)
GQA via num_key_value_heads
RoPE via rope_parameters
SwiGLU MLP layers
Configurable attention layer placement via attn_layer_indices

TLDR no difference with the Dense Hybrid pattern. (correct me if I am wrong I did not spend too much time there)

MoE Hybrid→ GraniteMoeHybrid

GraniteMoeHybridConfig supports:

Mamba2 with mamba_n_heads + mamba_d_head
GQA
RoPE via rope_parameters
MoE with num_local_experts + num_experts_per_tok
Shared expert via shared_intermediate_size
Configurable layer composition via layer_types list

NemotronH MoE Hybrid adds moe_latent_size (only used in Ultra-253B) and uses non Gated MLP. This warrants code update -> inheritance.

I can check again I think there was a nit with MLP first layers just like DeepSeek.

Why we won't merge

When I approved the original nemotron_h PR I was told there were meaningfully different architectures across the model family. Looking at the actual hub releases, there are 2, and both are already covered by existing implementations.

Adding "-": "mlp" to _pattern_to_list (this PR) makes the composability system more complete — but composability is the wrong abstraction here. We now have a freeform pattern string that:

Is non-standard (no other model in transformers uses a character pattern for layer types this way)
Describes architectures that already exist under different model names
Makes it harder to understand what model you're actually loading
Encourages future releases to define novel combinations using the same opaque string notation
Allow for a freedom that is not used by the authors themselves
Forces downstream library to have many changes, when actually there's juste moe latents.

The right fix is to make this explicit: either use the appropriate bamba and granite_moe_hybrid model_type explicit or register two separate model_type (nemotron_h_dense and nemotorng_h_moe).

xenova · 2026-03-29T20:19:44Z

Makes sense to me! :) Thanks for the explanation. I mainly just opened the PR when I tried loading the model with transformers (w/o trust_remote_code) and ran into this error.

ArthurZucker · 2026-03-29T20:54:39Z

No no of course, i'll try to cook something up to support the models still!

Hub survey shows only 2 nemotron_h patterns exist: - Dense (M,-,*): structurally identical to Bamba (Mamba2+FFN or Attn+FFN per layer) - MoE (M,E,*): new arch, inherits from GraniteMoeHybrid Changes: - Replace NemotronHBlock+MIXER_TYPES dispatch with two explicit decoder layer classes (NemotronHMambaDecoderLayer, NemotronHAttentionDecoderLayer), both inheriting from GraniteMoeHybridDecoderLayer per transformers tenets ("different stages warrant explicit classes, not codepaths") - "moe" layers are now mamba+MoE (two-stage, matching GraniteMoeHybrid) instead of pure-MoE single-dispatch layers - Add NemotronHDenseConfig(PreTrainedConfig) with model_type="nemotron_h_dense" routing to BambaForCausalLM in auto-config (draft; weight converter needed) - Add WeightRenaming entries in conversion_mapping.py for hub checkpoint compat Ref: #44763 (review) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

suhara · 2026-04-09T02:15:49Z

Hi @ArthurZucker

Thank you for your support! Being late to follow up on this PR.

I’d like to clarify that the Nemotron-H architecture originally started from the Nemotron-H model series and was later adopted by Nemotron v2 series:

Nemotron-H: https://huggingface.co/collections/nvidia/nemotron-h
Nemotron v2: https://huggingface.co/collections/nvidia/nvidia-nemotron-v2

Historically, the original codebase (with the MLP mixer) was implemented for Nemotron-H, which was hosted on model repositories, and then expanded to support the MoE architecture for the Nemotron3 series.

By merging this PR, all of all the models that uses the Nemotron-H architecture will be officially available by the Transformers library, without updating config.json on the model repository side, which could otherwise break backward compatibility for customers. This is critical for ensuring that all Nemotron models are made available seamlessly.

Furthermore, as pointed in this PR, it’s blocking further adoption of the latest Nemotron 3 model.

It’d be great if you consider re-opening and merging the PR. Thank you for your support!

ArthurZucker · 2026-04-14T13:46:14Z

Yep I'm not dropping it and want to reach an appropriate conclusion / something that makes everyone happy!

…ron-h-fixes

no longer needed

github-actions · 2026-04-22T08:51:22Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: nemotron_h

ArthurZucker · 2026-04-22T08:56:08Z

can confirm:

<|im_start|>system
<|im_end|>
<|im_start|>user
Write a haiku about GPUs<|im_end|>
<|im_start|>assistant
<think>
We need to produce a haiku about GPUs. Haiku is 5-7-5 syllable structure. Let's craft: "Silent silicon hums / Parallel worlds collide in light / Speed beyond thought". Count syllables: "Silent" (2) "silicon" (3) "hums" (1) = 6? Actually "Silent" 2, "silicon" 3, "hums" 1 = 6. Need 5. Maybe "Silent silicon hums" is 5? Let's count: Silent (2), silicon (3) =5, hums (1) =6. Too many. Maybe "Silent silicon hums" is 5? Let's see: Silent (2), silicon (3) =5, hums (1) =6. So need 5 total. Could be "Silent silicon hums" is 5? Actually "Silent" 2, "silicon" 3 =5, "hums" 1 =6. So too many. Maybe "Silent silicon hums" is 5? No. Let's try "Silent silicon hums" is 5? Let's count again: Silent (2), silicon (3) =5, hums (1) =6. So 6. Need 5. Maybe "Silent silicon hums" is 5? No. Let's try "Silent silicon hums" is 5? Eh.

Better: "Silent silicon hums" is 5? Let's think: "Silent" 2, "silicon" 3 =5, "hums" 1 =6. So 6. So maybe "Silent silicon hums" is 5? No.

Let's make first line 5 syllables: "Silent silicon hums" is 6. Maybe "Silent silicon hums" is 5? Let's try "Silent silicon hums" = 5? Actually "Silent" 2, "silicon" 3 =5, "hums" 1 =6. So 6.

Let's do "Silent silicon hums" is 5? No.

Alternative: "Silent silicon hums" is 5? Hmm.

Maybe "Silent silicon hums" is 5? Let's count again: Silent (2), silicon (3) =5, hums (1) =6. So 6.

Let's try "Silent silicon hums" is 5? No.

Let's make first line "Silent silicon hums" is 5? Not.

Let's do "Silent silicon hums" is 5? I'm stuck.

Let's try "Silent silicon hums" is 5? No.

Ok new line: "Silent silicon hums" is 5? Let's drop "silicon": "Silent hums" = 3? "Silent" 2, "hums" 1 =3. Need 5. "Silent hums" is 3.

Maybe "Silent silicon hums" is 5? No.

Let's think of a 5-syllable line: "Silent silicon hums" is 6. "Silent silicon hums" is 6. "Silent silicon hums" is 6.

Maybe "Silent silicon hums" is 5? No.

Let's try "Silent silicon hums" is 5? I'm stuck.

Let's create a different line: "Silent silicon hums" is 5? Let's count: Silent (2), silicon (3) =5, hums (1) =6. So 6.

Ok maybe "Silent silicon hums" is 5? No.

Let's do "Silent silicon hums" is 5? I'm stuck.

Let's just write haiku: "Silent silicon hums / Parallel worlds collide in light / Speed beyond thought". Count syllables:

Line1: Silent (2) silicon (3) hums (1) =6. Too many.

We need 5. Maybe "Silent silicon hums" is 5? No.

Let's try "Silent silicon hums" is 5? I'm stuck.

Let's make line1: "Silent silicon hums" is 5? Let's drop "silicon": "Silent hums" = 3. Need 5. "Silent hums" is 3.

Maybe "Silent silicon hums" is 5? No.

Let's try "Silent silicon hums" is 5? I'm stuck.

Ok new approach: "Silent silicon hums" is 5? Let's count again: Silent (2), silicon (3) =5, hums (1)

xenova added 2 commits March 16, 2026 12:58

Support mlp mixers

34b9a13

Fix forward

b4fd7e1

xenova changed the title ~~Nemotron h fixes~~ [nemotron_h] Add support for MLP mixers Mar 16, 2026

make changes to modular

9e5d337

janimo mentioned this pull request Mar 28, 2026

NemotronH implementation can't load NemotronH checkpoints! #44863

Closed

4 tasks

liding-nv reviewed Mar 29, 2026

View reviewed changes

Comment thread src/transformers/models/nemotron_h/configuration_nemotron_h.py

ArthurZucker reviewed Mar 29, 2026

View reviewed changes

ArthurZucker previously requested changes Mar 29, 2026

View reviewed changes

xenova closed this Mar 29, 2026

Datta0 mentioned this pull request Mar 30, 2026

Error loading checkpoint: '-' unslothai/unsloth#4628

Open

ArthurZucker mentioned this pull request Apr 9, 2026

Fix Nemotron-H: add mlp layer type support #45300

Open

6 tasks

ArthurZucker reopened this Apr 20, 2026

ArthurZucker added 6 commits April 21, 2026 17:17

Merge branch 'main' of github.com:huggingface/transformers into nemot…

c7acbb1

…ron-h-fixes

up

55aa51f

tyle

30bffae

nit

f4eb5e7

update

85a0cc7

update

ee603af

ArthurZucker added 2 commits April 22, 2026 17:49

up[

c7bf060

nit

2310d89

ArthurZucker enabled auto-merge April 22, 2026 08:56

ArthurZucker approved these changes Apr 22, 2026

View reviewed changes

ArthurZucker disabled auto-merge April 22, 2026 08:56

ArthurZucker merged commit 77c0e6e into main Apr 22, 2026
21 checks passed

ArthurZucker deleted the nemotron-h-fixes branch April 22, 2026 08:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nemotron_h] Add support for MLP mixers#44763

[nemotron_h] Add support for MLP mixers#44763
ArthurZucker merged 11 commits intomainfrom
nemotron-h-fixes

xenova commented Mar 16, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Mar 16, 2026

Uh oh!

Cyrilvallez commented Mar 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

ArthurZucker left a comment

Uh oh!

ArthurZucker left a comment •

edited

Loading

Uh oh!

xenova commented Mar 29, 2026

Uh oh!

ArthurZucker commented Mar 29, 2026

Uh oh!

suhara commented Apr 9, 2026

Uh oh!

ArthurZucker commented Apr 14, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

ArthurZucker commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

xenova commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Mar 16, 2026

Uh oh!

Cyrilvallez commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

1. How many distinct nemotron_h patterns actually exist?

2. These architectures collapse to existing ones.

Why we won't merge

Uh oh!

xenova commented Mar 29, 2026

Uh oh!

ArthurZucker commented Mar 29, 2026

Uh oh!

suhara commented Apr 9, 2026

Uh oh!

ArthurZucker commented Apr 14, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

ArthurZucker commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

xenova commented Mar 16, 2026 •

edited

Loading

Cyrilvallez commented Mar 18, 2026 •

edited

Loading

ArthurZucker left a comment •

edited

Loading

1. How many distinct `nemotron_h` patterns actually exist?