Skip to content

[nemotron_h] Add support for MLP mixers#44763

Merged
ArthurZucker merged 11 commits intomainfrom
nemotron-h-fixes
Apr 22, 2026
Merged

[nemotron_h] Add support for MLP mixers#44763
ArthurZucker merged 11 commits intomainfrom
nemotron-h-fixes

Conversation

@xenova
Copy link
Copy Markdown
Contributor

@xenova xenova commented Mar 16, 2026

What does this PR do?

Adds support for MLP mixers, used by nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16. Previously, it would crash because it would not recognize the - char in the pattern.

Fixes # (issue)

Example usage:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16")
model = AutoModelForCausalLM.from_pretrained(
    "nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
messages = [
    {"role": "user", "content": "Write a haiku about GPUs"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=1024,
    eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(outputs[0]))
See example output
<|im_start|>system
<|im_end|>
<|im_start|>user
Write a haiku about GPUs<|im_end|>
<|im_start|>assistant
<think>
We need to produce a haiku about GPUs. Haiku is 5-7-5 syllable structure. Let's craft: "Silent silicon hums" (5? Silent (2) si-licon (2) hums (1) =5). Next line 7 syllables: "Lightning speed across the screen" Count: Light-ning (2) speed (1) a-cross (2) the (1) screen (1) =7? Let's count: Lightning (2), speed (1) =3, across (2) =5, the (1)=6, screen (1)=7. Good. Third line 5 syllables: "Pixels dance bright". Count: Pi-cles (2) dance (1) bright (1) =4. Need 5. Maybe "Pixels dance, bright". That's 5? Pixels (2) dance (1) bright (1) =4. Hmm. "Pixels dance, bright" still 4. Maybe "Pixels dance, so bright". Count: Pixels (2) dance (1) so (1) bright (1) =5. Good. So haiku:

Silent silicon hums  
Lightning speed across the screen  
Pixels dance, so bright

Check syllable: "Pixels dance, so bright" = Pixels (2) dance (1) so (1) bright (1) =5. Good.

Return that.
</think>
Silent silicon hums  
Lightning speed across the screen  
Pixels dance, so bright<|im_end|>

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker @Cyrilvallez

@xenova xenova changed the title Nemotron h fixes [nemotron_h] Add support for MLP mixers Mar 16, 2026
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Cyrilvallez
Copy link
Copy Markdown
Member

Cyrilvallez commented Mar 18, 2026

cc @ArthurZucker, looks good but pinging you since you're the one who reviewed the model!

Comment thread src/transformers/models/nemotron_h/configuration_nemotron_h.py
Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't mind, my main concern is adding this when maybe nemotron3 itelsef is another FIXED arch vs this pattern mapping that :

  1. is not standard
  2. ends up actually just being 2 or 3 architectures that probably already exits in transformers

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I don't think we'll move forward with it.
It goes against the whole point of making standards.

1. How many distinct nemotron_h patterns actually exist?

All released nvidia/ models that use model_type: nemotron_h and their hybrid_override_pattern character sets:

Model Params Pattern chars n_routed_experts
NVIDIA-Nemotron-3-Nano-4B-BF16 4B M - *
NVIDIA-Nemotron-Nano-9B-v2 9B M - *
NVIDIA-Nemotron-Nano-12B-v2 12B M - *
Nemotron-Cascade-2-30B-A3B 30B M E * 128
NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 30B M E * 128
NVIDIA-Nemotron-3-Super-120B-A12B 120B M E * 512
Nemotron-Ultra-253B-v1 253B M E * 512

There are only 2 distinct architectural patterns:

**Dense Hybrid ** (4B / 9B / 12B models, no MoE):

  • Layer types: Mamba2 + plain MLP + Attention (GQA)

MoE Hybrid (30B / 120B / 253B models):

  • Layer types: Mamba2 + MoE + Attention (GQA)

No released model mixes all 4 character types. E (MoE) and - (MLP) are mutually exclusive across the entire hub release history and there is thus absolutely no need for composability.


2. These architectures collapse to existing ones.

Dense Hybrid → Bamba

BambaConfig supports:

  • Mamba2 with mamba_n_heads + mamba_d_head (NemotronH's mamba_num_heads / mamba_head_dim)
  • GQA via num_key_value_heads
  • RoPE via rope_parameters
  • SwiGLU MLP layers
  • Configurable attention layer placement via attn_layer_indices

TLDR no difference with the Dense Hybrid pattern. (correct me if I am wrong I did not spend too much time there)

MoE Hybrid→ GraniteMoeHybrid

GraniteMoeHybridConfig supports:

  • Mamba2 with mamba_n_heads + mamba_d_head
  • GQA
  • RoPE via rope_parameters
  • MoE with num_local_experts + num_experts_per_tok
  • Shared expert via shared_intermediate_size
  • Configurable layer composition via layer_types list

NemotronH MoE Hybrid adds moe_latent_size (only used in Ultra-253B) and uses non Gated MLP. This warrants code update -> inheritance.

I can check again I think there was a nit with MLP first layers just like DeepSeek.

Why we won't merge

When I approved the original nemotron_h PR I was told there were meaningfully different architectures across the model family. Looking at the actual hub releases, there are 2, and both are already covered by existing implementations.

Adding "-": "mlp" to _pattern_to_list (this PR) makes the composability system more complete — but composability is the wrong abstraction here. We now have a freeform pattern string that:

  1. Is non-standard (no other model in transformers uses a character pattern for layer types this way)
  2. Describes architectures that already exist under different model names
  3. Makes it harder to understand what model you're actually loading
  4. Encourages future releases to define novel combinations using the same opaque string notation
  5. Allow for a freedom that is not used by the authors themselves
  6. Forces downstream library to have many changes, when actually there's juste moe latents.

The right fix is to make this explicit: either use the appropriate bamba and granite_moe_hybrid model_type explicit or register two separate model_type (nemotron_h_dense and nemotorng_h_moe).

@xenova
Copy link
Copy Markdown
Contributor Author

xenova commented Mar 29, 2026

Makes sense to me! :) Thanks for the explanation. I mainly just opened the PR when I tried loading the model with transformers (w/o trust_remote_code) and ran into this error.

@xenova xenova closed this Mar 29, 2026
@ArthurZucker
Copy link
Copy Markdown
Collaborator

No no of course, i'll try to cook something up to support the models still!

ArthurZucker added a commit that referenced this pull request Mar 30, 2026
Hub survey shows only 2 nemotron_h patterns exist:
- Dense (M,-,*): structurally identical to Bamba (Mamba2+FFN or Attn+FFN per layer)
- MoE (M,E,*): new arch, inherits from GraniteMoeHybrid

Changes:
- Replace NemotronHBlock+MIXER_TYPES dispatch with two explicit decoder layer
  classes (NemotronHMambaDecoderLayer, NemotronHAttentionDecoderLayer), both
  inheriting from GraniteMoeHybridDecoderLayer per transformers tenets
  ("different stages warrant explicit classes, not codepaths")
- "moe" layers are now mamba+MoE (two-stage, matching GraniteMoeHybrid) instead
  of pure-MoE single-dispatch layers
- Add NemotronHDenseConfig(PreTrainedConfig) with model_type="nemotron_h_dense"
  routing to BambaForCausalLM in auto-config (draft; weight converter needed)
- Add WeightRenaming entries in conversion_mapping.py for hub checkpoint compat

Ref: #44763 (review)

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@suhara
Copy link
Copy Markdown
Contributor

suhara commented Apr 9, 2026

Hi @ArthurZucker

Thank you for your support! Being late to follow up on this PR.

I’d like to clarify that the Nemotron-H architecture originally started from the Nemotron-H model series and was later adopted by Nemotron v2 series:

Historically, the original codebase (with the MLP mixer) was implemented for Nemotron-H, which was hosted on model repositories, and then expanded to support the MoE architecture for the Nemotron3 series.

By merging this PR, all of all the models that uses the Nemotron-H architecture will be officially available by the Transformers library, without updating config.json on the model repository side, which could otherwise break backward compatibility for customers. This is critical for ensuring that all Nemotron models are made available seamlessly.

Furthermore, as pointed in this PR, it’s blocking further adoption of the latest Nemotron 3 model.

It’d be great if you consider re-opening and merging the PR. Thank you for your support!

@ArthurZucker
Copy link
Copy Markdown
Collaborator

Yep I'm not dropping it and want to reach an appropriate conclusion / something that makes everyone happy!

@ArthurZucker ArthurZucker reopened this Apr 20, 2026
@ArthurZucker ArthurZucker dismissed their stale review April 22, 2026 07:16

no longer needed

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: nemotron_h

@ArthurZucker
Copy link
Copy Markdown
Collaborator

can confirm:

<|im_start|>system
<|im_end|>
<|im_start|>user
Write a haiku about GPUs<|im_end|>
<|im_start|>assistant
<think>
We need to produce a haiku about GPUs. Haiku is 5-7-5 syllable structure. Let's craft: "Silent silicon hums / Parallel worlds collide in light / Speed beyond thought". Count syllables: "Silent" (2) "silicon" (3) "hums" (1) = 6? Actually "Silent" 2, "silicon" 3, "hums" 1 = 6. Need 5. Maybe "Silent silicon hums" is 5? Let's count: Silent (2), silicon (3) =5, hums (1) =6. Too many. Maybe "Silent silicon hums" is 5? Let's see: Silent (2), silicon (3) =5, hums (1) =6. So need 5 total. Could be "Silent silicon hums" is 5? Actually "Silent" 2, "silicon" 3 =5, "hums" 1 =6. So too many. Maybe "Silent silicon hums" is 5? No. Let's try "Silent silicon hums" is 5? Let's count again: Silent (2), silicon (3) =5, hums (1) =6. So 6. Need 5. Maybe "Silent silicon hums" is 5? No. Let's try "Silent silicon hums" is 5? Eh.

Better: "Silent silicon hums" is 5? Let's think: "Silent" 2, "silicon" 3 =5, "hums" 1 =6. So 6. So maybe "Silent silicon hums" is 5? No.

Let's make first line 5 syllables: "Silent silicon hums" is 6. Maybe "Silent silicon hums" is 5? Let's try "Silent silicon hums" = 5? Actually "Silent" 2, "silicon" 3 =5, "hums" 1 =6. So 6.

Let's do "Silent silicon hums" is 5? No.

Alternative: "Silent silicon hums" is 5? Hmm.

Maybe "Silent silicon hums" is 5? Let's count again: Silent (2), silicon (3) =5, hums (1) =6. So 6.

Let's try "Silent silicon hums" is 5? No.

Let's make first line "Silent silicon hums" is 5? Not.

Let's do "Silent silicon hums" is 5? I'm stuck.

Let's try "Silent silicon hums" is 5? No.

Ok new line: "Silent silicon hums" is 5? Let's drop "silicon": "Silent hums" = 3? "Silent" 2, "hums" 1 =3. Need 5. "Silent hums" is 3.

Maybe "Silent silicon hums" is 5? No.

Let's think of a 5-syllable line: "Silent silicon hums" is 6. "Silent silicon hums" is 6. "Silent silicon hums" is 6.

Maybe "Silent silicon hums" is 5? No.

Let's try "Silent silicon hums" is 5? I'm stuck.

Let's create a different line: "Silent silicon hums" is 5? Let's count: Silent (2), silicon (3) =5, hums (1) =6. So 6.

Ok maybe "Silent silicon hums" is 5? No.

Let's do "Silent silicon hums" is 5? I'm stuck.

Let's just write haiku: "Silent silicon hums / Parallel worlds collide in light / Speed beyond thought". Count syllables:

Line1: Silent (2) silicon (3) hums (1) =6. Too many.

We need 5. Maybe "Silent silicon hums" is 5? No.

Let's try "Silent silicon hums" is 5? I'm stuck.

Let's make line1: "Silent silicon hums" is 5? Let's drop "silicon": "Silent hums" = 3. Need 5. "Silent hums" is 3.

Maybe "Silent silicon hums" is 5? No.

Let's try "Silent silicon hums" is 5? I'm stuck.

Ok new approach: "Silent silicon hums" is 5? Let's count again: Silent (2), silicon (3) =5, hums (1)

@ArthurZucker ArthurZucker enabled auto-merge April 22, 2026 08:56
@ArthurZucker ArthurZucker disabled auto-merge April 22, 2026 08:56
@ArthurZucker ArthurZucker merged commit 77c0e6e into main Apr 22, 2026
21 checks passed
@ArthurZucker ArthurZucker deleted the nemotron-h-fixes branch April 22, 2026 08:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants