[nemotron_h] Add support for MLP mixers#44763
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
cc @ArthurZucker, looks good but pinging you since you're the one who reviewed the model! |
ArthurZucker
left a comment
There was a problem hiding this comment.
Yeah I don't mind, my main concern is adding this when maybe nemotron3 itelsef is another FIXED arch vs this pattern mapping that :
- is not standard
- ends up actually just being 2 or 3 architectures that probably already exits in transformers
There was a problem hiding this comment.
Thanks for the PR. I don't think we'll move forward with it.
It goes against the whole point of making standards.
1. How many distinct nemotron_h patterns actually exist?
All released nvidia/ models that use model_type: nemotron_h and their hybrid_override_pattern character sets:
| Model | Params | Pattern chars | n_routed_experts |
|---|---|---|---|
| NVIDIA-Nemotron-3-Nano-4B-BF16 | 4B | M - * |
— |
| NVIDIA-Nemotron-Nano-9B-v2 | 9B | M - * |
— |
| NVIDIA-Nemotron-Nano-12B-v2 | 12B | M - * |
— |
| Nemotron-Cascade-2-30B-A3B | 30B | M E * |
128 |
| NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 | 30B | M E * |
128 |
| NVIDIA-Nemotron-3-Super-120B-A12B | 120B | M E * |
512 |
| Nemotron-Ultra-253B-v1 | 253B | M E * |
512 |
There are only 2 distinct architectural patterns:
**Dense Hybrid ** (4B / 9B / 12B models, no MoE):
- Layer types: Mamba2 + plain MLP + Attention (GQA)
MoE Hybrid (30B / 120B / 253B models):
- Layer types: Mamba2 + MoE + Attention (GQA)
No released model mixes all 4 character types. E (MoE) and - (MLP) are mutually exclusive across the entire hub release history and there is thus absolutely no need for composability.
2. These architectures collapse to existing ones.
Dense Hybrid → Bamba
BambaConfig supports:
- Mamba2 with
mamba_n_heads+mamba_d_head(NemotronH'smamba_num_heads/mamba_head_dim) - GQA via
num_key_value_heads - RoPE via
rope_parameters - SwiGLU MLP layers
- Configurable attention layer placement via
attn_layer_indices
TLDR no difference with the Dense Hybrid pattern. (correct me if I am wrong I did not spend too much time there)
MoE Hybrid→ GraniteMoeHybrid
GraniteMoeHybridConfig supports:
- Mamba2 with
mamba_n_heads+mamba_d_head - GQA
- RoPE via
rope_parameters - MoE with
num_local_experts+num_experts_per_tok - Shared expert via
shared_intermediate_size - Configurable layer composition via
layer_typeslist
NemotronH MoE Hybrid adds moe_latent_size (only used in Ultra-253B) and uses non Gated MLP. This warrants code update -> inheritance.
I can check again I think there was a nit with MLP first layers just like DeepSeek.
Why we won't merge
When I approved the original nemotron_h PR I was told there were meaningfully different architectures across the model family. Looking at the actual hub releases, there are 2, and both are already covered by existing implementations.
Adding "-": "mlp" to _pattern_to_list (this PR) makes the composability system more complete — but composability is the wrong abstraction here. We now have a freeform pattern string that:
- Is non-standard (no other model in transformers uses a character pattern for layer types this way)
- Describes architectures that already exist under different model names
- Makes it harder to understand what model you're actually loading
- Encourages future releases to define novel combinations using the same opaque string notation
- Allow for a freedom that is not used by the authors themselves
- Forces downstream library to have many changes, when actually there's juste moe latents.
The right fix is to make this explicit: either use the appropriate bamba and granite_moe_hybrid model_type explicit or register two separate model_type (nemotron_h_dense and nemotorng_h_moe).
|
Makes sense to me! :) Thanks for the explanation. I mainly just opened the PR when I tried loading the model with transformers (w/o trust_remote_code) and ran into this error. |
|
No no of course, i'll try to cook something up to support the models still! |
Hub survey shows only 2 nemotron_h patterns exist:
- Dense (M,-,*): structurally identical to Bamba (Mamba2+FFN or Attn+FFN per layer)
- MoE (M,E,*): new arch, inherits from GraniteMoeHybrid
Changes:
- Replace NemotronHBlock+MIXER_TYPES dispatch with two explicit decoder layer
classes (NemotronHMambaDecoderLayer, NemotronHAttentionDecoderLayer), both
inheriting from GraniteMoeHybridDecoderLayer per transformers tenets
("different stages warrant explicit classes, not codepaths")
- "moe" layers are now mamba+MoE (two-stage, matching GraniteMoeHybrid) instead
of pure-MoE single-dispatch layers
- Add NemotronHDenseConfig(PreTrainedConfig) with model_type="nemotron_h_dense"
routing to BambaForCausalLM in auto-config (draft; weight converter needed)
- Add WeightRenaming entries in conversion_mapping.py for hub checkpoint compat
Ref: #44763 (review)
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
Thank you for your support! Being late to follow up on this PR. I’d like to clarify that the Nemotron-H architecture originally started from the Nemotron-H model series and was later adopted by Nemotron v2 series:
Historically, the original codebase (with the MLP mixer) was implemented for Nemotron-H, which was hosted on model repositories, and then expanded to support the MoE architecture for the Nemotron3 series. By merging this PR, all of all the models that uses the Nemotron-H architecture will be officially available by the Transformers library, without updating Furthermore, as pointed in this PR, it’s blocking further adoption of the latest Nemotron 3 model. It’d be great if you consider re-opening and merging the PR. Thank you for your support! |
|
Yep I'm not dropping it and want to reach an appropriate conclusion / something that makes everyone happy! |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: nemotron_h |
|
can confirm: <|im_start|>system
<|im_end|>
<|im_start|>user
Write a haiku about GPUs<|im_end|>
<|im_start|>assistant
<think>
We need to produce a haiku about GPUs. Haiku is 5-7-5 syllable structure. Let's craft: "Silent silicon hums / Parallel worlds collide in light / Speed beyond thought". Count syllables: "Silent" (2) "silicon" (3) "hums" (1) = 6? Actually "Silent" 2, "silicon" 3, "hums" 1 = 6. Need 5. Maybe "Silent silicon hums" is 5? Let's count: Silent (2), silicon (3) =5, hums (1) =6. Too many. Maybe "Silent silicon hums" is 5? Let's see: Silent (2), silicon (3) =5, hums (1) =6. So need 5 total. Could be "Silent silicon hums" is 5? Actually "Silent" 2, "silicon" 3 =5, "hums" 1 =6. So too many. Maybe "Silent silicon hums" is 5? No. Let's try "Silent silicon hums" is 5? Let's count again: Silent (2), silicon (3) =5, hums (1) =6. So 6. Need 5. Maybe "Silent silicon hums" is 5? No. Let's try "Silent silicon hums" is 5? Eh.
Better: "Silent silicon hums" is 5? Let's think: "Silent" 2, "silicon" 3 =5, "hums" 1 =6. So 6. So maybe "Silent silicon hums" is 5? No.
Let's make first line 5 syllables: "Silent silicon hums" is 6. Maybe "Silent silicon hums" is 5? Let's try "Silent silicon hums" = 5? Actually "Silent" 2, "silicon" 3 =5, "hums" 1 =6. So 6.
Let's do "Silent silicon hums" is 5? No.
Alternative: "Silent silicon hums" is 5? Hmm.
Maybe "Silent silicon hums" is 5? Let's count again: Silent (2), silicon (3) =5, hums (1) =6. So 6.
Let's try "Silent silicon hums" is 5? No.
Let's make first line "Silent silicon hums" is 5? Not.
Let's do "Silent silicon hums" is 5? I'm stuck.
Let's try "Silent silicon hums" is 5? No.
Ok new line: "Silent silicon hums" is 5? Let's drop "silicon": "Silent hums" = 3? "Silent" 2, "hums" 1 =3. Need 5. "Silent hums" is 3.
Maybe "Silent silicon hums" is 5? No.
Let's think of a 5-syllable line: "Silent silicon hums" is 6. "Silent silicon hums" is 6. "Silent silicon hums" is 6.
Maybe "Silent silicon hums" is 5? No.
Let's try "Silent silicon hums" is 5? I'm stuck.
Let's create a different line: "Silent silicon hums" is 5? Let's count: Silent (2), silicon (3) =5, hums (1) =6. So 6.
Ok maybe "Silent silicon hums" is 5? No.
Let's do "Silent silicon hums" is 5? I'm stuck.
Let's just write haiku: "Silent silicon hums / Parallel worlds collide in light / Speed beyond thought". Count syllables:
Line1: Silent (2) silicon (3) hums (1) =6. Too many.
We need 5. Maybe "Silent silicon hums" is 5? No.
Let's try "Silent silicon hums" is 5? I'm stuck.
Let's make line1: "Silent silicon hums" is 5? Let's drop "silicon": "Silent hums" = 3. Need 5. "Silent hums" is 3.
Maybe "Silent silicon hums" is 5? No.
Let's try "Silent silicon hums" is 5? I'm stuck.
Ok new approach: "Silent silicon hums" is 5? Let's count again: Silent (2), silicon (3) =5, hums (1) |
What does this PR do?
Adds support for MLP mixers, used by nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16. Previously, it would crash because it would not recognize the
-char in the pattern.Fixes # (issue)
Example usage:
See example output
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@ArthurZucker @Cyrilvallez