feat(llm): add Hy3-preview (HYV3) SFT support by khazic · Pull Request #2072 · NVIDIA-NeMo/Automodel

khazic · 2026-04-28T07:21:54Z

Summary

Adds fine-tuning support for tencent/Hy3-preview, a 295B MoE model with:

80 layers (layer 0 dense, layers 1–79 MoE)
192 routed experts + 1 shared expert, top-8 sigmoid routing
GQA (64 Q / 8 KV heads), per-head QK RMSNorm, RoPE
e_score_correction_bias gate buffer for expert-bias correction
256K context window

References

HF model: tencent/Hy3-preview
HF training reference: tencent/Hy3-preview/tree/main/train
Upstream model implementation: transformers/models/hy_v3/modeling_hy_v3.py

Files added

nemo_automodel/components/models/hy_v3/ — model, layers, config, state_dict_adapter
nemo_automodel/_transformers/registry.py — register HYV3ForCausalLM and hy_v3 config
nemo_automodel/components/checkpoint/checkpointing.py — partial-load fix for HF base checkpoint init (see below)
examples/llm_finetune/hy_v3/hy3_preview_deepep.yaml — full 295B SFT recipe

Bug fixes included

gate_bias_update_factor=0.0 for HYV3

The official Tencent Hy3-preview SFT procedure treats e_score_correction_bias as a static pre-trained buffer and never updates it during fine-tuning. The original implementation incorrectly set gate_bias_update_factor=1e-3, which triggered an EMA update on every step. This is now set to 0.0. update_moe_gate_bias() is also guarded to be a no-op when the factor is zero, preventing an assertion error from Gate.update_bias().

Validation status

YAML	Scope	Status
`hy3_preview_deepep.yaml`	Full 295B, 16-node, EP32 PP4, DeepEP dispatcher	✅ Tested. Loss decreased from ~2.2 to ~1.6 over 100 steps; grad_norm stable; no NaN/OOM. WandB run

Full 295B training curves (EP32 PP4, 16 nodes):

Adds SFT training support for tencent/Hy3-preview (295B MoE, 192 experts top-8, 256K context). Requires transformers >= 5.6.0. New files: - nemo_automodel/components/models/hy_v3/layers.py: HYV3Attention with GQA, per-head QK RMSNorm, and RoPE - nemo_automodel/components/models/hy_v3/model.py: HYV3ForCausalLM / HYV3Model / Block wrapping Automodel's MoE infrastructure - nemo_automodel/components/models/hy_v3/state_dict_adapter.py: HYV3StateDictAdapter handling HF↔native conversion (expert tensor transposition, e_score_correction_bias relocation, MTP layer skipping) - examples/llm_finetune/hy_v3/hy3_preview_deepep.yaml: example SFT config Key architecture differences vs Qwen3-MoE: - Sigmoid routing with e_score_correction_bias (vs softmax) - first_k_dense_replace=1: only layer 0 is dense - 1 shared expert alongside 192 routed experts - route_scale=2.826 applied to routing weights - HF expert tensors are pre-grouped [n,2i,h] vs per-expert; only need transposition (not stack/concat) in state_dict_adapter Registers HYV3ForCausalLM in MODEL_ARCH_MAPPING. Signed-off-by: khazic <khazzz1c@gmail.com>

P0 (hy3_4layer_p0_smoke.yaml): 4-layer proxy, pp=2, ep=4, torch dispatcher, 100 steps — validates forward/backward/PP/EP health and e_score_correction_bias updates with no checkpoint I/O. P1 (hy3_4layer_p1_ckpt.yaml): same topology + DCP checkpoint save at step 50, exercises save/resume continuity of the full FSDP2+EP state including Gate buffers. P2 (hy3_8layer_p2_deepep.yaml): 8-layer proxy, pp=2, ep=4, DeepEP dispatcher (async_finish=True), 200 steps — validates deepep communicate-compute overlap and throughput vs P0 torch baseline. All three configs: pp=2, ep=4, 8 GPUs, interleaved1f1b schedule, real sigmoid routing (fake_balanced_gate: false). Signed-off-by: khazic <khazzz1c@gmail.com>

Switch from tiny proxy model to real tencent/Hy3-preview weights with a truncated layer count (4/8 layers), following the same approach used for DeepSeek V4 Flash validation. Checkpoint keys for layers beyond the truncated num_hidden_layers are ignored via strict=False on load. P0 (4 layers, torch, pp=2 ep=4): validates real tensor shapes and routing P1 (4 layers, torch, pp=2 ep=4): adds DCP checkpoint save/resume P2 (8 layers, deepep, pp=2 ep=4): validates DeepEP with real expert dims All configs: AutoConfig.from_pretrained + num_hidden_layers override + load_base_model=true + enable_hf_state_dict_adapter=true. 192 experts / ep=4 = 48 experts per rank (~8GB params per rank at bf16). Signed-off-by: khazic <khazzz1c@gmail.com>

- Fix optimizer: Adam → AdamW, lr 5e-4 → 1e-5, eps 1e-7 → 1e-8 (follows official train.py and matches DSV4 pattern) - Add gate_precision: float32 to all HYV3 backends (matches HF router FP32) - Add rope_fusion: false to P0/P1/P2 (attn: sdpa; avoids TE mismatch) - Fix collate_fn to dict form with pad_seq_len_divisible: 64 - Add _target_ to tokenizer fields in dataset/validation_dataset - Add shuffle: false and drop_last: true to validation_dataloader - Fix hy3_preview_deepep: pp_schedule 1f1b (pp=1), add moe section, fix optimizer and dataset fields - Remove num_nextn_predict_layers (DeepSeek-specific, not in HYV3 config) - Add update_moe_gate_bias() to HYV3ForCausalLM so the training recipe updates e_score_correction_bias each optimizer step (load balancing) Signed-off-by: khazic <khazzz1c@gmail.com>

…nyuan3 Signed-off-by: khazic <khazzz1c@gmail.com>

AutoConfig.from_pretrained failed on checkpoints with model_type=hy_v3 because the type was not registered. Add config.py with HYV3Config (PretrainedConfig subclass) and wire it into _CUSTOM_CONFIG_REGISTRATIONS so that trust_remote_code=False keeps working. Signed-off-by: khazic <khazzz1c@gmail.com>

…ia DCP Custom models (e.g. HYV3) create training-only buffers (e_score_correction_bias) that are not present in the original HF pretrained checkpoint. Using DefaultLoadPlanner(allow_partial_load=True) for is_init_step loads lets those buffers keep their zero initialization instead of raising "Missing key in checkpoint state_dict". Signed-off-by: khazic <khazzz1c@gmail.com>

In the standard DCP load path (post-shard, e.g. PP+EP), DCP already distributes expert tensors correctly via DTensor placement (Shard(0)). Passing moe_mesh to from_hf causes a second EP slice on the DTensor, producing a plain tensor of local shape [48,...] that cannot be loaded into the model's DTensor parameter of global shape [192,...]. Fix: pass moe_mesh=None to _maybe_adapt_state_dict_from_hf in the standard DCP path so adapters only rename keys and do not re-slice tensors that DCP has already distributed. The fast path (lines 444-498) is unaffected: it still passes moe_mesh because it loads full plain tensors from disk and needs explicit slicing. Signed-off-by: khazic <khazzz1c@gmail.com>

Replace internal server paths with the public HuggingFace model ID. Signed-off-by: khazic <khazzz1c@gmail.com>

Signed-off-by: khazic <khazzz1c@gmail.com>

copy-pr-bot · 2026-04-28T07:21:57Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

HuiyingLi · 2026-04-28T07:24:40Z

/ok to test 0c71b04

jgerh

Completed tech pubs review and provided a few copyedits.

Bump local_batch_size 4->8, set pp_size=4 and ep_size=32 for 128 GPUs (16 nodes x 8), add max_steps=100, and raise val_every_steps to 500. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

The previous HYV3 adapter expected an already-fused HF state dict (mlp.experts.gate_up_proj / down_proj / e_score_correction_bias / shared_experts.* / mlp.gate.weight). The on-disk Tencent format actually stores per-expert split keys plus internal names (mlp.experts.{i}.{gate,up,down}_proj.weight, mlp.expert_bias, mlp.router.gate.weight, mlp.shared_mlp.*). Because checkpointing.py:507 zeroes reader_key_mapping when the model has a state_dict_adapter, the storage reader's renames never ran and DCP found no matching keys for any MoE tensor -- so router/shared/experts silently stayed at random init while everything else loaded fine. Rewrite HYV3StateDictAdapter on top of MoESplitExpertsStateDictMixin so to_hf produces on-disk-format keys (per-expert split + Tencent names) that DCP can match against the safetensors, and from_hf merges them back into the grouped native form. The three HYV3-specific renames (router.gate <-> gate, expert_bias <-> e_score_correction_bias, shared_mlp. <-> shared_experts.) are applied around the mixin. Also fix _maybe_adapt_state_dict_from_hf to pass moe_mesh in the DCP init path (was None). The mixin's validator/merger needs the EP mesh to know which expert-id subset is expected on the rank; without it, required_experts = range(192) and validation fails ("Expert weights missing from checkpoint: 432/576 ..."). The previous comment said "DCP already distributed -- don't pass moe_mesh", but the mesh is needed for subset-aware validation, not re-slicing. Verified end-to-end on hy3_4layer_p0_smoke (pp=2, ep=4, 8 GPUs): all 56 non-bias tensors are bitwise identical to the HF reference; the 3 e_score_correction_bias tensors agree to <=2.2e-4 (bf16 round-trip noise). Stitched mlp.gate.weight across 4 EP ranks matches the on-disk router.gate.weight bitwise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Add two debug-only hooks behind environment variables, both no-ops by default: - AUTOMODEL_PARITY_DUMP=<dir>: after build_model() returns, write each rank's post-load state_dict to <dir>/rank{R}_state_dict.pt. Used to verify HF -> Automodel weight loading matches a reference state dict from the on-disk safetensors. - AUTOMODEL_PARITY_LOGITS=<dir>: at the end of setup(), register forward hooks on embed_tokens, every decoder layer, the final norm, and lm_head; run one deterministic eval-mode forward through the PP schedule on a fixed input (torch.randint with seed 0, seqlen 8); each rank dumps its captured tensors to <dir>/rank{R}_outputs.pt. Stage 0 ranks capture hidden_0..hidden_{first_stage_layers}; stage 1 ranks capture the remaining hidden states + hidden_norm + logits. Used together with an HF transformers reference forward (same input) to validate per-layer parity within bf16 noise. No effect on production runs that don't set either env var. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

This reverts commit c23bc04. Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

The two truncated 4-layer recipes (hy3_4layer_p0_smoke.yaml, hy3_4layer_p1_ckpt.yaml) were used during initial state-dict and checkpoint validation; the only remaining production recipe for HYV3 is hy3_preview_deepep.yaml. Drop the smoke files and remove the now-stale download links from the model-coverage page. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

The PR branch's modifications to nemo_automodel/components/checkpoint/checkpointing.py were tracking older state of main; the relevant moe_mesh wiring (_maybe_adapt_state_dict_from_hf(..., moe_mesh=self.moe_mesh)) has been on main since NVIDIA-NeMo#1904 (Adil, 2025-10-16). Sync this file back to main. Also drop the DefaultLoadPlanner(allow_partial_load=True) shim that was introduced for the previous HYV3 adapter where the e_score_correction_bias buffer's HF key was not renamed correctly. The new mixin-based adapter (commit cb30a59) renames expert_bias <-> e_score_correction_bias during from_hf/to_hf, so DCP finds matching keys on disk and allow_partial_load is unnecessary. Verified end-to-end on hy3_4layer_p0_smoke (pp=2, ep=4, 8 GPUs): 3 training steps complete, no missing-key errors, MoE weights load correctly via the shared mixin path with moe_mesh from the call site. While here, drop the unused _infer_ep_mesh helper in HYV3StateDictAdapter -- the call site always supplies moe_mesh now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

40 tests covering all behavior introduced by the rewritten adapter: - Initialization: attribute wiring, default dtype, mixin inheritance. - Rename tables (_NATIVE_TO_HF_RENAMES / _HF_TO_NATIVE_RENAMES): parametrized round-trip for each rename pair, plus negative cases that must NOT be renamed (attention, layernorm, embed, dense MLP, lm_head, model.norm). - from_hf (on-disk -> native): * router.gate.weight -> gate.weight * expert_bias -> gate.e_score_correction_bias * shared_mlp.* -> shared_experts.* * per-expert split keys merged into experts.gate_and_up_projs and experts.down_projs with the right [E,H,2I]/[E,I,H] native shapes and value-level transposed/concatenated layout * MTP layer keys (index >= num_hidden_layers) dropped * unrelated keys pass through unchanged - to_hf (native -> on-disk): * reverse renames produce the on-disk Tencent names * grouped expert tensors split into per-expert keys with [moe_inter, hidden] / [hidden, moe_inter] disk shapes * exclude_key_regex honored - convert_single_tensor_to_hf: * non-expert keys renamed or pass through * expert tensors split + renamed (one input -> 2*E or E pairs) * exclude_key_regex applied after rename - Round-trip integrity: * native -> to_hf -> from_hf recovers every key value-for-value * disk -> from_hf -> to_hf recovers every non-MTP key value-for-value - _is_mtp_key: parametrized layer-index classification with/without the "model." prefix, plus a config-threshold variation test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Adds tests for every code change in the PR that wasn't already covered: - tests/unit_tests/models/hy_v3/test_hy_v3_config.py (15 tests): HYV3Config defaults match the published 295B spec, override propagation for attention dims / MoE routing / layer truncation / first_k_dense_replace / router flags / RoPE / token IDs, to_dict round-trip, and class-level model_type stability. - tests/unit_tests/models/hy_v3/test_hy_v3_layers.py (10 tests): HYV3Attention initialization (projection shapes, per-head qk_norm, attention_bias on/off), forward output shapes through the sdpa backend, q/k/v/o projections all called, attention_mask propagated, init_weights resets norms + reseeds linears. - tests/unit_tests/models/hy_v3/test_hy_v3_model.py (24 tests): Block dense vs MoE switching at first_k_dense_replace, residual forward calls attn + mlp, attention_mask -> padding_mask conversion, init_weights propagates to sub-components; HYV3Model construction + dense+MoE structure + moe_config inference + moe_overrides + moe_config-vs-overrides conflict + forward + position_ids + init_weights; HYV3ForCausalLM construction, optional state_dict_adapter wiring, default backend, get/set in/out embeddings, forward logits shape, initialize_weights, the update_moe_gate_bias no-op-when-factor-zero contract (regression test for 564ff4f), from_config/from_pretrained classmethods, ModelClass alias, module exports. - tests/unit_tests/_transformers/test_registry_hy_v3.py (6 tests): HYV3ForCausalLM is registered in MODEL_ARCH_MAPPING and resolves to the right (module, class). hy_v3 is registered in _CUSTOM_CONFIG_REGISTRATIONS and the resolved class has model_type == 'hy_v3'. Negative tests confirm the removed Ministral3 bidirectional retrieval keys are gone from SUPPORTED_BACKBONES. - tests/unit_tests/recipes/test_train_ft.py (3 tests): PEFT + torch_save raises ValueError (parity with the equivalent test added in test_finetune_vlm_helpers.py), PEFT + safetensors succeeds, non-PEFT + torch_save succeeds. All 107 tests pass (40 from the prior adapter test commit + 67 new here). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Mirror the MiniMax-M2.7 PR (NVIDIA-NeMo#1785) doc additions for the new HYV3 support: news entry at the top of README.md "What's New" and a row in docs/model-coverage/latest-models.md. Per-model coverage page (docs/model-coverage/llm/tencent/hy3.md) and llm/index.md row are already present. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

HuiyingLi · 2026-04-28T23:23:12Z

/ok to test 1d511f8

Signed-off-by: HuiyingLi <willwin.lee@gmail.com> # Conflicts: # tests/unit_tests/recipes/test_train_ft.py

HuiyingLi · 2026-04-28T23:32:39Z

/ok to test fc62a07

khazic added 10 commits April 28, 2026 15:17

ci(llm): set Hy3-preview checkpoint path to /llm-align/open_models/hu…

a555be5

…nyuan3 Signed-off-by: khazic <khazzz1c@gmail.com>

ci(llm): use public tencent/Hy3-preview HF path in HYV3 test yamls

c662e72

Replace internal server paths with the public HuggingFace model ID. Signed-off-by: khazic <khazzz1c@gmail.com>

ci(llm): remove P2 DeepEP yaml (not yet validated)

0c71b04

Signed-off-by: khazic <khazzz1c@gmail.com>

khazic requested review from HuiyingLi, ZhiyuLi-Nvidia, adil-a, akoumpa, athitten, hemildesai, pthombre and zyzhou5 as code owners April 28, 2026 07:21

github-actions Bot added the community-request label Apr 28, 2026

copy-pr-bot Bot temporarily deployed to test April 28, 2026 07:25 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 28, 2026 07:25 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 28, 2026 07:29 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 28, 2026 16:23 Inactive

copy-pr-bot Bot temporarily deployed to test April 28, 2026 16:24 Inactive

jgerh reviewed Apr 28, 2026

View reviewed changes

Comment thread docs/model-coverage/llm/tencent/hy3.md Outdated

Comment thread docs/model-coverage/llm/tencent/hy3.md Outdated

Comment thread docs/model-coverage/llm/index.md Outdated

Comment thread docs/model-coverage/llm/index.md Outdated

HuiyingLi and others added 4 commits April 28, 2026 14:42

Revert "chore(llm): env-gated parity dumps in train_ft.py setup()"

a023500

This reverts commit c23bc04. Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

HuiyingLi force-pushed the feat/hy3-sft branch from 83c3eb6 to a023500 Compare April 28, 2026 21:43

HuiyingLi and others added 4 commits April 28, 2026 14:46

Apply suggestion from @jgerh

0265f01

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Apply suggestion from @jgerh

f60a2da

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Apply suggestion from @jgerh

b99084c

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

Apply suggestion from @jgerh

3a1b2a4

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

HuiyingLi force-pushed the feat/hy3-sft branch from 8c1056a to 3a1b2a4 Compare April 28, 2026 21:46

HuiyingLi and others added 5 commits April 28, 2026 14:51

HuiyingLi requested a review from snowmanwwg as a code owner April 28, 2026 23:17

copy-pr-bot Bot temporarily deployed to nemo-ci April 28, 2026 23:23 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci April 28, 2026 23:23 Error

copy-pr-bot Bot temporarily deployed to nemo-ci April 28, 2026 23:23 Inactive

copy-pr-bot Bot temporarily deployed to test April 28, 2026 23:23 Inactive

Merge remote-tracking branch 'origin/main' into feat/hy3-sft

fc62a07

Signed-off-by: HuiyingLi <willwin.lee@gmail.com> # Conflicts: # tests/unit_tests/recipes/test_train_ft.py

HuiyingLi approved these changes Apr 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(llm): add Hy3-preview (HYV3) SFT support#2072

feat(llm): add Hy3-preview (HYV3) SFT support#2072
HuiyingLi merged 28 commits intoNVIDIA-NeMo:mainfrom
khazic:feat/hy3-sft

khazic commented Apr 28, 2026 •

edited by HuiyingLi

Loading

Uh oh!

copy-pr-bot Bot commented Apr 28, 2026

Uh oh!

HuiyingLi commented Apr 28, 2026

Uh oh!

jgerh left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HuiyingLi commented Apr 28, 2026

Uh oh!

HuiyingLi commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

khazic commented Apr 28, 2026 • edited by HuiyingLi Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

References

Files added

Bug fixes included

Validation status

Uh oh!

copy-pr-bot Bot commented Apr 28, 2026

Uh oh!

HuiyingLi commented Apr 28, 2026

Uh oh!

jgerh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HuiyingLi commented Apr 28, 2026

Uh oh!

HuiyingLi commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

khazic commented Apr 28, 2026 •

edited by HuiyingLi

Loading