feat(llm): add Hy3-preview (HYV3) SFT support#2072
Merged
HuiyingLi merged 28 commits intoNVIDIA-NeMo:mainfrom Apr 29, 2026
Merged
feat(llm): add Hy3-preview (HYV3) SFT support#2072HuiyingLi merged 28 commits intoNVIDIA-NeMo:mainfrom
HuiyingLi merged 28 commits intoNVIDIA-NeMo:mainfrom
Conversation
Adds SFT training support for tencent/Hy3-preview (295B MoE, 192 experts top-8, 256K context). Requires transformers >= 5.6.0. New files: - nemo_automodel/components/models/hy_v3/layers.py: HYV3Attention with GQA, per-head QK RMSNorm, and RoPE - nemo_automodel/components/models/hy_v3/model.py: HYV3ForCausalLM / HYV3Model / Block wrapping Automodel's MoE infrastructure - nemo_automodel/components/models/hy_v3/state_dict_adapter.py: HYV3StateDictAdapter handling HF↔native conversion (expert tensor transposition, e_score_correction_bias relocation, MTP layer skipping) - examples/llm_finetune/hy_v3/hy3_preview_deepep.yaml: example SFT config Key architecture differences vs Qwen3-MoE: - Sigmoid routing with e_score_correction_bias (vs softmax) - first_k_dense_replace=1: only layer 0 is dense - 1 shared expert alongside 192 routed experts - route_scale=2.826 applied to routing weights - HF expert tensors are pre-grouped [n,2i,h] vs per-expert; only need transposition (not stack/concat) in state_dict_adapter Registers HYV3ForCausalLM in MODEL_ARCH_MAPPING. Signed-off-by: khazic <khazzz1c@gmail.com>
P0 (hy3_4layer_p0_smoke.yaml): 4-layer proxy, pp=2, ep=4, torch dispatcher, 100 steps — validates forward/backward/PP/EP health and e_score_correction_bias updates with no checkpoint I/O. P1 (hy3_4layer_p1_ckpt.yaml): same topology + DCP checkpoint save at step 50, exercises save/resume continuity of the full FSDP2+EP state including Gate buffers. P2 (hy3_8layer_p2_deepep.yaml): 8-layer proxy, pp=2, ep=4, DeepEP dispatcher (async_finish=True), 200 steps — validates deepep communicate-compute overlap and throughput vs P0 torch baseline. All three configs: pp=2, ep=4, 8 GPUs, interleaved1f1b schedule, real sigmoid routing (fake_balanced_gate: false). Signed-off-by: khazic <khazzz1c@gmail.com>
Switch from tiny proxy model to real tencent/Hy3-preview weights with a truncated layer count (4/8 layers), following the same approach used for DeepSeek V4 Flash validation. Checkpoint keys for layers beyond the truncated num_hidden_layers are ignored via strict=False on load. P0 (4 layers, torch, pp=2 ep=4): validates real tensor shapes and routing P1 (4 layers, torch, pp=2 ep=4): adds DCP checkpoint save/resume P2 (8 layers, deepep, pp=2 ep=4): validates DeepEP with real expert dims All configs: AutoConfig.from_pretrained + num_hidden_layers override + load_base_model=true + enable_hf_state_dict_adapter=true. 192 experts / ep=4 = 48 experts per rank (~8GB params per rank at bf16). Signed-off-by: khazic <khazzz1c@gmail.com>
- Fix optimizer: Adam → AdamW, lr 5e-4 → 1e-5, eps 1e-7 → 1e-8 (follows official train.py and matches DSV4 pattern) - Add gate_precision: float32 to all HYV3 backends (matches HF router FP32) - Add rope_fusion: false to P0/P1/P2 (attn: sdpa; avoids TE mismatch) - Fix collate_fn to dict form with pad_seq_len_divisible: 64 - Add _target_ to tokenizer fields in dataset/validation_dataset - Add shuffle: false and drop_last: true to validation_dataloader - Fix hy3_preview_deepep: pp_schedule 1f1b (pp=1), add moe section, fix optimizer and dataset fields - Remove num_nextn_predict_layers (DeepSeek-specific, not in HYV3 config) - Add update_moe_gate_bias() to HYV3ForCausalLM so the training recipe updates e_score_correction_bias each optimizer step (load balancing) Signed-off-by: khazic <khazzz1c@gmail.com>
…nyuan3 Signed-off-by: khazic <khazzz1c@gmail.com>
AutoConfig.from_pretrained failed on checkpoints with model_type=hy_v3 because the type was not registered. Add config.py with HYV3Config (PretrainedConfig subclass) and wire it into _CUSTOM_CONFIG_REGISTRATIONS so that trust_remote_code=False keeps working. Signed-off-by: khazic <khazzz1c@gmail.com>
…ia DCP Custom models (e.g. HYV3) create training-only buffers (e_score_correction_bias) that are not present in the original HF pretrained checkpoint. Using DefaultLoadPlanner(allow_partial_load=True) for is_init_step loads lets those buffers keep their zero initialization instead of raising "Missing key in checkpoint state_dict". Signed-off-by: khazic <khazzz1c@gmail.com>
In the standard DCP load path (post-shard, e.g. PP+EP), DCP already distributes expert tensors correctly via DTensor placement (Shard(0)). Passing moe_mesh to from_hf causes a second EP slice on the DTensor, producing a plain tensor of local shape [48,...] that cannot be loaded into the model's DTensor parameter of global shape [192,...]. Fix: pass moe_mesh=None to _maybe_adapt_state_dict_from_hf in the standard DCP path so adapters only rename keys and do not re-slice tensors that DCP has already distributed. The fast path (lines 444-498) is unaffected: it still passes moe_mesh because it loads full plain tensors from disk and needs explicit slicing. Signed-off-by: khazic <khazzz1c@gmail.com>
Replace internal server paths with the public HuggingFace model ID. Signed-off-by: khazic <khazzz1c@gmail.com>
Signed-off-by: khazic <khazzz1c@gmail.com>
Contributor
|
/ok to test 0c71b04 |
jgerh
reviewed
Apr 28, 2026
Contributor
jgerh
left a comment
There was a problem hiding this comment.
Completed tech pubs review and provided a few copyedits.
Bump local_batch_size 4->8, set pp_size=4 and ep_size=32 for 128 GPUs (16 nodes x 8), add max_steps=100, and raise val_every_steps to 500. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
The previous HYV3 adapter expected an already-fused HF state dict
(mlp.experts.gate_up_proj / down_proj / e_score_correction_bias /
shared_experts.* / mlp.gate.weight). The on-disk Tencent format
actually stores per-expert split keys plus internal names
(mlp.experts.{i}.{gate,up,down}_proj.weight, mlp.expert_bias,
mlp.router.gate.weight, mlp.shared_mlp.*). Because checkpointing.py:507
zeroes reader_key_mapping when the model has a state_dict_adapter, the
storage reader's renames never ran and DCP found no matching keys for
any MoE tensor -- so router/shared/experts silently stayed at random
init while everything else loaded fine.
Rewrite HYV3StateDictAdapter on top of MoESplitExpertsStateDictMixin
so to_hf produces on-disk-format keys (per-expert split + Tencent
names) that DCP can match against the safetensors, and from_hf merges
them back into the grouped native form. The three HYV3-specific
renames (router.gate <-> gate, expert_bias <-> e_score_correction_bias,
shared_mlp. <-> shared_experts.) are applied around the mixin.
Also fix _maybe_adapt_state_dict_from_hf to pass moe_mesh in the DCP
init path (was None). The mixin's validator/merger needs the EP mesh
to know which expert-id subset is expected on the rank; without it,
required_experts = range(192) and validation fails ("Expert weights
missing from checkpoint: 432/576 ..."). The previous comment said
"DCP already distributed -- don't pass moe_mesh", but the mesh is
needed for subset-aware validation, not re-slicing.
Verified end-to-end on hy3_4layer_p0_smoke (pp=2, ep=4, 8 GPUs): all
56 non-bias tensors are bitwise identical to the HF reference; the 3
e_score_correction_bias tensors agree to <=2.2e-4 (bf16 round-trip
noise). Stitched mlp.gate.weight across 4 EP ranks matches the on-disk
router.gate.weight bitwise.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Add two debug-only hooks behind environment variables, both no-ops by
default:
- AUTOMODEL_PARITY_DUMP=<dir>: after build_model() returns, write each
rank's post-load state_dict to <dir>/rank{R}_state_dict.pt. Used to
verify HF -> Automodel weight loading matches a reference state dict
from the on-disk safetensors.
- AUTOMODEL_PARITY_LOGITS=<dir>: at the end of setup(), register
forward hooks on embed_tokens, every decoder layer, the final norm,
and lm_head; run one deterministic eval-mode forward through the PP
schedule on a fixed input (torch.randint with seed 0, seqlen 8); each
rank dumps its captured tensors to <dir>/rank{R}_outputs.pt. Stage 0
ranks capture hidden_0..hidden_{first_stage_layers}; stage 1 ranks
capture the remaining hidden states + hidden_norm + logits.
Used together with an HF transformers reference forward (same input)
to validate per-layer parity within bf16 noise. No effect on production
runs that don't set either env var.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
This reverts commit c23bc04. Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
The two truncated 4-layer recipes (hy3_4layer_p0_smoke.yaml, hy3_4layer_p1_ckpt.yaml) were used during initial state-dict and checkpoint validation; the only remaining production recipe for HYV3 is hy3_preview_deepep.yaml. Drop the smoke files and remove the now-stale download links from the model-coverage page. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
The PR branch's modifications to nemo_automodel/components/checkpoint/checkpointing.py were tracking older state of main; the relevant moe_mesh wiring (_maybe_adapt_state_dict_from_hf(..., moe_mesh=self.moe_mesh)) has been on main since NVIDIA-NeMo#1904 (Adil, 2025-10-16). Sync this file back to main. Also drop the DefaultLoadPlanner(allow_partial_load=True) shim that was introduced for the previous HYV3 adapter where the e_score_correction_bias buffer's HF key was not renamed correctly. The new mixin-based adapter (commit cb30a59) renames expert_bias <-> e_score_correction_bias during from_hf/to_hf, so DCP finds matching keys on disk and allow_partial_load is unnecessary. Verified end-to-end on hy3_4layer_p0_smoke (pp=2, ep=4, 8 GPUs): 3 training steps complete, no missing-key errors, MoE weights load correctly via the shared mixin path with moe_mesh from the call site. While here, drop the unused _infer_ep_mesh helper in HYV3StateDictAdapter -- the call site always supplies moe_mesh now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
40 tests covering all behavior introduced by the rewritten adapter:
- Initialization: attribute wiring, default dtype, mixin inheritance.
- Rename tables (_NATIVE_TO_HF_RENAMES / _HF_TO_NATIVE_RENAMES):
parametrized round-trip for each rename pair, plus negative cases that
must NOT be renamed (attention, layernorm, embed, dense MLP, lm_head,
model.norm).
- from_hf (on-disk -> native):
* router.gate.weight -> gate.weight
* expert_bias -> gate.e_score_correction_bias
* shared_mlp.* -> shared_experts.*
* per-expert split keys merged into experts.gate_and_up_projs and
experts.down_projs with the right [E,H,2I]/[E,I,H] native shapes
and value-level transposed/concatenated layout
* MTP layer keys (index >= num_hidden_layers) dropped
* unrelated keys pass through unchanged
- to_hf (native -> on-disk):
* reverse renames produce the on-disk Tencent names
* grouped expert tensors split into per-expert keys with
[moe_inter, hidden] / [hidden, moe_inter] disk shapes
* exclude_key_regex honored
- convert_single_tensor_to_hf:
* non-expert keys renamed or pass through
* expert tensors split + renamed (one input -> 2*E or E pairs)
* exclude_key_regex applied after rename
- Round-trip integrity:
* native -> to_hf -> from_hf recovers every key value-for-value
* disk -> from_hf -> to_hf recovers every non-MTP key value-for-value
- _is_mtp_key: parametrized layer-index classification with/without
the "model." prefix, plus a config-threshold variation test.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Adds tests for every code change in the PR that wasn't already covered:
- tests/unit_tests/models/hy_v3/test_hy_v3_config.py (15 tests):
HYV3Config defaults match the published 295B spec, override
propagation for attention dims / MoE routing / layer truncation /
first_k_dense_replace / router flags / RoPE / token IDs, to_dict
round-trip, and class-level model_type stability.
- tests/unit_tests/models/hy_v3/test_hy_v3_layers.py (10 tests):
HYV3Attention initialization (projection shapes, per-head qk_norm,
attention_bias on/off), forward output shapes through the sdpa
backend, q/k/v/o projections all called, attention_mask propagated,
init_weights resets norms + reseeds linears.
- tests/unit_tests/models/hy_v3/test_hy_v3_model.py (24 tests):
Block dense vs MoE switching at first_k_dense_replace, residual
forward calls attn + mlp, attention_mask -> padding_mask conversion,
init_weights propagates to sub-components; HYV3Model construction
+ dense+MoE structure + moe_config inference + moe_overrides +
moe_config-vs-overrides conflict + forward + position_ids +
init_weights; HYV3ForCausalLM construction, optional
state_dict_adapter wiring, default backend, get/set in/out
embeddings, forward logits shape, initialize_weights, the
update_moe_gate_bias no-op-when-factor-zero contract (regression
test for 564ff4f), from_config/from_pretrained classmethods,
ModelClass alias, module exports.
- tests/unit_tests/_transformers/test_registry_hy_v3.py (6 tests):
HYV3ForCausalLM is registered in MODEL_ARCH_MAPPING and resolves
to the right (module, class). hy_v3 is registered in
_CUSTOM_CONFIG_REGISTRATIONS and the resolved class has
model_type == 'hy_v3'. Negative tests confirm the removed
Ministral3 bidirectional retrieval keys are gone from
SUPPORTED_BACKBONES.
- tests/unit_tests/recipes/test_train_ft.py (3 tests):
PEFT + torch_save raises ValueError (parity with the equivalent
test added in test_finetune_vlm_helpers.py), PEFT + safetensors
succeeds, non-PEFT + torch_save succeeds.
All 107 tests pass (40 from the prior adapter test commit + 67 new
here).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Mirror the MiniMax-M2.7 PR (NVIDIA-NeMo#1785) doc additions for the new HYV3 support: news entry at the top of README.md "What's New" and a row in docs/model-coverage/latest-models.md. Per-model coverage page (docs/model-coverage/llm/tencent/hy3.md) and llm/index.md row are already present. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Contributor
|
/ok to test 1d511f8 |
Signed-off-by: HuiyingLi <willwin.lee@gmail.com> # Conflicts: # tests/unit_tests/recipes/test_train_ft.py
Contributor
|
/ok to test fc62a07 |
HuiyingLi
approved these changes
Apr 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds fine-tuning support for tencent/Hy3-preview, a 295B MoE model with:
e_score_correction_biasgate buffer for expert-bias correctionReferences
Files added
nemo_automodel/components/models/hy_v3/— model, layers, config, state_dict_adapternemo_automodel/_transformers/registry.py— registerHYV3ForCausalLMandhy_v3confignemo_automodel/components/checkpoint/checkpointing.py— partial-load fix for HF base checkpoint init (see below)examples/llm_finetune/hy_v3/hy3_preview_deepep.yaml— full 295B SFT recipeBug fixes included
gate_bias_update_factor=0.0for HYV3The official Tencent Hy3-preview SFT procedure treats
e_score_correction_biasas a static pre-trained buffer and never updates it during fine-tuning. The original implementation incorrectly setgate_bias_update_factor=1e-3, which triggered an EMA update on every step. This is now set to0.0.update_moe_gate_bias()is also guarded to be a no-op when the factor is zero, preventing an assertion error fromGate.update_bias().Validation status
hy3_preview_deepep.yamlFull 295B training curves (EP32 PP4, 16 nodes):