Conversation
Signed-off-by: Huiying Li <willwin.lee@gmail.com>
Contributor
Author
|
/ok to test 6a9f3b1 |
akoumpa
approved these changes
Apr 12, 2026
4 tasks
edjson
pushed a commit
to edjson/Automodel
that referenced
this pull request
Apr 17, 2026
update docs for m27 Signed-off-by: Huiying Li <willwin.lee@gmail.com>
edjson
pushed a commit
to edjson/Automodel
that referenced
this pull request
Apr 18, 2026
update docs for m27 Signed-off-by: Huiying Li <willwin.lee@gmail.com> Signed-off-by: Edison <edisonggacc@gmail.com>
linnanwang
pushed a commit
that referenced
this pull request
Apr 24, 2026
update docs for m27 Signed-off-by: Huiying Li <willwin.lee@gmail.com>
HuiyingLi
added a commit
to khazic/Automodel_lao
that referenced
this pull request
Apr 25, 2026
Mirrors the per-model rollout pattern used for MiniMax-M2.7 (NVIDIA-NeMo#1785): news entry at the top of the README, a dedicated model-coverage page under deepseek-ai/, and registration of the new page in the LLM index (architecture table + toctree). - README.md (news entry) - docs/model-coverage/llm/deepseek-ai/dsv4-flash.md (new) - docs/model-coverage/llm/index.md (table + toctree) Signed-off-by: Huiying Li <huiyingl@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
HuiyingLi
added a commit
to khazic/Automodel_lao
that referenced
this pull request
Apr 25, 2026
Mirrors the per-model rollout pattern used for MiniMax-M2.7 (NVIDIA-NeMo#1785): news entry at the top of the README, a dedicated model-coverage page under deepseek-ai/, and registration of the new page in the LLM index (architecture table + toctree). - README.md (news entry) - docs/model-coverage/llm/deepseek-ai/dsv4-flash.md (new) - docs/model-coverage/llm/index.md (table + toctree) Signed-off-by: Huiying Li <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
HuiyingLi
added a commit
that referenced
this pull request
Apr 26, 2026
…2054) * docs(llm): drop validate-yaml reference from DeepSeek V4 Flash guide Removes the validate-yaml bullet under "Launch Training" and the "Quick infrastructure validation" subsection. The validate harness is an internal smoke-test config, not a user-facing finetune recipe; the guide should advertise only the HellaSwag recipe. Follow-up to #2053 (the original change was force-pushed after the PR had already merged, so the deletion did not land on main). Signed-off-by: khazic <khazzz1c@gmail.com> * docs(llm): add DeepSeek V4 Flash to README + model-coverage index Mirrors the per-model rollout pattern used for MiniMax-M2.7 (#1785): news entry at the top of the README, a dedicated model-coverage page under deepseek-ai/, and registration of the new page in the LLM index (architecture table + toctree). - README.md (news entry) - docs/model-coverage/llm/deepseek-ai/dsv4-flash.md (new) - docs/model-coverage/llm/index.md (table + toctree) Signed-off-by: Huiying Li <willwin.lee@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(llm): use plain link for hellaswag yaml until model PR lands The {download} directive on the recipe yaml fails the Sphinx build with `download.not_readable` because examples/llm_finetune/deepseek_v4/deepseek_v4_flash_hellaswag.yaml is added by the model PR (#2039), which has not yet landed on main. Use a plain GitHub link until #2039 merges; a follow-up can switch back to {download} once the file is on main. Signed-off-by: khazic <khazzz1c@gmail.com> --------- Signed-off-by: khazic <khazzz1c@gmail.com> Co-authored-by: Huiying Li <willwin.lee@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
HuiyingLi
added a commit
to khazic/Automodel_lao
that referenced
this pull request
Apr 28, 2026
Mirror the MiniMax-M2.7 PR (NVIDIA-NeMo#1785) doc additions for the new HYV3 support: news entry at the top of README.md "What's New" and a row in docs/model-coverage/latest-models.md. Per-model coverage page (docs/model-coverage/llm/tencent/hy3.md) and llm/index.md row are already present. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
HuiyingLi
added a commit
that referenced
this pull request
Apr 29, 2026
* feat(llm): add Hy3-preview (HYV3) SFT support
Adds SFT training support for tencent/Hy3-preview (295B MoE, 192 experts
top-8, 256K context). Requires transformers >= 5.6.0.
New files:
- nemo_automodel/components/models/hy_v3/layers.py: HYV3Attention with
GQA, per-head QK RMSNorm, and RoPE
- nemo_automodel/components/models/hy_v3/model.py: HYV3ForCausalLM /
HYV3Model / Block wrapping Automodel's MoE infrastructure
- nemo_automodel/components/models/hy_v3/state_dict_adapter.py:
HYV3StateDictAdapter handling HF↔native conversion (expert tensor
transposition, e_score_correction_bias relocation, MTP layer skipping)
- examples/llm_finetune/hy_v3/hy3_preview_deepep.yaml: example SFT config
Key architecture differences vs Qwen3-MoE:
- Sigmoid routing with e_score_correction_bias (vs softmax)
- first_k_dense_replace=1: only layer 0 is dense
- 1 shared expert alongside 192 routed experts
- route_scale=2.826 applied to routing weights
- HF expert tensors are pre-grouped [n,2i,h] vs per-expert; only need
transposition (not stack/concat) in state_dict_adapter
Registers HYV3ForCausalLM in MODEL_ARCH_MAPPING.
Signed-off-by: khazic <khazzz1c@gmail.com>
* ci(llm): add HYV3 phased test configs (P0/P1/P2)
P0 (hy3_4layer_p0_smoke.yaml): 4-layer proxy, pp=2, ep=4, torch
dispatcher, 100 steps — validates forward/backward/PP/EP health
and e_score_correction_bias updates with no checkpoint I/O.
P1 (hy3_4layer_p1_ckpt.yaml): same topology + DCP checkpoint save
at step 50, exercises save/resume continuity of the full FSDP2+EP
state including Gate buffers.
P2 (hy3_8layer_p2_deepep.yaml): 8-layer proxy, pp=2, ep=4, DeepEP
dispatcher (async_finish=True), 200 steps — validates deepep
communicate-compute overlap and throughput vs P0 torch baseline.
All three configs: pp=2, ep=4, 8 GPUs, interleaved1f1b schedule,
real sigmoid routing (fake_balanced_gate: false).
Signed-off-by: khazic <khazzz1c@gmail.com>
* ci(llm): rewrite HYV3 test configs to use real checkpoint (DSV4 pattern)
Switch from tiny proxy model to real tencent/Hy3-preview weights with a
truncated layer count (4/8 layers), following the same approach used for
DeepSeek V4 Flash validation. Checkpoint keys for layers beyond the
truncated num_hidden_layers are ignored via strict=False on load.
P0 (4 layers, torch, pp=2 ep=4): validates real tensor shapes and routing
P1 (4 layers, torch, pp=2 ep=4): adds DCP checkpoint save/resume
P2 (8 layers, deepep, pp=2 ep=4): validates DeepEP with real expert dims
All configs: AutoConfig.from_pretrained + num_hidden_layers override +
load_base_model=true + enable_hf_state_dict_adapter=true.
192 experts / ep=4 = 48 experts per rank (~8GB params per rank at bf16).
Signed-off-by: khazic <khazzz1c@gmail.com>
* fix(llm): align HYV3 configs and model with official Hy3-preview specs
- Fix optimizer: Adam → AdamW, lr 5e-4 → 1e-5, eps 1e-7 → 1e-8 (follows
official train.py and matches DSV4 pattern)
- Add gate_precision: float32 to all HYV3 backends (matches HF router FP32)
- Add rope_fusion: false to P0/P1/P2 (attn: sdpa; avoids TE mismatch)
- Fix collate_fn to dict form with pad_seq_len_divisible: 64
- Add _target_ to tokenizer fields in dataset/validation_dataset
- Add shuffle: false and drop_last: true to validation_dataloader
- Fix hy3_preview_deepep: pp_schedule 1f1b (pp=1), add moe section,
fix optimizer and dataset fields
- Remove num_nextn_predict_layers (DeepSeek-specific, not in HYV3 config)
- Add update_moe_gate_bias() to HYV3ForCausalLM so the training recipe
updates e_score_correction_bias each optimizer step (load balancing)
Signed-off-by: khazic <khazzz1c@gmail.com>
* ci(llm): set Hy3-preview checkpoint path to /llm-align/open_models/hunyuan3
Signed-off-by: khazic <khazzz1c@gmail.com>
* feat(llm): add HYV3Config and register hy_v3 with AutoConfig
AutoConfig.from_pretrained failed on checkpoints with model_type=hy_v3
because the type was not registered. Add config.py with HYV3Config
(PretrainedConfig subclass) and wire it into _CUSTOM_CONFIG_REGISTRATIONS
so that trust_remote_code=False keeps working.
Signed-off-by: khazic <khazzz1c@gmail.com>
* fix(checkpoint): allow partial load when loading HF base checkpoint via DCP
Custom models (e.g. HYV3) create training-only buffers (e_score_correction_bias)
that are not present in the original HF pretrained checkpoint. Using
DefaultLoadPlanner(allow_partial_load=True) for is_init_step loads lets those
buffers keep their zero initialization instead of raising
"Missing key in checkpoint state_dict".
Signed-off-by: khazic <khazzz1c@gmail.com>
* fix(checkpoint): skip EP slicing in from_hf for standard DCP load path
In the standard DCP load path (post-shard, e.g. PP+EP), DCP already
distributes expert tensors correctly via DTensor placement (Shard(0)).
Passing moe_mesh to from_hf causes a second EP slice on the DTensor,
producing a plain tensor of local shape [48,...] that cannot be loaded
into the model's DTensor parameter of global shape [192,...].
Fix: pass moe_mesh=None to _maybe_adapt_state_dict_from_hf in the
standard DCP path so adapters only rename keys and do not re-slice
tensors that DCP has already distributed.
The fast path (lines 444-498) is unaffected: it still passes moe_mesh
because it loads full plain tensors from disk and needs explicit slicing.
Signed-off-by: khazic <khazzz1c@gmail.com>
* ci(llm): use public tencent/Hy3-preview HF path in HYV3 test yamls
Replace internal server paths with the public HuggingFace model ID.
Signed-off-by: khazic <khazzz1c@gmail.com>
* ci(llm): remove P2 DeepEP yaml (not yet validated)
Signed-off-by: khazic <khazzz1c@gmail.com>
* fix(llm): disable e_score_correction_bias EMA update for HYV3
The official Hy3-preview fine-tuning scripts (train.py and
hy_v3_patches.py) treat e_score_correction_bias as a static
pre-trained buffer — it is loaded from the base checkpoint and
used as-is during SFT, with no in-training update.
Set gate_bias_update_factor=0.0 so our implementation matches:
the buffer is still created and loaded from checkpoint
(via force_e_score_correction_bias), but the EMA update path
in Gate.update_bias() is never triggered.
Signed-off-by: khazic <khazzz1c@gmail.com>
* fix(llm): guard update_moe_gate_bias against disabled bias update
When gate_bias_update_factor=0.0, calling update_bias() triggers an
assertion error. Skip the call when the factor is zero.
Signed-off-by: khazic <khazzz1c@gmail.com>
* docs(llm): add model-coverage page for Hy3-preview (HYV3ForCausalLM)
Fixes test_every_registered_arch_has_model_coverage_doc: the new
HYV3ForCausalLM architecture registered in the registry now has a
corresponding docs/model-coverage/llm/tencent/hy3.md page.
Signed-off-by: khazic <khazzz1c@gmail.com>
* docs(llm): add tencent/Hy3-preview to LLM model-coverage toctree
hy3.md was missing from the toctree, causing a Sphinx build warning
(treated as error by CI).
Signed-off-by: khazic <khazzz1c@gmail.com>
* chore(llm): tune Hy3-preview DeepEP recipe for 16-node run
Bump local_batch_size 4->8, set pp_size=4 and ep_size=32 for 128 GPUs
(16 nodes x 8), add max_steps=100, and raise val_every_steps to 500.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
* fix(llm): load Hy3-preview MoE expert weights from HF checkpoint
The previous HYV3 adapter expected an already-fused HF state dict
(mlp.experts.gate_up_proj / down_proj / e_score_correction_bias /
shared_experts.* / mlp.gate.weight). The on-disk Tencent format
actually stores per-expert split keys plus internal names
(mlp.experts.{i}.{gate,up,down}_proj.weight, mlp.expert_bias,
mlp.router.gate.weight, mlp.shared_mlp.*). Because checkpointing.py:507
zeroes reader_key_mapping when the model has a state_dict_adapter, the
storage reader's renames never ran and DCP found no matching keys for
any MoE tensor -- so router/shared/experts silently stayed at random
init while everything else loaded fine.
Rewrite HYV3StateDictAdapter on top of MoESplitExpertsStateDictMixin
so to_hf produces on-disk-format keys (per-expert split + Tencent
names) that DCP can match against the safetensors, and from_hf merges
them back into the grouped native form. The three HYV3-specific
renames (router.gate <-> gate, expert_bias <-> e_score_correction_bias,
shared_mlp. <-> shared_experts.) are applied around the mixin.
Also fix _maybe_adapt_state_dict_from_hf to pass moe_mesh in the DCP
init path (was None). The mixin's validator/merger needs the EP mesh
to know which expert-id subset is expected on the rank; without it,
required_experts = range(192) and validation fails ("Expert weights
missing from checkpoint: 432/576 ..."). The previous comment said
"DCP already distributed -- don't pass moe_mesh", but the mesh is
needed for subset-aware validation, not re-slicing.
Verified end-to-end on hy3_4layer_p0_smoke (pp=2, ep=4, 8 GPUs): all
56 non-bias tensors are bitwise identical to the HF reference; the 3
e_score_correction_bias tensors agree to <=2.2e-4 (bf16 round-trip
noise). Stitched mlp.gate.weight across 4 EP ranks matches the on-disk
router.gate.weight bitwise.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
* chore(llm): env-gated parity dumps in train_ft.py setup()
Add two debug-only hooks behind environment variables, both no-ops by
default:
- AUTOMODEL_PARITY_DUMP=<dir>: after build_model() returns, write each
rank's post-load state_dict to <dir>/rank{R}_state_dict.pt. Used to
verify HF -> Automodel weight loading matches a reference state dict
from the on-disk safetensors.
- AUTOMODEL_PARITY_LOGITS=<dir>: at the end of setup(), register
forward hooks on embed_tokens, every decoder layer, the final norm,
and lm_head; run one deterministic eval-mode forward through the PP
schedule on a fixed input (torch.randint with seed 0, seqlen 8); each
rank dumps its captured tensors to <dir>/rank{R}_outputs.pt. Stage 0
ranks capture hidden_0..hidden_{first_stage_layers}; stage 1 ranks
capture the remaining hidden states + hidden_norm + logits.
Used together with an HF transformers reference forward (same input)
to validate per-layer parity within bf16 noise. No effect on production
runs that don't set either env var.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
* Revert "chore(llm): env-gated parity dumps in train_ft.py setup()"
This reverts commit c23bc04.
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
* Apply suggestion from @jgerh
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
* Apply suggestion from @jgerh
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
* Apply suggestion from @jgerh
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
* Apply suggestion from @jgerh
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
* chore(llm): drop Hy3 4-layer smoke/ckpt example yamls
The two truncated 4-layer recipes (hy3_4layer_p0_smoke.yaml,
hy3_4layer_p1_ckpt.yaml) were used during initial state-dict and
checkpoint validation; the only remaining production recipe for
HYV3 is hy3_preview_deepep.yaml. Drop the smoke files and remove
the now-stale download links from the model-coverage page.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
* chore(checkpoint): drop redundant changes already on main
The PR branch's modifications to nemo_automodel/components/checkpoint/checkpointing.py
were tracking older state of main; the relevant moe_mesh wiring
(_maybe_adapt_state_dict_from_hf(..., moe_mesh=self.moe_mesh)) has been
on main since #1904 (Adil, 2025-10-16). Sync this file back to main.
Also drop the DefaultLoadPlanner(allow_partial_load=True) shim that was
introduced for the previous HYV3 adapter where the e_score_correction_bias
buffer's HF key was not renamed correctly. The new mixin-based adapter
(commit cb30a59) renames expert_bias <-> e_score_correction_bias
during from_hf/to_hf, so DCP finds matching keys on disk and
allow_partial_load is unnecessary.
Verified end-to-end on hy3_4layer_p0_smoke (pp=2, ep=4, 8 GPUs):
3 training steps complete, no missing-key errors, MoE weights load
correctly via the shared mixin path with moe_mesh from the call site.
While here, drop the unused _infer_ep_mesh helper in
HYV3StateDictAdapter -- the call site always supplies moe_mesh now.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
* test(llm): unit tests for HYV3StateDictAdapter
40 tests covering all behavior introduced by the rewritten adapter:
- Initialization: attribute wiring, default dtype, mixin inheritance.
- Rename tables (_NATIVE_TO_HF_RENAMES / _HF_TO_NATIVE_RENAMES):
parametrized round-trip for each rename pair, plus negative cases that
must NOT be renamed (attention, layernorm, embed, dense MLP, lm_head,
model.norm).
- from_hf (on-disk -> native):
* router.gate.weight -> gate.weight
* expert_bias -> gate.e_score_correction_bias
* shared_mlp.* -> shared_experts.*
* per-expert split keys merged into experts.gate_and_up_projs and
experts.down_projs with the right [E,H,2I]/[E,I,H] native shapes
and value-level transposed/concatenated layout
* MTP layer keys (index >= num_hidden_layers) dropped
* unrelated keys pass through unchanged
- to_hf (native -> on-disk):
* reverse renames produce the on-disk Tencent names
* grouped expert tensors split into per-expert keys with
[moe_inter, hidden] / [hidden, moe_inter] disk shapes
* exclude_key_regex honored
- convert_single_tensor_to_hf:
* non-expert keys renamed or pass through
* expert tensors split + renamed (one input -> 2*E or E pairs)
* exclude_key_regex applied after rename
- Round-trip integrity:
* native -> to_hf -> from_hf recovers every key value-for-value
* disk -> from_hf -> to_hf recovers every non-MTP key value-for-value
- _is_mtp_key: parametrized layer-index classification with/without
the "model." prefix, plus a config-threshold variation test.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
* test(llm): cover remaining HYV3 + recipe changes in this PR
Adds tests for every code change in the PR that wasn't already covered:
- tests/unit_tests/models/hy_v3/test_hy_v3_config.py (15 tests):
HYV3Config defaults match the published 295B spec, override
propagation for attention dims / MoE routing / layer truncation /
first_k_dense_replace / router flags / RoPE / token IDs, to_dict
round-trip, and class-level model_type stability.
- tests/unit_tests/models/hy_v3/test_hy_v3_layers.py (10 tests):
HYV3Attention initialization (projection shapes, per-head qk_norm,
attention_bias on/off), forward output shapes through the sdpa
backend, q/k/v/o projections all called, attention_mask propagated,
init_weights resets norms + reseeds linears.
- tests/unit_tests/models/hy_v3/test_hy_v3_model.py (24 tests):
Block dense vs MoE switching at first_k_dense_replace, residual
forward calls attn + mlp, attention_mask -> padding_mask conversion,
init_weights propagates to sub-components; HYV3Model construction
+ dense+MoE structure + moe_config inference + moe_overrides +
moe_config-vs-overrides conflict + forward + position_ids +
init_weights; HYV3ForCausalLM construction, optional
state_dict_adapter wiring, default backend, get/set in/out
embeddings, forward logits shape, initialize_weights, the
update_moe_gate_bias no-op-when-factor-zero contract (regression
test for 564ff4f), from_config/from_pretrained classmethods,
ModelClass alias, module exports.
- tests/unit_tests/_transformers/test_registry_hy_v3.py (6 tests):
HYV3ForCausalLM is registered in MODEL_ARCH_MAPPING and resolves
to the right (module, class). hy_v3 is registered in
_CUSTOM_CONFIG_REGISTRATIONS and the resolved class has
model_type == 'hy_v3'. Negative tests confirm the removed
Ministral3 bidirectional retrieval keys are gone from
SUPPORTED_BACKBONES.
- tests/unit_tests/recipes/test_train_ft.py (3 tests):
PEFT + torch_save raises ValueError (parity with the equivalent
test added in test_finetune_vlm_helpers.py), PEFT + safetensors
succeeds, non-PEFT + torch_save succeeds.
All 107 tests pass (40 from the prior adapter test commit + 67 new
here).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
* docs(llm): announce Hy3-preview in README and latest-models log
Mirror the MiniMax-M2.7 PR (#1785) doc additions for the new HYV3
support: news entry at the top of README.md "What's New" and a row
in docs/model-coverage/latest-models.md. Per-model coverage page
(docs/model-coverage/llm/tencent/hy3.md) and llm/index.md row are
already present.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
---------
Signed-off-by: khazic <khazzz1c@gmail.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-authored-by: HuiyingLi <willwin.lee@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Changelog
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information