fix: Step-3.5-Flash layer_types mismatch and related recipe fixes by hemildesai · Pull Request #1916 · NVIDIA-NeMo/Automodel

hemildesai · 2026-04-20T18:37:26Z

Summary

Add tiktoken to base deps so Moonlight's TikToken-based remote tokenizer can load.
Work around upstream configs whose layer_types is longer than num_hidden_layers (e.g. stepfun-ai/Step-3.5-Flash ships 48 vs 45). get_hf_config now catches the validation error, truncates layer_types in the raw config dict, and rebuilds via the resolved config class (remote dynamic module or CONFIG_MAPPING).
Tune Qwen MoE recipes: qwen3_moe_30b_hellaswag hf_kl_threshold 1e-3 → 1e-2; qwen3_moe_30b_uccl_ep ep_size 16 → 8.

Test plan

New unit tests in tests/unit_tests/_transformers/test_model_init.py cover the helper (dynamic-module path, CONFIG_MAPPING fallback, no-op when lengths match, unresolved class) and get_hf_config retry behavior (triggers fix on validator error, reraises unrelated ValueError, preserves the "does not recognize this architecture" helpful message).
Smoke-tested _load_config_with_layer_types_fix against stepfun-ai/Step-3.5-Flash — config loads with matching lengths (45/45).
Rerun Moonlight and Step-3.5-Flash recipes on the convergence cluster.

🤖 Generated with Claude Code

copy-pr-bot · 2026-04-20T18:37:29Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

akoumpa · 2026-04-20T18:39:13Z

/ok to test 2d72807

hemildesai · 2026-04-20T18:40:59Z

/claude review

hemildesai · 2026-04-20T18:41:06Z

/ok to test 2d72807

claude

LGTM

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

@claude

) * fix: add tiktoken dep, patch Step-3.5-Flash layer_types mismatch, tune Qwen MoE recipes - Add tiktoken to base deps for Moonlight's TikToken-based remote tokenizer. - Retry AutoConfig.from_pretrained when upstream configs ship layer_types longer than num_hidden_layers (e.g. stepfun-ai/Step-3.5-Flash) by truncating layer_types in the raw config dict and rebuilding via the resolved config class (dynamic module or CONFIG_MAPPING). - Bump qwen3_moe_30b_hellaswag hf_kl_threshold 1e-3 -> 1e-2 and qwen3_moe_30b_uccl_ep ep_size 16 -> 8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com> * Update uv lock Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> * Apply suggestion from @claude[bot] Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> --------- Signed-off-by: hemildesai <hemild@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

@claude

) * fix: add tiktoken dep, patch Step-3.5-Flash layer_types mismatch, tune Qwen MoE recipes - Add tiktoken to base deps for Moonlight's TikToken-based remote tokenizer. - Retry AutoConfig.from_pretrained when upstream configs ship layer_types longer than num_hidden_layers (e.g. stepfun-ai/Step-3.5-Flash) by truncating layer_types in the raw config dict and rebuilding via the resolved config class (dynamic module or CONFIG_MAPPING). - Bump qwen3_moe_30b_hellaswag hf_kl_threshold 1e-3 -> 1e-2 and qwen3_moe_30b_uccl_ep ep_size 16 -> 8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com> * Update uv lock Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> * Apply suggestion from @claude[bot] Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> --------- Signed-off-by: hemildesai <hemild@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Signed-off-by: hemildesai <hemild@nvidia.com>

@claude

#1936) fix: Step-3.5-Flash layer_types mismatch and related recipe fixes (#1916) * fix: add tiktoken dep, patch Step-3.5-Flash layer_types mismatch, tune Qwen MoE recipes - Add tiktoken to base deps for Moonlight's TikToken-based remote tokenizer. - Retry AutoConfig.from_pretrained when upstream configs ship layer_types longer than num_hidden_layers (e.g. stepfun-ai/Step-3.5-Flash) by truncating layer_types in the raw config dict and rebuilding via the resolved config class (dynamic module or CONFIG_MAPPING). - Bump qwen3_moe_30b_hellaswag hf_kl_threshold 1e-3 -> 1e-2 and qwen3_moe_30b_uccl_ep ep_size 16 -> 8. * Update uv lock * Apply suggestion from @claude[bot] --------- Signed-off-by: hemildesai <hemild@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Hemil Desai <hemild@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

@claude

) * fix: add tiktoken dep, patch Step-3.5-Flash layer_types mismatch, tune Qwen MoE recipes - Add tiktoken to base deps for Moonlight's TikToken-based remote tokenizer. - Retry AutoConfig.from_pretrained when upstream configs ship layer_types longer than num_hidden_layers (e.g. stepfun-ai/Step-3.5-Flash) by truncating layer_types in the raw config dict and rebuilding via the resolved config class (dynamic module or CONFIG_MAPPING). - Bump qwen3_moe_30b_hellaswag hf_kl_threshold 1e-3 -> 1e-2 and qwen3_moe_30b_uccl_ep ep_size 16 -> 8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: hemildesai <hemild@nvidia.com> * Update uv lock Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> * Apply suggestion from @claude[bot] Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> --------- Signed-off-by: hemildesai <hemild@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com> Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

hemildesai requested review from a team, HuiyingLi, ZhiyuLi-Nvidia, adil-a, akoumpa and pthombre as code owners April 20, 2026 18:37

copy-pr-bot Bot temporarily deployed to nemo-ci April 20, 2026 18:39 Inactive

copy-pr-bot Bot temporarily deployed to test April 20, 2026 18:39 Inactive

hemildesai changed the title ~~fix: add tiktoken dep, patch Step-3.5-Flash layer_types mismatch, tune Qwen MoE recipes~~ fix: Step-3.5-Flash layer_types mismatch and related recipe fixes Apr 20, 2026

claude Bot previously approved these changes Apr 20, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to nemo-ci April 20, 2026 18:45 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 20, 2026 19:08 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 20, 2026 19:32 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 20, 2026 21:59 Inactive

claude Bot reviewed Apr 20, 2026

View reviewed changes

Comment thread nemo_automodel/_transformers/model_init.py Outdated

copy-pr-bot Bot temporarily deployed to nemo-ci April 20, 2026 22:15 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 20, 2026 22:38 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 20, 2026 23:05 Inactive

Apply suggestion from @claude[bot]

8707641

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

akoumpa approved these changes Apr 21, 2026

View reviewed changes

akoumpa merged commit a1dc3a6 into main Apr 21, 2026
4 of 5 checks passed

akoumpa deleted the hemild/fix-step35-moonlight-configs branch April 21, 2026 05:21

adil-a mentioned this pull request Apr 21, 2026

fix: widen qwen3_moe_30b_hellaswag ckpt-robustness KL threshold to 3e-2 #1942

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Step-3.5-Flash layer_types mismatch and related recipe fixes#1916

fix: Step-3.5-Flash layer_types mismatch and related recipe fixes#1916
akoumpa merged 3 commits intomainfrom
hemild/fix-step35-moonlight-configs

hemildesai commented Apr 20, 2026

Uh oh!

copy-pr-bot Bot commented Apr 20, 2026

Uh oh!

akoumpa commented Apr 20, 2026

Uh oh!

hemildesai commented Apr 20, 2026

Uh oh!

hemildesai commented Apr 20, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hemildesai commented Apr 20, 2026

Summary

Test plan

Uh oh!

copy-pr-bot Bot commented Apr 20, 2026

Uh oh!

akoumpa commented Apr 20, 2026

Uh oh!

hemildesai commented Apr 20, 2026

Uh oh!

hemildesai commented Apr 20, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants