Skip to content

fix: Step-3.5-Flash layer_types mismatch and related recipe fixes#1916

Merged
akoumpa merged 3 commits intomainfrom
hemild/fix-step35-moonlight-configs
Apr 21, 2026
Merged

fix: Step-3.5-Flash layer_types mismatch and related recipe fixes#1916
akoumpa merged 3 commits intomainfrom
hemild/fix-step35-moonlight-configs

Conversation

@hemildesai
Copy link
Copy Markdown
Contributor

Summary

  • Add tiktoken to base deps so Moonlight's TikToken-based remote tokenizer can load.
  • Work around upstream configs whose layer_types is longer than num_hidden_layers (e.g. stepfun-ai/Step-3.5-Flash ships 48 vs 45). get_hf_config now catches the validation error, truncates layer_types in the raw config dict, and rebuilds via the resolved config class (remote dynamic module or CONFIG_MAPPING).
  • Tune Qwen MoE recipes: qwen3_moe_30b_hellaswag hf_kl_threshold 1e-3 → 1e-2; qwen3_moe_30b_uccl_ep ep_size 16 → 8.

Test plan

  • New unit tests in tests/unit_tests/_transformers/test_model_init.py cover the helper (dynamic-module path, CONFIG_MAPPING fallback, no-op when lengths match, unresolved class) and get_hf_config retry behavior (triggers fix on validator error, reraises unrelated ValueError, preserves the "does not recognize this architecture" helpful message).
  • Smoke-tested _load_config_with_layer_types_fix against stepfun-ai/Step-3.5-Flash — config loads with matching lengths (45/45).
  • Rerun Moonlight and Step-3.5-Flash recipes on the convergence cluster.

🤖 Generated with Claude Code

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 20, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented Apr 20, 2026

/ok to test 2d72807

@hemildesai hemildesai changed the title fix: add tiktoken dep, patch Step-3.5-Flash layer_types mismatch, tune Qwen MoE recipes fix: Step-3.5-Flash layer_types mismatch and related recipe fixes Apr 20, 2026
@hemildesai
Copy link
Copy Markdown
Contributor Author

/claude review

@hemildesai
Copy link
Copy Markdown
Contributor Author

/ok to test 2d72807

claude[bot]
claude Bot previously approved these changes Apr 20, 2026
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread nemo_automodel/_transformers/model_init.py Outdated
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
@akoumpa akoumpa merged commit a1dc3a6 into main Apr 21, 2026
4 of 5 checks passed
@akoumpa akoumpa deleted the hemild/fix-step35-moonlight-configs branch April 21, 2026 05:21
akoumpa added a commit that referenced this pull request Apr 21, 2026
)

* fix: add tiktoken dep, patch Step-3.5-Flash layer_types mismatch, tune Qwen MoE recipes

- Add tiktoken to base deps for Moonlight's TikToken-based remote tokenizer.
- Retry AutoConfig.from_pretrained when upstream configs ship layer_types
  longer than num_hidden_layers (e.g. stepfun-ai/Step-3.5-Flash) by
  truncating layer_types in the raw config dict and rebuilding via
  the resolved config class (dynamic module or CONFIG_MAPPING).
- Bump qwen3_moe_30b_hellaswag hf_kl_threshold 1e-3 -> 1e-2 and
  qwen3_moe_30b_uccl_ep ep_size 16 -> 8.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* Update uv lock

Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

* Apply suggestion from @claude[bot]

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

---------

Signed-off-by: hemildesai <hemild@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
hemildesai added a commit that referenced this pull request Apr 21, 2026
)

* fix: add tiktoken dep, patch Step-3.5-Flash layer_types mismatch, tune Qwen MoE recipes

- Add tiktoken to base deps for Moonlight's TikToken-based remote tokenizer.
- Retry AutoConfig.from_pretrained when upstream configs ship layer_types
  longer than num_hidden_layers (e.g. stepfun-ai/Step-3.5-Flash) by
  truncating layer_types in the raw config dict and rebuilding via
  the resolved config class (dynamic module or CONFIG_MAPPING).
- Bump qwen3_moe_30b_hellaswag hf_kl_threshold 1e-3 -> 1e-2 and
  qwen3_moe_30b_uccl_ep ep_size 16 -> 8.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* Update uv lock

Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

* Apply suggestion from @claude[bot]

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

---------

Signed-off-by: hemildesai <hemild@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Signed-off-by: hemildesai <hemild@nvidia.com>
akoumpa added a commit that referenced this pull request Apr 21, 2026
#1936)

fix: Step-3.5-Flash layer_types mismatch and related recipe fixes (#1916)

* fix: add tiktoken dep, patch Step-3.5-Flash layer_types mismatch, tune Qwen MoE recipes

- Add tiktoken to base deps for Moonlight's TikToken-based remote tokenizer.
- Retry AutoConfig.from_pretrained when upstream configs ship layer_types
  longer than num_hidden_layers (e.g. stepfun-ai/Step-3.5-Flash) by
  truncating layer_types in the raw config dict and rebuilding via
  the resolved config class (dynamic module or CONFIG_MAPPING).
- Bump qwen3_moe_30b_hellaswag hf_kl_threshold 1e-3 -> 1e-2 and
  qwen3_moe_30b_uccl_ep ep_size 16 -> 8.




* Update uv lock



* Apply suggestion from @claude[bot]



---------

Signed-off-by: hemildesai <hemild@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Hemil Desai <hemild@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
linnanwang pushed a commit that referenced this pull request Apr 24, 2026
)

* fix: add tiktoken dep, patch Step-3.5-Flash layer_types mismatch, tune Qwen MoE recipes

- Add tiktoken to base deps for Moonlight's TikToken-based remote tokenizer.
- Retry AutoConfig.from_pretrained when upstream configs ship layer_types
  longer than num_hidden_layers (e.g. stepfun-ai/Step-3.5-Flash) by
  truncating layer_types in the raw config dict and rebuilding via
  the resolved config class (dynamic module or CONFIG_MAPPING).
- Bump qwen3_moe_30b_hellaswag hf_kl_threshold 1e-3 -> 1e-2 and
  qwen3_moe_30b_uccl_ep ep_size 16 -> 8.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: hemildesai <hemild@nvidia.com>

* Update uv lock

Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

* Apply suggestion from @claude[bot]

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>

---------

Signed-off-by: hemildesai <hemild@nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: NeMo Bot <nemo-bot@nvidia.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants