fix: forward use_te_activation_func flag in non-MoE GPT layer spec#3300
Merged
yaox12 merged 4 commits intoNVIDIA:mainfrom Feb 26, 2026
Merged
Conversation
The --use-te-activation-func CLI flag was parsed and stored in TransformerConfig but never forwarded to get_gpt_layer_with_transformer_engine_spec() in the non-MoE code path of _get_transformer_layer_spec(). This caused the flag to silently default to False, preventing TE activation functions from being used in non-MoE GPT models. Added use_te_activation_func=config.use_te_activation_func to the function call, consistent with how MoE and experimental attention code paths already forward this parameter. Fixes: NVIDIA#2770
Add a unit test verifying that _get_transformer_layer_spec() correctly forwards use_te_activation_func from the TransformerConfig to get_gpt_layer_with_transformer_engine_spec(). Uses mock patching to isolate the parameter forwarding behavior without requiring CUDA. Regression test for NVIDIA#2770
santhnm2
approved these changes
Feb 9, 2026
Contributor
|
/ok to test 509ef8d |
Member
|
/ok to test f3b227e |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/22453236145 |
BoxiangW
pushed a commit
to BoxiangW/Megatron-LM
that referenced
this pull request
Mar 4, 2026
…VIDIA#3300) Co-authored-by: Xin Yao <xiny@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Fixes the
--use-te-activation-funcCLI flag being silently ignored for non-MoE GPT models by forwarding the parameter through the_get_transformer_layer_spec()code path.Fixes: #2770
Problem
The
--use-te-activation-funcflag is correctly parsed from the command line and stored inTransformerConfig.use_te_activation_func, but it is never forwarded toget_gpt_layer_with_transformer_engine_spec()when building layer specs for non-MoE GPT models.In
gpt_builders.py, the function_get_transformer_layer_spec()callsget_gpt_layer_with_transformer_engine_spec()without passinguse_te_activation_func, causing it to silently default toFalse. This means TransformerEngine activation functions are never used in non-MoE GPT models regardless of the CLI flag.Root Cause
The call chain is:
--use-te-activation-funcis parsed intoargsand transferred toTransformerConfigcore_transformer_config_from_args(args)creates aTransformerConfigwithuse_te_activation_func=True_get_transformer_layer_spec(use_te, config)callsget_gpt_layer_with_transformer_engine_spec()withoutuse_te_activation_funcget_gpt_layer_with_transformer_engine_spec(..., use_te_activation_func=False)defaults toFalseget_mlp_module_spec_for_backend(..., use_te_activation_func=False)selects PyTorch activation functions instead of TE's fused implementationsNotably, the MoE code path (
get_gpt_decoder_layer_specsingpt_layer_specs.py) and the experimental attention variant code path (experimental_attention_variant_module_specs.py) both correctly forwarduse_te_activation_func=config.use_te_activation_func. Only the non-MoE path ingpt_builders.pyhas this omission.Fix
Added
use_te_activation_func=config.use_te_activation_functo theget_gpt_layer_with_transformer_engine_spec()call in_get_transformer_layer_spec(), consistent with how other code paths already forward this parameter.Changed Files
gpt_builders.py: Added one line to forwarduse_te_activation_funcfrom the config to the layer spec builder function.tests/unit_tests/models/test_gpt_model.py: Added regression testtest_get_transformer_layer_spec_forwards_use_te_activation_functhat uses mock patching to verify the parameter is correctly forwarded from config to the downstream spec function.Diff
def _get_transformer_layer_spec(use_te, config): args = get_args() if use_te: return get_gpt_layer_with_transformer_engine_spec( args.num_experts, args.moe_grouped_gemm, args.qk_layernorm, args.multi_latent_attention, args.experimental_attention_variant, moe_use_legacy_grouped_gemm=args.moe_use_legacy_grouped_gemm, qk_l2_norm=args.qk_l2_norm, use_kitchen=config.use_kitchen, + use_te_activation_func=config.use_te_activation_func, use_kitchen_attention=config.use_kitchen_attention, kitchen_attention_backend=config.kitchen_attention_backend, )How to Reproduce the Original Bug
As described in #2770:
--use-te-activation-funcand--transformer-impl transformer_engine--use-te-activation-funcAfter this fix, setting
--use-te-activation-funccorrectly enables TransformerEngine activation functions (e.g., TE's fused GELU/SiLU) in the MLP layer spec for non-MoE GPT models.Testing
Unit Test Added
A regression test
test_get_transformer_layer_spec_forwards_use_te_activation_funcwas added totests/unit_tests/models/test_gpt_model.py. The test:get_args()andget_gpt_layer_with_transformer_engine_spec()to isolate the forwarding behavioruse_te_activation_func=True_get_transformer_layer_spec(use_te=True, config=config)get_gpt_layer_with_transformer_engine_specwas called withuse_te_activation_func=TrueThis test does not require CUDA and directly validates the fix for issue #2770.
Existing Coverage
The existing
test_gpt_with_te_activation_funcin the same file validates the downstream behavior ofget_gpt_layer_with_transformer_engine_spec(use_te_activation_func=True)end-to-end with actual model construction and forward pass.Verification
configobject passed to_get_transformer_layer_spec()already containsuse_te_activation_func(set from CLI args viacore_transformer_config_from_args)get_gpt_layer_with_transformer_engine_spec()already acceptsuse_te_activation_funcas a keyword argument (gpt_layer_specs.py:181)gpt_layer_specs.py:536,548and experimental attention atexperimental_attention_variant_module_specs.py:398,448) forward this parameter in the same wayContribution process
Pre-checks