Describe the bug
The --use-te-activation-func command-line flag is correctly parsed, but it is not propagated to the layer spec builder for non-MoE GPT models. As a result, the flag is silently ignored and Transformer Engine’s activation function is never enabled.
Steps/Code to reproduce bug
-
Launch training on a non-MoE GPT model
-
Enable Transformer Engine and set --use-te-activation-func
-
Inspect logs or execution path of the MLP activation
-
Observe that PyTorch GELU is used instead of TE activation
Example Command
#!/bin/bash
export CUDA_DEVICE_MAX_CONNECTIONS=1
export MASTER_ADDR=localhost
export MASTER_PORT=6105
torchrun --nnodes=1 --nproc-per-node=1 pretrain_gpt.py \
--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 1 \
--expert-model-parallel-size 1 \
--train-samples 200 \
--tokenizer-type GPT2BPETokenizer \
--split 1000,0,0 \
--eval-iters 0 \
--use-cpu-initialization \
--num-layers 12 \
--hidden-size 256 \
--num-attention-heads 4 \
--max-position-embeddings 256 \
--seq-length 256 \
--micro-batch-size 2 \
--global-batch-size 2 \
--lr 0.0001 \
--distributed-backend nccl \
--seed 42 \
--no-bias-gelu-fusion \
--use-te-activation-func \
--data-path <your-data-path> \
--vocab-file <your-vocab-file-path> \
--merge-file <your-merge-file-path>
Expected behavior
When --use-te-activation-func is enabled, the model should use Transformer Engine’s activation function. When the flag is removed, the model should fall back to PyTorch’s GELU.
Therefore, running the example command with and without --use-te-activation-func is expected to produce small numerical differences. However, the two runs produce identical results.
No warning or error is emitted.
Additional context
A possible cause is that in gpt_builders.py, the _get_transformer_layer_spec() function calls
get_gpt_layer_with_transformer_engine_spec() without forwarding the use_te_activation_func argument, causing it to default to False.
Describe the bug
The
--use-te-activation-funccommand-line flag is correctly parsed, but it is not propagated to the layer spec builder for non-MoE GPT models. As a result, the flag is silently ignored and Transformer Engine’s activation function is never enabled.Steps/Code to reproduce bug
Launch training on a non-MoE GPT model
Enable Transformer Engine and set --use-te-activation-func
Inspect logs or execution path of the MLP activation
Observe that PyTorch GELU is used instead of TE activation
Example Command
Expected behavior
When
--use-te-activation-funcis enabled, the model should use Transformer Engine’s activation function. When the flag is removed, the model should fall back to PyTorch’s GELU.Therefore, running the example command with and without
--use-te-activation-funcis expected to produce small numerical differences. However, the two runs produce identical results.No warning or error is emitted.
Additional context
A possible cause is that in
gpt_builders.py, the_get_transformer_layer_spec()function callsget_gpt_layer_with_transformer_engine_spec()without forwarding theuse_te_activation_funcargument, causing it to default toFalse.