Skip to content

--use-te-activation-func Flag Ignored for Non-MoE GPT Models #2770

@vineash

Description

@vineash

Describe the bug

The --use-te-activation-func command-line flag is correctly parsed, but it is not propagated to the layer spec builder for non-MoE GPT models. As a result, the flag is silently ignored and Transformer Engine’s activation function is never enabled.

Steps/Code to reproduce bug

  1. Launch training on a non-MoE GPT model

  2. Enable Transformer Engine and set --use-te-activation-func

  3. Inspect logs or execution path of the MLP activation

  4. Observe that PyTorch GELU is used instead of TE activation

Example Command

#!/bin/bash

export CUDA_DEVICE_MAX_CONNECTIONS=1
export MASTER_ADDR=localhost
export MASTER_PORT=6105

torchrun --nnodes=1 --nproc-per-node=1 pretrain_gpt.py \
    --tensor-model-parallel-size 1 \
    --pipeline-model-parallel-size 1 \
    --expert-model-parallel-size 1 \
    --train-samples 200 \
    --tokenizer-type GPT2BPETokenizer \
    --split 1000,0,0 \
    --eval-iters 0 \
    --use-cpu-initialization \
    --num-layers 12 \
    --hidden-size 256 \
    --num-attention-heads 4 \
    --max-position-embeddings 256 \
    --seq-length 256 \
    --micro-batch-size 2 \
    --global-batch-size 2 \
    --lr 0.0001 \
    --distributed-backend nccl \
    --seed 42 \
    --no-bias-gelu-fusion \
    --use-te-activation-func \
    --data-path <your-data-path> \
    --vocab-file <your-vocab-file-path> \
    --merge-file <your-merge-file-path>

Expected behavior

When --use-te-activation-func is enabled, the model should use Transformer Engine’s activation function. When the flag is removed, the model should fall back to PyTorch’s GELU.

Therefore, running the example command with and without --use-te-activation-func is expected to produce small numerical differences. However, the two runs produce identical results.

No warning or error is emitted.

Additional context

A possible cause is that in gpt_builders.py, the _get_transformer_layer_spec() function calls
get_gpt_layer_with_transformer_engine_spec() without forwarding the use_te_activation_func argument, causing it to default to False.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions