Using FP8 to train a 1B model on H800 resulted in a significant decrease in throughput performance compared to FP16. However, upon examining the pytorch profiler, there is a significant performance gap in Linear. What is the reason for this performance
MSAMP+TransformerEngine
