Training the 1B model on H800 resulted in a decrease in throughput

Using FP8 to train a 1B model on H800 resulted in a significant decrease in throughput performance compared to FP16. However, upon examining the pytorch profiler, there is a significant performance gap in Linear. What is the reason for this performance
MSAMP+TransformerEngine
![image](https://github.com/NVIDIA/TransformerEngine/assets/43092496/bb3411e9-4acd-4979-9346-3604864284d1)