Skip to content

Training the 1B model on H800 resulted in a decrease in throughput #836

@forevergj

Description

@forevergj

Using FP8 to train a 1B model on H800 resulted in a significant decrease in throughput performance compared to FP16. However, upon examining the pytorch profiler, there is a significant performance gap in Linear. What is the reason for this performance
MSAMP+TransformerEngine
image

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions