Isn't the memory consumption should be dropped when using fp8?

Hi, I am just trying the example provided (https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/te_llama/tutorial_accelerate_hf_llama_with_te.html), with llama 2 model.

As it is 7B model, I assume the GPU memory usage for model should be around 14GB when using fp16 (which is default), and around 7B for fp8.
However, it still shows memory usage of 14B (I used model.get_memory_footprint() and nvidia-smi to check allocated memory).
Also, when I print out dtype of hidden states of the layers, it shows bfloat16.

Is it normal or is not working well on my side?
Please correct me if I understand sth wrong.

Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Isn't the memory consumption should be dropped when using fp8? #1261

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Isn't the memory consumption should be dropped when using fp8? #1261

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions