Skip to content

Isn't the memory consumption should be dropped when using fp8? #1261

@JayC1208

Description

@JayC1208

Hi, I am just trying the example provided (https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/te_llama/tutorial_accelerate_hf_llama_with_te.html), with llama 2 model.

As it is 7B model, I assume the GPU memory usage for model should be around 14GB when using fp16 (which is default), and around 7B for fp8.
However, it still shows memory usage of 14B (I used model.get_memory_footprint() and nvidia-smi to check allocated memory).
Also, when I print out dtype of hidden states of the layers, it shows bfloat16.

Is it normal or is not working well on my side?
Please correct me if I understand sth wrong.

Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions