Hi, I am just trying the example provided (https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/te_llama/tutorial_accelerate_hf_llama_with_te.html), with llama 2 model.
As it is 7B model, I assume the GPU memory usage for model should be around 14GB when using fp16 (which is default), and around 7B for fp8.
However, it still shows memory usage of 14B (I used model.get_memory_footprint() and nvidia-smi to check allocated memory).
Also, when I print out dtype of hidden states of the layers, it shows bfloat16.
Is it normal or is not working well on my side?
Please correct me if I understand sth wrong.
Thanks.