-
Notifications
You must be signed in to change notification settings - Fork 928
[Enhancement]add pytorch backend support for llama #611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…tance.h Co-authored-by: Bram Wasti <bwasti@fb.com>
runs in mpirun
|
Is there any guide how to create correct gemm_config.in for llama parameters? |
|
Hello, may I ask if this change can support the CPP version of llama?Or does it only support the Pytorch version of llama? |
|
Fantasitic~! I change a bit and test the performance between ft-llama and huggingface implement. It simply show the ft-llama get 3 times speed up on A100-80G
However, the output seems different~~~. Can you give a quick review? The new DetailsAnd use the commend The result should be |
|
It seem only the Should use a fixed |
|
@veya2ztn Traceback (most recent call last): |
upgrade your |
@veya2ztn |
|
May I ask if it supports single process multiple GPUs. I also have the same confusion that the length of each output is fixed. |
also fail for the multiple GPUs example. |
|
@veya2ztn I have some probably stupid questions, how to compile it? First, I build Pytorch 2.0.1 with MPI backend from source. Then, I follow the guide CUDA Error: (null) .../FasterTransformer/3rdparty/trt_fused_multihead_attention/fused_multihead_attention.h 345I come across this issue #177 , did you also compiled with docker? |
I dont compile in docker. If you want to compile a MPI backbone from source, I compile successfully under this configuration. But it seems that the |
|
@veya2ztn I successfully compiled in docker and achieved similar performance as you mentioned, thanks a lot! |
|
Thank you for your great work, but when I test 13B, 30B, 60B models, the following error occurs:
You can reproduce the problem with the following command: convert model: run model: |
|
usually, its mean OOM
获取Outlook for Android<https://aka.ms/AAb9ysg>
…________________________________
From: BasicCoder ***@***.***>
Sent: Tuesday, June 20, 2023 3:20:07 PM
To: NVIDIA/FasterTransformer ***@***.***>
Cc: veya2ztn ***@***.***>; Mention ***@***.***>
Subject: Re: [NVIDIA/FasterTransformer] [Enhancement]add pytorch backend support for llama (PR #611)
Thank you for your great work, but when I test 13B, 30B, 60B models, the following error occurs:
[FT][ERROR] CUDA runtime error: an illegal memory access was encountered /workspace/Llama-FT/FasterTransformer/src/fastertransformer/utils/allocator.h:462
You can reproduce the problem with the following command:
convert model:
python ../examples/cpp/llama/huggingface_llama_convert.py -saved_dir=./llama-13b-hf/c-model -in_file=./llama-13b-hf -infer_gpu_num=2 -weight_data_type=fp16 -model_name=llama_13b
run model:
export CUDA_LAUNCH_BLOCKING=1 mpirun -n 2 --allow-run-as-root python ../examples/pytorch/llama/llama_example.py --tensor_para_size=2 --pipeline_para_size=1 --ckpt_path ./llama-13b-hf/c-model/2-gpu --tokenizer_path ./llama-13b-hf --lib_path ./lib/libth_transformer.so --max_batch_size 4 --inference_data_type fp16 --output_len 170 --time --start_id_file ../examples/pytorch/llama/start_ids.csv
―
Reply to this email directly, view it on GitHub<#611 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AEKHE2KJHTPKCRCLACH5E7LXMFFKPANCNFSM6AAAAAAYDJFH4A>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
|
I don't think it's OOM,In this test, I am using 2* A100 80G in tensors parallel. In the same environment and tensors parallel configuration, the result of the 7B model is normal, but a larger model size will cause errors. |
May I ask why it is written as a fixed value here? |
Have you solved the problem? |
NO. I haven't resolved it yet. |
This is probably caused by pytorch. I solved this problem by rebuild the pytorch with mpi from source |
Good news, can you share more information about your execution environment? pytorch/mpi/cuda version? |
host: docker container: |
|
hello what role does the mpi play in the cli? why does pytorch example need mpi? |
|
Hi, @vitrun , the upstream branch by @void-main now support int8 inference. Would you consider support it also? |
|
@veya2ztn - Any luck with getting batch_size>1 working? I am hit with a runtime error when I use int as output length and when I use |
|
Update: Got it working with batches by lowering output_len to 512. and use |
|
@vitrun Thanks for your work. I m trying to use your code to run llama 2 13b model on my v100-32G, 1-gpu works well for me, but when i try 2-gpu, i got this error: [INFO] batch size: 1
[INFO] batch size: 1
[INFO] WARNING: Have initialized the process group
[INFO] WARNING: Have initialized the process group
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[FT][INFO] NCCL initialized rank=1 world_size=2 tensor_para=NcclParam[rank=1, world_size=2, nccl_comm=0x5631dea2d0a0] pipeline_para=NcclParam[rank=0, world_size=1, nccl_comm=0x5631dea80c70]
[FT][INFO] NCCL initialized rank=0 world_size=2 tensor_para=NcclParam[rank=0, world_size=2, nccl_comm=0x5592093795d0] pipeline_para=NcclParam[rank=0, world_size=1, nccl_comm=0x5592093cd030]
[FT][ERROR] CUDA runtime error: an illegal memory access was encountered /mnt/work/llama_ft_vitrun/FasterTransformer/src/fastertransformer/utils/allocator.h:462
[FT][ERROR] CUDA runtime error: an illegal memory access was encountered /mnt/work/llama_ft_vitrun/FasterTransformer/src/fastertransformer/utils/allocator.h:462
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[58456,1],0]
Exit code: 255
--------------------------------------------------------------------------here is my command line mpirun -n 2 --allow-run-as-root python llama_example.py \
--ckpt_path ${ckpt_path} \
--lib_path ${lib_path} \
--tokenizer_path ${tokenizer_name_or_path} \
--tensor_para_size=2 --pipeline_para_size=1 --max_batch_size 1 --start_id_file start_ids.csvany help will be grateful. |
hello have you solved this problem? illegal memory access error. |
Added pytorch backend support for llama, based on #575.
For a simple test, run under
examples/pytorch/llama/:It should print:
which meets with the result of its cpp version.