[Enhancement]add pytorch backend support for llama #611

vitrun · 2023-05-16T08:30:31Z

Added pytorch backend support for llama, based on #575.

For a simple test, run under examples/pytorch/llama/:

mpirun -n 4 --allow-run-as-root python llama_example.py --tensor_para_size=4 --pipeline_para_size=1  --ckpt_path path_to_weight --tokenizer_path path_to_tokenizer --lib_path path_to_libth_transformer.so --max_batch_size 1 --start_id_file start_ids.csv

It should print:

[Context]
0, 18637, 29892, 526, 366, 1136, 455, 2470, 29973, 1815, 366, 5193, 304, 592, 29973

[Generated]
[18637 29892   526   366  1136   455  2470 29973  1815   366  5193   304
   592 29973 18637 29892   526   366  1136   455  2470 29973  1815   366
  5193   304   592 29973 18637 29892   526   366  1136   455  2470 29973
  1815   366  5193   304   592 29973 18637 29892   526   366  1136]

[Output]
Hey, are you consciours? Can you talk to me? Hey, are you consciours? Can you talk to me? Hey, are you consciours? Can you talk to me? Hey, are you cons

which meets with the result of its cpp version.

…tance.h Co-authored-by: Bram Wasti <bwasti@fb.com>

…into main

…to main

runs in mpirun

RomaA2000 · 2023-05-17T10:45:37Z

Is there any guide how to create correct gemm_config.in for llama parameters?

hepj987 · 2023-05-18T02:13:18Z

Hello, may I ask if this change can support the CPP version of llama?Or does it only support the Pytorch version of llama?

veya2ztn · 2023-06-02T11:15:15Z

Fantasitic~! I change a bit and test the performance between ft-llama and huggingface implement. It simply show the ft-llama get 3 times speed up on A100-80G

FT-LLama: generates 10 batches, taking 4.867 secs to generate 470 tokens, 96.571 tokens/sec.
Huggingface: generates 10 batches, taking 9.923 secs to generate 340 tokens, 34.265 tokens/sec.

However, the output seems different~~~. Can you give a quick review?

The new llama_example.py

Details

# Copyright (c) 2021-2023, NVIDIA CORPORATION.  All rights reserved.
# Copyright (c) 2021, NAVER Corp.  Authored by CLOVA.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# from __future__ import print_function

from torch.nn.utils.rnn import pad_sequence
import os
import sys
import argparse
import configparser
import timeit
import torch
import torch.distributed as dist
from transformers import AutoTokenizer

dir_path = os.path.dirname(os.path.realpath(__file__))
sys.path.append(dir_path + "/../../..")
from examples.pytorch.llama.utils.llama import Llama
from transformers import AutoTokenizer,AutoModelForCausalLM

class Config(object):
    def __init__(self, config_dict):
        for key, val in config_dict.items():
            self.__setattr__(key, val)

def get_model_config(args):
    if "-gpu" not in args.ckpt_path:
        return Config({
            "head_num":32,
            "size_per_head":128,
            "inter_size":11008,
            "vocab_size":32000,
            "layer_num":32,
            "rotary_embedding":128,
            "layernorm_eps":1e-6,
            "start_id":1,
            "end_id":2,
            "use_gptj_residual":False,
            "weight_data_type":"fp16",
        })
    config = configparser.ConfigParser()
    config.read(os.path.join(args.ckpt_path, "config.ini"))
    return Config({
        "head_num":int(config.get('llama', 'head_num')),         
        "size_per_head":int(config.get('llama', 'size_per_head')),    
        "inter_size":int(config.get('llama', 'inter_size')),       
        "vocab_size":int(config.get('llama', 'vocab_size')),       
        "layer_num":int(config.get('llama', 'num_layer')),        
        "rotary_embedding":int(config.get('llama', 'rotary_embedding')), 
        "layernorm_eps":float(config.get('llama', 'layernorm_eps')),    
        "start_id":int(config.get('llama', 'start_id')),         
        "end_id":int(config.get('llama', 'end_id')),           
        "use_gptj_residual":False,
        "weight_data_type":config.get('llama', 'weight_data_type'), 
    })

def get_infer_config(args):
    return Config({
        "output_len":args.output_len,                  
        "beam_width":args.beam_width,                  
        "top_k":args.top_k,                       
        "top_p":args.top_p,                       
        "temperature":args.temperature,                 
        "len_penalty":args.len_penalty,                 
        "beam_search_diversity_rate":args.beam_search_diversity_rate,  
        "tensor_para_size":args.tensor_para_size,            
        "pipeline_para_size":args.pipeline_para_size,          
        "max_batch_size":args.max_batch_size,              
        "max_seq_len":args.max_seq_len,                 
        "repetition_penalty":args.repetition_penalty,          
        "inference_data_type":args.inference_data_type,         
    })

def get_system_config(args):
    return Config({
        "ckpt_path":args.ckpt_path,      
        "tokenizer_path":args.tokenizer_path, 
        "lib_path":args.lib_path,       
    })

def get_model(model_config, infer_config,system_config):
    if "-gpu" in system_config.ckpt_path:
        print('load [fastertransformer] model !')
        model  = Llama(model_config.head_num, model_config.size_per_head, model_config.inter_size, model_config.vocab_size, 
                    model_config.rotary_embedding, model_config.layernorm_eps,
                    model_config.start_id, model_config.end_id, model_config.layer_num, 
                    infer_config.max_seq_len, 
                    infer_config.tensor_para_size, 
                    infer_config.pipeline_para_size, 
                    model_config.use_gptj_residual, 
                    system_config.lib_path, 
                    inference_data_type=infer_config.inference_data_type, 
                    weights_data_type=model_config.weight_data_type)

        if not model.load(ckpt_path=system_config.ckpt_path):
            print("[WARNING] Checkpoint file not found. Model loading is skipped.")
    else:
        print('load [hugging face] model !')
        model = AutoModelForCausalLM.from_pretrained(system_config.ckpt_path).cuda()
    return model

def get_inputs_ids(args, tokenizer,device):
    # Inputs
    contexts = []
    if args.start_id_file:
        with open(args.start_id_file, 'r') as f:
            contexts = f.read().splitlines()
            batch_size = min(len(contexts), args.max_batch_size)
        contexts = contexts[:batch_size]
        start_ids = [torch.IntTensor([int(i) for i in c.strip().split(',')]) for c in contexts]
    elif args.sample_input_file:  # conditional case
        with open(args.sample_input_file, "r") as f:
            contexts = f.read().splitlines()
            batch_size = min(len(contexts), args.max_batch_size)
        contexts = contexts[:batch_size]
        start_ids = [torch.tensor(tokenizer.encode(c), dtype=torch.int32, device=device) for c in contexts]
    else:  # unconditional case
        raise 
        batch_size = infer_config.max_batch_size
        contexts = ['<|endoftext|>'] * batch_size
        start_ids = [torch.IntTensor([model_config.end_id])] * batch_size
    return start_ids, contexts

def get_model_result(model, start_ids,random_seed_tensor,infer_config):
        if isinstance(model, Llama):
            start_lengths = torch.IntTensor([len(ids) for ids in start_ids])
            batch_size    = len(start_ids)
            return model(start_ids    =start_ids,
                        start_lengths=start_lengths,
                        output_len=start_lengths + infer_config.output_len,
                        beam_width=infer_config.beam_width,
                        top_k=infer_config.top_k * torch.ones(size=[batch_size], dtype=torch.int32),
                        top_p=infer_config.top_p * torch.ones(size=[batch_size], dtype=torch.float32),
                        beam_search_diversity_rate=infer_config.beam_search_diversity_rate * torch.ones(size=[batch_size], dtype=torch.float32),
                        temperature=infer_config.temperature * torch.ones(size=[batch_size], dtype=torch.float32),
                        len_penalty=infer_config.len_penalty * torch.ones(size=[batch_size], dtype=torch.float32),
                        repetition_penalty=infer_config.repetition_penalty * torch.ones(size=[batch_size], dtype=torch.float32),
                        random_seed=random_seed_tensor,
                        return_output_length=False,
                        return_cum_log_probs=0)
        else:
            output_ids = model.generate(start_ids,max_length=512)
            return output_ids

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--output_len', type=int, default=32,
                        help='output sequence length to generate.')
    parser.add_argument('--beam_width', type=int, default=1,
                        help='beam width for beam search. Using sampling when beam width is 1.')
    parser.add_argument('--top_k', type=int, default=1,
                        help='top k candidate num')
    parser.add_argument('--top_p', type=float, default=0.,
                        help='top p probability threshold')
    parser.add_argument('--temperature', type=float, default=1.,
                        help='temperature')
    parser.add_argument('--len_penalty', type=float, default=0.,
                        help='len_penalty')
    parser.add_argument('--beam_search_diversity_rate', type=float, default=0.,
                        help='beam_search_diversity_rate')
    parser.add_argument('--tensor_para_size', type=int, default=1,
                        help='tensor parallel size')
    parser.add_argument('--pipeline_para_size', type=int, default=1,
                        help='pipeline parallel size')
    parser.add_argument('--ckpt_path', type=str, 
                        help='path to the checkpoint file.')
    parser.add_argument('--tokenizer_path', type=str, 
                        help='directory where the tokenizer file is located.')
    parser.add_argument('--lib_path', type=str, default='./lib/libth_transformer.so',
                        help='path to the pyt_fastertransformer dynamic lib file.')
    parser.add_argument('--sample_input_file', type=str,
                        help='path to the sample input file.')
    parser.add_argument('--start_id_file', type=str,
                        help='path to the start id file.')
    parser.add_argument('--max_batch_size', type=int, default=8,
                        help='max batch size.')
    parser.add_argument('--repetition_penalty', type=float, default=1.,
                        help='repetition penalty')
    parser.add_argument('--max_seq_len', type=int, default=1024,
                        help='max sequence length for position embedding table.')
    parser.add_argument('--inference_data_type', '--data_type', type=str, choices=['fp32', 'fp16'], default='fp16')
    parser.add_argument('--time', action='store_true',
                        help='whether or not to measure time elapsed.')
    parser.add_argument('--enable_random_seed', action='store_true',
                        help='is enable the random seed.')

    args = parser.parse_args()

    
    print("\n=============== Arguments ===============")
    for arg in vars(args):
        print("{}: {}".format(arg, getattr(args, arg)))
    print("=========================================\n")

    model_config = get_model_config(args)

    #### resource_configuration
    system_config= get_system_config(args)

    #### inference configuration
    infer_config = get_infer_config(args)

    #### set the multiprocess group
    if infer_config.tensor_para_size * infer_config.pipeline_para_size > 1:
        dist.init_process_group(backend=dist.Backend.MPI)
    rank         = dist.get_rank() if dist.is_initialized() else 0
    device_count = dist.get_world_size() if dist.is_initialized() else 1
    device       = rank % device_count
    torch.cuda.set_device(device)
    device       = torch.cuda.current_device()

    # sentencepiece needed
    tokenizer = AutoTokenizer.from_pretrained(system_config.tokenizer_path, use_fast=False)

    # get ids
    start_ids, contexts = get_inputs_ids(args, tokenizer,device)
    batch_size= len(start_ids)
    print("[INFO] batch size: {}".format(batch_size))
    start_ids     = pad_sequence(start_ids, batch_first=True, padding_value=model_config.end_id).cuda()
    
    if args.enable_random_seed == True:
        random_seed_tensor = torch.randint(0, 10000, size=[batch_size], dtype=torch.int64)
    else:
        random_seed_tensor = torch.zeros([batch_size], dtype=torch.int64)

    # Prepare model.
    print("building model...............")
    llama = get_model(model_config, infer_config, system_config)
    print("done!")
    with torch.no_grad():
        print(f"[INFO] input size {start_ids.shape}")
        tokens_batch = get_model_result(llama, start_ids,random_seed_tensor,infer_config)
        print(f"[INFO] output size {tokens_batch.shape}")
        if tokens_batch is not None and rank == 0:
            tokens_batch = tokens_batch.cpu().numpy()
            if not isinstance(llama, Llama): tokens_batch = [tokens_batch]
            start_lengths = torch.IntTensor([len(ids) for ids in start_ids])
            for i, (context, tokens) in enumerate(zip(contexts, tokens_batch)):
                for beam_id in range(infer_config.beam_width):
                    token = tokens[beam_id][start_lengths[i]:]  # exclude context input from the output
                    output = tokenizer.decode(token)
                    print(f'[INFO] batch {i}, beam {beam_id}:\n[Context]\n{context}\n\n[Generated]\n{token}\n\n[Output]\n{output}\n')

        # Measure inference time.
        if args.time:
            iterations = 10
            # warmup
            for i in range(iterations):
                tokens_batch = get_model_result(llama, start_ids,random_seed_tensor,infer_config)
            batch_num = 0
            token_num = 0
            time = timeit.default_timer()
            for i in range(iterations):
                tokens_batch = get_model_result(llama, start_ids,random_seed_tensor,infer_config)
                batch_num += 1
                for j, tokens in enumerate(tokens_batch):
                    token_num += tokens.shape[-1] - start_lengths[j]
            time_elapsed = timeit.default_timer() - time
            throughput = token_num / time_elapsed
            print(f"[INFO] FT-LLAMA:{args.ckpt_path}:\n      generates {batch_num} batches, taking {time_elapsed:0.3f} secs "
                  f"to generate {token_num} tokens, {throughput:0.3f} tokens/sec.")


if __name__ == '__main__':
    main()

And use the commend

python llama_example.py --tensor_para_size=1 --pipeline_para_size=1  --ckpt_path ~/pretrain_weights/vicuna/vicuna-7b-v1.1/ --tokenizer_path ~/pretrain_weights/vicuna/vicuna-7b-v1.1/ --lib_path ~/projects/FasterTransformer2/FasterTransformer/build/lib/libth_transformer.so --max_batch_size 1 --start_id_file start_ids.csv --time

python llama_example.py --tensor_para_size=1 --pipeline_para_size=1  --ckpt_path ~/pretrain_weights/vicuna/vicuna-7b-fastertransformer_fp16/1-gpu/ --tokenizer_path ~/pretrain_weights/vicuna/vicuna-7b-v1.1/ --lib_path ~/projects/FasterTransformer2/FasterTransformer/build/lib/libth_transformer.so --max_batch_size 1 --start_id_file start_ids.csv --time

The result should be


=============== Arguments ===============
output_len: 32
beam_width: 1
top_k: 1
top_p: 0.0
temperature: 1.0
len_penalty: 0.0
beam_search_diversity_rate: 0.0
tensor_para_size: 1
pipeline_para_size: 1
ckpt_path: /mnt/lustre/zhangtianning/pretrain_weights/vicuna/vicuna-7b-v1.1/
tokenizer_path: /mnt/lustre/zhangtianning/pretrain_weights/vicuna/vicuna-7b-v1.1/
lib_path: /mnt/lustre/zhangtianning/projects/FasterTransformer2/FasterTransformer/build/lib/libth_transformer.so
sample_input_file: None
start_id_file: start_ids.csv
max_batch_size: 1
repetition_penalty: 1.0
max_seq_len: 1024
inference_data_type: fp16
time: True
enable_random_seed: False
=========================================

[INFO] batch size: 1
building model...............
load [hugging face] model !

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|█████     | 1/2 [00:13<00:13, 13.31s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:18<00:00,  8.55s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:18<00:00,  9.27s/it]
done!
[INFO] input size torch.Size([1, 15])
[INFO] output size torch.Size([1, 49])
[INFO] batch 0, beam 0:
[Context]
0, 18637, 29892, 526, 366, 1136, 455, 2470, 29973, 1815, 366, 5193, 304, 592, 29973

[Generated]
[   13 29902 29915 29885  7423 29892   306 29915 29885   451  9985   411
   278  1840   376  3200   455  2470  1213  6527   366  3113  3867   901
  3030   470  5649   825   366   526 16811   304 29973     2]

[Output]
 
I'm sorry, I'm not familiar with the term "consciours." Could you please provide more context or explain what you are referring to?</s>

[INFO] FT-LLAMA:/mnt/lustre/zhangtianning/pretrain_weights/vicuna/vicuna-7b-v1.1/:
      generates 10 batches, taking 9.923 secs to generate 340 tokens, 34.265 tokens/sec.
[WARNING] gemm_config.in is not found; using default GEMM algo
[FT][WARNING] Skip NCCL initialization since requested tensor/pipeline parallel sizes are equals to 1.

=============== Arguments ===============
output_len: 32
beam_width: 1
top_k: 1
top_p: 0.0
temperature: 1.0
len_penalty: 0.0
beam_search_diversity_rate: 0.0
tensor_para_size: 1
pipeline_para_size: 1
ckpt_path: /mnt/lustre/zhangtianning/pretrain_weights/vicuna/vicuna-7b-fastertransformer_fp16/1-gpu/
tokenizer_path: /mnt/lustre/zhangtianning/pretrain_weights/vicuna/vicuna-7b-v1.1/
lib_path: /mnt/lustre/zhangtianning/projects/FasterTransformer2/FasterTransformer/build/lib/libth_transformer.so
sample_input_file: None
start_id_file: start_ids.csv
max_batch_size: 1
repetition_penalty: 1.0
max_seq_len: 1024
inference_data_type: fp16
time: True
enable_random_seed: False
=========================================

[INFO] batch size: 1
building model...............
load [fastertransformer] model !
[INFO] WARNING: Have initialized the process group
done!
[INFO] input size torch.Size([1, 15])
[INFO] output size torch.Size([1, 1, 62])
[INFO] batch 0, beam 0:
[Context]
0, 18637, 29892, 526, 366, 1136, 455, 2470, 29973, 1815, 366, 5193, 304, 592, 29973

[Generated]
[   13    13 29930 18637 29892   526   366  1136   455  2470 29973  1815
   366  5193   304   592 29973    13 29930 18637 29892   526   366  1136
   455  2470 29973  1815   366  5193   304   592 29973    13 29930 18637
 29892   526   366  1136   455  2470 29973  1815   366  5193   304]

[Output]


* Hey, are you consciours? Can you talk to me?
* Hey, are you consciours? Can you talk to me?
* Hey, are you consciours? Can you talk to

[INFO] FT-LLAMA:/mnt/lustre/zhangtianning/pretrain_weights/vicuna/vicuna-7b-fastertransformer_fp16/1-gpu/:
      generates 10 batches, taking 4.867 secs to generate 470 tokens, 96.571 tokens/sec.

veya2ztn · 2023-06-02T11:50:03Z

It seem only the max_batch_size=1 can work. For any batch input with batch_size>1, it fails

Traceback (most recent call last):
  File "/mnt/petrelfs/zhangtianning/projects/FasterTransformer2/FasterTransformer/examples/pytorch/llama/llama_example.py", line 233, in <module>
    main()
  File "/mnt/petrelfs/zhangtianning/projects/FasterTransformer2/FasterTransformer/examples/pytorch/llama/llama_example.py", line 163, in main
    tokens_batch = llama(
  File "/mnt/cache/zhangtianning/anaconda3/envs/llm2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/petrelfs/zhangtianning/projects/FasterTransformer2/FasterTransformer/examples/pytorch/llama/../../../examples/pytorch/llama/utils/llama.py", line 283, in forward
    outputs = self.model.forward(input_ids,
RuntimeError: forward() Expected a value of type 'int' for argument '_3' but instead found type 'Tensor'.
Position: 3
Value: tensor([47, 47], dtype=torch.int32)
Declaration: forward(__torch__.torch.classes.FasterTransformer.LlamaOp _0, Tensor _1, Tensor _2, int _3, int? _4, Tensor? _5, Tensor? _6, Tensor? _7, Tensor? _8, Tensor? _9, Tensor? _10, Tensor? _11, int? _12) -> (Tensor[] _0)
Cast error details: Unable to cast Python instance to C++ type (compile in debug mode for details)

Should use a fixed int rather than start_length
output_len=256, #start_lengths + infer_config.output_len,
It is consistant with hugging face model
model.generate(input_ids,max_length=512).

77h2l · 2023-06-05T12:38:50Z

@veya2ztn
thx for your job
the above llama_example.py script seem to have the flowing bugs:

Traceback (most recent call last):
File "llama_example.py", line 277, in
main()
File "llama_example.py", line 225, in main
tokenizer = AutoTokenizer.from_pretrained(system_config.tokenizer_path, use_fast=False)
File "/opt/conda/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 690, in from_pretrained
raise ValueError(
ValueError: Tokenizer class LLaMATokenizer does not exist or is not currently imported.

veya2ztn · 2023-06-06T05:48:48Z

upgrade your transformer
see from transformers import AutoTokenizer

77h2l · 2023-06-07T07:06:56Z

@veya2ztn
Thx for your reply. Your provided script did work,however when bs sets to be greater than 1, error occurs, set the output_len to a fixed value could avoid the problem, but each output is fixed length, it need a lot of paddings, this is counterintuitive, isn't it ? The output should end when meets eos.

chailt · 2023-06-08T07:56:21Z

May I ask if it supports single process multiple GPUs. I also have the same confusion that the length of each output is fixed.

veya2ztn · 2023-06-08T07:59:15Z

also fail for the multiple GPUs example.

alanxmay · 2023-06-08T10:30:44Z

@veya2ztn I have some probably stupid questions, how to compile it?

First, I build Pytorch 2.0.1 with MPI backend from source.

Then, I follow the guide docs/decoder_guide.md to compile FT without docker, run the example and got error:

CUDA Error: (null) .../FasterTransformer/3rdparty/trt_fused_multihead_attention/fused_multihead_attention.h 345

I come across this issue #177 , did you also compiled with docker?

veya2ztn · 2023-06-14T08:02:10Z

I dont compile in docker. If you want to compile a MPI backbone from source,
Try:
torch11.2 + cuda 11.3 +gcc 7.5

I compile successfully under this configuration. But it seems that the MPI for CUDA is not avaliable.
Let me know if you get success

alanxmay · 2023-06-17T05:27:51Z

@veya2ztn I successfully compiled in docker and achieved similar performance as you mentioned, thanks a lot!

BasicCoder · 2023-06-20T07:19:56Z

Thank you for your great work, but when I test 13B, 30B, 60B models, the following error occurs：

[FT][ERROR] CUDA runtime error: an illegal memory access was encountered /workspace/Llama-FT/FasterTransformer/src/fastertransformer/utils/allocator.h:462

You can reproduce the problem with the following command:

convert model:
python ../examples/cpp/llama/huggingface_llama_convert.py -saved_dir=./llama-13b-hf/c-model -in_file=./llama-13b-hf -infer_gpu_num=2 -weight_data_type=fp16 -model_name=llama_13b

run model:
export CUDA_LAUNCH_BLOCKING=1 mpirun -n 2 --allow-run-as-root python ../examples/pytorch/llama/llama_example.py --tensor_para_size=2 --pipeline_para_size=1 --ckpt_path ./llama-13b-hf/c-model/2-gpu --tokenizer_path ./llama-13b-hf --lib_path ./lib/libth_transformer.so --max_batch_size 4 --inference_data_type fp16 --output_len 170 --time --start_id_file ../examples/pytorch/llama/start_ids.csv

veya2ztn · 2023-06-20T07:21:26Z

usually, its mean OOM 获取Outlook for Android<https://aka.ms/AAb9ysg>

…

________________________________ From: BasicCoder ***@***.***> Sent: Tuesday, June 20, 2023 3:20:07 PM To: NVIDIA/FasterTransformer ***@***.***> Cc: veya2ztn ***@***.***>; Mention ***@***.***> Subject: Re: [NVIDIA/FasterTransformer] [Enhancement]add pytorch backend support for llama (PR #611) Thank you for your great work, but when I test 13B, 30B, 60B models, the following error occurs： [FT][ERROR] CUDA runtime error: an illegal memory access was encountered /workspace/Llama-FT/FasterTransformer/src/fastertransformer/utils/allocator.h:462 You can reproduce the problem with the following command: convert model: python ../examples/cpp/llama/huggingface_llama_convert.py -saved_dir=./llama-13b-hf/c-model -in_file=./llama-13b-hf -infer_gpu_num=2 -weight_data_type=fp16 -model_name=llama_13b run model: export CUDA_LAUNCH_BLOCKING=1 mpirun -n 2 --allow-run-as-root python ../examples/pytorch/llama/llama_example.py --tensor_para_size=2 --pipeline_para_size=1 --ckpt_path ./llama-13b-hf/c-model/2-gpu --tokenizer_path ./llama-13b-hf --lib_path ./lib/libth_transformer.so --max_batch_size 4 --inference_data_type fp16 --output_len 170 --time --start_id_file ../examples/pytorch/llama/start_ids.csv ― Reply to this email directly, view it on GitHub<#611 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AEKHE2KJHTPKCRCLACH5E7LXMFFKPANCNFSM6AAAAAAYDJFH4A>. You are receiving this because you were mentioned.Message ID: ***@***.***>

BasicCoder · 2023-06-20T08:02:07Z

I don't think it's OOM，In this test, I am using 2* A100 80G in tensors parallel. In the same environment and tensors parallel configuration, the result of the 7B model is normal, but a larger model size will cause errors.
I used export CUDA_LAUNCH_BLOCKING=1, the error line number information is:
check_cuda_error(cudaMemset(ptr, val, size));
so i think it’s a cuda error.

sleepcoo · 2023-06-20T09:00:49Z

May I ask why it is written as a fixed value here?

illfg · 2023-06-27T01:53:26Z

Have you solved the problem？

BasicCoder · 2023-06-27T05:15:04Z

NO. I haven't resolved it yet.

illfg · 2023-06-27T11:24:21Z

This is probably caused by pytorch. I solved this problem by rebuild the pytorch with mpi from source

BasicCoder · 2023-06-28T03:31:15Z

Good news, can you share more information about your execution environment? pytorch/mpi/cuda version?

illfg · 2023-06-28T06:16:11Z

host:
driver 525.116.03
cuda 12.0

docker container:
driver 525.116.03
cuda 11.7
openmpi 4.1.2
cdunn 11.x
magma 117
pytorch build from source

sleepwalker2017 · 2023-06-29T07:13:38Z

hello what role does the mpi play in the cli? why does pytorch example need mpi?

jcao-ai · 2023-07-06T05:25:46Z

Hi, @vitrun , the upstream branch by @void-main now support int8 inference. Would you consider support it also?

savemuri · 2023-07-07T20:07:30Z

@veya2ztn - Any luck with getting batch_size>1 working? I am hit with a runtime error when I use int as output length and when I use output_len=start_lengths + output_len, it throws Cast error details: Unable to cast Python instance to C++ type (compile in debug mode for details)

savemuri · 2023-07-07T20:47:20Z

Update: Got it working with batches by lowering output_len to 512.

=============== Arguments ===============
output_len: 512
beam_width: 1
top_k: 1
top_p: 0.95
temperature: 0.8
len_penalty: 0.0
beam_search_diversity_rate: 0.0
tensor_para_size: 1
pipeline_para_size: 1
ckpt_path: /workspace/model/llama/1-gpu
tokenizer_path: /workspace/tokenizer
lib_path: /app/FasterTransformer/build/lib/libth_transformer.so
sample_input_file: /app/prompts.txt
start_id_file: None
max_batch_size: 12
repetition_penalty: 1.1
max_seq_len: 1024
inference_data_type: fp16
time: False
enable_random_seed: False
=========================================

and use output_len=output_len instead of output_len=start_lengths + output_len

Louis-y-nlp · 2023-07-28T13:04:58Z

@vitrun Thanks for your work. I m trying to use your code to run llama 2 13b model on my v100-32G, 1-gpu works well for me, but when i try 2-gpu, i got this error:

[INFO] batch size: 1
[INFO] batch size: 1
[INFO] WARNING: Have initialized the process group
[INFO] WARNING: Have initialized the process group
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[FT][INFO] NCCL initialized rank=1 world_size=2 tensor_para=NcclParam[rank=1, world_size=2, nccl_comm=0x5631dea2d0a0] pipeline_para=NcclParam[rank=0, world_size=1, nccl_comm=0x5631dea80c70]
[FT][INFO] NCCL initialized rank=0 world_size=2 tensor_para=NcclParam[rank=0, world_size=2, nccl_comm=0x5592093795d0] pipeline_para=NcclParam[rank=0, world_size=1, nccl_comm=0x5592093cd030]
[FT][ERROR] CUDA runtime error: an illegal memory access was encountered /mnt/work/llama_ft_vitrun/FasterTransformer/src/fastertransformer/utils/allocator.h:462
[FT][ERROR] CUDA runtime error: an illegal memory access was encountered /mnt/work/llama_ft_vitrun/FasterTransformer/src/fastertransformer/utils/allocator.h:462
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[58456,1],0]
  Exit code:    255
--------------------------------------------------------------------------

here is my command line

mpirun -n 2 --allow-run-as-root python llama_example.py \
	--ckpt_path ${ckpt_path} \
	--lib_path ${lib_path} \
	--tokenizer_path ${tokenizer_name_or_path} \
	--tensor_para_size=2 --pipeline_para_size=1 --max_batch_size 1 --start_id_file start_ids.csv

any help will be grateful.

sleepwalker2017 · 2023-08-18T08:35:23Z

hello have you solved this problem? illegal memory access error.

void-main and others added 22 commits April 23, 2023 12:12

get llama coded

f3bd8e6

make the code work :yay:

a32fc1d

fix llama rms ln

ce8700f

add bf16 support

91989cb

add triton model for streaming callback

4bc97c3

register RMS for bf16

a6d51ec

revert bf16

7a72ca3

revert bf16

9820565

bugfix

bfeebef

add megatron llama convert

0379cc5

Update src/fastertransformer/triton_backend/llama/LlamaTritonModelIns…

d65adf1

…tance.h Co-authored-by: Bram Wasti <bwasti@fb.com>

donot callback too frequnetly

cf1b9b1

add bf16

95afed4

make sure examples work for bf16

9aee02e

support bf16 conversion with bfloat 16 numpy ext

8ddac81

Merge branch 'main' of https://github.com/void-main/FasterTransformer …

694faec

…into main

bugfix

40fbe48

load layernorm_eps from config; change cb default to 5

f6cf9da

Merge branch 'main' of https://github.com/void-main/FasterTransformer …

d752088

…into main

update megatron convert script

da2ad14

Merge branch 'main' of https://github.com/NVIDIA/FasterTransformer in…

6820ae5

…to main

runs in mpirun

8d29e68

runs in mpirun

vitrun changed the title ~~Llama torch~~ [Enhancement]add pytorch backend support for llama May 16, 2023

revert CMakeLists.txt

7907796

andrescodas mentioned this pull request Jun 2, 2023

fix MPI initialization following deepspeed example vitrun/FasterTransformer#1

Open

veya2ztn mentioned this pull request Jun 3, 2023

LLaMA support #506

Open

WingEdge777 mentioned this pull request Jun 9, 2023

[soft prompt] llama generation quality decrease when using soft prompt #661

Closed

vitrun closed this Jul 30, 2023

[Enhancement]add pytorch backend support for llama #611

[Enhancement]add pytorch backend support for llama #611

Uh oh!

Conversation

vitrun commented May 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RomaA2000 commented May 17, 2023

Uh oh!

hepj987 commented May 18, 2023

Uh oh!

veya2ztn commented Jun 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

veya2ztn commented Jun 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

77h2l commented Jun 5, 2023

Uh oh!

veya2ztn commented Jun 6, 2023

Uh oh!

77h2l commented Jun 7, 2023

Uh oh!

chailt commented Jun 8, 2023

Uh oh!

veya2ztn commented Jun 8, 2023

Uh oh!

alanxmay commented Jun 8, 2023

Uh oh!

veya2ztn commented Jun 14, 2023

Uh oh!

alanxmay commented Jun 17, 2023

Uh oh!

BasicCoder commented Jun 20, 2023

Uh oh!

veya2ztn commented Jun 20, 2023 via email

Uh oh!

BasicCoder commented Jun 20, 2023

Uh oh!

sleepcoo commented Jun 20, 2023

Uh oh!

illfg commented Jun 27, 2023

Uh oh!

BasicCoder commented Jun 27, 2023

Uh oh!

illfg commented Jun 27, 2023

Uh oh!

BasicCoder commented Jun 28, 2023

Uh oh!

illfg commented Jun 28, 2023

Uh oh!

sleepwalker2017 commented Jun 29, 2023

Uh oh!

jcao-ai commented Jul 6, 2023

Uh oh!

savemuri commented Jul 7, 2023

Uh oh!

savemuri commented Jul 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Louis-y-nlp commented Jul 28, 2023

Uh oh!

sleepwalker2017 commented Aug 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants

vitrun commented May 16, 2023 •

edited

Loading

veya2ztn commented Jun 2, 2023 •

edited

Loading

veya2ztn commented Jun 2, 2023 •

edited

Loading

savemuri commented Jul 7, 2023 •

edited

Loading