Skip to content

Conversation

@vitrun
Copy link

@vitrun vitrun commented May 16, 2023

Added pytorch backend support for llama, based on #575.

For a simple test, run under examples/pytorch/llama/:

mpirun -n 4 --allow-run-as-root python llama_example.py --tensor_para_size=4 --pipeline_para_size=1  --ckpt_path path_to_weight --tokenizer_path path_to_tokenizer --lib_path path_to_libth_transformer.so --max_batch_size 1 --start_id_file start_ids.csv

It should print:

[Context]
0, 18637, 29892, 526, 366, 1136, 455, 2470, 29973, 1815, 366, 5193, 304, 592, 29973

[Generated]
[18637 29892   526   366  1136   455  2470 29973  1815   366  5193   304
   592 29973 18637 29892   526   366  1136   455  2470 29973  1815   366
  5193   304   592 29973 18637 29892   526   366  1136   455  2470 29973
  1815   366  5193   304   592 29973 18637 29892   526   366  1136]

[Output]
Hey, are you consciours? Can you talk to me? Hey, are you consciours? Can you talk to me? Hey, are you consciours? Can you talk to me? Hey, are you cons

which meets with the result of its cpp version.

@vitrun vitrun changed the title Llama torch [Enhancement]add pytorch backend support for llama May 16, 2023
@RomaA2000
Copy link

Is there any guide how to create correct gemm_config.in for llama parameters?

@hepj987
Copy link

hepj987 commented May 18, 2023

Hello, may I ask if this change can support the CPP version of llama?Or does it only support the Pytorch version of llama?

@veya2ztn
Copy link

veya2ztn commented Jun 2, 2023

Fantasitic~! I change a bit and test the performance between ft-llama and huggingface implement. It simply show the ft-llama get 3 times speed up on A100-80G

  • FT-LLama: generates 10 batches, taking 4.867 secs to generate 470 tokens, 96.571 tokens/sec.
  • Huggingface: generates 10 batches, taking 9.923 secs to generate 340 tokens, 34.265 tokens/sec.

However, the output seems different~~~. Can you give a quick review?

The new llama_example.py

Details
# Copyright (c) 2021-2023, NVIDIA CORPORATION.  All rights reserved.
# Copyright (c) 2021, NAVER Corp.  Authored by CLOVA.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# from __future__ import print_function

from torch.nn.utils.rnn import pad_sequence
import os
import sys
import argparse
import configparser
import timeit
import torch
import torch.distributed as dist
from transformers import AutoTokenizer

dir_path = os.path.dirname(os.path.realpath(__file__))
sys.path.append(dir_path + "/../../..")
from examples.pytorch.llama.utils.llama import Llama
from transformers import AutoTokenizer,AutoModelForCausalLM

class Config(object):
    def __init__(self, config_dict):
        for key, val in config_dict.items():
            self.__setattr__(key, val)

def get_model_config(args):
    if "-gpu" not in args.ckpt_path:
        return Config({
            "head_num":32,
            "size_per_head":128,
            "inter_size":11008,
            "vocab_size":32000,
            "layer_num":32,
            "rotary_embedding":128,
            "layernorm_eps":1e-6,
            "start_id":1,
            "end_id":2,
            "use_gptj_residual":False,
            "weight_data_type":"fp16",
        })
    config = configparser.ConfigParser()
    config.read(os.path.join(args.ckpt_path, "config.ini"))
    return Config({
        "head_num":int(config.get('llama', 'head_num')),         
        "size_per_head":int(config.get('llama', 'size_per_head')),    
        "inter_size":int(config.get('llama', 'inter_size')),       
        "vocab_size":int(config.get('llama', 'vocab_size')),       
        "layer_num":int(config.get('llama', 'num_layer')),        
        "rotary_embedding":int(config.get('llama', 'rotary_embedding')), 
        "layernorm_eps":float(config.get('llama', 'layernorm_eps')),    
        "start_id":int(config.get('llama', 'start_id')),         
        "end_id":int(config.get('llama', 'end_id')),           
        "use_gptj_residual":False,
        "weight_data_type":config.get('llama', 'weight_data_type'), 
    })

def get_infer_config(args):
    return Config({
        "output_len":args.output_len,                  
        "beam_width":args.beam_width,                  
        "top_k":args.top_k,                       
        "top_p":args.top_p,                       
        "temperature":args.temperature,                 
        "len_penalty":args.len_penalty,                 
        "beam_search_diversity_rate":args.beam_search_diversity_rate,  
        "tensor_para_size":args.tensor_para_size,            
        "pipeline_para_size":args.pipeline_para_size,          
        "max_batch_size":args.max_batch_size,              
        "max_seq_len":args.max_seq_len,                 
        "repetition_penalty":args.repetition_penalty,          
        "inference_data_type":args.inference_data_type,         
    })

def get_system_config(args):
    return Config({
        "ckpt_path":args.ckpt_path,      
        "tokenizer_path":args.tokenizer_path, 
        "lib_path":args.lib_path,       
    })

def get_model(model_config, infer_config,system_config):
    if "-gpu" in system_config.ckpt_path:
        print('load [fastertransformer] model !')
        model  = Llama(model_config.head_num, model_config.size_per_head, model_config.inter_size, model_config.vocab_size, 
                    model_config.rotary_embedding, model_config.layernorm_eps,
                    model_config.start_id, model_config.end_id, model_config.layer_num, 
                    infer_config.max_seq_len, 
                    infer_config.tensor_para_size, 
                    infer_config.pipeline_para_size, 
                    model_config.use_gptj_residual, 
                    system_config.lib_path, 
                    inference_data_type=infer_config.inference_data_type, 
                    weights_data_type=model_config.weight_data_type)

        if not model.load(ckpt_path=system_config.ckpt_path):
            print("[WARNING] Checkpoint file not found. Model loading is skipped.")
    else:
        print('load [hugging face] model !')
        model = AutoModelForCausalLM.from_pretrained(system_config.ckpt_path).cuda()
    return model

def get_inputs_ids(args, tokenizer,device):
    # Inputs
    contexts = []
    if args.start_id_file:
        with open(args.start_id_file, 'r') as f:
            contexts = f.read().splitlines()
            batch_size = min(len(contexts), args.max_batch_size)
        contexts = contexts[:batch_size]
        start_ids = [torch.IntTensor([int(i) for i in c.strip().split(',')]) for c in contexts]
    elif args.sample_input_file:  # conditional case
        with open(args.sample_input_file, "r") as f:
            contexts = f.read().splitlines()
            batch_size = min(len(contexts), args.max_batch_size)
        contexts = contexts[:batch_size]
        start_ids = [torch.tensor(tokenizer.encode(c), dtype=torch.int32, device=device) for c in contexts]
    else:  # unconditional case
        raise 
        batch_size = infer_config.max_batch_size
        contexts = ['<|endoftext|>'] * batch_size
        start_ids = [torch.IntTensor([model_config.end_id])] * batch_size
    return start_ids, contexts

def get_model_result(model, start_ids,random_seed_tensor,infer_config):
        if isinstance(model, Llama):
            start_lengths = torch.IntTensor([len(ids) for ids in start_ids])
            batch_size    = len(start_ids)
            return model(start_ids    =start_ids,
                        start_lengths=start_lengths,
                        output_len=start_lengths + infer_config.output_len,
                        beam_width=infer_config.beam_width,
                        top_k=infer_config.top_k * torch.ones(size=[batch_size], dtype=torch.int32),
                        top_p=infer_config.top_p * torch.ones(size=[batch_size], dtype=torch.float32),
                        beam_search_diversity_rate=infer_config.beam_search_diversity_rate * torch.ones(size=[batch_size], dtype=torch.float32),
                        temperature=infer_config.temperature * torch.ones(size=[batch_size], dtype=torch.float32),
                        len_penalty=infer_config.len_penalty * torch.ones(size=[batch_size], dtype=torch.float32),
                        repetition_penalty=infer_config.repetition_penalty * torch.ones(size=[batch_size], dtype=torch.float32),
                        random_seed=random_seed_tensor,
                        return_output_length=False,
                        return_cum_log_probs=0)
        else:
            output_ids = model.generate(start_ids,max_length=512)
            return output_ids

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--output_len', type=int, default=32,
                        help='output sequence length to generate.')
    parser.add_argument('--beam_width', type=int, default=1,
                        help='beam width for beam search. Using sampling when beam width is 1.')
    parser.add_argument('--top_k', type=int, default=1,
                        help='top k candidate num')
    parser.add_argument('--top_p', type=float, default=0.,
                        help='top p probability threshold')
    parser.add_argument('--temperature', type=float, default=1.,
                        help='temperature')
    parser.add_argument('--len_penalty', type=float, default=0.,
                        help='len_penalty')
    parser.add_argument('--beam_search_diversity_rate', type=float, default=0.,
                        help='beam_search_diversity_rate')
    parser.add_argument('--tensor_para_size', type=int, default=1,
                        help='tensor parallel size')
    parser.add_argument('--pipeline_para_size', type=int, default=1,
                        help='pipeline parallel size')
    parser.add_argument('--ckpt_path', type=str, 
                        help='path to the checkpoint file.')
    parser.add_argument('--tokenizer_path', type=str, 
                        help='directory where the tokenizer file is located.')
    parser.add_argument('--lib_path', type=str, default='./lib/libth_transformer.so',
                        help='path to the pyt_fastertransformer dynamic lib file.')
    parser.add_argument('--sample_input_file', type=str,
                        help='path to the sample input file.')
    parser.add_argument('--start_id_file', type=str,
                        help='path to the start id file.')
    parser.add_argument('--max_batch_size', type=int, default=8,
                        help='max batch size.')
    parser.add_argument('--repetition_penalty', type=float, default=1.,
                        help='repetition penalty')
    parser.add_argument('--max_seq_len', type=int, default=1024,
                        help='max sequence length for position embedding table.')
    parser.add_argument('--inference_data_type', '--data_type', type=str, choices=['fp32', 'fp16'], default='fp16')
    parser.add_argument('--time', action='store_true',
                        help='whether or not to measure time elapsed.')
    parser.add_argument('--enable_random_seed', action='store_true',
                        help='is enable the random seed.')

    args = parser.parse_args()

    
    print("\n=============== Arguments ===============")
    for arg in vars(args):
        print("{}: {}".format(arg, getattr(args, arg)))
    print("=========================================\n")

    model_config = get_model_config(args)

    #### resource_configuration
    system_config= get_system_config(args)

    #### inference configuration
    infer_config = get_infer_config(args)

    #### set the multiprocess group
    if infer_config.tensor_para_size * infer_config.pipeline_para_size > 1:
        dist.init_process_group(backend=dist.Backend.MPI)
    rank         = dist.get_rank() if dist.is_initialized() else 0
    device_count = dist.get_world_size() if dist.is_initialized() else 1
    device       = rank % device_count
    torch.cuda.set_device(device)
    device       = torch.cuda.current_device()

    # sentencepiece needed
    tokenizer = AutoTokenizer.from_pretrained(system_config.tokenizer_path, use_fast=False)

    # get ids
    start_ids, contexts = get_inputs_ids(args, tokenizer,device)
    batch_size= len(start_ids)
    print("[INFO] batch size: {}".format(batch_size))
    start_ids     = pad_sequence(start_ids, batch_first=True, padding_value=model_config.end_id).cuda()
    
    if args.enable_random_seed == True:
        random_seed_tensor = torch.randint(0, 10000, size=[batch_size], dtype=torch.int64)
    else:
        random_seed_tensor = torch.zeros([batch_size], dtype=torch.int64)

    # Prepare model.
    print("building model...............")
    llama = get_model(model_config, infer_config, system_config)
    print("done!")
    with torch.no_grad():
        print(f"[INFO] input size {start_ids.shape}")
        tokens_batch = get_model_result(llama, start_ids,random_seed_tensor,infer_config)
        print(f"[INFO] output size {tokens_batch.shape}")
        if tokens_batch is not None and rank == 0:
            tokens_batch = tokens_batch.cpu().numpy()
            if not isinstance(llama, Llama): tokens_batch = [tokens_batch]
            start_lengths = torch.IntTensor([len(ids) for ids in start_ids])
            for i, (context, tokens) in enumerate(zip(contexts, tokens_batch)):
                for beam_id in range(infer_config.beam_width):
                    token = tokens[beam_id][start_lengths[i]:]  # exclude context input from the output
                    output = tokenizer.decode(token)
                    print(f'[INFO] batch {i}, beam {beam_id}:\n[Context]\n{context}\n\n[Generated]\n{token}\n\n[Output]\n{output}\n')

        # Measure inference time.
        if args.time:
            iterations = 10
            # warmup
            for i in range(iterations):
                tokens_batch = get_model_result(llama, start_ids,random_seed_tensor,infer_config)
            batch_num = 0
            token_num = 0
            time = timeit.default_timer()
            for i in range(iterations):
                tokens_batch = get_model_result(llama, start_ids,random_seed_tensor,infer_config)
                batch_num += 1
                for j, tokens in enumerate(tokens_batch):
                    token_num += tokens.shape[-1] - start_lengths[j]
            time_elapsed = timeit.default_timer() - time
            throughput = token_num / time_elapsed
            print(f"[INFO] FT-LLAMA:{args.ckpt_path}:\n      generates {batch_num} batches, taking {time_elapsed:0.3f} secs "
                  f"to generate {token_num} tokens, {throughput:0.3f} tokens/sec.")


if __name__ == '__main__':
    main()

And use the commend

python llama_example.py --tensor_para_size=1 --pipeline_para_size=1  --ckpt_path ~/pretrain_weights/vicuna/vicuna-7b-v1.1/ --tokenizer_path ~/pretrain_weights/vicuna/vicuna-7b-v1.1/ --lib_path ~/projects/FasterTransformer2/FasterTransformer/build/lib/libth_transformer.so --max_batch_size 1 --start_id_file start_ids.csv --time

python llama_example.py --tensor_para_size=1 --pipeline_para_size=1  --ckpt_path ~/pretrain_weights/vicuna/vicuna-7b-fastertransformer_fp16/1-gpu/ --tokenizer_path ~/pretrain_weights/vicuna/vicuna-7b-v1.1/ --lib_path ~/projects/FasterTransformer2/FasterTransformer/build/lib/libth_transformer.so --max_batch_size 1 --start_id_file start_ids.csv --time

The result should be


=============== Arguments ===============
output_len: 32
beam_width: 1
top_k: 1
top_p: 0.0
temperature: 1.0
len_penalty: 0.0
beam_search_diversity_rate: 0.0
tensor_para_size: 1
pipeline_para_size: 1
ckpt_path: /mnt/lustre/zhangtianning/pretrain_weights/vicuna/vicuna-7b-v1.1/
tokenizer_path: /mnt/lustre/zhangtianning/pretrain_weights/vicuna/vicuna-7b-v1.1/
lib_path: /mnt/lustre/zhangtianning/projects/FasterTransformer2/FasterTransformer/build/lib/libth_transformer.so
sample_input_file: None
start_id_file: start_ids.csv
max_batch_size: 1
repetition_penalty: 1.0
max_seq_len: 1024
inference_data_type: fp16
time: True
enable_random_seed: False
=========================================

[INFO] batch size: 1
building model...............
load [hugging face] model !

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|█████     | 1/2 [00:13<00:13, 13.31s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:18<00:00,  8.55s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:18<00:00,  9.27s/it]
done!
[INFO] input size torch.Size([1, 15])
[INFO] output size torch.Size([1, 49])
[INFO] batch 0, beam 0:
[Context]
0, 18637, 29892, 526, 366, 1136, 455, 2470, 29973, 1815, 366, 5193, 304, 592, 29973

[Generated]
[   13 29902 29915 29885  7423 29892   306 29915 29885   451  9985   411
   278  1840   376  3200   455  2470  1213  6527   366  3113  3867   901
  3030   470  5649   825   366   526 16811   304 29973     2]

[Output]
 
I'm sorry, I'm not familiar with the term "consciours." Could you please provide more context or explain what you are referring to?</s>

[INFO] FT-LLAMA:/mnt/lustre/zhangtianning/pretrain_weights/vicuna/vicuna-7b-v1.1/:
      generates 10 batches, taking 9.923 secs to generate 340 tokens, 34.265 tokens/sec.
[WARNING] gemm_config.in is not found; using default GEMM algo
[FT][WARNING] Skip NCCL initialization since requested tensor/pipeline parallel sizes are equals to 1.

=============== Arguments ===============
output_len: 32
beam_width: 1
top_k: 1
top_p: 0.0
temperature: 1.0
len_penalty: 0.0
beam_search_diversity_rate: 0.0
tensor_para_size: 1
pipeline_para_size: 1
ckpt_path: /mnt/lustre/zhangtianning/pretrain_weights/vicuna/vicuna-7b-fastertransformer_fp16/1-gpu/
tokenizer_path: /mnt/lustre/zhangtianning/pretrain_weights/vicuna/vicuna-7b-v1.1/
lib_path: /mnt/lustre/zhangtianning/projects/FasterTransformer2/FasterTransformer/build/lib/libth_transformer.so
sample_input_file: None
start_id_file: start_ids.csv
max_batch_size: 1
repetition_penalty: 1.0
max_seq_len: 1024
inference_data_type: fp16
time: True
enable_random_seed: False
=========================================

[INFO] batch size: 1
building model...............
load [fastertransformer] model !
[INFO] WARNING: Have initialized the process group
done!
[INFO] input size torch.Size([1, 15])
[INFO] output size torch.Size([1, 1, 62])
[INFO] batch 0, beam 0:
[Context]
0, 18637, 29892, 526, 366, 1136, 455, 2470, 29973, 1815, 366, 5193, 304, 592, 29973

[Generated]
[   13    13 29930 18637 29892   526   366  1136   455  2470 29973  1815
   366  5193   304   592 29973    13 29930 18637 29892   526   366  1136
   455  2470 29973  1815   366  5193   304   592 29973    13 29930 18637
 29892   526   366  1136   455  2470 29973  1815   366  5193   304]

[Output]


* Hey, are you consciours? Can you talk to me?
* Hey, are you consciours? Can you talk to me?
* Hey, are you consciours? Can you talk to

[INFO] FT-LLAMA:/mnt/lustre/zhangtianning/pretrain_weights/vicuna/vicuna-7b-fastertransformer_fp16/1-gpu/:
      generates 10 batches, taking 4.867 secs to generate 470 tokens, 96.571 tokens/sec.

@veya2ztn
Copy link

veya2ztn commented Jun 2, 2023

It seem only the max_batch_size=1 can work. For any batch input with batch_size>1, it fails

Traceback (most recent call last):
  File "/mnt/petrelfs/zhangtianning/projects/FasterTransformer2/FasterTransformer/examples/pytorch/llama/llama_example.py", line 233, in <module>
    main()
  File "/mnt/petrelfs/zhangtianning/projects/FasterTransformer2/FasterTransformer/examples/pytorch/llama/llama_example.py", line 163, in main
    tokens_batch = llama(
  File "/mnt/cache/zhangtianning/anaconda3/envs/llm2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/petrelfs/zhangtianning/projects/FasterTransformer2/FasterTransformer/examples/pytorch/llama/../../../examples/pytorch/llama/utils/llama.py", line 283, in forward
    outputs = self.model.forward(input_ids,
RuntimeError: forward() Expected a value of type 'int' for argument '_3' but instead found type 'Tensor'.
Position: 3
Value: tensor([47, 47], dtype=torch.int32)
Declaration: forward(__torch__.torch.classes.FasterTransformer.LlamaOp _0, Tensor _1, Tensor _2, int _3, int? _4, Tensor? _5, Tensor? _6, Tensor? _7, Tensor? _8, Tensor? _9, Tensor? _10, Tensor? _11, int? _12) -> (Tensor[] _0)
Cast error details: Unable to cast Python instance to C++ type (compile in debug mode for details)

Should use a fixed int rather than start_length
output_len=256, #start_lengths + infer_config.output_len,
It is consistant with hugging face model
model.generate(input_ids,max_length=512).

@77h2l
Copy link

77h2l commented Jun 5, 2023

@veya2ztn
thx for your job
the above llama_example.py script seem to have the flowing bugs:

Traceback (most recent call last):
File "llama_example.py", line 277, in
main()
File "llama_example.py", line 225, in main
tokenizer = AutoTokenizer.from_pretrained(system_config.tokenizer_path, use_fast=False)
File "/opt/conda/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 690, in from_pretrained
raise ValueError(
ValueError: Tokenizer class LLaMATokenizer does not exist or is not currently imported.

@veya2ztn
Copy link

veya2ztn commented Jun 6, 2023

@veya2ztn thx for your job the above llama_example.py script seem to have the flowing bugs:

Traceback (most recent call last): File "llama_example.py", line 277, in main() File "llama_example.py", line 225, in main tokenizer = AutoTokenizer.from_pretrained(system_config.tokenizer_path, use_fast=False) File "/opt/conda/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 690, in from_pretrained raise ValueError( ValueError: Tokenizer class LLaMATokenizer does not exist or is not currently imported.

upgrade your transformer
see from transformers import AutoTokenizer

@77h2l
Copy link

77h2l commented Jun 7, 2023

@veya2ztn thx for your job the above llama_example.py script seem to have the flowing bugs:
Traceback (most recent call last): File "llama_example.py", line 277, in main() File "llama_example.py", line 225, in main tokenizer = AutoTokenizer.from_pretrained(system_config.tokenizer_path, use_fast=False) File "/opt/conda/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 690, in from_pretrained raise ValueError( ValueError: Tokenizer class LLaMATokenizer does not exist or is not currently imported.

upgrade your transformer see from transformers import AutoTokenizer

@veya2ztn
Thx for your reply. Your provided script did work,however when bs sets to be greater than 1, error occurs, set the output_len to a fixed value could avoid the problem, but each output is fixed length, it need a lot of paddings, this is counterintuitive, isn't it ? The output should end when meets eos.

@chailt
Copy link

chailt commented Jun 8, 2023

May I ask if it supports single process multiple GPUs. I also have the same confusion that the length of each output is fixed.

@veya2ztn
Copy link

veya2ztn commented Jun 8, 2023

May I ask if it supports si

also fail for the multiple GPUs example.

@alanxmay
Copy link

alanxmay commented Jun 8, 2023

@veya2ztn I have some probably stupid questions, how to compile it?

First, I build Pytorch 2.0.1 with MPI backend from source.

Then, I follow the guide docs/decoder_guide.md to compile FT without docker, run the example and got error:

CUDA Error: (null) .../FasterTransformer/3rdparty/trt_fused_multihead_attention/fused_multihead_attention.h 345

I come across this issue #177 , did you also compiled with docker?

@veya2ztn
Copy link

@veya2ztn I have some probably stupid questions, how to compile it?

First, I build Pytorch 2.0.1 with MPI backend from source.

Then, I follow the guide docs/decoder_guide.md to compile FT without docker, run the example and got error:

CUDA Error: (null) .../FasterTransformer/3rdparty/trt_fused_multihead_attention/fused_multihead_attention.h 345

I come across this issue #177 , did you also compiled with docker?

I dont compile in docker. If you want to compile a MPI backbone from source,
Try:
torch11.2 + cuda 11.3 +gcc 7.5

I compile successfully under this configuration. But it seems that the MPI for CUDA is not avaliable.
Let me know if you get success

@alanxmay
Copy link

@veya2ztn I successfully compiled in docker and achieved similar performance as you mentioned, thanks a lot!

@BasicCoder
Copy link

Thank you for your great work, but when I test 13B, 30B, 60B models, the following error occurs:

[FT][ERROR] CUDA runtime error: an illegal memory access was encountered /workspace/Llama-FT/FasterTransformer/src/fastertransformer/utils/allocator.h:462

You can reproduce the problem with the following command:

convert model:
python ../examples/cpp/llama/huggingface_llama_convert.py -saved_dir=./llama-13b-hf/c-model -in_file=./llama-13b-hf -infer_gpu_num=2 -weight_data_type=fp16 -model_name=llama_13b

run model:
export CUDA_LAUNCH_BLOCKING=1 mpirun -n 2 --allow-run-as-root python ../examples/pytorch/llama/llama_example.py --tensor_para_size=2 --pipeline_para_size=1 --ckpt_path ./llama-13b-hf/c-model/2-gpu --tokenizer_path ./llama-13b-hf --lib_path ./lib/libth_transformer.so --max_batch_size 4 --inference_data_type fp16 --output_len 170 --time --start_id_file ../examples/pytorch/llama/start_ids.csv

@veya2ztn
Copy link

veya2ztn commented Jun 20, 2023 via email

@BasicCoder
Copy link

I don't think it's OOM,In this test, I am using 2* A100 80G in tensors parallel. In the same environment and tensors parallel configuration, the result of the 7B model is normal, but a larger model size will cause errors.
I used export CUDA_LAUNCH_BLOCKING=1, the error line number information is:
check_cuda_error(cudaMemset(ptr, val, size));
so i think it’s a cuda error.

@sleepcoo
Copy link

It is consistant with hugging face model
model.generate(input_ids,max_length=512).

May I ask why it is written as a fixed value here?

@illfg
Copy link

illfg commented Jun 27, 2023

I don't think it's OOM,In this test, I am using 2* A100 80G in tensors parallel. In the same environment and tensors parallel configuration, the result of the 7B model is normal, but a larger model size will cause errors. I used export CUDA_LAUNCH_BLOCKING=1, the error line number information is: check_cuda_error(cudaMemset(ptr, val, size)); so i think it’s a cuda error.

Have you solved the problem?

@BasicCoder
Copy link

I don't think it's OOM,In this test, I am using 2* A100 80G in tensors parallel. In the same environment and tensors parallel configuration, the result of the 7B model is normal, but a larger model size will cause errors. I used export CUDA_LAUNCH_BLOCKING=1, the error line number information is: check_cuda_error(cudaMemset(ptr, val, size)); so i think it’s a cuda error.

Have you solved the problem?

NO. I haven't resolved it yet.

@illfg
Copy link

illfg commented Jun 27, 2023

I don't think it's OOM,In this test, I am using 2* A100 80G in tensors parallel. In the same environment and tensors parallel configuration, the result of the 7B model is normal, but a larger model size will cause errors. I used export CUDA_LAUNCH_BLOCKING=1, the error line number information is: check_cuda_error(cudaMemset(ptr, val, size)); so i think it’s a cuda error.

Have you solved the problem?

NO. I haven't resolved it yet.

This is probably caused by pytorch. I solved this problem by rebuild the pytorch with mpi from source

@BasicCoder
Copy link

I don't think it's OOM,In this test, I am using 2* A100 80G in tensors parallel. In the same environment and tensors parallel configuration, the result of the 7B model is normal, but a larger model size will cause errors. I used export CUDA_LAUNCH_BLOCKING=1, the error line number information is: check_cuda_error(cudaMemset(ptr, val, size)); so i think it’s a cuda error.

Have you solved the problem?

NO. I haven't resolved it yet.

This is probably caused by pytorch. I solved this problem by rebuild the pytorch with mpi from source

Good news, can you share more information about your execution environment? pytorch/mpi/cuda version?

@illfg
Copy link

illfg commented Jun 28, 2023

I don't think it's OOM,In this test, I am using 2* A100 80G in tensors parallel. In the same environment and tensors parallel configuration, the result of the 7B model is normal, but a larger model size will cause errors. I used export CUDA_LAUNCH_BLOCKING=1, the error line number information is: check_cuda_error(cudaMemset(ptr, val, size)); so i think it’s a cuda error.

Have you solved the problem?

NO. I haven't resolved it yet.

This is probably caused by pytorch. I solved this problem by rebuild the pytorch with mpi from source

Good news, can you share more information about your execution environment? pytorch/mpi/cuda version?

host:
driver 525.116.03
cuda 12.0

docker container:
driver 525.116.03
cuda 11.7
openmpi 4.1.2
cdunn 11.x
magma 117
pytorch build from source

@sleepwalker2017
Copy link

hello what role does the mpi play in the cli? why does pytorch example need mpi?

@jcao-ai
Copy link

jcao-ai commented Jul 6, 2023

Hi, @vitrun , the upstream branch by @void-main now support int8 inference. Would you consider support it also?

@savemuri
Copy link

savemuri commented Jul 7, 2023

@veya2ztn - Any luck with getting batch_size>1 working? I am hit with a runtime error when I use int as output length and when I use output_len=start_lengths + output_len, it throws Cast error details: Unable to cast Python instance to C++ type (compile in debug mode for details)

@savemuri
Copy link

savemuri commented Jul 7, 2023

Update: Got it working with batches by lowering output_len to 512.

=============== Arguments ===============
output_len: 512
beam_width: 1
top_k: 1
top_p: 0.95
temperature: 0.8
len_penalty: 0.0
beam_search_diversity_rate: 0.0
tensor_para_size: 1
pipeline_para_size: 1
ckpt_path: /workspace/model/llama/1-gpu
tokenizer_path: /workspace/tokenizer
lib_path: /app/FasterTransformer/build/lib/libth_transformer.so
sample_input_file: /app/prompts.txt
start_id_file: None
max_batch_size: 12
repetition_penalty: 1.1
max_seq_len: 1024
inference_data_type: fp16
time: False
enable_random_seed: False
=========================================

and use output_len=output_len instead of output_len=start_lengths + output_len

@Louis-y-nlp
Copy link

@vitrun Thanks for your work. I m trying to use your code to run llama 2 13b model on my v100-32G, 1-gpu works well for me, but when i try 2-gpu, i got this error:

[INFO] batch size: 1
[INFO] batch size: 1
[INFO] WARNING: Have initialized the process group
[INFO] WARNING: Have initialized the process group
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[FT][INFO] NCCL initialized rank=1 world_size=2 tensor_para=NcclParam[rank=1, world_size=2, nccl_comm=0x5631dea2d0a0] pipeline_para=NcclParam[rank=0, world_size=1, nccl_comm=0x5631dea80c70]
[FT][INFO] NCCL initialized rank=0 world_size=2 tensor_para=NcclParam[rank=0, world_size=2, nccl_comm=0x5592093795d0] pipeline_para=NcclParam[rank=0, world_size=1, nccl_comm=0x5592093cd030]
[FT][ERROR] CUDA runtime error: an illegal memory access was encountered /mnt/work/llama_ft_vitrun/FasterTransformer/src/fastertransformer/utils/allocator.h:462
[FT][ERROR] CUDA runtime error: an illegal memory access was encountered /mnt/work/llama_ft_vitrun/FasterTransformer/src/fastertransformer/utils/allocator.h:462
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[58456,1],0]
  Exit code:    255
--------------------------------------------------------------------------

here is my command line

mpirun -n 2 --allow-run-as-root python llama_example.py \
	--ckpt_path ${ckpt_path} \
	--lib_path ${lib_path} \
	--tokenizer_path ${tokenizer_name_or_path} \
	--tensor_para_size=2 --pipeline_para_size=1 --max_batch_size 1 --start_id_file start_ids.csv

any help will be grateful.

@vitrun vitrun closed this Jul 30, 2023
@sleepwalker2017
Copy link

I don't think it's OOM,In this test, I am using 2* A100 80G in tensors parallel. In the same environment and tensors parallel configuration, the result of the 7B model is normal, but a larger model size will cause errors. I used export CUDA_LAUNCH_BLOCKING=1, the error line number information is: check_cuda_error(cudaMemset(ptr, val, size)); so i think it’s a cuda error.

hello have you solved this problem? illegal memory access error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.