-
Notifications
You must be signed in to change notification settings - Fork 248
Description
The TinyLlama project aims to pretrain a 1.1B Llama on 3T tokens. So that model should be an ideal draft model for speculative inference.
https://github.com/jzhang38/TinyLlama
https://huggingface.co/PY007/TinyLlama-1.1B-intermediate-step-240k-503b
I encounter this error when I tried to use TinyLlama-1.1B-intermediate-step-240k-503b as the draft model
/root/miniconda3/lib/python3.10/site-packages/torch/__init__.py:635: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:450.)
_C._set_default_tensor_type(t)
Creating directory /root/.cache/flexflow/weights/model/chaoscodes/tinyllama-1.1b-intermediate-step-240k-503b/half-precision (if it doesn't exist)...
Loading 'model/chaoscodes/TinyLlama-1.1B-intermediate-step-240k-503b' model weights from the cache...
Loading weight file tok_embeddings_weight
Loading weight file layers_0_attention_norm_weight
Loading weight file layers_0_attention_wq_weight
Loading weight file layers_0_attention_wk_weight
load attention data error 1048576, 8388608, 1, /root/.cache/flexflow/weights/model/chaoscodes/tinyllama-1.1b-intermediate-step-240k-503b/half-precision/layers_0_attention_wk_weight
python: /tmp/pip-install-ijvow1hh/flexflow_0192abbf2b1a40128377649dca2ea9f0/inference/file_loader.cc:252: void load_attention_weights_v2(DT*, int, int, size_t, size_t, std::string, std::string, size_t, int) [with DT = __half; size_t = long unsigned int; std::string = std::__cxx11::basic_string<char>]: Assertion `false && "data size mismatch"' failed.
Aborted (core dumped)
code I use:
import flexflow.serve as ff
ff.init(
num_gpus=4,
memory_per_gpu=23000,
zero_copy_memory_per_node=30000,
tensor_parallelism_degree=4,
pipeline_parallelism_degree=1
)
# Specify the LLM
llm = ff.LLM("model/Llama-2-7b-hf")
# Specify a list of SSMs (just one in this case)
ssms=[]
ssm = ff.SSM("model/TinyLlama-1.1B-intermediate-step-240k-503b")
ssms.append(ssm)
# Create the sampling configs
generation_config = ff.GenerationConfig(
do_sample=False, temperature=0.9, topp=0.8, topk=1
)
# Compile the SSMs for inference and load the weights into memory
for ssm in ssms:
ssm.compile(generation_config)
# Compile the LLM for inference and load the weights into memory
llm.compile(generation_config, ssms=ssms)
result = llm.generate("Here are some travel tips for Tokyo:\n")
I believe this is probably not an issue with the TinyLlama weight. More than likely there is a bug in FlexFlow dealing with GQA weight/ rope.
The reason why I claim that is the TinyLlama weight works fine with HuggingFace and llama.cpp.
Previously I spotted a bug in the llama.cpp: ggml-org/llama.cpp#3364. Basically, a bug exists when converting the HF weight to llama.cpp weight (from GPT-NeoX style rope to GPT-J style). Nobody spotted this because the previous GQA model like llama-2-70B has num_heads = kv_heads ** 2, while for TinyLlama its num_heads = 32 and kv_heads = 4.
This bug exists in repos such as llama.cpp (fixed now), llama2.c and llama2.mojo.
I am wondering if similar things may happen here.
Right now I am thinking if this line in FlexFlow is correct:
(Above is a hypothesis from me that may not be correct. My point is it would be nice if someone can make FlexFlow work with TinyLlama :) ).