Skip to content

Helps needed in testing out TinyLlama #1154

@jzhang38

Description

@jzhang38

The TinyLlama project aims to pretrain a 1.1B Llama on 3T tokens. So that model should be an ideal draft model for speculative inference.

https://github.com/jzhang38/TinyLlama
https://huggingface.co/PY007/TinyLlama-1.1B-intermediate-step-240k-503b

I encounter this error when I tried to use TinyLlama-1.1B-intermediate-step-240k-503b as the draft model

/root/miniconda3/lib/python3.10/site-packages/torch/__init__.py:635: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:450.)
  _C._set_default_tensor_type(t)
Creating directory /root/.cache/flexflow/weights/model/chaoscodes/tinyllama-1.1b-intermediate-step-240k-503b/half-precision (if it doesn't exist)...
Loading 'model/chaoscodes/TinyLlama-1.1B-intermediate-step-240k-503b' model weights from the cache...
Loading weight file tok_embeddings_weight
Loading weight file layers_0_attention_norm_weight
Loading weight file layers_0_attention_wq_weight
Loading weight file layers_0_attention_wk_weight
load attention data error 1048576, 8388608, 1, /root/.cache/flexflow/weights/model/chaoscodes/tinyllama-1.1b-intermediate-step-240k-503b/half-precision/layers_0_attention_wk_weight
python: /tmp/pip-install-ijvow1hh/flexflow_0192abbf2b1a40128377649dca2ea9f0/inference/file_loader.cc:252: void load_attention_weights_v2(DT*, int, int, size_t, size_t, std::string, std::string, size_t, int) [with DT = __half; size_t = long unsigned int; std::string = std::__cxx11::basic_string<char>]: Assertion `false && "data size mismatch"' failed.
Aborted (core dumped)

code I use:

import flexflow.serve as ff

ff.init(
        num_gpus=4,
        memory_per_gpu=23000,
        zero_copy_memory_per_node=30000,
        tensor_parallelism_degree=4,
        pipeline_parallelism_degree=1
    )

# Specify the LLM
llm = ff.LLM("model/Llama-2-7b-hf")

# Specify a list of SSMs (just one in this case)
ssms=[]
ssm = ff.SSM("model/TinyLlama-1.1B-intermediate-step-240k-503b")
ssms.append(ssm)


# Create the sampling configs
generation_config = ff.GenerationConfig(
    do_sample=False, temperature=0.9, topp=0.8, topk=1
)

# Compile the SSMs for inference and load the weights into memory
for ssm in ssms:
    ssm.compile(generation_config)

# Compile the LLM for inference and load the weights into memory
llm.compile(generation_config, ssms=ssms)

result = llm.generate("Here are some travel tips for Tokyo:\n")

I believe this is probably not an issue with the TinyLlama weight. More than likely there is a bug in FlexFlow dealing with GQA weight/ rope.

The reason why I claim that is the TinyLlama weight works fine with HuggingFace and llama.cpp.

Previously I spotted a bug in the llama.cpp: ggml-org/llama.cpp#3364. Basically, a bug exists when converting the HF weight to llama.cpp weight (from GPT-NeoX style rope to GPT-J style). Nobody spotted this because the previous GQA model like llama-2-70B has num_heads = kv_heads ** 2, while for TinyLlama its num_heads = 32 and kv_heads = 4.
This bug exists in repos such as llama.cpp (fixed now), llama2.c and llama2.mojo.

I am wondering if similar things may happen here.

Right now I am thinking if this line in FlexFlow is correct:

https://github.com/flexflow/FlexFlow/blob/1d5e0c593a956b7fcc789a1b034e6ff920aad1d4/python/flexflow/serve/serve.py#L265

(Above is a hypothesis from me that may not be correct. My point is it would be nice if someone can make FlexFlow work with TinyLlama :) ).

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions