Repo for SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting (ISCA25)
Paper website: https://arxiv.org/pdf/2504.08850
The SpecEE is implemented based on the HuggingFace framework in the cloud scenario and the llama.cpp scenario in the edge scenario. We modify the part of code to support the SpecEE, and we will conduct the faster research iteration in the future.
Left: SpecEE & Right: llama.cpp
please enter in the SpecEE_PC directory
torch~=2.2.1 numpy~=1.26.4 sentencepiece~=0.2.0 transformers>=4.45.1,<5.0.0 gguf>=0.1.0 protobuf>=4.21.0,<5.0.0
cd SpecEE_PC
conda create -n specee python=3.12
conda activate specee
pip install -r ./requirements/requirements-convert_hf_to_gguf.txthuggingface-cli login
huggingface-cli download YYDH2333/SpecEE-7b-chat-hf- Download Llama-7b-chat-hf and EAGLE-llama2-chat-7B
huggingface-cli login
huggingface-cli download meta-llama/Llama-2-7b-chat-hf
huggingface-cli download yuhuili/EAGLE-llama2-chat-7B- Copy Llama-7b-chat-hf weights to ./specee_weights
cp Llama-7b-chat-hf/config.json specee_weights/ specee_weights/
cp Llama-7b-chat-hf/generation_config.json specee_weights/
cp Llama-7b-chat-hf/model-00001-of-00002.safetensors specee_weights/
cp Llama-7b-chat-hf/model-00002-of-00002.safetensors specee_weights/
cp Llama-7b-chat-hf/special_tokens_map.json specee_weights/
cp Llama-7b-chat-hf/tokenizer.json specee_weights/
cp Llama-7b-chat-hf/tokenizer.model specee_weights/
cp Llama-7b-chat-hf/tokenizer_config.json specee_weights/- Run ./specee_weights/add_eagle.py
python ./specee_weights/add_eagle.py- Convert safetensors to gguf
python convert_hf_to_gguf.py Llama-2-7b-chat-hf --outtype f16 --outfile models/Llama-2-7b-chat-hf.gguf --verbose
python convert_hf_to_gguf.py specee_weights --outtype f16 --outfile models/SpecEE-7b-chat-hf.gguf --verbose make GGML_CUDA=1 -j$(nproc)CUDA_VISIBLE_DEVICES=0 \
./llama-cli \
-m models/SpecEE-7b-chat-hf.gguf \
-p "Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions." \
-e -ngl 16 -t 4 -n 256 -c 512 -s 8 --top_k 0 --temp 0python eval_on_alpaca_dataset.pypython eval_on_gsm8k_dataset.pypython eval_on_humaneval_dataset.pypython eval_on_mt-bench_dataset.pypython eval_on_qa_dataset.pypython eval_on_sum_dataset.pyplease enter in the SpecEE_cloud directory
First setup the environments.
We need to setup three environments to evaluate SpecEE in huggingface and awq.
cd SpecEE-cloud
conda create -n SpecEE python==3.10
conda activate SpecEE
pip install -r requirements.txtcd SpecEE-cloud
conda create -n awq python==3.10
conda activate awq
pip install -r requirements_awq.txtcd SpecEE-cloud
conda create -n specee_awq python==3.10
conda activate specee_awq
pip install -r requirements_awq.txt
cd AutoAWQ-0.2.6
pip install -e .Second download the models needed.
huggingface-cli login
huggingface-cli download meta-llama/Llama-2-7b-chat-hf
huggingface-cli download TheBloke/Llama-2-7B-Chat-AWQ
huggingface-cli download yuhuili/EAGLE-llama2-chat-7BIn our code, we can evaluate the speed and accuracy performance based on Huggingface.
Firstly, activate SpecEE environment.
conda activate SpecEEThe example command to evaluate speed.
CUDA_VISIBLE_DEVICES=0 python EEInference.py --base-model-path meta-llama/Llama-2-7b-chat-hf --draft-model-path yuhuili/EAGLE-llama2-chat-7B --dataset mt_bench --task speed --predictor-path [the local path of ./llama-7b]
The example command to evaluate accuracy.
CUDA_VISIBLE_DEVICES=0 python EEInference.py --base-model-path meta-llama/Llama-2-7b-chat-hf --draft-model-path yuhuili/EAGLE-llama2-chat-7B --dataset sst2 --task accuracy --predictor-path [the local path of ./llama-7b]
You can replace the model path information in run_hf_specee.sh to complete all AE experiments replication in one go.
./run_hf_specee.shIn our AE code, we can evaluate the speed and accuracy performance based on awq.
Firstly, activate SpecEE environment.
conda activate SpecEEThe example command.
CUDA_VISIBLE_DEVICES=0 python EEInference_awq.py --base-model-path TheBloke/Llama-2-7B-Chat-AWQ --draft-model-path yuhuili/EAGLE-llama2-chat-7B --dataset commonsenseqa --task accuracy --predictor-path [the local path of ./llama-7b]
Firstly, activate awq environment.
conda activate awqYou can get all awq speed via:
CUDA_VISIBLE_DEVICES=0 python AwqInference.py --base-model-path TheBloke/Llama-2-7B-Chat-AWQ Next, activate specee_awq environment.
conda activate specee_awqYou can get all awq+specee speed via:
CUDA_VISIBLE_DEVICES=0 python AwqEEInference.py --base-model-path TheBloke/Llama-2-7B-Chat-AWQ --draft-model-path yuhuili/EAGLE-llama2-chat-7BThe speed evaluation results of awq and awq+specee are in raw_awq.json and specee_awq.json.
You can run calculate_awq_speedup.py to get speedup ratio of awq+specee.
> python calculate_awq_speedup.py

