Focus is a hardware–algorithm co-designed architecture that accelerates Vision-Language Model (VLM) inference by eliminating redundancy in visual tokens. It introduces a multilevel concentration pipeline—covering semantic-, block-, and vector-level redundancy—to reorganize VLM inputs into a hardware-friendly, locality-preserving format. An overview of Focus design is shown below
Focus Architecture Overview
This repository provides a full-stack implementation of Focus, including the algorithm, architecture simulator, RTL hardware design, and baselines. It reproduces all experimental results reported in our evaluation.
📚 Component Documentation
This repository contains three major components, each with a detailed README:
- Algorithm — Focus algorithm, sparse-trace generation, and accuracy evaluation.
- Simulator — Performance modeling and design-space exploration.
- RTL Hardware — Verilog implementation of Focus hardware modules.
-
algorithm/– Focus algorithm implementation and accuracy evaluation. -
simulator/– Architecture performance simulator. -
rtl/– Hardware RTL implementation Includes systolic array, SEC/SIC, and other hardware blocks. -
evaluation_scripts/– Plotting and result-analysis utilitiesplot_scripts/— Jupyter notebooks for generating paper figures.
-
3rd_party/– Third-party dependenciesLLaVA-NeXT/– LLaVA VLM implementationscalesim/– GEMM performance simulatorcacti/– SRAM memory modelingDRAMsim3/– DRAM simulation
- Python 3.11 (conda recommended)
- CUDA-capable GPU (≥80 GB HBM recommended)
- G++
- HuggingFace access token (for model checkpoints and datasets)
- Clone the repository
git clone git@github.com:dubcyfor3/Focus.git
cd Focus- Initialize submodules
git submodule init
git submodule update- Create and activate the environment
conda create -n focus python=3.11 -y
conda activate focus- Install dependencies
# Install LLaVA-NeXT
cd 3rd_party/LLaVA-NeXT
pip install -e .
# Install ScaleSim
cd ../scalesim
pip install -e .
# Build CACTI
cd ../cacti
make
# Build DRAMsim3
cd ../DRAMsim3
make
# Install lmms-eval
cd ../../algorithm/lmms-eval
pip install -e .
# Install *Focus*
cd ../focus
pip install -e '.[main]' # '[main]' ensures the correct transformers version
# pip install -e '.[qwen25_vl]' # run this when running QWen2.5-VLExample command to run LLaVA-Video with Focus on VideoMME dataset and export sparse traces:
cd algorithm/
python -m run_eval \
--model llava_vid \
--model_args pretrained=lmms-lab/LLaVA-Video-7B-Qwen2,conv_template=qwen_1_5,max_frames_num=64,mm_spatial_pool_mode=average \
--tasks videomme \
--focus \
--batch_size 1 \
--log_samples --log_samples_suffix llava_vid \
--output_path ./logs_traces/ \
--limit 10 \
--export_focus_trace \
--trace_dir ./output/focus_main/ \
--trace_name llava_vid_videomme \
--use_median \
--trace_meta_dir ./output/See algorithm/README.md for complete usage and scripts for all datasets and models.
Example simulation using generated traces:
cd ../simulator
python main.py \
--model llava_vid \
--dataset videomme \
--accelerator focus \
--trace_dir ../algorithm/output \
--output_dir resultsSee simulator/README.md for details on all experiments and configurations.
cd ../evaluation_scripts/plot_scripts/ipynb_src
# Open the Jupyter notebooks and execute to generate plotsThe notebooks provide end-to-end instructions for reproducing all figures and tables from the paper.
This repository is built on top of the following open-source projects:
We thank the authors and contributors of these projects for their valuable work.
If you find Focus helpful in your project or research, please consider citing our paper:
@misc{wei2025focus,
title={Focus: A Streaming Concentration Architecture for Efficient Vision-Language Models},
author={Chiyue Wei and Cong Guo and Junyao Zhang and Haoxuan Shan and Yifan Xu and Ziyue Zhang and Yudong Liu and Qinsi Wang and Changchun Zhou and Hai "Helen" Li and Yiran Chen},
year={2025},
eprint={2512.14661},
archivePrefix={arXiv},
primaryClass={cs.AR},
url={https://arxiv.org/abs/2512.14661},
}
