Focus: A Streaming Concentration Architecture for Efficient Vision-Language Models

Focus is a hardware–algorithm co-designed architecture that accelerates Vision-Language Model (VLM) inference by eliminating redundancy in visual tokens. It introduces a multilevel concentration pipeline—covering semantic-, block-, and vector-level redundancy—to reorganize VLM inputs into a hardware-friendly, locality-preserving format. An overview of Focus design is shown below

Focus Architecture Overview

This repository provides a full-stack implementation of Focus, including the algorithm, architecture simulator, RTL hardware design, and baselines. It reproduces all experimental results reported in our evaluation.

Overview

📚 Component Documentation

This repository contains three major components, each with a detailed README:

Algorithm — Focus algorithm, sparse-trace generation, and accuracy evaluation.

Simulator — Performance modeling and design-space exploration.

RTL Hardware — Verilog implementation of Focus hardware modules.

Repository Structure

algorithm/ – Focus algorithm implementation and accuracy evaluation.
simulator/ – Architecture performance simulator.
rtl/ – Hardware RTL implementation Includes systolic array, SEC/SIC, and other hardware blocks.
evaluation_scripts/ – Plotting and result-analysis utilities
- plot_scripts/ — Jupyter notebooks for generating paper figures.
3rd_party/ – Third-party dependencies
- LLaVA-NeXT/ – LLaVA VLM implementation
- scalesim/ – GEMM performance simulator
- cacti/ – SRAM memory modeling
- DRAMsim3/ – DRAM simulation

Getting Started

Prerequisites

Python 3.11 (conda recommended)
CUDA-capable GPU (≥80 GB HBM recommended)
G++
HuggingFace access token (for model checkpoints and datasets)

Installation

Clone the repository

git clone git@github.com:dubcyfor3/Focus.git
cd Focus

Initialize submodules

git submodule init
git submodule update

Create and activate the environment

conda create -n focus python=3.11 -y
conda activate focus

Install dependencies

# Install LLaVA-NeXT
cd 3rd_party/LLaVA-NeXT
pip install -e .

# Install ScaleSim
cd ../scalesim
pip install -e .

# Build CACTI
cd ../cacti
make

# Build DRAMsim3
cd ../DRAMsim3
make

# Install lmms-eval
cd ../../algorithm/lmms-eval
pip install -e .

# Install *Focus*
cd ../focus
pip install -e '.[main]'   # '[main]' ensures the correct transformers version
# pip install -e '.[qwen25_vl]' # run this when running QWen2.5-VL

Running VLMs with Focus

1. Algorithm: Generate Sparse Traces & Evaluate Accuracy

Example command to run LLaVA-Video with Focus on VideoMME dataset and export sparse traces:

cd algorithm/
python -m run_eval \
  --model llava_vid \
  --model_args pretrained=lmms-lab/LLaVA-Video-7B-Qwen2,conv_template=qwen_1_5,max_frames_num=64,mm_spatial_pool_mode=average \
  --tasks videomme \
  --focus \
  --batch_size 1 \
  --log_samples --log_samples_suffix llava_vid \
  --output_path ./logs_traces/ \
  --limit 10 \
  --export_focus_trace \
  --trace_dir ./output/focus_main/ \
  --trace_name llava_vid_videomme \
  --use_median \
  --trace_meta_dir ./output/

See algorithm/README.md for complete usage and scripts for all datasets and models.

2. Simulator: Run Architecture Simulation

Example simulation using generated traces:

cd ../simulator
python main.py \
  --model llava_vid \
  --dataset videomme \
  --accelerator focus \
  --trace_dir ../algorithm/output \
  --output_dir results

See simulator/README.md for details on all experiments and configurations.

3. Evaluation Scripts: Plot Figures & Tables

cd ../evaluation_scripts/plot_scripts/ipynb_src
# Open the Jupyter notebooks and execute to generate plots

The notebooks provide end-to-end instructions for reproducing all figures and tables from the paper.

Acknowledgement

This repository is built on top of the following open-source projects:

lmms-eval
LLaVA-NeXT
FrameFusion
ScaleSim
CACTI
DRAMsim3

We thank the authors and contributors of these projects for their valuable work.

Citation

If you find Focus helpful in your project or research, please consider citing our paper:

@misc{wei2025focus,
      title={Focus: A Streaming Concentration Architecture for Efficient Vision-Language Models}, 
      author={Chiyue Wei and Cong Guo and Junyao Zhang and Haoxuan Shan and Yifan Xu and Ziyue Zhang and Yudong Liu and Qinsi Wang and Changchun Zhou and Hai "Helen" Li and Yiran Chen},
      year={2025},
      eprint={2512.14661},
      archivePrefix={arXiv},
      primaryClass={cs.AR},
      url={https://arxiv.org/abs/2512.14661}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
3rd_party		3rd_party
algorithm		algorithm
evaluation_scripts		evaluation_scripts
rtl		rtl
simulator		simulator
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
focus_overview.png		focus_overview.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Focus: A Streaming Concentration Architecture for Efficient Vision-Language Models

Overview

Repository Structure

Getting Started

Prerequisites

Installation

Running VLMs with Focus

1. Algorithm: Generate Sparse Traces & Evaluate Accuracy

2. Simulator: Run Architecture Simulation

3. Evaluation Scripts: Plot Figures & Tables

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Languages

License

dubcyfor3/Focus

Folders and files

Latest commit

History

Repository files navigation

Focus: A Streaming Concentration Architecture for Efficient Vision-Language Models

Overview

Repository Structure

Getting Started

Prerequisites

Installation

Running VLMs with Focus

1. Algorithm: Generate Sparse Traces & Evaluate Accuracy

2. Simulator: Run Architecture Simulation

3. Evaluation Scripts: Plot Figures & Tables

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages