GANQ: GPU-Adaptive Non-Uniform Quantization for Large Language Models

Pengxiang Zhao | Xiaoming Yuan

The University of Hong Kong

TL;DR

How can we efficiently determine a lookup table (LUT) for LUT-based post-training non-uniform quantization that achieves a balance between model accuracy and model size effectively?

Approach

We propose an optimization-based framework for LUT-based post-training non-uniform quantization:

Problem Formulation:

We Formulate layer-wise, channel-wise LUT-based non-uniform quantization as a mixed-integer quartic programming problem:

where $\mathbf{W}_i \in \mathbb{R}^{1 \times n}$ is the $i$-th row of $\mathbf{W}$, $\mathbf{T}_i \in \mathbb{R}^{1 \times 2^N}$ is the $i$-th row of $\mathbf{T}$, $\mathbf{S}_i \in {0, 1}^{2^N \times n}$ is a column-wise one-hot encoding matrix indicating the mapping of elements from $\mathbf{T}_i$, and $\mathbf{1}$ denotes an all-one vector.
Alternating Direction Optimization:

To solve this problem efficiently, we employ an alternating direction optimization framework, iteratively updating $\mathbf{S}_i$ and $\mathbf{T}_i$ by decomposing the objective into two subproblems:
Solving the $\mathbf{T}_i$-Subproblem:

The $\mathbf{T}_i$-subproblem is an unconstrained quadratic program that admits a closed-form solution:

where $(\cdot)^\dagger$ denotes the Moore-Penrose inverse.
Solving the $\mathbf{S}_i$-Subproblem:

For the $\mathbf{S}_i$-subproblem, the objective can be rewritten as:

where $\mathbf{L}$ is derived from the Cholesky decomposition. By leveraging the structure of $\mathbf{L}$, we employ a back-substitution approach to efficiently derive a sub-optimal solution for $\mathbf{S}_i$.

Usage

Prerequisites

First, install the required Python dependencies:
```
pip install -r requirements.txt
```
Quantizing OPT Models

To quantize an OPT model, use the following command. For example, to quantize opt-125m to 4 bits using 32 calibration samples from the C4 dataset and up to 10 GANQ iterations:
```
CUDA_VISIBLE_DEVICES=0 python opt.py ./opt-125m c4 --bits 4 --max_epoch 10 --nsample 32
```
Quantizing LLaMA Models

To quantize an OPT model, use the following command. For example, to quantize opt-125m to 4 bits using 32 calibration samples from the C4 dataset and up to 10 GANQ iterations:
```
CUDA_VISIBLE_DEVICES=0 python llama.py ./Llama-7b c4 --bits 4 --max_epoch 10 --nsample 128
```

Citation

If you find GANQ useful for your project or research, please consider citing our paper:

@article{zhao2025ganq,
  title={GANQ: GPU-Adaptive Non-Uniform Quantization for Large Language Models},
  author={Zhao, Pengxiang and Yuan, Xiaoming},
  journal={arXiv preprint arXiv:2501.12956},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
imgs		imgs
README.md		README.md
datautils.py		datautils.py
ganq.py		ganq.py
llama.py		llama.py
lut_quant.py		lut_quant.py
modelutils.py		modelutils.py
opt.py		opt.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GANQ: GPU-Adaptive Non-Uniform Quantization for Large Language Models

TL;DR

Approach

Usage

Citation

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GANQ: GPU-Adaptive Non-Uniform Quantization for Large Language Models

TL;DR

Approach

Usage

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages