How can we efficiently determine a lookup table (LUT) for LUT-based post-training non-uniform quantization that achieves a balance between model accuracy and model size effectively?
We propose an optimization-based framework for LUT-based post-training non-uniform quantization:
-
Problem Formulation:
We Formulate layer-wise, channel-wise LUT-based non-uniform quantization as a mixed-integer quartic programming problem:
where
$\mathbf{W}_i \in \mathbb{R}^{1 \times n}$ is the$i$ -th row of$\mathbf{W}$ ,$\mathbf{T}_i \in \mathbb{R}^{1 \times 2^N}$ is the$i$ -th row of$\mathbf{T}$ ,$\mathbf{S}_i \in {0, 1}^{2^N \times n}$ is a column-wise one-hot encoding matrix indicating the mapping of elements from$\mathbf{T}_i$ , and$\mathbf{1}$ denotes an all-one vector. -
Alternating Direction Optimization:
To solve this problem efficiently, we employ an alternating direction optimization framework, iteratively updating
$\mathbf{S}_i$ and$\mathbf{T}_i$ by decomposing the objective into two subproblems: -
Solving the
$\mathbf{T}_i$ -Subproblem:The
$\mathbf{T}_i$ -subproblem is an unconstrained quadratic program that admits a closed-form solution:where
$(\cdot)^\dagger$ denotes the Moore-Penrose inverse. -
Solving the
$\mathbf{S}_i$ -Subproblem:For the
$\mathbf{S}_i$ -subproblem, the objective can be rewritten as:where
$\mathbf{L}$ is derived from the Cholesky decomposition. By leveraging the structure of$\mathbf{L}$ , we employ a back-substitution approach to efficiently derive a sub-optimal solution for$\mathbf{S}_i$ .
-
Prerequisites
First, install the required Python dependencies:
pip install -r requirements.txt
-
Quantizing OPT Models
To quantize an OPT model, use the following command. For example, to quantize opt-125m to 4 bits using 32 calibration samples from the C4 dataset and up to 10 GANQ iterations:
CUDA_VISIBLE_DEVICES=0 python opt.py ./opt-125m c4 --bits 4 --max_epoch 10 --nsample 32
-
Quantizing LLaMA Models
To quantize an OPT model, use the following command. For example, to quantize opt-125m to 4 bits using 32 calibration samples from the C4 dataset and up to 10 GANQ iterations:
CUDA_VISIBLE_DEVICES=0 python llama.py ./Llama-7b c4 --bits 4 --max_epoch 10 --nsample 128
If you find GANQ useful for your project or research, please consider citing our paper:
@article{zhao2025ganq,
title={GANQ: GPU-Adaptive Non-Uniform Quantization for Large Language Models},
author={Zhao, Pengxiang and Yuan, Xiaoming},
journal={arXiv preprint arXiv:2501.12956},
year={2025}
}