Quantization is among the more useful recent innovations in machine learning model inference, allowing large models to be compressed and run on common consumer hardware. I could not find a comprehensive explanation of what it is, how it works, its limitations, or a comparison of different types of quantization, so I started this repository so that others may find it useful.
Maarten Grootendorst wrote a great but general overview with plenty of visual aids to describe quantization, the problems it solves, and some of the problems that occur with quantization: A Visual Guide to Quantization
Quantization is the lossy compression of the floating-point values that make up a machine learning model.
Generally, to lower the amount of memory needed to run a model.
A typical open-weights model like the Llama 3.1 8b-instruct BF16 model is 16 gigabytes, which only 7% of users can run. However, by quantizing the weights to the Q6_K format, the model can be reduced to 6.6 gigabytes, which is small enough for 64% of users to run. With quantization, the median graphics card owner with at least 8 GB of graphics memory can now run this model on their own hardware.
Memory bandwidth is most often the bottleneck when running models, as the entire model is used for processing input. By reducing model size, it takes less time for layers to load.
By reducing the amount of memory needed to run a model, overall throughput can be increased by allowing a larger batch size, and latency (time to first token) can be reduced.
Memory usage scales linearly with the number of tokens and with the amount of memory needed for the attention layer. By quantizing the attention layer, a larger context can fit in memory.
- GGUF- GPT-Generated Unified Format
- bitsandbytes - BitsandBytes
- Supported by PyTorch,
bitsandbytes,aphrodite-engine, andtransformers.
- Supported by PyTorch,
- AWQ - Activation-aware Weight Quantization, which quantizes all but 1% of weights.
- This format is supported by Huggingface's
transformerslibrary ifAutoAWQis installed,vLLM,TensorRT-LLM, andaphrodite-engine. - This quantization format works best when you have a calibration dataset, though it is not required.
- This format is supported by Huggingface's
- GPTQ - Generative Post-Training Quantization
- This format is supported by
AutoGPTQ,ExLlamaV2, andTensorRT-LLM - Huggingface models can be converted to GPTQ format using the
AutoGPTQPython library. AutoGPTQquantization may require a calibration dataset. It's unclear from the docs, though all examples show one being used.
- This format is supported by
- FBGEMM - Facebook General Matrix Multiplication
- The Pytorch-native quantization format, supporting 4-bit and 8-bit weights.
- EXL2- ExLlamaV2 quantization format
- This is GPTQ but with support for mixed-precision and different weight sizes. It's supported by
ExLlamaV2. - A model can be converted to the
EXL2format using the officialExLlamaV2conversion script. EXL2requires a calibration dataset for quantization, though if one is not provided, a default dataset will be used..
- This is GPTQ but with support for mixed-precision and different weight sizes. It's supported by
- SmoothQuant - SmoothQuant is an INT8 Quantization format
- Supported by
TensorRT-LLM,aphrodite-engine, andonnxruntime. - A model can be converted to the
SmoothQuantformat using the official conversion script. SmoothQuantrequires a calibration dataset to generate activation channel scales before quantizing, though many are provided for some model architectures.
- Supported by
- AQLM - Additive Quantization of Language Models
- A member of the Multi-Codebook Quantization family.
- Supported by
AQLM,aphrodite-engine, and usable withtransformerswith theAQLMlibrary. - The main page of the
AQLMrepository has a guide for quantizing models to theAQLMformat. - This quantization format works best when you have some of the model's original training data for calibration, though it is not required.
- HQQ - Half-Quadratic Quantization
- Supported by the
hqqlibrary,transformers, andoobabooga. - The main page of the
HQQrepository has a guide for quantizing to this format. - No need for calibration data.
- SpPR - Sparse-Quantized Representation
- Supported by the
SpQRlibrary. - The main page of the
SpQRrepository has a guide for quantizing to this format. - This quantization format works best when you have some of the model's original training data for calibration, though it is not required.
- Supported by the