A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models
🚀 News: SegQuant has been accepted by CVPR 2026!
Our project has been tested with Python 3.10 (specifically version 3.10.12) and CUDA 12.5. We highly recommend using a virtual environment, such as Anaconda3, to manage and install the required dependencies.
Before installation, make sure all required Python dependencies are available. You can install them using:
pip install -r requirements.txtThen, install the segquant package using editable mode (recommended for development):
pip install -e .This installs the package in-place, so changes to the source code will be reflected immediately without reinstallation.
Alternatively, you can build and install the package as a standard Python package:
python -m build
pip install dist/segquant-*.whlNote: This project is organized using pyproject.toml, and requires Python ≥ 3.10. You should also ensure build tools such as setuptools, wheel, and build are installed.
This project may depend on the CUTLASS library. You'll need to set the CUTLASS_PATH environment variable to point to its installation directory. If it's not set, the project will default to /usr/local/cutlass.
You can set this temporarily in your current terminal session:
export CUTLASS_PATH=/path/to/your/cutlassTo quantize a diffusion model, follow these steps:
- Generate a calibration dataset
Usegenerate_calibrate_setto sample data from the model. This dataset will be used to calibrate and optimize the quantized model.
generate_calibrate_set(
model, # The original model to sample from
sampler, # A data sampler (e.g., NormalSampler, UniformSampler)
sample_dataloader, # Dataloader for the calibration images or text prompts
calib_layer, # The target layer to extract calibration features
dump_path="calib_data", # Optional: where to store calibration data
)This function supports chunked saving and optional compression, making it scalable for large datasets. If no dump_path is specified, the data is returned in memory.
- Quantize the model
Once calibration data is prepared, callquantize():
quantized_model = quantize(
model, # The original full-precision model
calib_data_loader, # The loader of calibration features
config=quant_config, # Optional config dict for quantization parameters
tmp_device=None, # multi-device support
verbose=True, # Print debug info if needed
example=input_sample # Optional example input for operator tracing
)The result is a quantized model that can be evaluated or deployed.
To further improve the quality of quantized diffusion models, an affiner can be trained to compensate for errors introduced by quantization.
Use process_affiner() as follows:
affiner = process_affiner(
config=affiner_cfg, # Dict containing optimizer & solver settings
dataset=calib_dataset, # Dataset used to compute affine corrections
model_real=fp_model, # Ground-truth full-precision model
model_quant=quant_model, # Already quantized model
latents=optional_latents, # Optional: precomputed latents
shuffle=True # Whether to shuffle training data
)This step is optional but highly recommended for high-fidelity tasks such as image generation. You can also plug in a third-party affiner module via the thirdparty_affiner argument.
Note: In our implementation of the diffusion model's forward function, we support a stepper argument. You can pass the trained affiner to this argument to seamlessly perform error reconstruction during the generation process.
For detailed usage and parameter descriptions of these APIs, please refer to the full documentation.
During quantization, our framework constructs semantic structures from the model to automatically select appropriate quantization configurations for linear layers. These include patterns suitable for techniques like Chunk-Linear in segmentation and Activation structures for DualScale quantization. We also support graph-based semantic pattern detection, enabling integration with other optimization strategies.
We provide CUDA kernels that implement key optimization strategies. These kernels are designed to be easily reused in other quantization or model inference frameworks. For integration examples, please refer to this.
In the backend directory, we use three popular text-to-image diffusion models (Stable Diffusion 3.5[1], FLUX-1.0-dev[2] and Stable Diffusion XL[3]) as examples. These implementations are adapted from the Diffusers library. Our framework supports any PyTorch-based models, and the quantization and optimization are not limited to diffusion models.
In the dataset directory, we provide several commonly used detection datasets, including MS-COCO[4], Densely Captioned Images[5] and MJHQ-30K[6] in the diffusion model domain. To support quantization testing with ControlNet[7], we have created preprocessing scripts that generate ControlNet (Canny[8]) input images based on these open-source datasets. These tools make it easy to convert and adapt datasets for various experimental needs.
For detailed instructions on dataset usage, please refer to this.
Note: The pre-processed datasets used in our paper experiments are now publicly available on Hugging Face. You can access the ready-to-use versions at CSunRay/SegQuant-Dataset.
To align with mainstream quantization frameworks such as ModelOPT, we reference parts of their quantization implementation—particularly components related to FP8 quantization.
On the kernel side, we build upon CUTLASS for custom CUDA kernel development. Our framework also integrates unique features such as SegLinear and DualScale quantization, offering improved performance and flexibility.
In terms of quantization optimization algorithms, we draw inspiration from GPTQ[9], SmoothQuant[10] and SVDQuant[11] (INT4-based) implementations. These references help demonstrate the orthogonality of our quantization approach with respect to mainstream methods, showcasing its general applicability and compatibility.
For experimental purposes, we have also implemented several related optimization methods from recent papers, including PTQ4DM[12], Q-Diffusion[13], PTQD[14], and TAC-Diffusion[15].
@misc{zhang2025segquantsemanticsawaregeneralizablequantization,
title={SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models},
author={Jiaji Zhang and Ruichao Sun and Hailiang Zhao and Jiaju Wu and Peng Chen and Hao Li and Yuying Liu and Kingsum Chow and Gang Xiong and Shuiguang Deng},
year={2025},
eprint={2507.14811},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.14811},
}[1] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach. 2024. Scaling rectified flow transformers for high-resolution image synthesis. In Proceedings of the 41st International Conference on Machine Learning (ICML'24), Vol. 235. JMLR.org, Article 503, 12606–12633.
[2] Black Forest Labs. 2024. Flux.1. Retrieved May 5, 2025 from https://blackforestlabs.ai/
[3] Podell, D.; English, Z.; Lacey, K.; Blattmann, A.; Dockhorn, T.; M¨uller, J.; Penna, J.; and Rombach, R. 2023. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. arXiv:2307.01952.
[4] Lin, TY. et al. (2014). Microsoft COCO: Common Objects in Context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds) Computer Vision – ECCV 2014. ECCV 2014. Lecture Notes in Computer Science, vol 8693. Springer, Cham. https://doi.org/10.1007/978-3-319-10602-1_48
[5] J. Urbanek et al., "A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions," in 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2024, pp. 26690-26699, doi: 10.1109/CVPR52733.2024.02521.
[6] Li, Daiqing, et al. "Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation." arXiv preprint arXiv:2402.17245 (2024).
[7] L. Zhang, A. Rao and M. Agrawala, "Adding Conditional Control to Text-to-Image Diffusion Models," 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2023, pp. 3813-3824, doi: 10.1109/ICCV51070.2023.00355.
[8] J. Canny, "A Computational Approach to Edge Detection," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-8, no. 6, pp. 679-698, Nov. 1986, doi: 10.1109/TPAMI.1986.4767851.
[9] Frantar, E.; Ashkboos, S.; Hoefler, T.; and Alistarh, D. 2022. GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers. arXiv preprint arXiv:2210.17323.
[10] Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. SmoothQuant: accurate and efficient post-training quantization for large language models. In Proceedings of the 40th International Conference on Machine Learning (ICML'23), Vol. 202. JMLR.org, Article 1585, 38087–38099.
[11] Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. 2025. SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models. In Proceedings of the Thirteenth International Conference on Learning Representations (ICLR 2025).
[12] Y. Shang, Z. Yuan, B. Xie, B. Wu and Y. Yan, "Post-Training Quantization on Diffusion Models," 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 2023, pp. 1972-1981, doi: 10.1109/CVPR52729.2023.00196.
[13] X. Li et al., "Q-Diffusion: Quantizing Diffusion Models," 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2023, pp. 17489-17499, doi: 10.1109/ICCV51070.2023.01608.
[14] Yefei He, Luping Liu, Jing Liu, Weijia Wu, Hong Zhou, and Bohan Zhuang. 2023. PTQD: accurate post-training quantization for diffusion models. In Proceedings of the 37th International Conference on Neural Information Processing Systems (NIPS '23). Curran Associates Inc., Red Hook, NY, USA, Article 580, 13237–13249.
[15] Yuzhe Yao, Feng Tian, Jun Chen, Haonan Lin, Guang Dai, Yong Liu, and Jingdong Wang. 2024. Timestep-Aware Correction for Quantized Diffusion Models. In Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXVI. Springer-Verlag, Berlin, Heidelberg, 215–232. https://doi.org/10.1007/978-3-031-72848-8_13