Skip to content

shalusingh-tech/Model-Quantization-Guide

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 

Repository files navigation

Model Quantization Technical Report


Table of Contents

  1. Introduction
  2. Quantization Methods
  3. Code Implementation
  4. Dependency Matrix
  5. Performance Comparison
  6. Hardware Support
  7. Implementation Checklist
  8. Troubleshooting Guide
  9. Workflow Diagram
  10. Tools & Libraries
  11. Conclusion
  12. Appendices

1. Introduction

Quantization reduces neural network precision for efficient deployment:

  • 4Γ— memory reduction (FP32 β†’ INT8)
  • 2-3Γ— faster inference
  • 60% energy savings
  • Supported formats: INT8, INT4, FP16, BF16

Types of Quantization Methods

Method Full Form Key Information When to Use
PTQ Post-Training Quantization - No retraining needed
- Fast deployment
- Moderate accuracy drop
Edge devices, Batch processing, Quick prototyping
QAT Quantization-Aware Training - Simulates quantization during training
- Minimal accuracy loss
High-accuracy requirements, Medical imaging, Safety-critical systems
GPTQ Gradient-based Post-Training Quant - 4-bit LLM quantization
- GPU-optimized
- Requires calibration data
Large language models (LLaMA, Mistral), Chat applications
AWQ Activation-aware Weight Quantization - 4-bit with activation awareness
- Better outlier preservation
Instruction-tuned models, Complex prompt engineering
Dynamic Dynamic Quantization - On-the-fly activation quantization
- Flexible but higher latency
NLP models, Variable input lengths
Static Static Quantization - Pre-calibrated ranges
- Faster inference
- Needs representative data
Computer vision, Fixed input sizes

FLOWCHART

graph TD
  A[Start Quantization] --> B{Retraining Possible?}
  B -->|Yes| C[QAT Pathway]
  B -->|No| D[PTQ Pathway]
  
  %% QAT Branch
  C --> C1[Implement Fake Quantization Layers]
  C1 --> C2[Fine-tune Model]
  C2 --> C3{Accuracy Valid?}
  C3 -->|Yes| C4[Export Quantized Model πŸš€]
  C3 -->|No| C5[Adjust Training Parameters]
  C5 --> C2
  
  %% PTQ Branch
  D --> D1{Model Type?}
  D1 -->|LLM| D2[GPTQ/AWQ 4-bit]
  D1 -->|Vision| D3[Static PTQ]
  D1 -->|NLP| D4[Dynamic PTQ]
  
  %% LLM Subpath
  D2 --> D21[Prepare Calibration Data]
  D21 --> D22[Run GPTQ Optimization]
  D22 --> D23[Validate Perplexity]
  
  %% Vision/NLP Subpaths
  D3 --> D31[Collect Representative Dataset]
  D31 --> D32[Calibrate Activations]
  D32 --> D33[Convert to INT8]
  
  D4 --> D41[Quantize Weights]
  D41 --> D42[Runtime Activation Quantization]
  
  %% Validation Node
  D23 --> E[Performance Validation]
  D33 --> E
  D42 --> E
  
  E --> F{Meets Targets?}
  F -->|Yes| G[Deploy Model πŸš€]
  F -->|No| H[Debug Pipeline]
  H -->|Calibration Issues| D21
  H -->|Architecture Issues| B
  
  style A fill:#4CAF50,stroke:#388E3C
  style B fill:#FFC107,stroke:#FFA000
  style C fill:#2196F3,stroke:#1976D2
  style D fill:#2196F3,stroke:#1976D2
  style G fill:#4CAF50,stroke:#388E3C
  style H fill:#F44336,stroke:#D32F2F
Loading

2. Quantization Methods

2.1 Post-Training Quantization (PTQ)

# TensorFlow Lite Example
import tensorflow as tf

converter = tf.lite.TFLiteConverter.from_saved_model("model/")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_quant_model = converter.convert()

2.2 Quantization-Aware Training (QAT)

# PyTorch Example
import torch.ao.quantization

model.qconfig = torch.ao.quantization.get_default_qat_qconfig('fbgemm')
torch.ao.quantization.prepare_qat(model, inplace=True)
# Training loop here
torch.ao.quantization.convert(model, inplace=True)

2.3 GPTQ (4-bit)

from auto_gptq import AutoGPTQForCausalLM

model = AutoGPTQForCausalLM.from_quantized(
    "TheBloke/Llama-3-8B-GPTQ",
    use_safetensors=True,
    device_map="auto"
)

2.4 AWQ (4-bit)

from awq import AutoAWQForCausalLM

quantizer = AutoAWQForCausalLM.quantize(
    model,
    quant_config={"zero_point": True, "q_group_size": 128}
)

3. Code Implementation

3.1 4-bit Loading with bitsandbytes

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B",
    quantization_config=bnb_config
)

3.2 Dynamic Quantization

# Dynamic quantization for LSTM
quantized_model = torch.quantization.quantize_dynamic(
    model,
    {torch.nn.LSTM},
    dtype=torch.qint8
)

4. Dependency Matrix

Tool/Library PyTorch TensorFlow ONNX Requirements
Optimum βœ… ❌ βœ… transformers>=4.25
TFLite ❌ βœ… ❌ tensorflow>=2.10
Intel Neural Compressor βœ… βœ… βœ… neural-compressor>=2.0

5. Performance Comparison

Metric FP32 INT8 INT4
Model Size (MB) 420 105 55
Latency (ms) 45 22 30
Accuracy 84.5% 83.1% 82.3%
RAM Usage (MB) 1200 600 450
Energy (Joules) 12.5 5.8 7.2

6. Hardware Support

Hardware PTQ QAT 4-bit Notes
Intel CPUs βœ… βœ… ❌ AVX-512 required for INT8
NVIDIA GPUs βœ… βœ… βœ… Ampere+ for 4-bit
ARM Cortex-M βœ… ❌ ❌ Limited to INT8
Apple M1/M2 βœ… ❌ ❌ CoreML compatibility

7. Implementation Checklist

graph TD
  A[Quantization Checklist] --> B[Method Selection]
  A --> C[Data Preparation]
  A --> D[Quantization]
  A --> E[Validation]
  A --> F[Export]
  
  B --> B1[PTQ]
  B --> B2[QAT]
  
  C --> C1[Calibration Data]
  C --> C2[Input Shapes]
  
  D --> D1[Weights: INT8]
  D --> D2[Weights: 4-bit]
  D1 --> D11[Activations: Static]
  D1 --> D12[Activations: Dynamic]
  D2 --> D21[Group-wise Scaling]
  
  E --> E1[Accuracy Check]
  E --> E2[Latency Test]
  E1 --> E11[<2% Drop]
  E2 --> E21[Target Threshold]
  
  F --> F1[ONNX]
  F --> F2[TFLite]
 
  style A fill:#2e7d32,stroke:#1b5e20,color:black
  style B fill:#1565c0,stroke:#0d47a1,color:black
  style C fill:#1565c0,stroke:#0d47a1,color:white
  style D fill:#1565c0,stroke:#0d47a1,color:white
  style E fill:#1565c0,stroke:#0d47a1,color:white
  style F fill:#4CAF50,stroke:#388E3C,color:black
  
Loading

Visual Legend:

  • 🟦 Blue Boxes: Main Checklist Items
  • 🟩 Green Boxes: Actionable Tasks
  • ⬛ Black Diamonds: Data Requirements
  • Arrows: Workflow Sequence

8. Troubleshooting Guide

Issue Root Cause Solution
Severe Accuracy Drop Poor calibration data Use larger/diverse dataset
Runtime Errors Unsupported ops Check framework compatibility (e.g., torch.quantized_lstm)
Model Bloat Mixed precision Force INT8-only conversion
Calibration Crash Input range mismatch Normalize inputs to [0, 1]

9. Workflow Diagram

graph LR  
  A[Original FP32 Model] --> B{Quantization Type}  
  B -->|PTQ| C[Calibrate with Dataset]  
  B -->|QAT| D[Fine-Tune with FakeQuant]  
  C --> E[Validate Metrics]  
  D --> E  
  E -->|Pass| F[Deploy Quantized Model]  
  E -->|Fail| G[Adjust Calibration]  
Loading

10. Tools & Libraries

Core Tools

Tool Framework Use Case Command/API Example
Optimum Intel PyTorch CPU-Optimized Quantization OVQuantizer.from_pretrained(model)
TFLite Converter TensorFlow Mobile Deployment tf.lite.Optimize.DEFAULT
bitsandbytes PyTorch 4-bit LLMs BitsAndBytesConfig(load_in_4bit=True)
AutoGPTQ Transformers GPTQ 4-bit AutoGPTQForCausalLM.from_quantized()

Specialized Libraries

  • ONNX Runtime: onnxruntime.quantization.quantize_static()
  • NVIDIA TensorRT: trt.Builder.create_network() (FP16/INT8)
  • Apple CoreML: coremltools.convert() (iOS/macOS)

11. Conclusion

Key Recommendations

Scenario Solution Toolchain
Low Latency INT8 Static PTQ PyTorch/Optimum + ONNX
LLM Deployment 4-bit GPTQ/AWQ Hugging Face + bitsandbytes
Accuracy Critical QAT with Layer-wise Calibration TensorFlow/PyTorch QAT

Limitations

  • 4-bit Quantization: Requires Ampere GPUs (NVIDIA A100/RTX 30xx+).
  • Dynamic Quantization: Not supported for all ops (e.g., LayerNorm).

12. Appendices

A. Version Compatibility

Library Quantization Support Version Requirement
PyTorch PTQ/QAT >=2.0
TensorFlow TFLite PTQ >=2.10
Transformers 4-bit/AWQ/GPTQ >=4.31

About

A comprehensive technical report and codebase for optimizing neural networks through quantization. Covers PTQ, QAT, GPTQ, and AWQ methods with implementation examples, performance benchmarks, and hardware compatibility guidelines. Includes workflows, troubleshooting, and deployment checklists for edge devices and LLMs.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors