- Introduction
- Quantization Methods
- Code Implementation
- Dependency Matrix
- Performance Comparison
- Hardware Support
- Implementation Checklist
- Troubleshooting Guide
- Workflow Diagram
- Tools & Libraries
- Conclusion
- Appendices
Quantization reduces neural network precision for efficient deployment:
- 4Γ memory reduction (FP32 β INT8)
- 2-3Γ faster inference
- 60% energy savings
- Supported formats: INT8, INT4, FP16, BF16
| Method | Full Form | Key Information | When to Use |
|---|---|---|---|
| PTQ | Post-Training Quantization | - No retraining needed - Fast deployment - Moderate accuracy drop |
Edge devices, Batch processing, Quick prototyping |
| QAT | Quantization-Aware Training | - Simulates quantization during training - Minimal accuracy loss |
High-accuracy requirements, Medical imaging, Safety-critical systems |
| GPTQ | Gradient-based Post-Training Quant | - 4-bit LLM quantization - GPU-optimized - Requires calibration data |
Large language models (LLaMA, Mistral), Chat applications |
| AWQ | Activation-aware Weight Quantization | - 4-bit with activation awareness - Better outlier preservation |
Instruction-tuned models, Complex prompt engineering |
| Dynamic | Dynamic Quantization | - On-the-fly activation quantization - Flexible but higher latency |
NLP models, Variable input lengths |
| Static | Static Quantization | - Pre-calibrated ranges - Faster inference - Needs representative data |
Computer vision, Fixed input sizes |
graph TD
A[Start Quantization] --> B{Retraining Possible?}
B -->|Yes| C[QAT Pathway]
B -->|No| D[PTQ Pathway]
%% QAT Branch
C --> C1[Implement Fake Quantization Layers]
C1 --> C2[Fine-tune Model]
C2 --> C3{Accuracy Valid?}
C3 -->|Yes| C4[Export Quantized Model π]
C3 -->|No| C5[Adjust Training Parameters]
C5 --> C2
%% PTQ Branch
D --> D1{Model Type?}
D1 -->|LLM| D2[GPTQ/AWQ 4-bit]
D1 -->|Vision| D3[Static PTQ]
D1 -->|NLP| D4[Dynamic PTQ]
%% LLM Subpath
D2 --> D21[Prepare Calibration Data]
D21 --> D22[Run GPTQ Optimization]
D22 --> D23[Validate Perplexity]
%% Vision/NLP Subpaths
D3 --> D31[Collect Representative Dataset]
D31 --> D32[Calibrate Activations]
D32 --> D33[Convert to INT8]
D4 --> D41[Quantize Weights]
D41 --> D42[Runtime Activation Quantization]
%% Validation Node
D23 --> E[Performance Validation]
D33 --> E
D42 --> E
E --> F{Meets Targets?}
F -->|Yes| G[Deploy Model π]
F -->|No| H[Debug Pipeline]
H -->|Calibration Issues| D21
H -->|Architecture Issues| B
style A fill:#4CAF50,stroke:#388E3C
style B fill:#FFC107,stroke:#FFA000
style C fill:#2196F3,stroke:#1976D2
style D fill:#2196F3,stroke:#1976D2
style G fill:#4CAF50,stroke:#388E3C
style H fill:#F44336,stroke:#D32F2F
# TensorFlow Lite Example
import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model("model/")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_quant_model = converter.convert()# PyTorch Example
import torch.ao.quantization
model.qconfig = torch.ao.quantization.get_default_qat_qconfig('fbgemm')
torch.ao.quantization.prepare_qat(model, inplace=True)
# Training loop here
torch.ao.quantization.convert(model, inplace=True)from auto_gptq import AutoGPTQForCausalLM
model = AutoGPTQForCausalLM.from_quantized(
"TheBloke/Llama-3-8B-GPTQ",
use_safetensors=True,
device_map="auto"
)from awq import AutoAWQForCausalLM
quantizer = AutoAWQForCausalLM.quantize(
model,
quant_config={"zero_point": True, "q_group_size": 128}
)from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8B",
quantization_config=bnb_config
)# Dynamic quantization for LSTM
quantized_model = torch.quantization.quantize_dynamic(
model,
{torch.nn.LSTM},
dtype=torch.qint8
)| Tool/Library | PyTorch | TensorFlow | ONNX | Requirements |
|---|---|---|---|---|
| Optimum | β | β | β | transformers>=4.25 |
| TFLite | β | β | β | tensorflow>=2.10 |
| Intel Neural Compressor | β | β | β | neural-compressor>=2.0 |
| Metric | FP32 | INT8 | INT4 |
|---|---|---|---|
| Model Size (MB) | 420 | 105 | 55 |
| Latency (ms) | 45 | 22 | 30 |
| Accuracy | 84.5% | 83.1% | 82.3% |
| RAM Usage (MB) | 1200 | 600 | 450 |
| Energy (Joules) | 12.5 | 5.8 | 7.2 |
| Hardware | PTQ | QAT | 4-bit | Notes |
|---|---|---|---|---|
| Intel CPUs | β | β | β | AVX-512 required for INT8 |
| NVIDIA GPUs | β | β | β | Ampere+ for 4-bit |
| ARM Cortex-M | β | β | β | Limited to INT8 |
| Apple M1/M2 | β | β | β | CoreML compatibility |
graph TD
A[Quantization Checklist] --> B[Method Selection]
A --> C[Data Preparation]
A --> D[Quantization]
A --> E[Validation]
A --> F[Export]
B --> B1[PTQ]
B --> B2[QAT]
C --> C1[Calibration Data]
C --> C2[Input Shapes]
D --> D1[Weights: INT8]
D --> D2[Weights: 4-bit]
D1 --> D11[Activations: Static]
D1 --> D12[Activations: Dynamic]
D2 --> D21[Group-wise Scaling]
E --> E1[Accuracy Check]
E --> E2[Latency Test]
E1 --> E11[<2% Drop]
E2 --> E21[Target Threshold]
F --> F1[ONNX]
F --> F2[TFLite]
style A fill:#2e7d32,stroke:#1b5e20,color:black
style B fill:#1565c0,stroke:#0d47a1,color:black
style C fill:#1565c0,stroke:#0d47a1,color:white
style D fill:#1565c0,stroke:#0d47a1,color:white
style E fill:#1565c0,stroke:#0d47a1,color:white
style F fill:#4CAF50,stroke:#388E3C,color:black
Visual Legend:
- π¦ Blue Boxes: Main Checklist Items
- π© Green Boxes: Actionable Tasks
- β¬ Black Diamonds: Data Requirements
- Arrows: Workflow Sequence
| Issue | Root Cause | Solution |
|---|---|---|
| Severe Accuracy Drop | Poor calibration data | Use larger/diverse dataset |
| Runtime Errors | Unsupported ops | Check framework compatibility (e.g., torch.quantized_lstm) |
| Model Bloat | Mixed precision | Force INT8-only conversion |
| Calibration Crash | Input range mismatch | Normalize inputs to [0, 1] |
graph LR
A[Original FP32 Model] --> B{Quantization Type}
B -->|PTQ| C[Calibrate with Dataset]
B -->|QAT| D[Fine-Tune with FakeQuant]
C --> E[Validate Metrics]
D --> E
E -->|Pass| F[Deploy Quantized Model]
E -->|Fail| G[Adjust Calibration]
| Tool | Framework | Use Case | Command/API Example |
|---|---|---|---|
| Optimum Intel | PyTorch | CPU-Optimized Quantization | OVQuantizer.from_pretrained(model) |
| TFLite Converter | TensorFlow | Mobile Deployment | tf.lite.Optimize.DEFAULT |
| bitsandbytes | PyTorch | 4-bit LLMs | BitsAndBytesConfig(load_in_4bit=True) |
| AutoGPTQ | Transformers | GPTQ 4-bit | AutoGPTQForCausalLM.from_quantized() |
- ONNX Runtime:
onnxruntime.quantization.quantize_static() - NVIDIA TensorRT:
trt.Builder.create_network()(FP16/INT8) - Apple CoreML:
coremltools.convert()(iOS/macOS)
| Scenario | Solution | Toolchain |
|---|---|---|
| Low Latency | INT8 Static PTQ | PyTorch/Optimum + ONNX |
| LLM Deployment | 4-bit GPTQ/AWQ | Hugging Face + bitsandbytes |
| Accuracy Critical | QAT with Layer-wise Calibration | TensorFlow/PyTorch QAT |
- 4-bit Quantization: Requires Ampere GPUs (NVIDIA A100/RTX 30xx+).
- Dynamic Quantization: Not supported for all ops (e.g.,
LayerNorm).
| Library | Quantization Support | Version Requirement |
|---|---|---|
| PyTorch | PTQ/QAT | >=2.0 |
| TensorFlow | TFLite PTQ | >=2.10 |
| Transformers | 4-bit/AWQ/GPTQ | >=4.31 |