MindPipe is a unified compression and evaluation framework for large language models and vision-language models. It provides one CLI entrypoint for post-training quantization, quantization-aware training, pruning, perplexity evaluation, zero-shot evaluation, and VLMEvalKit-based multimodal evaluation.
The framework is designed for reproducible research across GPU and NPU backends, with shared model loading, dataset handling, device management, and result serialization.
- Unified
main.pyentrypoint for quantization, pruning, compression pipelines, and evaluation-only runs. - 11 quantization methods registered in-tree, including PTQ and QAT-style methods.
- 7 pruning methods registered in-tree, covering unstructured, semi-structured, and structured pruning.
- Text and vision-language model support through a shared model adapter layer.
- GPU and NPU device abstraction for cache management, synchronization, seeds, and dtype policy.
- Per-run artifacts and metrics written as JSON for downstream aggregation.
- Reproducibility scripts for common text and multimodal benchmark suites.
MindPipe/
├── main.py # Unified CLI entrypoint
├── algorithm/
│ ├── common/ # Shared model, data, device, IO utilities
│ ├── quantization/
│ │ ├── ptq/ # AWQ, GPTQ, MQuant, OmniQuant, QuaRot, SmoothQuant, SpinQuant
│ │ └── qat/ # FlatQuant, QLoRA, QA-LoRA, SplitQuant
│ └── pruning/
│ ├── structured/ # FLAP, LLM-Pruner, ShortGPT, Wanda-SP
│ └── unstructured/ # ALPS, SparseGPT, Wanda
├── workflow/ # CLI config builder and stage executor
├── evaluation/ # PPL, lm-eval-harness, and VLMEvalKit runners
├── configs/ # Shared and algorithm-specific configs
├── scripts/ # Batch and reproducibility scripts
└── third_party/ # Optional external evaluation tools
| Method | Family | Main Coverage | NPU Status |
|---|---|---|---|
awq |
PTQ | Weight-only quantization with activation-aware scaling | Ready |
gptq |
PTQ | Weight-only GPTQ quantization | Ready |
mquant |
PTQ | Multimodal GPTQ/AWQ-style quantization for language and visual branches | Not ready |
omniquant |
PTQ | Learnable weight and activation transformation | Ready |
quarot |
PTQ | Rotation-based W/A/KV quantization | Not ready |
smoothquant |
PTQ | Activation smoothing for W/A quantization | Ready |
spinquant |
PTQ | Rotation-based W/A/KV quantization with SpinQuant-style hooks | Not ready |
flatquant |
QAT | FlatQuant-style trainable transformations | Ready |
qlora |
QAT | QLoRA and low-bit fake-quant adapter training | Ready, with experimental NPU fake-quant fallback |
qalora |
QAT | Basic QA-LoRA group-pooled adapter training | CUDA only |
splitquant |
QAT | SplitQuant-style trainable transformations | Ready |
| Method | Type | Default Calibration Dataset | NPU Status |
|---|---|---|---|
alps |
Unstructured and n:m semi-structured | c4 |
Ready |
flap |
Structured | wikitext2 |
Ready |
llm_pruner |
Structured | c4 |
Ready |
shortgpt |
Layer pruning | pg19 |
Ready |
sparsegpt |
Unstructured and n:m semi-structured | c4 |
Ready |
wanda |
Unstructured and n:m semi-structured | c4 |
Ready |
wanda_sp |
Structured | c4 |
Ready |
MindPipe has been adapted across text-only and multimodal model families, including:
- LLaMA-family text models, including LLaMA-2 and LLaMA-3 style checkpoints.
- Qwen2.5 text models.
- Qwen3 text models.
- Qwen3.5 text/language-only paths.
- Qwen2-VL, Qwen2.5-VL, and Qwen3-VL multimodal paths.
- MiniCPM-V language and multimodal paths for selected quantization flows.
- LLaVA and InternVL loader compatibility paths where supported by the local Transformers environment.
Model support is algorithm-dependent. The most reliable way to check current
support is to inspect each method under algorithm/quantization/*/*/method.py
or algorithm/pruning/*/*/method.py, and the model-specific configs under
configs/algorithms/.
- Completed GPU validation for AWQ W4A16 on Qwen3, Qwen3-VL, Qwen3.5, Qwen2-VL, and LLaVA-1.5. Text-side PPL runs completed successfully with no obvious anomalies.
- Qwen2-VL and Qwen3-VL completed VLMEvalKit evaluation on the validated multimodal datasets. AWQ W4A16 showed acceptable accuracy degradation compared with FP16.
- The evaluation framework did not yet support Qwen3.5 and LLaVA-1.5 multimodal evaluation at that time, so only text-side validation was completed for those models.
- Completed NPU validation for AWQ W4A16 on Qwen3, Qwen3-VL, Qwen3.5, Qwen2-VL, and LLaVA-1.5. Text-side PPL runs completed successfully with no obvious anomalies.
- Completed GPU adaptation and validation for MQuant on Qwen3-VL, Qwen2-VL, and Qwen2.5-VL. Text-side PPL runs completed successfully with no obvious anomalies.
- Qwen3-VL, Qwen2-VL, and Qwen2.5-VL completed VLMEvalKit evaluation on the validated multimodal datasets. The visual W8A8 plus language W4A8 setting showed acceptable accuracy degradation compared with FP16.
- Completed GPU adaptation and validation for QuaRot and SpinQuant on Qwen3, Qwen3.5, Qwen3-VL, and Qwen2-VL. Text-side PPL runs completed successfully with no obvious anomalies.
- Qwen3-VL, Qwen2-VL, and Qwen2.5-VL completed VLMEvalKit evaluation on the validated multimodal datasets. The W4A8 setting showed acceptable accuracy degradation compared with FP16.
conda activate mindpipe
git submodule update --init --recursive
python -m pip install -r requirements.txtIf VLMEvalKit evaluation is required, initialize the VLMEvalKit submodule or set
VLMEVALKIT_ROOT to an existing checkout.
Quantization and pruning runs require --device_map. This applies to single-GPU
runs as well as multi-GPU runs. The recommended pattern is:
CUDA_VISIBLE_DEVICES=0 python main.py \
--quantization awq \
--model_path /path/to/model \
--device_map auto \
--dtype float16 \
--attn_implementation sdpa \
--calibration_dataset pileval \
--evaluation_dataset wikitext2 \
--calibration_samples 128 \
--sequence_length 2048 \
--weight_bits 4 \
--group_size 128 \
--eval_ppl true \
--output_dir ./results/awqThis policy keeps model placement under the Hugging Face Accelerate dispatch
hooks. Avoid manually moving compressed models with .to(device) after loading
with device_map.
CUDA_VISIBLE_DEVICES=0 python main.py \
--model_path /path/to/model \
--device_map auto \
--dtype float16 \
--attn_implementation sdpa \
--evaluation_dataset wikitext2 \
--sequence_length 2048 \
--batch_size 1 \
--max_eval_chunks 64 \
--eval_ppl true \
--eval_zero_shot true \
--zero_shot_tasks boolq piqa rte winogrande arc_easy arc_challenge openbookqa \
--zero_shot_num_fewshot 0 \
--zero_shot_batch_size 1 \
--output_dir ./results/fp_evalCUDA_VISIBLE_DEVICES=0 python main.py \
--quantization gptq \
--model_path /path/to/model \
--device_map auto \
--dtype float16 \
--attn_implementation sdpa \
--calibration_dataset pileval \
--evaluation_dataset wikitext2 \
--calibration_samples 128 \
--sequence_length 2048 \
--weight_bits 4 \
--activation_bits 16 \
--group_size 128 \
--weight_group_size 128 \
--eval_ppl true \
--output_dir ./results/gptqCUDA_VISIBLE_DEVICES=0 python main.py \
--pruning wanda \
--model_path /path/to/model \
--device_map auto \
--dtype float16 \
--attn_implementation sdpa \
--calibration_dataset c4 \
--calibration_samples 128 \
--sequence_length 2048 \
--sparsity_ratio 0.5 \
--eval_ppl true \
--output_dir ./results/wandaCUDA_VISIBLE_DEVICES=0,1 python main.py \
--pruning wanda_sp \
--quantization gptq \
--execution_order pruning_then_quantization \
--model_path /path/to/model \
--device_map auto \
--dtype float16 \
--attn_implementation sdpa \
--calibration_dataset c4 \
--calibration_samples 128 \
--sequence_length 2048 \
--sparsity_ratio 0.2 \
--weight_bits 4 \
--group_size 128 \
--eval_ppl true \
--output_dir ./results/workflowMindPipe integrates VLMEvalKit through evaluation/vlm_eval.py. A typical VLM
evaluation command is:
CUDA_VISIBLE_DEVICES=0 python main.py \
--model_path /path/to/vlm \
--device_map auto \
--dtype float16 \
--attn_implementation sdpa \
--eval_ppl false \
--eval_zero_shot false \
--eval_vlm true \
--vlm_datasets OCRBench TextVQA_VAL ChartQA_TEST InfoVQA_VAL \
--vlm_mode all \
--vlm_api_nproc 1 \
--vlm_eval_kit_root /path/to/VLMEvalKit \
--output_dir ./results/vlm_evalUse --num_samples for smoke tests and --vlm_resume true to reuse existing
per-dataset artifacts when available.
| Argument | Default | Description |
|---|---|---|
--model_path |
Required | Local or Hugging Face model path |
--device |
auto |
Logical device used by runtime helpers |
--device_map |
None |
Required for pruning and quantization, recommended value: auto |
--dtype |
bfloat16 |
auto, float16, or bfloat16 |
--attn_implementation |
flash_attention_2 |
flash_attention_2, sdpa, or eager |
--calibration_dataset |
Method default | wikitext2, c4, pileval, pg19, or bookcorpus |
--evaluation_dataset |
wikitext2 |
Dataset used for PPL evaluation |
--calibration_samples |
128 |
Number of calibration samples |
--sequence_length |
2048 in many scripts |
Sequence length for calibration and evaluation |
--batch_size |
1 |
PPL batch size |
--max_eval_chunks |
64 |
Optional cap for PPL chunks |
--eval_ppl |
false |
Enable perplexity evaluation |
--eval_zero_shot |
false |
Enable lm-eval-harness tasks |
--eval_vlm |
false |
Enable VLMEvalKit evaluation |
| Argument | Default | Description |
|---|---|---|
--quantization |
None |
One of the registered quantization methods |
--weight_bits |
4 |
Weight quantization bit width |
--activation_bits |
16 |
Activation quantization bit width |
--query_bits |
16 |
Query activation bit width for supported methods |
--key_bits |
16 |
Key cache bit width for supported methods |
--value_bits |
16 |
Value cache bit width for supported methods |
--group_size |
128 |
Default group size |
--weight_group_size |
None |
Overrides weight group size |
--activation_group_size |
None |
Overrides activation group size |
--kv_group_size |
None |
Overrides KV group size |
--weight_method |
gptq |
Weight method for methods that support GPTQ or RTN |
| Argument | Default | Description |
|---|---|---|
--pruning |
None |
One of the registered pruning methods |
--sparsity_ratio |
0.5 |
Target sparsity ratio |
--structure_pattern |
unstructured |
unstructured, 2:4, or 4:8 where supported |
--block_size |
128 |
Block size for supported pruning methods |
--damp_percent |
0.01 |
Hessian damping ratio for second-order methods |
The scripts/repro/ directory contains serial benchmark launchers for adapted
model families and algorithm paths. Examples include:
scripts/repro/run_qlora_adapted_models_text_suite.shscripts/repro/run_qalora_adapted_models_text_suite.shscripts/repro/run_mquantpp_awq_vlm_serial_suite.shscripts/repro/run_qwen2_5_vl_gptq_vlm_suite.shscripts/repro/run_qwen3_vl_2b_gptq_suite.sh
Use DRY_RUN=true to print commands without executing them, and use
MODEL_FILTER=<model_key> when the script supports model-level filtering.
Each run writes metrics and artifacts under the resolved output directory.
results/
├── <model>/<algorithm>/<run_spec>/metrics.json
├── <model>/<algorithm>/<run_spec>/artifacts.json
└── <model>/<workflow>/<run_spec>/metrics.json
metrics.json stores evaluation results and run metadata. artifacts.json
stores algorithm-specific details such as quantized layers, adapter paths,
calibration settings, and generated checkpoint locations.
- QuaRot and SpinQuant are not marked NPU-ready in the current registry.
- MQuant is currently GPU-oriented and not marked NPU-ready.
- QA-LoRA is a basic CUDA-only implementation and does not export an AutoGPTQ packed checkpoint.
- QLoRA uses bitsandbytes for CUDA W4 when available; W2/W3 and NPU paths use the in-tree fake-quant fallback.
- Support for saved-model reload after methods that insert custom runtime wrappers is method-dependent.
MindPipe vendors or adapts ideas and implementation components from several model compression projects, including AWQ, GPTQ, QuaRot, SpinQuant, FlatQuant, SmoothQuant, OmniQuant, SplitQuant, QLoRA, QA-LoRA, Wanda, SparseGPT, FLAP, ShortGPT, LLM-Pruner, and ALPS. Please cite the original method papers when using the corresponding algorithms.