Here is the development roadmap for H2 2025. We will pin this roadmap in Issues, and most of our subsequent work will be updated in this roadmap within Issues. In MLLM's documentation, we will archive each version of the roadmap and provide some outlooks. Contributions and feedback are welcome.
We plan to release a major MLLM version every year. The version for H2 2025 will be 2.0.0, and the main updates to be implemented in this version can be found in the Focus section.
Focus
- Refactoring from mllm-v1: Implement a more streamlined project structure; introduce a simple and user-friendly eager mode; provide MLLM static graph IR
- Support for more backends: P0-CANN, and P1-CUDA/AMD NPU
- Experimental attempt: Compilation from MLLM static graph IR to NPU backend
- Provide user-friendly components such as pymllm, mllm-cli, and MllmCSdk to expand the adoption of the MLLM project
- Enhance the benchmarking system with a focus on optimizing Arm Kernels
Engine
Model coverage
Kernels
- Arm: KleidiAI SME Kernels are supported, as the latest SoC includes SME capability. @chenghuaWang
- Arm: Improve Mllm-Blas performance.
- X86: FP32 Kernels built on top of highway. @HayzelHan
- X86: Quantized kernel, using GGUF format. @HayzelHan
- ✔️ Commons: Paged Attention kernels based on mllm's zen-file system. @chenghuaWang
Backends
Performance
- Benchmark MLLM, llama.cpp and mnn using q4_k like quantization settings. @jialilve
- ✔️ Fast version of qwen3 using: 1. Manually memory planning 2. Fused kernels 3. Inplace Operators etc. @chenghuaWang
- ✔️ Using tracy and perfetto for performance measurement @chenghuaWang
- (Optional) ARM PMU Tools setup
Quantization
- kai: Quantize on any machine, packing on ARM (make
mllm-convertor --pipeline xxx_kai_pipeline available on any devices).
- GGUF: GGUF Q4_K and Q6_K Quantization method on
.mllm file. @HayzelHan
Compile
KV Cache Management
- Quantized KVCache: int8 per token.
- ✔️ Prefix Cache and Paged Attn: Support multi-turn chat. @chenghuaWang
Pymllm
- ✔️ Waiting: Awaiting PyPI's approval of our organization's application. pymllm now is available on MacOS(
pip install pymllm).
Production Stack
- ✔️ mllm-cli: API Server and CLI Chat Interface. @yuerqiqi
- MllmCSdk: Mllm C Wrapper for mllm-cli and other language usage. @yuerqiqi
Here is the development roadmap for H2 2025. We will pin this roadmap in Issues, and most of our subsequent work will be updated in this roadmap within Issues. In MLLM's documentation, we will archive each version of the roadmap and provide some outlooks. Contributions and feedback are welcome.
We plan to release a major MLLM version every year. The version for H2 2025 will be 2.0.0, and the main updates to be implemented in this version can be found in the Focus section.
Focus
Engine
inplaceandredirectAPI for resue memory. Checkmllm/models/qwen3/modeling_qwen3_fa2.hpp. @chenghuaWangModel coverage
Kernels
Backends
Performance
Quantization
mllm-convertor --pipeline xxx_kai_pipelineavailable on any devices)..mllmfile. @HayzelHanCompile
KV Cache Management
Pymllm
pip install pymllm).Production Stack