feat: add kai&qnn-vl&opencl by yirongjie · Pull Request #489 · UbiquitousLearning/mllm

yirongjie · 2025-10-27T09:52:41Z

Summary by CodeRabbit

New Features
- Added OpenCL backend support for GPU acceleration
- Enhanced QNN/NPU backend capabilities for accelerated device inference
- Expanded quantization format support with new optimization paths
- Added example programs for multiple model architectures
Improvements
- Performance optimizations for CPU operations via SIMD acceleration
- Enhanced attention mechanisms for improved inference speed
- Build system updates for better dependency management
- Project structure reorganization for improved maintainability

* Squashed commit of the following: commit efde6d0d014b647b8ceea59441aef1bd3ac424c0 Author: yirongjie <yirj0809@gmail.com> Date: Tue May 27 16:09:16 2025 +0000 fix: merge commit fe7fb476717e99df2eac23ab7fd1088e03cf8b3c Merge: f52bb32e 20e94c0 Author: yirongjie <yirj0809@gmail.com> Date: Tue May 27 16:09:08 2025 +0000 Merge branch 'main' of https://github.com/yirongjie/mllm commit f52bb32e5dbf4edcd4998d664ae071a1b5c8dbbb Author: yirongjie <yirj0809@gmail.com> Date: Tue May 27 12:25:08 2025 +0000 fix: merge from qnn-qwen2vl; commit 6f6c2442f750363c6789e7717861ea3a216cf356 Author: yirongjie <yirj0809@gmail.com> Date: Tue May 27 12:24:17 2025 +0000 Squashed commit of the following: commit 4862c76 Author: oreomaker <zh002919@outlook.com> Date: Thu May 15 14:59:37 2025 +0800 refact: use hvx qnn silu(faster); usable showui npu version commit 5df1b07 Author: oreomaker <zh002919@outlook.com> Date: Wed May 14 22:10:52 2025 +0800 feat: qnn dequantize_add hvx op commit c813f55 Author: oreomaker <zh002919@outlook.com> Date: Tue May 13 09:50:06 2025 +0800 chore: format qnn op package code commit ea215f0 Author: oreomaker <zh002919@outlook.com> Date: Mon May 12 11:34:38 2025 +0800 feat: free act tensors after qnn vit embedding commit e4f5011 Author: oreomaker <zh002919@outlook.com> Date: Mon May 12 11:14:30 2025 +0800 chore: remove save data in modeling qwen2vlnpu commit 2dcb677 Author: oreomaker <zh002919@outlook.com> Date: Mon May 12 10:48:34 2025 +0800 fix: seperate weights for embedding-lmhead when using rotated qwen2vl/showui commit 4847318 Author: oreomaker <zh002919@outlook.com> Date: Sun May 11 21:16:59 2025 +0800 fix: cpu tensor free bug(todo: handle tensor free) commit 799b673 Author: xudaliang <xudaliang@pku.edu.cn> Date: Sat May 10 22:51:11 2025 +0800 feat : new qwen2_vl model. commit dd1817d Author: xudaliang <xudaliang@pku.edu.cn> Date: Sat May 10 22:50:35 2025 +0800 feat : support qwen2-vl rotation model with fp bias. commit 305dc5c Author: oreomaker <zh002919@outlook.com> Date: Thu May 8 21:37:35 2025 +0800 feat: runnable qwen2vl qnn showui(2*256) commit 8e14815 Author: oreomaker <zh002919@outlook.com> Date: Thu May 8 21:36:33 2025 +0800 fix: pre processing of qwen2vl commit e041296 Author: oreomaker <zh002919@outlook.com> Date: Thu May 8 21:34:07 2025 +0800 refact: qwen vl npu modeling using closetFactor view(64->8x8) feat: get_position_id padding in Qwen2VL_ImagePatchAndEmbedding commit 5b17204 Author: oreomaker <zh002919@outlook.com> Date: Thu May 8 21:29:13 2025 +0800 feat: vit(visual_xx) tensor reuse for qnn (noted as: QNN VLM trick) commit 7c42658 Author: oreomaker <zh002919@outlook.com> Date: Thu May 8 21:26:49 2025 +0800 feat: finish cpu pipeline mrope commit 0962c00 Author: oreomaker <zh002919@outlook.com> Date: Tue May 6 11:39:29 2025 +0800 feat: pipeline multimodal rope commit 5317933 Author: oreomaker <zh002919@outlook.com> Date: Tue May 6 11:38:10 2025 +0800 refactor: use old&fast qnn silu commit 5bd14de Author: oreomaker <zh002919@outlook.com> Date: Mon Apr 28 21:10:48 2025 +0800 feat: runnable qwen 2 vl npu commit 1df6eed Author: oreomaker <zh002919@outlook.com> Date: Sun Apr 27 10:13:44 2025 +0800 refactor: tensor.to(QNN) commit d3d29c4 Author: oreomaker <zh002919@outlook.com> Date: Sat Apr 26 21:22:52 2025 +0800 chore: remove saveData in qwen2vl modeling commit c40e0c0 Author: oreomaker <zh002919@outlook.com> Date: Sat Apr 26 20:51:16 2025 +0800 feat: add qnn retrieve context info log commit 175d3a2 Author: oreomaker <zh002919@outlook.com> Date: Sat Apr 26 20:46:14 2025 +0800 fix: qwen 2 vl npu input tensor backend(correct version) commit 871e920 Author: oreomaker <zh002919@outlook.com> Date: Fri Apr 25 09:50:05 2025 +0800 fix: quantize i16 arm neon macro commit a2b802c Author: xudaliang <xudaliang@pku.edu.cn> Date: Wed Apr 23 18:33:26 2025 +0800 fix : Qwen2-VL prefill bugs: 1.FP32 KVCache. 2.LMHead does not execute. commit 8c66604 Author: oreomaker <zh002919@outlook.com> Date: Fri Apr 18 15:35:03 2025 +0800 fix: restore qwen2.5 modeling commit f138beb Author: oreomaker <zh002919@outlook.com> Date: Fri Apr 18 15:28:35 2025 +0800 fix: restore debug change commit 09e12ce Merge: d725942 9b271a9 Author: oreomaker <zh002919@outlook.com> Date: Fri Apr 18 13:39:10 2025 +0800 Merge branch 'debug-qwen2.5' of github.com:liang1232018/mllm into debug-qwen2.5 commit d725942 Author: oreomaker <zh002919@outlook.com> Date: Fri Apr 18 13:39:04 2025 +0800 dev: qnn sigmoid version silu feat: qnn backend f16 type input commit 9b271a9 Author: xudaliang <xudaliang@pku.edu.cn> Date: Fri Apr 18 13:24:52 2025 +0800 fix : linear W8A8 bias uint8 type bug commit 793a6c6 Author: xudaliang <xudaliang@pku.edu.cn> Date: Fri Apr 18 13:23:49 2025 +0800 fix : Shadow linear triger condition. commit 4e24bca Author: oreomaker <zh002919@outlook.com> Date: Wed Apr 16 20:53:07 2025 +0800 qwen 2.5 debug commit 4d74756 Author: oreomaker <zh002919@outlook.com> Date: Wed Apr 16 20:52:33 2025 +0800 fix: shadow linear commit 5866e2b Author: oreomaker <zh002919@outlook.com> Date: Tue Apr 15 22:17:12 2025 +0800 qwen 2.5 debug commit 29e9b92 Author: oreomaker <zh002919@outlook.com> Date: Mon Apr 14 09:28:45 2025 +0800 fix: remove shadow linear if(round_value) logic commit a61e837 Author: oreomaker <zh002919@outlook.com> Date: Sun Apr 13 22:03:45 2025 +0800 feat: int16 qkv for qwen2.5 vl npu commit 566f21d Author: xudaliang <xudaliang@pku.edu.cn> Date: Sun Apr 13 18:45:06 2025 +0800 fix : modeling input quantize to I8, but dequantize with I16 bug. commit 60639d0 Author: xudaliang <xudaliang@pku.edu.cn> Date: Sun Apr 13 18:44:18 2025 +0800 fix : LLaMADequantize INT16 to FP32 shuffle order bugs. commit a5cc652 Author: xudaliang <xudaliang@pku.edu.cn> Date: Sun Apr 13 17:31:10 2025 +0800 fix : LLaMAQuantize FP32 to INT16 round scale error. commit f139822 Author: oreomaker <zh002919@outlook.com> Date: Sat Apr 12 22:24:30 2025 +0800 fix: qnn int 16 linear bias(use int8 bias scale) commit 8831811 Author: oreomaker <zh002919@outlook.com> Date: Sat Apr 12 15:03:40 2025 +0800 debug: qnn int16 linear commit 088fe09 Author: xudaliang <xudaliang@pku.edu.cn> Date: Fri Apr 11 23:22:41 2025 +0800 feat : support INT16 dequantize and quantize. commit 73ebe87 Merge: b73c1c3 6007443 Author: liang1232018 <40791416+liang1232018@users.noreply.github.com> Date: Wed Apr 9 14:50:25 2025 +0800 Merge pull request UbiquitousLearning#12 from liang1232018/develop-zh Develop zh commit 6007443 Merge: 1c8647e b73c1c3 Author: liang1232018 <40791416+liang1232018@users.noreply.github.com> Date: Wed Apr 9 14:50:07 2025 +0800 Merge branch 'develop-xdl' into develop-zh commit 1c8647e Author: oreomaker <zh002919@outlook.com> Date: Tue Apr 8 21:39:56 2025 +0800 fix: qnn quant scale pow(2,bit) -> pow(2,bit-1) commit cc760ae Author: oreomaker <zh002919@outlook.com> Date: Tue Apr 8 17:03:17 2025 +0800 fix: op create param type->dtype commit 6afa80c Author: oreomaker <zh002919@outlook.com> Date: Mon Apr 7 15:25:21 2025 +0800 feat: Tensor::saveData only do when STATIC_READY commit 2ebded3 Author: oreomaker <zh002919@outlook.com> Date: Mon Apr 7 15:24:11 2025 +0800 feat: add qnn int16 layer param & op todo: qnn llama package implement commit 4faeca8 Author: oreomaker <zh002919@outlook.com> Date: Mon Mar 24 15:52:54 2025 +0800 dev: runnable qwen2vl npu (buggy) commit ebf110e Author: oreomaker <zh002919@outlook.com> Date: Mon Mar 24 15:46:23 2025 +0800 feat: add qwen vl export tool (todo: simulate infer and profile tools) commit bde9a92 Author: oreomaker <zh002919@outlook.com> Date: Mon Mar 24 15:44:25 2025 +0800 dev: a just working version of qwen 2.5 npu commit 126c283 Merge: 25de8c3 9d33aaf Author: oreomaker <zh002919@outlook.com> Date: Mon Mar 24 15:43:30 2025 +0800 Merge branch 'fix-qnn-python' into develop-zh commit 9d33aaf Author: oreomaker <zh002919@outlook.com> Date: Fri Mar 21 16:01:23 2025 +0800 fix: qnn profile quant bugs commit 25de8c3 Author: oreomaker <zh002919@outlook.com> Date: Thu Mar 20 16:00:19 2025 +0800 refactor: add graph split layer for QNN, change the modeling note: xnnpack is affected, should not merge commit 690a24e Author: oreomaker <zh002919@outlook.com> Date: Mon Mar 17 17:45:34 2025 +0800 feat: QNN load cache execute commit 4f28330 Author: oreomaker <zh002919@outlook.com> Date: Sun Mar 9 22:33:21 2025 +0800 dev: QNN graph merging execute commit b73c1c3 Author: xudaliang <xudaliang@pku.edu.cn> Date: Tue Nov 12 23:28:12 2024 +0800 feat : support decoding model configuration. commit ec3d4e5 Author: xudaliang <xudaliang@pku.edu.cn> Date: Tue Nov 12 20:31:45 2024 +0800 feat : support Qwen2.5 npu. commit 7246d53 Author: yirongjie <yirj0809@gmail.com> Date: Tue May 27 07:12:53 2025 +0000 feat: set run in Backends commit 1150241 Author: yirongjie <yirj0809@gmail.com> Date: Sat May 24 07:57:09 2025 +0000 fix: getFunc commit 24db241 Author: yirongjie <yirj0809@gmail.com> Date: Fri May 23 05:16:41 2025 +0000 fix: tensor function <Tensor *> to shared_ptr<Tensor> commit 0ecce75 Author: yirongjie <yirj0809@gmail.com> Date: Thu May 22 14:05:11 2025 +0000 feat：eager cpu commit 9835db5 Author: yirongjie <yirj0809@gmail.com> Date: Fri Apr 18 14:57:21 2025 +0000 fix: vtp commit 30c3046 Author: yirongjie <yirj0809@gmail.com> Date: Wed Apr 16 06:49:46 2025 +0000 fix: vtp commit b416268 Author: yirongjie <yirj0809@gmail.com> Date: Tue Apr 15 08:40:22 2025 +0000 fix: vtp commit 6430ca8 Author: yirongjie <yirj0809@gmail.com> Date: Mon Apr 14 12:53:58 2025 +0000 feat: vtp commit f86bff6 Author: yirongjie <yirj0809@gmail.com> Date: Sun Mar 23 09:41:14 2025 +0000 ref: add ShowUI * feat: add FlashAttention2 && fix: MULTIMODELROPE * remove broken submodule --------- Co-authored-by: yirongjie <yirj0809@gmail.com> Co-authored-by: yi <yi@U-21T7VPF4-1903.local>

commit efde6d0d014b647b8ceea59441aef1bd3ac424c0 Author: yirongjie <yirj0809@gmail.com> Date: Tue May 27 16:09:16 2025 +0000 fix: merge commit fe7fb476717e99df2eac23ab7fd1088e03cf8b3c Merge: f52bb32e 20e94c0 Author: yirongjie <yirj0809@gmail.com> Date: Tue May 27 16:09:08 2025 +0000 Merge branch 'main' of https://github.com/yirongjie/mllm commit f52bb32e5dbf4edcd4998d664ae071a1b5c8dbbb Author: yirongjie <yirj0809@gmail.com> Date: Tue May 27 12:25:08 2025 +0000 fix: merge from qnn-qwen2vl; commit 6f6c2442f750363c6789e7717861ea3a216cf356 Author: yirongjie <yirj0809@gmail.com> Date: Tue May 27 12:24:17 2025 +0000 Squashed commit of the following: commit 4862c76 Author: oreomaker <zh002919@outlook.com> Date: Thu May 15 14:59:37 2025 +0800 refact: use hvx qnn silu(faster); usable showui npu version commit 5df1b07 Author: oreomaker <zh002919@outlook.com> Date: Wed May 14 22:10:52 2025 +0800 feat: qnn dequantize_add hvx op commit c813f55 Author: oreomaker <zh002919@outlook.com> Date: Tue May 13 09:50:06 2025 +0800 chore: format qnn op package code commit ea215f0 Author: oreomaker <zh002919@outlook.com> Date: Mon May 12 11:34:38 2025 +0800 feat: free act tensors after qnn vit embedding commit e4f5011 Author: oreomaker <zh002919@outlook.com> Date: Mon May 12 11:14:30 2025 +0800 chore: remove save data in modeling qwen2vlnpu commit 2dcb677 Author: oreomaker <zh002919@outlook.com> Date: Mon May 12 10:48:34 2025 +0800 fix: seperate weights for embedding-lmhead when using rotated qwen2vl/showui commit 4847318 Author: oreomaker <zh002919@outlook.com> Date: Sun May 11 21:16:59 2025 +0800 fix: cpu tensor free bug(todo: handle tensor free) commit 799b673 Author: xudaliang <xudaliang@pku.edu.cn> Date: Sat May 10 22:51:11 2025 +0800 feat : new qwen2_vl model. commit dd1817d Author: xudaliang <xudaliang@pku.edu.cn> Date: Sat May 10 22:50:35 2025 +0800 feat : support qwen2-vl rotation model with fp bias. commit 305dc5c Author: oreomaker <zh002919@outlook.com> Date: Thu May 8 21:37:35 2025 +0800 feat: runnable qwen2vl qnn showui(2*256) commit 8e14815 Author: oreomaker <zh002919@outlook.com> Date: Thu May 8 21:36:33 2025 +0800 fix: pre processing of qwen2vl commit e041296 Author: oreomaker <zh002919@outlook.com> Date: Thu May 8 21:34:07 2025 +0800 refact: qwen vl npu modeling using closetFactor view(64->8x8) feat: get_position_id padding in Qwen2VL_ImagePatchAndEmbedding commit 5b17204 Author: oreomaker <zh002919@outlook.com> Date: Thu May 8 21:29:13 2025 +0800 feat: vit(visual_xx) tensor reuse for qnn (noted as: QNN VLM trick) commit 7c42658 Author: oreomaker <zh002919@outlook.com> Date: Thu May 8 21:26:49 2025 +0800 feat: finish cpu pipeline mrope commit 0962c00 Author: oreomaker <zh002919@outlook.com> Date: Tue May 6 11:39:29 2025 +0800 feat: pipeline multimodal rope commit 5317933 Author: oreomaker <zh002919@outlook.com> Date: Tue May 6 11:38:10 2025 +0800 refactor: use old&fast qnn silu commit 5bd14de Author: oreomaker <zh002919@outlook.com> Date: Mon Apr 28 21:10:48 2025 +0800 feat: runnable qwen 2 vl npu commit 1df6eed Author: oreomaker <zh002919@outlook.com> Date: Sun Apr 27 10:13:44 2025 +0800 refactor: tensor.to(QNN) commit d3d29c4 Author: oreomaker <zh002919@outlook.com> Date: Sat Apr 26 21:22:52 2025 +0800 chore: remove saveData in qwen2vl modeling commit c40e0c0 Author: oreomaker <zh002919@outlook.com> Date: Sat Apr 26 20:51:16 2025 +0800 feat: add qnn retrieve context info log commit 175d3a2 Author: oreomaker <zh002919@outlook.com> Date: Sat Apr 26 20:46:14 2025 +0800 fix: qwen 2 vl npu input tensor backend(correct version) commit 871e920 Author: oreomaker <zh002919@outlook.com> Date: Fri Apr 25 09:50:05 2025 +0800 fix: quantize i16 arm neon macro commit a2b802c Author: xudaliang <xudaliang@pku.edu.cn> Date: Wed Apr 23 18:33:26 2025 +0800 fix : Qwen2-VL prefill bugs: 1.FP32 KVCache. 2.LMHead does not execute. commit 8c66604 Author: oreomaker <zh002919@outlook.com> Date: Fri Apr 18 15:35:03 2025 +0800 fix: restore qwen2.5 modeling commit f138beb Author: oreomaker <zh002919@outlook.com> Date: Fri Apr 18 15:28:35 2025 +0800 fix: restore debug change commit 09e12ce Merge: d725942 9b271a9 Author: oreomaker <zh002919@outlook.com> Date: Fri Apr 18 13:39:10 2025 +0800 Merge branch 'debug-qwen2.5' of github.com:liang1232018/mllm into debug-qwen2.5 commit d725942 Author: oreomaker <zh002919@outlook.com> Date: Fri Apr 18 13:39:04 2025 +0800 dev: qnn sigmoid version silu feat: qnn backend f16 type input commit 9b271a9 Author: xudaliang <xudaliang@pku.edu.cn> Date: Fri Apr 18 13:24:52 2025 +0800 fix : linear W8A8 bias uint8 type bug commit 793a6c6 Author: xudaliang <xudaliang@pku.edu.cn> Date: Fri Apr 18 13:23:49 2025 +0800 fix : Shadow linear triger condition. commit 4e24bca Author: oreomaker <zh002919@outlook.com> Date: Wed Apr 16 20:53:07 2025 +0800 qwen 2.5 debug commit 4d74756 Author: oreomaker <zh002919@outlook.com> Date: Wed Apr 16 20:52:33 2025 +0800 fix: shadow linear commit 5866e2b Author: oreomaker <zh002919@outlook.com> Date: Tue Apr 15 22:17:12 2025 +0800 qwen 2.5 debug commit 29e9b92 Author: oreomaker <zh002919@outlook.com> Date: Mon Apr 14 09:28:45 2025 +0800 fix: remove shadow linear if(round_value) logic commit a61e837 Author: oreomaker <zh002919@outlook.com> Date: Sun Apr 13 22:03:45 2025 +0800 feat: int16 qkv for qwen2.5 vl npu commit 566f21d Author: xudaliang <xudaliang@pku.edu.cn> Date: Sun Apr 13 18:45:06 2025 +0800 fix : modeling input quantize to I8, but dequantize with I16 bug. commit 60639d0 Author: xudaliang <xudaliang@pku.edu.cn> Date: Sun Apr 13 18:44:18 2025 +0800 fix : LLaMADequantize INT16 to FP32 shuffle order bugs. commit a5cc652 Author: xudaliang <xudaliang@pku.edu.cn> Date: Sun Apr 13 17:31:10 2025 +0800 fix : LLaMAQuantize FP32 to INT16 round scale error. commit f139822 Author: oreomaker <zh002919@outlook.com> Date: Sat Apr 12 22:24:30 2025 +0800 fix: qnn int 16 linear bias(use int8 bias scale) commit 8831811 Author: oreomaker <zh002919@outlook.com> Date: Sat Apr 12 15:03:40 2025 +0800 debug: qnn int16 linear commit 088fe09 Author: xudaliang <xudaliang@pku.edu.cn> Date: Fri Apr 11 23:22:41 2025 +0800 feat : support INT16 dequantize and quantize. commit 73ebe87 Merge: b73c1c3 6007443 Author: liang1232018 <40791416+liang1232018@users.noreply.github.com> Date: Wed Apr 9 14:50:25 2025 +0800 Merge pull request UbiquitousLearning#12 from liang1232018/develop-zh Develop zh commit 6007443 Merge: 1c8647e b73c1c3 Author: liang1232018 <40791416+liang1232018@users.noreply.github.com> Date: Wed Apr 9 14:50:07 2025 +0800 Merge branch 'develop-xdl' into develop-zh commit 1c8647e Author: oreomaker <zh002919@outlook.com> Date: Tue Apr 8 21:39:56 2025 +0800 fix: qnn quant scale pow(2,bit) -> pow(2,bit-1) commit cc760ae Author: oreomaker <zh002919@outlook.com> Date: Tue Apr 8 17:03:17 2025 +0800 fix: op create param type->dtype commit 6afa80c Author: oreomaker <zh002919@outlook.com> Date: Mon Apr 7 15:25:21 2025 +0800 feat: Tensor::saveData only do when STATIC_READY commit 2ebded3 Author: oreomaker <zh002919@outlook.com> Date: Mon Apr 7 15:24:11 2025 +0800 feat: add qnn int16 layer param & op todo: qnn llama package implement commit 4faeca8 Author: oreomaker <zh002919@outlook.com> Date: Mon Mar 24 15:52:54 2025 +0800 dev: runnable qwen2vl npu (buggy) commit ebf110e Author: oreomaker <zh002919@outlook.com> Date: Mon Mar 24 15:46:23 2025 +0800 feat: add qwen vl export tool (todo: simulate infer and profile tools) commit bde9a92 Author: oreomaker <zh002919@outlook.com> Date: Mon Mar 24 15:44:25 2025 +0800 dev: a just working version of qwen 2.5 npu commit 126c283 Merge: 25de8c3 9d33aaf Author: oreomaker <zh002919@outlook.com> Date: Mon Mar 24 15:43:30 2025 +0800 Merge branch 'fix-qnn-python' into develop-zh commit 9d33aaf Author: oreomaker <zh002919@outlook.com> Date: Fri Mar 21 16:01:23 2025 +0800 fix: qnn profile quant bugs commit 25de8c3 Author: oreomaker <zh002919@outlook.com> Date: Thu Mar 20 16:00:19 2025 +0800 refactor: add graph split layer for QNN, change the modeling note: xnnpack is affected, should not merge commit 690a24e Author: oreomaker <zh002919@outlook.com> Date: Mon Mar 17 17:45:34 2025 +0800 feat: QNN load cache execute commit 4f28330 Author: oreomaker <zh002919@outlook.com> Date: Sun Mar 9 22:33:21 2025 +0800 dev: QNN graph merging execute commit b73c1c3 Author: xudaliang <xudaliang@pku.edu.cn> Date: Tue Nov 12 23:28:12 2024 +0800 feat : support decoding model configuration. commit ec3d4e5 Author: xudaliang <xudaliang@pku.edu.cn> Date: Tue Nov 12 20:31:45 2024 +0800 feat : support Qwen2.5 npu. commit 7246d53 Author: yirongjie <yirj0809@gmail.com> Date: Tue May 27 07:12:53 2025 +0000 feat: set run in Backends commit 1150241 Author: yirongjie <yirj0809@gmail.com> Date: Sat May 24 07:57:09 2025 +0000 fix: getFunc commit 24db241 Author: yirongjie <yirj0809@gmail.com> Date: Fri May 23 05:16:41 2025 +0000 fix: tensor function <Tensor *> to shared_ptr<Tensor> commit 0ecce75 Author: yirongjie <yirj0809@gmail.com> Date: Thu May 22 14:05:11 2025 +0000 feat：eager cpu commit 9835db5 Author: yirongjie <yirj0809@gmail.com> Date: Fri Apr 18 14:57:21 2025 +0000 fix: vtp commit 30c3046 Author: yirongjie <yirj0809@gmail.com> Date: Wed Apr 16 06:49:46 2025 +0000 fix: vtp commit b416268 Author: yirongjie <yirj0809@gmail.com> Date: Tue Apr 15 08:40:22 2025 +0000 fix: vtp commit 6430ca8 Author: yirongjie <yirj0809@gmail.com> Date: Mon Apr 14 12:53:58 2025 +0000 feat: vtp commit f86bff6 Author: yirongjie <yirj0809@gmail.com> Date: Sun Mar 23 09:41:14 2025 +0000 ref: add ShowUI

Co-authored-by: liang1232018 <40791416+liang1232018@users.noreply.github.com> Co-authored-by: oreomaker <70836772+oreomaker@users.noreply.github.com> Co-authored-by: oreomaker <zh002919@outlook.com> Co-authored-by: xudaliang <xudaliang@pku.edu.cn> Co-authored-by: xwk <1263212259@qq.com> Co-authored-by: yuerqiqi <2500526025@qq.com>

coderabbitai · 2025-10-27T09:53:08Z

Caution

Review failed

Failed to post review comments

Walkthrough

Major architectural refactoring transitioning the codebase from src/ to mllm/ directory structure. Introduces comprehensive backend infrastructure (CPU, OpenCL, QNN), state management via singleton Context, new quantization types, SIMD-accelerated compute kernels, expanded tensor/module capabilities with smart pointer semantics, and replaces legacy examples with new model demos.

Changes

Cohort / File(s)	Summary
Build Configuration & Path Migration `.clang-tidy.ignore`, `.gitignore`, `.gitmodules`, `CMakeLists.txt`	Path remapping from `src/` to `mllm/` prefix; updated backend integration (OpenCL, QNN), architecture-specific flags, kleidiai kernel inclusion for ARM; AddressSanitizer support added; ASAN and OpenCL options introduced.
Documentation & Examples CMake `README.md`, `examples/CMakeLists.txt`	Updated model tables with Hexagon NPU INT8 support; revised QNN pipeline references; introduced modular LLM/VLM library targets (`mllm_llm`, `mllm_vlm`); conditional executable creation with existence checks.
Core State & Context Management `mllm/Context.hpp`, `mllm/Context.cpp`, `mllm/StateManager.hpp`	New singleton Context with InferenceStateManager (execution type, sequence lengths, QNN/CPU flags) and SpeculativeDecodingManager for draft state tracking.
Tensor & Type System `mllm/DataType.hpp`, `mllm/Types.hpp`, `mllm/Tensor.hpp`, `mllm/Tensor.cpp`, `mllm/TensorImpl.hpp`	Added FP16 type aliases, quantization block structures; introduced DeviceMemory abstraction; Tensor now inherits from enable_shared_from_this; master/child relationships via weak_ptr; backend-aware allocation and device transitions; extensive operator overloads and tensor operations.
Backend & Operation Infrastructure `mllm/Backend.hpp`, `mllm/Backend.cpp`, `mllm/Op.hpp`, `mllm/OpDefined.hpp`, `include/OpDefined.hpp` (removed)	Backend::global_backends changed to unique_ptr; new runOp signature replacing runFunc; device memory allocation stubs; OpType and TensorFuncType enums consolidated to mllm/OpDefined.hpp; Op::traced() accessor and dtype propagation in setUp.
Module & Layer Architecture `mllm/Module.hpp`, `mllm/Module.cpp`, `mllm/Layer.hpp`, `mllm/ParamLoader.hpp`, `mllm/ParamLoader.cpp`	Module loader changed to shared_ptr; added batch generate() returning vector<vector>; Layer now manages Op creation, backend switching (to/cpu/cl); forwardNoInput() timing; multi-file loading via load_multifile; mmap-based ParamLoader with getParamMetadata, getInputStream APIs.
Generation & Trace `mllm/Generate.hpp`, `mllm/Trace.cpp`, `mllm/Parallel.hpp`	FP16/FP32 dtype-aware score collection; greedy search method public interface; Trace now registers all inputs in activation_tensors; ChunkPipeline::run extended with clean_tensors parameter for cleanup.
CPU Backend Core `mllm/backends/cpu/CMakeLists.txt`, `mllm/backends/cpu/CPUBackend.hpp`, `mllm/backends/cpu/CPUBackend.cpp`	CPU backend creation via CPUBackendCreator; comprehensive Op registration (arithmetic, neural nets, functions); convert_fp_data for FP16↔FP32; runOp orchestrates in-graph tracing and activation tensor management; conditional kleidiai/OpenMP support for ARM.
CPU Compute Headers & Utilities `mllm/backends/cpu/compute/ActivationFunction.hpp`, `mllm/backends/cpu/compute/Convolution.hpp/cpp`, `mllm/backends/cpu/compute/FeatureCheck.hpp`, `mllm/backends/cpu/compute/Pooling.hpp/cpp`, `mllm/backends/cpu/compute/SIMDMemory.hpp`, `mllm/backends/cpu/compute/Sigmoid.hpp`	Updated includes to reference ggml third_party paths; Convolution public API (2D/3D); I8MM feature detection via ARM registers/sysctlbyname; Pooling with VecDotFP32 support; SIMD-accelerated Sigmoid (AVX2/NEON).
CPU GEMM & Matmul Implementations `mllm/backends/cpu/compute/Matmul.hpp/cpp`, `mllm/backends/cpu/compute/MatmulElastic.hpp/cpp`, `mllm/backends/cpu/compute/MatmulSparse.hpp/cpp`, `mllm/backends/cpu/compute/GemmFp.hpp`	Replaced SME paths with LlamaFile GEMM; updated function pointer types (gemv_func, gemm_func); includes moved to ggml third_party; GemmFp adds FP32 and FP32↔FP16 micro-kernels (NEON/AVX) with packing.
Kleidiai GEMM Integration `mllm/backends/cpu/compute/GemmKleidiai.hpp`, `mllm/backends/cpu/compute/GemmKleidiai.cpp`	Multi-precision GEMM with QSI4C32 (4-bit), FP16, FP32 paths; workspace management; packing/quantization helpers; BSHD layout support; runtime hardware capability detection.
Quantization GEMM & Q2K `mllm/backends/cpu/compute/GemmQ2K.hpp/cpp`	Q2_K × Q8_K quantized GEMM and GEMV operations with NEON micro-kernels (8x16) and reference fallback.
Flash Attention & Sage Attention `mllm/backends/cpu/compute/FlashAttention2H.hpp`, `mllm/backends/cpu/compute/SageAttention.hpp`, `mllm/backends/cpu/compute/SageAttentionKVQ8.hpp`, `mllm/backends/cpu/compute/SageAttentionPT.hpp`, `mllm/backends/cpu/compute/SageQuantize.hpp`	SIMD-accelerated FA2 with FP32 and mixed FP32/FP16 paths (BHSD layout); Sage Attention with KV quantization (Q8_0F), per-row quantization, mean computation, softmax with platform-optimized paths.
Transpose & Split Operations `mllm/backends/cpu/compute/Transpose2D.hpp`, `mllm/backends/cpu/compute/Transpose3D.hpp`, `mllm/backends/cpu/compute/Split.hpp`	Matrix transpose with AVX/NEON 8x8/4x4 block optimization; 3D tensor transpose with permutation handling; efficient split with mixed FP32/FP16 outputs and type-aware conversion.
CPU Op Implementations `mllm/backends/cpu/op/CPUArgSortFunc.hpp`, `mllm/backends/cpu/op/CPUBinCountFunc.hpp`, `mllm/backends/cpu/op/CPUBinaryFunc.hpp`, `mllm/backends/cpu/op/CPUCat.cpp`	Migrated from TensorFunction to Op-based architecture; reshape/execute return ErrorCode; Creator factory pattern; shift from args vector to constructor-injected parameters; multi-input operators (addTwo, subTwo, mulTwo, divTwo).
Removed Example Programs `examples/main_alpaca.cpp`, `examples/main_llama.cpp`, `examples/main_clip.cpp`, `examples/main_fuyu.cpp`, `examples/main_imagebind.cpp`, `examples/main_llava.cpp`, `examples/main_phonelm_npu.cpp`, `examples/main_phonelm_npu.hpp`, `examples/main_qwen_npu.cpp`, `examples/main_qwen_npu.hpp`, `examples/main_tinyllama.cpp`, `examples/main_vit.cpp`, `examples/main_vit.cpp`	Deleted legacy inference examples and supporting headers.
New Model Demo Programs `examples/demo_qwen.cpp`, `examples/demo_qwen_batch.cpp`, `examples/demo_qwen_npu.cpp`, `examples/demo_qwen_npu_pipeline.cpp`, `examples/demo_qwen2.5_vl.cpp`, `examples/demo_qwen2_vl.cpp`, `examples/demo_qwen2_vl_npu.cpp`, `examples/demo_qwen2_vl_vtp.cpp`, `examples/demo_qwen3.cpp`, `examples/demo_llama3.cpp`, `examples/demo_showui.cpp`, `examples/demo_showui_npu.cpp`, `examples/demo_showui_vtp.cpp`, `examples/demo_bailing_moe.cpp`, `examples/demo_bailing_moe_mbp.cpp`, `examples/demo_smallthinker.cpp`, `examples/demo_smallthinker_mbp.cpp`, `examples/demo_minicpm_moe_mbm.cpp`, `examples/demo_minicpm_moe_mbp.cpp`, `examples/demo_tinyllama.cpp`, `examples/demo_sparse_llama.cpp`, `examples/demo_ds_qwen2.cpp`, `examples/demo_phonelm_npu.cpp`, `examples/demo_qwen2.5_npu.cpp` (removed), `examples/mllm_benchmark.cpp`	New demos for Qwen (batch, NPU pipeline), Qwen2.5VL, Qwen2VL (CPU/NPU/VTP), Qwen3, LLaMA3, ShowUI, BailingMoE, SmallThinker, MiniCPM MoE, TinyLLaMA, SparseLLaMA; NPU variants use v2 API with context-based state management; ARM-specific defaults; OpenCL conditional setup; call clear_kvcache() and profiling() post-generation.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Areas requiring extra attention:

Tensor memory management refactoring: master_tensor_ and child_tensors_ conversion from raw pointers to weak_ptr/vector<weak_ptr> introduces safety but requires careful lifecycle verification, especially shallowCopyFrom and reconcileLayouts logic
Context singleton and InferenceStateManager: ensure thread-safe access patterns and state consistency across backends (CPU, QNN, OpenCL)
Backend::global_backends unique_ptr conversion: verify all access paths use .get() appropriately and no double-deletion or dangling pointer scenarios exist
Module::generate batch overload: new vector<vector> signature for batch generation requires validation of per-batch end detection and tensor cleanup
ParamLoader mmap implementation: alignment checks, fallback paths, and interaction with backend load_from_file hooks need verification
CPU Op Creator pattern migration: ensure all Op types properly register with CPUBackend and Creator factories wire OpParam → Op construction correctly
SIMD compute kernels (GemmFp, GemmKleidiai, FA2, SageAttention): platform-specific codepaths (AVX2, NEON) and fallback correctness; quantization block interpretations
Layer backend switching and initOp logic: backend type consistency checks and Op re-initialization on backend change
Examples NPU/QNN integration: v2 API usage, context state management, chunk handling, and graph freezing/switching logic
Possible missing ownership transfer semantics: verify Backend creation and ownership in Backend::global_backends is clear
State manager reset() completeness: ensure all inference state properly resets for new prompts/batches

Possibly related PRs

feat(cpu): add inplace rmsnorm implementations for fp32 and fp16 #483: Overlapping changes to Layer redirect macros and StaticCache KV cache APIs that coordinate with this refactoring
feat(qnn): Basic QNN Prefill on v2 #485: Related QNN backend registration and Op pattern implementations that complement the broader backend infrastructure additions

Suggested reviewers

oreomaker
chenghuaWang
liang1232018

Poem

🐰 Whiskers twitching with delight,
Paths refactored left and right,
SIMD kernels now take flight,
State and context, shining bright!
Backends bloom—what a sight! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 11.75% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Title Check	❓ Inconclusive	The pull request title "feat: add kai&qnn-vl&opencl" uses vague abbreviations (kai, qnn-vl, opencl) that lack clarity about what is actually being added to the codebase. While the title does reference real components included in the changeset, it fails to capture the primary architectural change evident throughout the PR: a massive directory restructuring from `src/` to `mllm/`. The title does not adequately convey the scope and significance of the changes, which include extensive backend implementations, infrastructure additions (Context, StateManager, device memory management), and new computational kernels. The abbreviated format makes it difficult for a developer scanning commit history to understand the primary intent of this change.	Consider revising the PR title to be more descriptive and less abbreviated. A clearer title might be: "refactor: migrate project structure to mllm and add Kleidiai, QNN vision-language, and OpenCL backends" or "feat: add Kleidiai/QNN-VL/OpenCL support with project restructuring" to better reflect both the architectural reorganization and the new feature additions.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 103

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (19)

README.md (1)

188-199: Specify language for fenced code block.

The code block starting at line 190 should declare a language for proper syntax highlighting. Based on the context (shell output), this should be marked as a shell or plaintext result block.

Apply this diff:
- Result are as followed:
+ Result are as followed (note: expected output is shown below):
  
- ```
+ ```plaintext
  > ./demo_qwen_npu
Alternatively, if this is intended to be interactive shell output, use:
- ```
+ ```shell
  > ./demo_qwen_npu

mllm/backends/cpu/compute/Matmul.hpp (1)

17-23: Contradictory preprocessor guards; unreachable #error.

#ifndef __ARM_NEON inside #ifdef __ARM_NEON can never trigger.

Apply:
-#ifdef __ARM_NEON
-
-#ifndef __ARM_NEON
-#error \
-    "The mllm-advance Armv8 backend is enbaled but __ARM_NEON is not defined. Pls use cross-compile toolchains(such as NDK) to compile."
-#endif
+#ifndef __ARM_NEON
+#error "The mllm-advance Armv8 backend is enabled but __ARM_NEON is not defined. Use an Armv8 toolchain (e.g., NDK)."
+#else
And keep the existing closing #endif at Line 157 as-is.

mllm/backends/cpu/op/CPUCat.cpp (1)

65-70: Batch concat copy logic breaks for varying batch sizes.

copysize and destination offset assume all inputs have inputs[0]->batch(). This corrupts memory when batches differ.

Apply:

-    for (int n = 0; n < inputs.size(); ++n) {
-        auto copysize = inputs[0]->batch() * inputs[0]->head() * inputs[0]->sequence() * inputs[0]->dimension();
-        memcpy(outputs[0]->ptrAt<float>(n * inputs[0]->batch(), 0, 0, 0), inputs[n]->ptrAt<float>(0, 0, 0, 0), sizeof(float) * copysize);
-    }
+    int dst_batch_offset = 0;
+    const int head = inputs[0]->head();
+    const int seq  = inputs[0]->sequence();
+    const int dim  = inputs[0]->dimension();
+    const size_t elems_per_item = static_cast<size_t>(head) * seq * dim;
+    for (int n = 0; n < inputs.size(); ++n) {
+        const int nb = inputs[n]->batch();
+        const size_t elems = static_cast<size_t>(nb) * elems_per_item;
+        memcpy(outputs[0]->ptrAt<float>(dst_batch_offset, 0, 0, 0),
+               inputs[n]->ptrAt<float>(0, 0, 0, 0),
+               sizeof(float) * elems);
+        dst_batch_offset += nb;
+    }

If non-float dtypes are possible here, use rawHostPtr() plus type_size(dtype) instead. See next comment.

mllm/backends/cpu/compute/MatmulElastic.cpp (1)

187-191: Bug: bitwise & used in loop condition; should be &&.

The condition n < (block + 1) * blck_0 & n < use_N evaluates with bitwise & and wrong precedence, causing incorrect loop bounds.

Apply:
-                    for (int n = block * blck_0; n < (block + 1) * blck_0 & n < use_N; n++) {
+                    for (int n = block * blck_0; n < (block + 1) * blck_0 && n < use_N; n++) {

mllm/TensorImpl.hpp (1)

252-262: Remove unreachable code in time()

The switch never executes due to the early return.

     int time() {
         assert(shape_.size() == 5);
-        return legacyShape(chls()[TIME]);
-        switch (ctype_) {
-        case BCTHW:
-            return legacyShape(2);
-        case BTHWC:
-            return legacyShape(1);
-        default: return -1;
-        }
+        return legacyShape(chls()[TIME]);
     }

If the intent was a layout-dependent path, decide one behavior and drop the other, or gate by ctype_ before returning.

mllm/backends/cpu/compute/Pooling.cpp (2)

55-63: Bug: assignment inside assert; should be comparison.

assert(padding_top = -blk_h); and assert(padding_left = -blk_w); assign, not compare.

-                        assert(padding_top = -blk_h);
+                        assert(padding_top == -blk_h);
...
-                        assert(padding_left = -blk_w);
+                        assert(padding_left == -blk_w);

If asserts are intended only for debug checks, consider removing the mutation to padding_* entirely and computing start offsets from -blk_* directly.

Also applies to: 129-135

69-73: Bug: use logical &&, not bitwise & in loop conditions.

Bitwise & yields wrong loop bounds.

-                    for (int k_h = start_k_h; k_h < kernel_h & blk_h + k_h < in_height; ++k_h) {
+                    for (int k_h = start_k_h; k_h < kernel_h && blk_h + k_h < in_height; ++k_h) {
...
-                        for (int k_w = start_k_w; k_w < kernel_w & blk_w + k_w < in_width; ++k_w) {
+                        for (int k_w = start_k_w; k_w < kernel_w && blk_w + k_w < in_width; ++k_w) {

Also applies to: 136-139

mllm/ParamLoader.hpp (1)

3-12: Missing standard headers cause TU‑order dependent builds.

This header uses FILE, vector, tuple, set, shared_ptr but doesn’t include them here. Add explicit includes.
 #include <cstdint>
 #include <map>
 #include <string>
 #include <utility>
 #include "Tensor.hpp"
 #include "Types.hpp"
 #include <initializer_list>
 #include <mutex>
+#include <cstdio>     // FILE, fread
+#include <memory>     // std::shared_ptr
+#include <set>        // std::set
+#include <tuple>      // std::tuple
+#include <vector>     // std::vector

examples/demo_qwen2_vl.cpp (1)

47-52: Guard against image/prompt length mismatch.

Indexing in_imgs[i] assumes same length as in_strs. Risk of OOB if prompts > images.
-        auto input_tensor = processor.process(in_str, in_imgs[i]);
+        const auto img_idx = std::min(i, static_cast<int>(in_imgs.size() - 1));
+        auto input_tensor = processor.process(in_str, in_imgs[img_idx]);

mllm/ParamLoader.cpp (3)

437-465: partialLoad mmap path reads from buffer_ which is never set in mmap ctor.

In the new design you only set mmap_buffer_; buffer_ stays null, so mmap partialLoad fails or UB.

Use mmap_buffer_.get():

-        if (!use_mmap_ || buffer_ == nullptr || offsets_.find(name) == offsets_.end()) {
+        if (!use_mmap_ || !mmap_buffer_ || offsets_.find(name) == offsets_.end()) {
             fprintf(stderr, "Error: mmap not initialized or tensor name not found for mmap partialLoad.\n");
             return false;
         }
@@
-            uint8_t *source_ptr_in_mmap = buffer_ + offset_info.first + (static_cast<uint64_t>(row) * colNum) * perValueLength;
+            uint8_t *source_ptr_in_mmap = mmap_buffer_.get() + offset_info.first
+                                        + (static_cast<uint64_t>(row) * colNum) * perValueLength;

Also, consider validating bounds before memcpy.

143-152: Avoid double-unmap; align dtor with shared_ptr-managed mapping.

mmap_buffer_ already unmaps via deleter; dtor also munmaps buffer_. If you ever set both to same region, this double-unmaps.

-    if (use_mmap_) {
-        if (use_mmap_ && buffer_ != nullptr && buffer_ != MAP_FAILED) {
-            munmap(buffer_, size_);
-            buffer_ = nullptr;
-        }
-    }
+    if (use_mmap_ && buffer_ && !mmap_buffer_) {
+        munmap(buffer_, size_);
+        buffer_ = nullptr;
+    }

393-401: MultiFileParamLoader dtor never closes files; condition inverted and local include.

#include <set> inside the function is odd.
if (closed.find(p.second) != closed.end()) closes only when already present; no files are closed.

-MultiFileParamLoader::~MultiFileParamLoader() {
-#include <set>
-    std::set<FILE *> closed;
-    for (const auto &p : files_) {
-        if (closed.find(p.second) != closed.end()) {
-            fclose(p.second);
-            closed.insert(p.second);
-        }
-    }
-}
+MultiFileParamLoader::~MultiFileParamLoader() {
+    std::set<FILE *> closed;
+    for (const auto &kv : files_) {
+        FILE *fp = kv.second;
+        if (closed.find(fp) == closed.end()) {
+            fclose(fp);
+            closed.insert(fp);
+        }
+    }
+}

Move #include <set> to the top of the file.

mllm/Parallel.hpp (3)

19-23: Chunk count off-by-one when length is a multiple of chunk_size.

Formula adds an extra chunk for exact multiples.

Use ceiling division:

-        const int seq_length_padding = (chunk_size - real_seq_length % chunk_size) + real_seq_length;
-        chunk_num = seq_length_padding / chunk_size;
+        chunk_num = (real_seq_length + chunk_size - 1) / chunk_size;

68-77: OpenMP misuse and skipped chunks.

Loop runs only chunk_num/2; chunk_num==1 executes nothing.
#pragma omp barrier outside a parallel region is invalid.

-        for (int chunk_id = 0; chunk_id < chunk_num / 2; ++chunk_id) {
+        for (int chunk_id = 0; chunk_id < (chunk_num + 1) / 2; ++chunk_id) {
             // for every two chunk, start at chunk_id * 2 to avoid no execute for
             for (int i = chunk_id * 2; i < num_graph + chunk_id * 2 + 5; ++i) {
 #pragma omp parallel for num_threads(2)
                 for (int pair_idx = 0; pair_idx < 2; ++pair_idx) {
                     executeFunc((chunk_id * 2) + pair_idx, i - (pair_idx * 4));
                 }
-#pragma omp barrier
-                // std::cout << "---------------------------" << std::endl;
+                // optional: add synchronization inside the parallel region if needed
             }
         }

Additionally, include <omp.h> in this header if not already via transitive includes to use omp_set_max_active_levels.

88-96: Negative index when real_seq_length % chunk_size == 0.

Indexing at -1 is UB. Use (real_seq_length - 1) % chunk_size.

-                auto value = result->dataAt<float>(0, 0, real_seq_length % chunk_size - 1, i);
+                const int pos = (real_seq_length - 1) % chunk_size;
+                auto value = result->dataAt<float>(0, 0, pos, i);

mllm/backends/cpu/op/CPUArgSortFunc.hpp (1)

23-39: Return integer indices and clean up comparator/unused state.

argsort writes indices but stores them as float; outputs dtype mirrors inputs. Use int32 indices.
compareIndices doesn’t depend on object state; make it static; drop lambda capture.
thread_count is unused; either use it or remove it.

-private:
-    int thread_count = 4;
-
-    // 自定义比较函数，用于对索引进行排序
-    bool compareIndices(const std::pair<int, float> &a, const std::pair<int, float> &b) {
+private:
+    int thread_count = 4; // TODO: use for parallelism or remove
+    // 自定义比较函数，用于对索引进行排序
+    static inline bool compareIndices(const std::pair<int, float> &a, const std::pair<int, float> &b) {
         return a.second < b.second;
     }
 
-    void argsort(float *input, int size, float *out_indices) {
+    void argsort(float *input, int size, int *out_indices) {
         std::vector<std::pair<int, float>> indexedInput(size);
         for (int i = 0; i < size; ++i) {
             indexedInput[i] = std::make_pair(i, input[i]);
         }
-        std::sort(indexedInput.begin(), indexedInput.end(), [this](const std::pair<int, float> &a, const std::pair<int, float> &b) {
-            return compareIndices(a, b);
-        });
+        std::sort(indexedInput.begin(), indexedInput.end(), &CPUargsortFunction::compareIndices);
         for (int i = 0; i < size; ++i) {
-            out_indices[i] = static_cast<float>(indexedInput[i].first);
+            out_indices[i] = indexedInput[i].first;
         }
     }
@@
-        outputs[0]->setDtype(inputs[0]->dtype()); // argsortk_values
+        outputs[0]->setDtype(DataType::DATA_TYPE_INT32); // indices
         return ErrorCode::MLLM_NO_ERROR;
     }
     ErrorCode execute(vector<shared_ptr<Tensor>> inputs, vector<shared_ptr<Tensor>> outputs) override {
         int size = inputs[0]->dimension();
         for (int b = 0; b < inputs[0]->batch(); b++) {
             float *data = inputs[0]->ptrAt<float>(b, 0, 0, 0);
-            float *out = outputs[0]->ptrAt<float>(b, 0, 0, 0);
+            int *out = outputs[0]->ptrAt<int>(b, 0, 0, 0);
             argsort(data, size, out);
         }
         return ErrorCode::MLLM_NO_ERROR;
     }

Also applies to: 46-61

mllm/Types.hpp (1)

280-306: Bug: returning -1 in a size_t function causes wraparound.

DataTypeSize returns -1 for several cases, but the function’s return type is size_t. This wraps to a huge value and will corrupt allocations/strides.

Apply a safe fallback (0) or throw. Minimal fix:

-    case MLLM_TYPE_Q4_1:
-    case MLLM_TYPE_Q8_1:
-        return -1;
+    case MLLM_TYPE_Q4_1:
+    case MLLM_TYPE_Q8_1:
+        return 0; // unsupported
...
-    case MLLM_TYPE_Q1_K:
-        return -1;
+    case MLLM_TYPE_Q1_K:
+        return 0;
...
-    case MLLM_TYPE_IQ2_XS:
-        return -1;
+    case MLLM_TYPE_IQ2_XS:
+        return 0;
-    case MLLM_TYPE_IQ1_S:
-        return -1;
+    case MLLM_TYPE_IQ1_S:
+        return 0;
-    case MLLM_TYPE_IQ1_M:
-        return -1;
+    case MLLM_TYPE_IQ1_M:
+        return 0;
-    case MLLM_TYPE_IQ2_S:
-        return -1;
+    case MLLM_TYPE_IQ2_S:
+        return 0;

Optionally log or assert on 0 to catch misuse.

mllm/Layer.hpp (1)

1042-1057: Use logical &&, not bitwise & in View dimension checks

Using & with integers here yields incorrect truthiness and can select the wrong reshape path.

-        if (batch == -1 & seq == -1 & head != -1 & dim != -1) { // keep b&s change h&d
+        if (batch == -1 && seq == -1 && head != -1 && dim != -1) { // keep b&s change h&d
@@
-        } else if (batch == -1 & dim == -1 & head != -1 & seq != -1) { // keep b&d change h&s
+        } else if (batch == -1 && dim == -1 && head != -1 && seq != -1) { // keep b&d change h&s
@@
-        } else if (head == -1 & dim == -1 & batch != -1 & seq != -1) { // keep h&d change b&s
+        } else if (head == -1 && dim == -1 && batch != -1 && seq != -1) { // keep h&d change b&s
@@
-        } else if (batch != -1 & dim != -1 & head != -1 & seq != -1) { // change all dimension.
+        } else if (batch != -1 && dim != -1 && head != -1 && seq != -1) { // change all dimension.

mllm/Tensor.hpp (1)

653-666: Incorrect channel/width index mapping for BCTHW/BTHWC.

WIDTH and CHANNLE indices are wrong; causes bad shape/indexing.

 case BTHWC:
   impl_->chls()[BATCH] = 0;
   impl_->chls()[TIME] = 1;
   impl_->chls()[HEIGHT] = 2;
-  impl_->chls()[WIDTH] = 3;
-  impl_->chls()[CHANNLE] = 3;
+  impl_->chls()[WIDTH] = 3;
+  impl_->chls()[CHANNLE] = 4;
   break;
 case BCTHW:
   impl_->chls()[BATCH] = 0;
-  impl_->chls()[CHANNLE] = 1;
-  impl_->chls()[TIME] = 2;
-  impl_->chls()[HEIGHT] = 3;
-  impl_->chls()[WIDTH] = 3;
+  impl_->chls()[CHANNLE] = 1;
+  impl_->chls()[TIME] = 2;
+  impl_->chls()[HEIGHT] = 3;
+  impl_->chls()[WIDTH] = 4;
   break;

♻️ Duplicate comments (2)

examples/demo_minicpm_moe_mbp.cpp (1)

11-11: Same verification needed as in demo_minicpm_moe_mbm.cpp.

This change disables memory-mapped allocations, identical to the change in demo_minicpm_moe_mbm.cpp. Please refer to the verification request in that file.

mllm/backends/cpu/compute/MatmulSparse.hpp (1)

8-9: Verify the include paths are correct.

Same issue as in Pooling.hpp: the static analysis tool reports that Tensor.hpp cannot be found. Please verify the include paths are correct relative to mllm/backends/cpu/compute/ or ensure the build system's include directories properly resolve these headers.

🧹 Nitpick comments (72)

.gitignore (1)
38-39: Clean up unnecessary blank lines.

Remove the consecutive blank lines at lines 38–39 to improve file organization.

Apply this diff:
  mllm/backends/qnn/sdk*
-
-
examples/demo_tinyllama.cpp (1)

32-32: Consider clarifying the purpose of this commented-out configuration.

This commented-out line has no functional effect. If it serves as a reference for optional configuration, consider adding a brief comment explaining when/why a developer might uncomment it. Otherwise, remove it to reduce code clutter.
mllm/backends/cpu/compute/Sigmoid.hpp (2)
28-69: Overall structure is solid, but wrap in namespace.

The SIMD implementation structure is well-designed with proper conditional compilation paths and scalar fallback. However, placing this function in global scope risks symbol conflicts in larger projects.

Apply this diff to wrap in a namespace:
+namespace mllm {
+namespace cpu {
+
 /**
  * @brief 对一个 float 数组进行 Sigmoid 计算 (支持 AVX 和 NEON 的高性能版本)
  * @param n   数组中元素的数量
  * @param y   指向输出数组的指针
  * @param x   指向输入数组的指针
  */
 void vec_sigmoid_f32(const int n, float *y, const float *x) {
     // ... implementation ...
 }
+
+}  // namespace cpu
+}  // namespace mllm
2-2: Consider using English comments for international collaboration.

The codebase contains Chinese comments (e.g., "包含 NEON 指令集的头文件"). For projects with international contributors, English comments improve accessibility.

Example:
-#include <arm_neon.h> // 包含 NEON 指令集的头文件
+#include <arm_neon.h> // Include NEON intrinsics header
This suggestion applies to all comments in the file (lines 2, 5, 10, 29-32, 37, 51, 65).
mllm/backends/cpu/compute/FeatureCheck.hpp (4)
4-6: Remove commented-out preprocessor directives.

The commented-out #if and #endif directives (lines 4 and 6) add no value and should be removed. The Chinese comment on line 5 should also be translated to English or removed for consistency.

Apply this diff:
-// #if defined(__aarch64__) && !defined(HWCAP_I8MM)
-#include <asm/hwcap.h> // 确保定义 HWCAP_I8MM
-// #endif
+#include <asm/hwcap.h> // Ensure HWCAP_I8MM is defined
27-78: Remove commented-out debug statements.

The function contains numerous commented-out debug statements (lines 27, 31, 37, 40-41, 46, 49, 52, 55, 60, 68, 74, 76). These should be removed for production code. If debugging capabilities are needed, consider using a logging framework or conditional compilation macros instead.

61-73: Consider RAII for file handling.

The C-style file handling works correctly but could benefit from C++ RAII patterns for better exception safety. Consider using std::ifstream or a custom RAII wrapper.

Example refactor using C++ streams:
#if defined(__linux__)
    std::ifstream cpuinfo("/proc/cpuinfo");
    if (cpuinfo) {
        std::string line;
        while (std::getline(cpuinfo, line)) {
            if (line.find("Features") != std::string::npos && 
                line.find("i8mm") != std::string::npos) {
                return true;
            }
        }
    }
#endif
48-51: Consider using named constants for magic numbers.

The bit positions (52) and mask (0xF) for the I8MM field, as well as the valid values (1, 2), should be named constants for better readability and maintainability.

Example:
// ARM Architecture Reference Manual: ID_AA64ISAR1_EL1.I8MM field
static constexpr uint8_t I8MM_FIELD_SHIFT = 52;
static constexpr uint64_t I8MM_FIELD_MASK = 0xF;
static constexpr uint8_t I8MM_SUPPORTED_VALUE_1 = 1;
static constexpr uint8_t I8MM_SUPPORTED_VALUE_2 = 2;

const uint8_t i8mm_field = (isar1 >> I8MM_FIELD_SHIFT) & I8MM_FIELD_MASK;
if (i8mm_field == I8MM_SUPPORTED_VALUE_1 || i8mm_field == I8MM_SUPPORTED_VALUE_2) {
    return true;
}
mllm/backends/cpu/compute/Matmul.cpp (1)
50-50: Consider simplifying the is_0 condition for clarity.

The condition src1->batch() != src0->batch() can be simplified to src0->batch() != 1 since we already know src1->batch() == 1 from the first check. This makes the intent clearer: we're checking if src0 has multiple batches that need broadcasting from src1's single batch.
-        int is_0 = (src1->batch() == 1 && src1->head() == 1 && src1->batch() != src0->batch()) ? 0 : 1;
+        int is_0 = (src1->batch() == 1 && src1->head() == 1 && src0->batch() != 1) ? 0 : 1;
Apply the same change at line 130.

Also applies to: 130-130
examples/demo_qwen3.cpp (1)
70-71: Consider simplifying KV cache management.

The cache is cleared in two places:

Line 63: Inside the callback when an end token is detected

Line 70: After generation completes (added in this PR)

When generation ends naturally (end token detected), the cache is cleared twice. While this ensures the cache is always cleared regardless of termination reason (end token vs max_new_tokens limit), you could simplify by removing the clearing at line 63 and relying solely on the post-generation clearing at line 70.

The profiling call at line 71 is a useful addition for gathering performance metrics in the demo.

If you prefer to eliminate the redundancy, apply this diff:
             auto [not_end, output_string] = tokenizer.postprocess(out_string);
             if (!not_end) {
-                model.clear_kvcache();
                 return false;
             }
mllm/backends/cpu/op/CPUBinCountFunc.hpp (2)
20-35: Input validation and perf notes (optional)

If non-F32 inputs are possible, either convert or reject with an ErrorCode before using dataAt/hostPtr.

Perf: compute max via raw pointer once to avoid dataAt overhead; optionally parallelize counting via per-thread histograms then reduce.

Example max via pointer:
-        for (int i = 0; i < size; ++i) {
-            int val = static_cast<int>(inputs[0]->dataAt<float>(0, 0, 0, i));
+        float* ptr = inputs[0]->hostPtr<float>();
+        for (int i = 0; i < size; ++i) {
+            int val = static_cast<int>(ptr[i]);
             if (val > max_val) {
                 max_val = val;
             }
         }
Optional: parallel counting with thread‑local histograms to avoid atomics.

Are non‑F32 input dtypes expected for this op?

Also applies to: 55-60

9-9: Drop unused include

is not used.
-#include <algorithm>
mllm/backends/cpu/compute/FlashAttention2H.hpp (3)
153-154: Make kv_group_size computation consistent and safe.

Prefill uses Q_Head / KV_Head directly; decode uses a guarded ternary. After adding asserts that Q_Head, KV_Head > 0, unify to a single formula.

Apply:
-        const int32_t kv_group_size = Q_Head / KV_Head;
+        const int32_t kv_group_size = Q_Head / KV_Head;
@@
-        const int32_t kv_group_size = (Q_Head > 0 && KV_Head > 0) ? Q_Head / KV_Head : 1;
+        const int32_t kv_group_size = Q_Head / KV_Head;
Alternatively, if you prefer defensive coding without relying on asserts:
+        const int32_t kv_group_size = (KV_Head > 0) ? (Q_Head / KV_Head) : 1;
Also applies to: 204-205, 615-616, 664-665

14-17: Architecture guard is too strict for generic x86 builds.

Only allowing AVX2 or __ARM_NEON will hard-fail SSE4-capable CPUs. Consider a scalar fallback path or a compile-time option to bypass the #error.

Introduce a scalar fallback block (or gate the #error behind a dedicated build flag like MLLM_STRICT_SIMD) to improve portability.

Also applies to: 39-41, 54-55

89-90: Unused flag: high_precision.

The high_precision member is configured but never read. If intentional for future work, add a brief TODO; otherwise remove to avoid confusion.

Add a comment or wire it into expf/accumulation choices if it is meant to control numerics.

Also applies to: 551-552
mllm/backends/cpu/compute/Matmul.hpp (2)

11-11: Avoid using namespace in a header.

Header-wide using namespace mllm; pollutes consumers.

Prefer either qualifying names (mllm::Tensor, mllm::ErrorCode) or wrap declarations in namespace mllm { ... } and convert namespace mllm::armv8 below to namespace armv8 inside. Happy to provide a patch if you choose either route.

29-36: Doc and naming nits (low priority).

“accpect” -> “accept”; “enbaled” -> “enabled”.

GEMM comments say C(M x K); should be C(M x N).

Consider renaming parameter src0_ to src0 for consistency.

Also applies to: 84-104, 124-136

mllm/backends/cpu/op/CPUCat.cpp (1)

64-104: Hardcoded float copies; generalize or assert dtype.

All memcpy paths use ptrAt<float>/sizeof(float). If tensors can be F16/BF16, this is UB.

If CPUCat guarantees FP32, add assert(outputs[0]->dtype()==MLLM_TYPE_F32) before copies.

Otherwise switch to rawHostPtr() and multiply by type_size(tensor->dtype()).
I can send a complete patch once you confirm the intended dtype contract.
examples/demo_phonelm_npu.cpp (1)
69-71: Prefer dynamic_cast over static_cast for backend downcast.

Defensive in case the global backend at MLLM_QNN isn’t a QNNBackend (misconfig causes UB).

Apply:
-    static_cast<QNNBackend *>(Backend::global_backends[MLLM_QNN].get())->saveQNNContext();
+    if (auto *qnn = dynamic_cast<QNNBackend *>(Backend::global_backends[MLLM_QNN].get())) {
+        qnn->saveQNNContext();
+    } else {
+        std::cerr << "QNN backend not initialized.\n";
+        return -1;
+    }
examples/demo_bailing_moe.cpp (1)

39-41: Guard device selection under build flags.

MLLM_OPENCL path is asserted even when built without OpenCL. Consider failing fast on invalid -d values in non-OpenCL builds.

Wrap the assertion under #ifdef USE_OPENCL or normalize device to MLLM_CPU when OpenCL isn’t compiled.
examples/demo_qwen2.5_vl.cpp (2)
38-46: Guard against mismatched image/text counts.

Future edits may uncomment more images or prompts, causing out-of-range access.
     vector<string> in_imgs = {
         // "../assets/bus.png",
         "../assets/two_cats.jpg",
         // "../assets/bird_image.jpg",
     };
     vector<string> in_strs = {
         "<|vision_start|><|image_pad|><|vision_end|>Describe this image.",
     };
+
+    if (in_imgs.size() != in_strs.size()) {
+        std::cerr << "Size mismatch: in_imgs=" << in_imgs.size()
+                  << " vs in_strs=" << in_strs.size() << std::endl;
+        return 1;
+    }
27-27: Normalize "billion" casing for consistency; apply mapping to both 3B and 7B.

The current code only maps "3B" → "3b" but leaves "7B" unchanged. While QWenConfig's constructor performs lowercase conversion internally, explicitly normalizing both values in the demo improves clarity and consistency with the help text "[3B | 7B |]".
-    string model_billion = cmdParser.get<string>("billion") == "3B" ? "3b" : cmdParser.get<string>("billion");
+    string model_billion = cmdParser.get<string>("billion");
+    if (model_billion == "3B") model_billion = "3b";
+    if (model_billion == "7B") model_billion = "7b";
examples/demo_sparse_llama.cpp (1)
16-18: Avoid hardcoded second model path; expose via CLI and validate existence.

Baking "../models/ReLULlama_q4_k.mllm" makes the demo brittle across environments.
-    // cmdParser.add<string>("predictor", 'p', "specify mllm model predictor path", false, "../models/ReLULlama_predictor.mllm");
+    cmdParser.add<string>("extra", 'x', "extra model file (e.g., merged or adapter)", false, "../models/ReLULlama_q4_k.mllm");
...
-    // string predictor_path = cmdParser.get<string>("predictor");
+    string extra_path = cmdParser.get<string>("extra");
...
-    model.load_multifile({model_path, "../models/ReLULlama_q4_k.mllm"});
+    model.load_multifile({model_path, extra_path});
Optionally add a std::filesystem::exists check to emit a helpful error if the file is missing.

Also applies to: 26-33
examples/demo_showui_npu.cpp (1)
64-66: Minor: tidy log output.
-    std::cout << "num_iter" << num_iter << std::endl;
+    std::cout << "num_iter: " << num_iter << std::endl;
mllm/backends/cpu/compute/GemmFp.hpp (2)
30-33: Avoid potential macro clashes with min; prefer a distinct name or std::min.
-static inline int min(int a, int b) {
+static inline int imin(int a, int b) {
     return a < b ? a : b;
 }
And replace call sites: min(...) → imin(...)

104-152: Clarify GEMM semantics (C += AB vs C = AB) and document zeroing requirements.

Micro-kernels and fallback use +=. If callers expect overwrite, ensure C is zeroed beforehand or add a beta parameter.

Add a brief comment above gemm_fp32/gemm_fp32_fp16 documenting accumulation semantics.
mllm/backends/cpu/compute/Convolution.hpp (1)
8-10: Header hygiene: fix include paths and remove 'using namespace'

Use project-qualified paths to match new layout.

Avoid 'using namespace' in headers.
-#include "Tensor.hpp"
-#include "Types.hpp"
-using namespace mllm;
+#include "mllm/Tensor.hpp"
+#include "mllm/Types.hpp"
And qualify symbols with mllm:: in declarations if needed.
mllm/backends/cpu/compute/SageQuantize.hpp (3)
4-12: Tidy includes: add <type_traits>, drop unused iostream, qualify project header
-#include <iostream>
+#include <type_traits>
...
-#include "Types.hpp"
+#include "mllm/Types.hpp"
Removes weight and fixes compile for std::is_same_v.

24-37: Remove unused AVX helper

_mm256_hmax_ps is declared but never used. Drop to reduce code surface.

86-110: Avoid per-call heap allocation in quantize_new_token_to_sage_blocks

Allocating std::vector<float> smoothed_row(dim_size) for every token is costly. Compute per block in-register.
-    std::vector<float> smoothed_row(dim_size);
-    for (int d = 0; d < dim_size; ++d)
-        smoothed_row[d] = new_token_vector[d] - current_mean_data[d];
     for (int g = 0; g < num_k_blocks; ++g) {
         const int offset = g * QK8_0F;
-        const float *smoothed_block_ptr = smoothed_row.data() + offset;
-        float max_abs_val = 0.0f;
-        for (int d = 0; d < QK8_0F; ++d)
-            max_abs_val = std::max(max_abs_val, fabsf(smoothed_block_ptr[d]));
+        float max_abs_val = 0.0f;
+        float tmp[QK8_0F];
+        #pragma unroll
+        for (int d = 0; d < QK8_0F; ++d) {
+            tmp[d] = new_token_vector[offset + d] - current_mean_data[offset + d];
+            max_abs_val = std::max(max_abs_val, fabsf(tmp[d]));
+        }
         const float scale = (max_abs_val > 1e-9f) ? max_abs_val / 127.0f : 0.0f;
         out_ptr[g].scale = scale;
         const float inv_scale = (scale > 1e-9f) ? 1.0f / scale : 0.0f;
-        for (int d = 0; d < QK8_0F; ++d)
-            out_ptr[g].qs[d] =
-                static_cast<int8_t>(roundf(smoothed_block_ptr[d] * inv_scale));
+        #pragma unroll
+        for (int d = 0; d < QK8_0F; ++d) {
+            out_ptr[g].qs[d] = static_cast<int8_t>(roundf(tmp[d] * inv_scale));
+        }
     }
mllm/backends/cpu/compute/Transpose2D.hpp (1)
2-4: Trim includes and qualify project header
-#include <iostream>
-#include "Types.hpp"
+#include "mllm/Types.hpp"
[iostream] not used; remove it.
mllm/OpDefined.hpp (1)

10-134: Prefer scoped enums to avoid global identifier collisions.

Unscoped names like DIRECT, RANGE, VIEW can clash. Consider enum class OpType : int and enum class TensorFuncType : int. Migration can be phased.

CMakeLists.txt (2)

49-51: Be cautious resetting CMAKE_CXX_STANDARD_INCLUDE_DIRECTORIES.

Altering standard include dirs can trigger STL include failures (e.g., “string not found”). Prefer not to override this unless necessary.

258-299: Kleidiai third_party enforcement: add a clearer opt-out or fetch.

Hard FATAL if missing may hinder non-AArch64 devs. Gate with an option (e.g., KLEIDIAI_ENABLE) or provide a message guiding how to obtain it.
mllm/backends/cpu/op/CPUBinaryFunc.hpp (1)
33-43: Use the per-op thread_count or remove the field.

You store thread_count but never use it; OMP uses CPUBackend::cpu_threads.
-#pragma omp parallel for collapse(3) num_threads(CPUBackend::cpu_threads)
+#pragma omp parallel for collapse(3) num_threads(thread_count)
Apply similarly across ops, or drop the member and ctor arg if centralizing on CPUBackend::cpu_threads.

Also applies to: 71-82, 109-121, 149-161, 188-202, 220-246, 269-286, 309-338, 361-384
mllm/backends/cpu/compute/Pooling.cpp (1)
126-132: Initialize max with -inf, not a magic constant.

Use numeric limits for correctness across ranges.
-                    float value = -999999;
+                    float value = -std::numeric_limits<float>::infinity();
Remember to include if not already present in this TU.
mllm/Op.hpp (2)
64-64: Consider verifying inputs vector is non-empty.

The code now propagates ctype from inputs[0] to outputs, but doesn't check if inputs is empty. While this may be guaranteed by calling context, adding a defensive check could prevent crashes.

Consider adding a safety check:
 virtual ErrorCode setUp(vector<shared_ptr<Tensor>> inputs, vector<shared_ptr<Tensor>> outputs) {
+    assert(!inputs.empty() && "setUp requires at least one input");
     for (auto &output : outputs) {
         output->setDtype(activation_dtype_);
         output->setCtype(inputs[0]->ctype());
135-137: Mutable reference to internal state.

Exposing traced_ via a mutable reference allows external code to directly modify the internal tracing state. While this enables fine-grained control for tracing/instrumentation, it also breaks encapsulation. Consider whether a setter method would be more appropriate:
bool traced() const { return traced_; }
void setTraced(bool traced) { traced_ = traced; }
mllm/Context.cpp (1)

15-42: Commented backend init: remove or refactor with smart pointers/registry.

Large commented code with raw new risks bitrot. Either delete it or reintroduce it behind a factory using std::unique_ptr/shared_ptr and a registry to avoid leaks and globals.
examples/demo_qwen2_vl_vtp.cpp (2)
26-27: Remove unused variables.

thread_num and param_loader are never used.
-    int thread_num = cmdParser.get<int>("thread");
     CPUBackend::cpu_threads = cmdParser.get<int>("thread");
@@
-    ParamLoader param_loader(model_path);
Also applies to: 34-34

66-66: Prefer iostream consistently.

Use std::cout << '\n' instead of printf.
-        printf("\n");
+        std::cout << '\n';
mllm/ParamLoader.hpp (1)
105-108: Tighten API: const/noexcept and consistent alias.

Make getInputStream const and use the mllm_file alias; mark trivial getters noexcept.
-    ParamMetadata getParamMetadata(const std::string &name);
-    FILE *getInputStream();
-    std::string getParamPath() const;
+    ParamMetadata getParamMetadata(const std::string &name) const noexcept;
+    mllm_file *getInputStream() const noexcept;
+    std::string getParamPath() const noexcept;
Note: if metadata lookup mutates caches, drop noexcept and const accordingly.
examples/demo_showui_vtp.cpp (3)
25-26: Remove unused variables.

thread_num and param_loader are unused.
-    int thread_num = cmdParser.get<int>("thread");
     CPUBackend::cpu_threads = cmdParser.get<int>("thread");
@@
-    ParamLoader param_loader(model_path);
Also applies to: 28-29

36-41: Ensure inputs are paired; guard for mismatches.

Protect against future edits causing OOB during batching.
-    vector<string> in_imgs = {
+    vector<string> in_imgs = {
         "../assets/uidemo2.png"};
@@
-    for (int i = 0; i < in_strs.size(); ++i) {
+    if (in_imgs.size() != in_strs.size()) {
+        std::cerr << "in_imgs and in_strs size mismatch\n";
+        return 1;
+    }
+    for (size_t i = 0; i < in_strs.size(); ++i) {
59-59: Use iostream newline.

Prefer std::cout << '\n' over printf.
-        printf("\n");
+        std::cout << '\n';
examples/demo_qwen2_vl.cpp (4)
27-27: Normalize and validate --billion mapping.

Hard-coding only "2B" -> "1.5b" is brittle (case/alias variants, unhandled "7B"). Normalize case and handle known aliases; reject unknowns early.

Apply:
-    string model_billion = cmdParser.get<string>("billion") == "2B" ? "1.5b" : cmdParser.get<string>("billion");
+    auto billion_arg = cmdParser.get<string>("billion");
+    std::transform(billion_arg.begin(), billion_arg.end(), billion_arg.begin(), ::tolower);
+    string model_billion;
+    if (billion_arg == "2b" || billion_arg == "2") {
+        model_billion = "1.5b";
+    } else if (billion_arg == "7b" || billion_arg == "7") {
+        model_billion = "7b";
+    } else {
+        std::cerr << "Unsupported --billion: " << billion_arg << std::endl;
+        return 1;
+    }
Confirm Qwen2VLConfig accepted values for model_billion ("1.5b", "7b", etc.) to avoid runtime mismatches. If needed, I can search and adapt the mapping.

19-21: Polish help text for --billion.

Help string has a dangling | and narrow options. Clarify accepted values.
-    cmdParser.add<string>("billion", 'b', "[2B | 7B |]", false, "2B");
+    cmdParser.add<string>("billion", 'b', "model size: 2B|7B (aliases: 2,7)", false, "2B");
29-31: Remove unused variable.

thread_num is never used.
-    int thread_num = cmdParser.get<int>("thread");
-    CPUBackend::cpu_threads = cmdParser.get<int>("thread");
+    CPUBackend::cpu_threads = cmdParser.get<int>("thread");
32-32: Avoid unused ParamLoader construction.

This opens the model file but is unused; wastes startup time.
-    ParamLoader param_loader(model_path);
mllm/ParamLoader.cpp (2)
95-103: Null fp_ guard in file-IO path.

If fp_ is null (e.g., failed open), fseek/fread will crash. Guard early.
-        if (offsets_.find(name) == offsets_.end()) { return false; }
+        if (!fp_) { return false; }
+        if (offsets_.find(name) == offsets_.end()) { return false; }
Optionally check fread return and propagate errors.

299-305: Avoid raw new[] ownership escape.

Returning a raw heap buffer risks leaks. Prefer std::vector<uint8_t>.
-std::tuple<uint8_t *, uint64_t> ParamLoader::load(string name) {
+std::vector<uint8_t> ParamLoader::load(string name) {
@@
-    auto *data = new uint8_t[length];
+    std::vector<uint8_t> data(length);
@@
-    auto _ = fread(data, sizeof(uint8_t), length, fp_);
-    return std::make_tuple(data, length);
+    auto _ = fread(data.data(), sizeof(uint8_t), length, fp_);
+    data.resize(_);
+    return data;
Note: update declaration in header and call sites.
mllm/Context.hpp (2)
3-6: Trim unused heavy includes in header to speed builds.

Backend.hpp seems only referenced in commented code. Prefer forward declarations or remove.
-#include "Backend.hpp"
+// #include "Backend.hpp" // avoid heavy include; keep in .cpp if needed
9-12: Consider marking Instance() as noexcept.

Singleton creation should not throw; helps callers.
-    static Context &Instance();
+    static Context &Instance() noexcept;
Requires matching change in Context.cpp.
examples/demo_qwen.cpp (1)
27-33: Platform macro and help text polish.

ARM macro: use compiler-defined macros.

“billion” help string inconsistent with defaults.

Apply:
-#if defined(ARM)
+#if defined(__aarch64__) || defined(__arm__)
     default_model_path = "../models/qwen-2.5-1.5b-instruct-kai_q4_0_lm.mllm";
     default_model_billion = "1.5b-lm";
 #endif
-    cmdParser.add<string>("billion", 'b', "[0.5B | 1.8B | 1.5B | 3B |]", false, default_model_billion);
+    cmdParser.add<string>("billion", 'b', "[0.5b | 1.5b | 3b | 1.5b-lm]", false, default_model_billion);
Also applies to: 32-32
mllm/Parallel.hpp (1)
34-40: Minor: noisy stdout in library code.

Consider guarding the "num_graph" print with DEBUG to reduce console noise in demos.
-        std::cout << "num_graph: " << num_graph << std::endl;
+        // #ifdef DEBUGPRINT
+        // std::cout << "num_graph: " << num_graph << std::endl;
+        // #endif
mllm/backends/cpu/compute/Transpose3D.hpp (1)
16-18: Guard OpenMP header for non-OpenMP builds.

Unconditional <omp.h> include can break toolchains without OpenMP headers.
-#include <omp.h>
+#ifdef _OPENMP
+#include <omp.h>
+#endif
Note: the pragmas are ignored if OpenMP is off.
examples/demo_qwen2_vl_npu.cpp (1)
39-39: Remove unused variables and streamline thread setting.

ParamLoader is unused.

Use parsed thread_num to set CPU threads.
-    ParamLoader param_loader(model_path);
+    // ParamLoader not needed here; remove if unused.
@@
-    int thread_num = cmdParser.get<int>("thread");
-    CPUBackend::cpu_threads = cmdParser.get<int>("thread");
+    int thread_num = cmdParser.get<int>("thread");
+    CPUBackend::cpu_threads = thread_num;
Also applies to: 29-31
examples/demo_qwen_npu.cpp (1)
1-3: Header include path sanity-check (Context.hpp).

Static analysis flagged "Context.hpp not found". If include dirs don’t add mllm/, consider switching to "mllm/Context.hpp" (and likewise for other project headers) or fix include paths in the build. Also remove the redundant commented saveQNNContext line to avoid confusion.
-            // static_cast<QNNBackend *>(Backend::global_backends[MLLM_QNN].get())->saveQNNContext();
+            // (removed duplicate commented-out call)
Also applies to: 8-8
mllm/DataType.hpp (1)

39-48: Packed structs + vector loads: verify unaligned access path.

These blocks are packed (pragma pack(1)). NEON vld1/vld1q tolerate unaligned addresses on AArch64 but can be slower. If hotspots load these arrays via NEON, consider aligning parent allocations to 16 bytes or using memcpy to a local aligned buffer in inner loops.

Also applies to: 60-69, 103-118, 167-174
mllm/backends/cpu/compute/GemmQ2K.cpp (1)
44-61: Avoid hard‑coding OpenMP threads; respect runtime/config.

num_threads(4) hard‑codes parallelism and may fight CPUBackend::cpu_threads or OMP settings. Drop it or plumb a parameter.
-#pragma omp parallel for num_threads(4)
+#pragma omp parallel for
 ...
-#pragma omp parallel for num_threads(4)
+#pragma omp parallel for
 ...
-#pragma omp parallel for num_threads(4)
+#pragma omp parallel for
Also applies to: 226-292, 318-348
mllm/Types.hpp (2)

25-33: Header globals: clarify const vs runtime-tunable; ensure C++17.

KVCache_TYPE looks constant while KVCache_Type_eager/KVCache_batch are runtime-tunable. Consider:

Make KVCache_TYPE constexpr int (true constant).

Keep others as inline int but document thread-safety if mutated at runtime (atomic if written cross-thread).
Also ensure the project enforces C++17 to support inline variables.

114-125: New BHSD mapping: double-check key collisions.

Chls2Type now mixes 4D and 5D keys. That’s fine, but ensure no duplicate 4D permutations map inconsistently elsewhere. Add a brief comment documenting BHSD semantics for maintainers.
mllm/backends/cpu/compute/Split.hpp (3)
196-223: F32 path can copy as a single contiguous block (faster, simpler).

Within an outer_idx block, the [split_size * inner_loop_size] region is contiguous. Prefer one memcpy; let libc use tuned SIMD.
-                const size_t copy_bytes = split_size * inner_loop_size * sizeof(float);
-                // memcpy(dst_base, src_base, copy_bytes);
-                for (int split_idx = 0; split_idx < split_size; ++split_idx) {
-                    const float *src = src_base + split_idx * inner_loop_size;
-                    float *dst = dst_base + split_idx * inner_loop_size;
-                    int count = inner_loop_size;
-#if defined(__AVX__)
-                    for (; count >= 8; count -= 8) {
-                        __m256 data = _mm256_loadu_ps(src);
-                        _mm256_storeu_ps(dst, data);
-                        src += 8;
-                        dst += 8;
-                    }
-#elif defined(__ARM_NEON)
-                    for (; count >= 4; count -= 4) {
-                        float32x4_t data = vld1q_f32(src);
-                        vst1q_f32(dst, data);
-                        src += 4;
-                        dst += 4;
-                    }
-#endif
-                    for (; count > 0; --count) *dst++ = *src++;
-                }
+                const size_t copy_bytes = static_cast<size_t>(split_size) * inner_loop_size * sizeof(float);
+                std::memcpy(dst_base, src_base, copy_bytes);
184-186: OpenMP loop index: prefer signed int for widest OMP compatibility.

Some OpenMP toolchains are stricter with signed loop vars. Minor, but avoids warnings.
-    for (size_t i = 0; i < out.size(); ++i) {
+    for (int i = 0; i < static_cast<int>(out.size()); ++i) {
132-134: Header include path robustness.

Including "Types.hpp" from a nested folder relies on include_dirs being set. Prefer "mllm/Types.hpp" for resilience across targets, or ensure include dirs add ${PROJECT_SOURCE_DIR}/mllm.
examples/CMakeLists.txt (2)
31-40: OpenMP linking via flags is brittle; use imported target.

Passing -fopenmp/-static-openmp in target_link_libraries is non‑portable. Prefer OpenMP::OpenMP_CXX set up at configure time; handle static/dynamic elsewhere.
-        if (ARM AND NOT (CMAKE_HOST_SYSTEM_NAME STREQUAL "Darwin" AND NOT CMAKE_CROSSCOMPILING))
-            target_link_libraries(${target} PUBLIC -fopenmp -static-openmp)
-        else()
-            target_link_libraries(${target} PUBLIC -fopenmp)
-        endif()
+        target_link_libraries(${target} PUBLIC OpenMP::OpenMP_CXX)
Also replace ARM detection with a consistent check (CMAKE_SYSTEM_PROCESSOR) if you need special-casing.

52-58: mllm_llm/mllm_vlm should propagate include dirs.

Examples include headers like "Context.hpp". Ensure these libraries set PUBLIC include dirs (e.g., ${PROJECT_SOURCE_DIR} and ${PROJECT_SOURCE_DIR}/mllm) so example targets build without local include tweaks.
# in the same block after add_library(...)
target_include_directories(mllm_llm PUBLIC ${PROJECT_SOURCE_DIR} ${PROJECT_SOURCE_DIR}/mllm)
target_include_directories(mllm_vlm PUBLIC ${PROJECT_SOURCE_DIR} ${PROJECT_SOURCE_DIR}/mllm)
mllm/backends/cpu/CMakeLists.txt (1)
154-171: Avoid undefined ARM/APK vars; use imported target for OpenMP.

The ARM/APK condition uses undeclared cache vars, leading to fragile paths. Prefer a single branch:
-if(OpenMP_FOUND)
-    ...
-    if(ARM AND NOT APK)
-        ...
-    else()
-    target_link_libraries(mllm_cpu
-            PUBLIC
-            OpenMP::OpenMP_CXX
-        )
-    endif()
-endif()
+if(OpenMP_FOUND)
+    target_link_libraries(mllm_cpu PUBLIC OpenMP::OpenMP_CXX)
+endif()
If you truly need static OpenMP on specific cross-ARM, gate it via a well-defined option (e.g., MLLM_OPENMP_STATIC) and documented toolchain.
mllm/StateManager.hpp (1)
123-131: Make getters const and avoid unnecessary copies

These accessors don’t mutate state. Mark them const; optionally return a const reference for the vector to avoid copies.
-    unsigned int getLastDraftLength() {
+    unsigned int getLastDraftLength() const {
         return last_draft_length_;
     }
-    std::vector<unsigned int> getLastVerifiedPositionIds() {
-        return last_verified_position_ids_;
-    }
-    bool isUsingDraft() {
+    const std::vector<unsigned int> &getLastVerifiedPositionIds() const {
+        return last_verified_position_ids_;
+    }
+    bool isUsingDraft() const {
         return using_draft_;
     }
mllm/backends/cpu/CPUBackend.cpp (1)
383-392: DEBUGOPTIME: static counter is not thread-safe

op_count is a function-static shared across threads. If runOp can be called concurrently, this will data-race.

Make op_count thread_local, or

Protect updates with a mutex when DEBUGOPTIME is enabled.
-    static int op_count = 0;
+    thread_local static int op_count = 0;
mllm/Tensor.hpp (1)

150-161: Child/parent cleanup is not thread-safe.

Erasing from master->childTensors() without synchronization can race if tensors are used cross-thread.

Guard with a mutex on the container or document single-threaded graph construction/destruction assumptions.
mllm/backends/cpu/compute/SageAttentionPT.hpp (1)
18-21: Duplicate AVX2 include.

Minor cleanup.
-#include <immintrin.h>
-#include <immintrin.h>
+#include <immintrin.h>

coderabbitai · 2025-10-27T10:06:26Z

 .vscode/
 .idea/
 .cache/
+.DS_Store


⚠️ Potential issue | 🟡 Minor

Remove duplicate .DS_Store entry.

Lines 5 and 40 both define .DS_Store. Keep one and remove the duplicate for clarity.

Apply this diff:

- .DS_Store examples/test.cpp

Remove line 40 and consolidate at line 5 (already present).

Also applies to: 40-40

🤖 Prompt for AI Agents

In .gitignore around lines 5 to 40, there is a duplicate .DS_Store entry (present at line 5 and again at line 40); remove the duplicate at line 40 and leave the single .DS_Store entry at line 5 so the file contains only one .DS_Store line for clarity.

coderabbitai · 2025-10-27T10:06:26Z

+ * @date 2025-07-01
+ *
+ */
+#include "Module.hpp"


⚠️ Potential issue | 🟠 Major

Fix include path or include dirs: 'Module.hpp' not found

Either adjust the include to the new layout or add include directories in CMake.

Suggested change:

-#include "Module.hpp" +#include "mllm/Module.hpp"

If other headers moved similarly, align them too.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

#include "Module.hpp"

#include "mllm/Module.hpp"

🧰 Tools

🪛 Clang (14.0.6)

[error] 8-8: 'Module.hpp' file not found

(clang-diagnostic-error)

🤖 Prompt for AI Agents

In examples/demo_bailing_moe_mbp.cpp around line 8, the include "#include \"Module.hpp\"" cannot be found; update the include path to the file's new location (for example adjust to the correct relative path like "subdir/Module.hpp") or add the directory containing Module.hpp to the project's include directories in CMakeLists.txt (target_include_directories or include_directories) and apply the same fix to any other moved headers so all includes align with the new layout.

coderabbitai · 2025-10-27T10:06:26Z

+    std::iostream::sync_with_stdio(false);
+    Module::alloc_mmap = false;


⚠️ Potential issue | 🟡 Minor

Use ios_base for sync_with_stdio

Prefer the canonical call to avoid ADL/scope surprises.

- std::iostream::sync_with_stdio(false); + std::ios_base::sync_with_stdio(false);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

std::iostream::sync_with_stdio(false);

Module::alloc_mmap = false;

std::ios_base::sync_with_stdio(false);

Module::alloc_mmap = false;

🤖 Prompt for AI Agents

In examples/demo_bailing_moe_mbp.cpp around lines 20 to 21, the code calls std::iostream::sync_with_stdio(false) which can trigger ADL/scope surprises; replace this call with the canonical ios_base::sync_with_stdio(false) to call the function on the standard ios_base type and avoid ambiguity.

coderabbitai · 2025-10-27T10:06:26Z

+        LlmTextGeneratorOpts opt{
+            .max_new_tokens = max_new_tokens > 0 ? max_new_tokens : static_cast<size_t>(tokens_limit - input_tensor.sequence()),
+            .do_sample = false,
+            .temperature = 0.3F,
+            .top_k = 50,
+            .top_p = 0.F,
+        };


⚠️ Potential issue | 🔴 Critical

Guard against negative max_new_tokens (size_t underflow → huge generation)

If input length ≥ tokens_limit, (tokens_limit - input_tensor.sequence()) becomes negative and, after cast, a gigantic size_t. Clamp to ≥ 0.

- LlmTextGeneratorOpts opt{ - .max_new_tokens = max_new_tokens > 0 ? max_new_tokens : static_cast<size_t>(tokens_limit - input_tensor.sequence()), + const int budget = tokens_limit - input_tensor.sequence(); + LlmTextGeneratorOpts opt{ + .max_new_tokens = max_new_tokens > 0 ? static_cast<size_t>(max_new_tokens) + : static_cast<size_t>(std::max(0, budget)), .do_sample = false, .temperature = 0.3F, .top_k = 50, .top_p = 0.F, };

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

LlmTextGeneratorOpts opt{

.max_new_tokens = max_new_tokens > 0 ? max_new_tokens : static_cast<size_t>(tokens_limit - input_tensor.sequence()),

.do_sample = false,

.temperature = 0.3F,

.top_k = 50,

.top_p = 0.F,

};

const int budget = tokens_limit - input_tensor.sequence();

LlmTextGeneratorOpts opt{

.max_new_tokens = max_new_tokens > 0 ? static_cast<size_t>(max_new_tokens)

: static_cast<size_t>(std::max(0, budget)),

.do_sample = false,

.temperature = 0.3F,

.top_k = 50,

.top_p = 0.F,

};

🤖 Prompt for AI Agents

In examples/demo_bailing_moe_mbp.cpp around lines 79 to 85, the expression casting (tokens_limit - input_tensor.sequence()) to size_t can underflow when input length >= tokens_limit, producing a huge max_new_tokens; change the logic to clamp the computed new-token budget to zero (e.g., compute a signed difference or check if input_tensor.sequence() >= tokens_limit and use 0, otherwise use tokens_limit - input_tensor.sequence()) before casting, then assign that non-negative value to max_new_tokens so generation is never set to a massive size_t from underflow.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Confirm C++ standard for designated initializers

Designated initializers require C++20. Ensure the project sets CMAKE_CXX_STANDARD to 20 or adjust initialization.

Run:

🏁 Script executed:

#!/bin/bash rg -nP 'CMAKE_CXX_STANDARD\s*$?\s*([0-9]+)\s*$?' -S -C2 -- cmake CMakeLists.txt

Length of output: 195

🏁 Script executed:

#!/bin/bash # Verify the specific file and its content fd -t f "demo_bailing_moe_mbp.cpp" --exec cat {} +

Length of output: 4560

🏁 Script executed:

#!/bin/bash # Check for target-specific or per-file C++ standard settings rg -nP '(set_target_properties|target_compile_features|set_source_files_properties)' -C3 --type=cmake

Length of output: 4422

🏁 Script executed:

#!/bin/bash # Check CMAKE_CXX_FLAGS and compiler-related settings rg -nP 'CMAKE_CXX_(FLAGS|STANDARD)' -C2 --type=cmake

Length of output: 2204

Fix C++ standard to 20 or use C++17 compatible initialization

The project sets CMAKE_CXX_STANDARD 17 (CMakeLists.txt:13), but the code uses designated initializers which require C++20. Either update CMAKE_CXX_STANDARD 17 to CMAKE_CXX_STANDARD 20 in CMakeLists.txt, or replace the designated initializer syntax with traditional initialization.

🤖 Prompt for AI Agents

In examples/demo_bailing_moe_mbp.cpp around lines 79 to 85, the code uses C++20 designated initializers for LlmTextGeneratorOpts which conflicts with the project C++17 setting; either update CMakeLists.txt to set CMAKE_CXX_STANDARD to 20 (or 23) so the compiler allows designated initializers, or change this initialization to C++17-compatible syntax by constructing the struct with positional/aggregate initialization or by default-constructing then assigning each field (e.g., LlmTextGeneratorOpts opt; opt.max_new_tokens = ...; opt.do_sample = ...; etc.), ensuring no C++20-only features remain.

coderabbitai · 2025-10-27T10:06:26Z

+#include "Types.hpp"
+#include "cmdline.h"
+#include "models/ling/configuration_bailing_moe.hpp"
+#include "models/ling/modeling_bailing_moe.hpp"
+#include "models/ling/tokenization_bailing.hpp"
+


⚠️ Potential issue | 🔴 Critical

Resolve missing headers and include paths.

Clang: 'Types.hpp' file not found. Also this TU uses std and CPUBackend symbols.

Apply:

-#include "Types.hpp" +#include "mllm/Types.hpp" +#include "mllm/backends/cpu/CPUBackend.hpp" +#include <cassert> +#include <iostream> +#include <string> +#include <vector>

🧰 Tools

🪛 Clang (14.0.6)

[error] 8-8: 'Types.hpp' file not found

(clang-diagnostic-error)

🤖 Prompt for AI Agents

In examples/demo_bailing_moe.cpp around lines 8 to 13, the translation unit is failing because Types.hpp cannot be found and the file uses std and CPUBackend symbols without the proper headers; fix by (1) correcting the include to the real path or adding the containing directory to the compiler's include paths so "Types.hpp" resolves, (2) adding the missing header that declares CPUBackend (e.g., the project's backend header that defines CPUBackend), and (3) including or qualifying the C++ standard headers used (e.g., include <string>, <vector> or <iostream> as needed, or prefix types with std::) to remove implicit std references.

coderabbitai · 2025-10-27T10:06:30Z

+    inline void _tensor_to_vec_of_multiIndices(Tensor &t, std::vector<std::vector<float>> &scores, std::vector<int> indices) {
+        assert(t.batch() == 1 && "Batch size of result is not 1. Which is not supported for now.");
+        assert(t.head() == 1 && "The 3rd dim of result should be one. e.g.:[1, 1, seq, hidden]");
+        int _dims = t.dimension();
+        // TODO: 考虑QNN进行padding
+        // padding prefill for QNN
+        // if (is_padding) {
+        //     if (chunk_size > 0) {
+        //         _seq = (seq_before_padding - 1) % chunk_size;
+        //     } else {
+        //         _seq = seq_before_padding - 1;
+        //     }
+        // }
+        for (int idx = 0; idx < indices.size(); ++idx) {
+            std::vector<float> values(t.dimension());
+            int _seq = indices[idx];
+            for (int i = 0; i < _dims; ++i) {
+                auto value = t.dataAt<float>(0, 0, _seq, i);
+                values[i] = value;
            }
+            scores.push_back(values);
        }
-        unsigned int generate(Tensor &t) override {
-            std::cerr << "Should use generate_SD" << std::endl;
-            assert(false);
-            return -1;
-        };
-        unsigned int generate_SD(Tensor &t, TracePool &tp);
+    }


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Inconsistent dtype handling across tensor conversion methods.

The newly added _tensor_to_vec_of_multiIndices method uses t.dataAt<float>() (line 184) without dtype checking, while the modified _tensor_to_vec method (lines 84-92) properly handles both FP16 and FP32. Since this PR introduces FP16 support infrastructure, FP16 tensors may be passed to this method, causing incorrect data interpretation or runtime errors.

Similarly, other existing methods (_tensor_to_vec_with_idx line 102, _tensor_to_multivec line 124, _tensor_to_multivec_with_idx line 148) also lack dtype checking and should be verified.

Apply this diff to add dtype handling to the new method:

for (int idx = 0; idx < indices.size(); ++idx) { std::vector<float> values(t.dimension()); int _seq = indices[idx]; for (int i = 0; i < _dims; ++i) { - auto value = t.dataAt<float>(0, 0, _seq, i); + float value; + if (t.dtype() == MLLM_TYPE_F16) { + value = MLLM_FP16_TO_FP32(t.dataAt<mllm_fp16_t>(0, 0, _seq, i)); + } else if (t.dtype() == MLLM_TYPE_F32) { + value = t.dataAt<float>(0, 0, _seq, i); + } else { + throw std::runtime_error("Unsupported dtype for text generation."); + } values[i] = value; } scores.push_back(values);

Verify whether other tensor-to-vector conversion methods need the same dtype handling:

🏁 Script executed:

#!/bin/bash # Description: Find all tensor data access patterns in conversion methods to check for dtype handling # Search for dataAt<float> calls in the file rg -n "dataAt<float>" mllm/Generate.hpp

Length of output: 308

🏁 Script executed:

#!/bin/bash # Extract context around each dataAt<float> line to see if dtype checking is present # Show lines 80-110 to check _tensor_to_vec and _tensor_to_vec_with_idx echo "=== Lines 80-110: _tensor_to_vec and _tensor_to_vec_with_idx ===" sed -n '80,110p' mllm/Generate.hpp echo "" echo "=== Lines 120-155: _tensor_to_multivec and _tensor_to_multivec_with_idx ===" sed -n '120,155p' mllm/Generate.hpp echo "" echo "=== Lines 180-189: _tensor_to_vec_of_multiIndices ===" sed -n '180,189p' mllm/Generate.hpp echo "" echo "=== Check for dtype handling patterns in the file ===" rg -n "MLLM_FP16_TO_FP32|t\.dtype()" mllm/Generate.hpp

Length of output: 3577

Inconsistent dtype handling across tensor conversion methods - verified critical issue.

The verification confirms the review comment is accurate. The _tensor_to_vec method (lines 85-91) has proper dtype handling with MLLM_FP16_TO_FP32 conversion, but the newly added _tensor_to_vec_of_multiIndices (line 184) and three other existing methods (_tensor_to_vec_with_idx line 102, _tensor_to_multivec line 124, _tensor_to_multivec_with_idx line 148) all lack dtype checking and directly use dataAt<float>(). Since FP16 support is being introduced, this will cause incorrect data interpretation when FP16 tensors are passed to these methods.

At minimum, apply the suggested dtype handling to the new method (lines 180-189). The three other existing methods should also be fixed to maintain consistency.

🤖 Prompt for AI Agents

In mllm/Generate.hpp around lines 167-189, _tensor_to_vec_of_multiIndices reads tensor values with dataAt<float>() without handling FP16; update this function to mirror the dtype-aware logic used in _tensor_to_vec (lines ~85-91): check the tensor dtype (or an is_fp16 flag / MLLM_FP16_TO_FP32 macro), and if FP16 convert values to float (e.g., load as FP16 and apply MLLM_FP16_TO_FP32) before storing into the float vector; if FP32, continue using dataAt<float>(). Also apply the same dtype-aware fix to the other methods mentioned (_tensor_to_vec_with_idx ~102, _tensor_to_multivec ~124, _tensor_to_multivec_with_idx ~148) to ensure consistent FP16 support across all conversions.

coderabbitai · 2025-10-27T10:06:30Z

 #include <cstdio>
-#include <iomanip>
-#include <cmath>
-#include <fstream>
+#include <iostream>
 #include <map>
 #include <memory>
-#include <vector>
+// #include <vector>
 #ifdef _WIN32
 #include <direct.h>
 #else
 #include <sys/stat.h>
 #endif
-#include <Types.hpp>
 #include <assert.h>
-// #include <sys/stat.h>
+
+#include "OpDefined.hpp"
+#include "Backend.hpp"
+#include <Types.hpp>



⚠️ Potential issue | 🟠 Major

Fix missing standard headers and include path

Required headers are missing; Types include should be project-qualified.

#include <cstdio> -#include <iostream> +#include <iostream> #include <map> #include <memory> -// #include <vector> +#include <vector> +#include <string> +#include <limits> +#include <cstdint> ... -#include <Types.hpp> +#include "mllm/Types.hpp"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

#include <cstdio>

#include <iomanip>

#include <cmath>

#include <fstream>

#include <iostream>

#include <map>

#include <memory>

#include <vector>

// #include <vector>

#ifdef _WIN32

#include <direct.h>

#else

#include <sys/stat.h>

#endif

#include <Types.hpp>

#include <assert.h>

// #include <sys/stat.h>

#include "OpDefined.hpp"

#include "Backend.hpp"

#include <Types.hpp>

#include <cstdio>

#include <iostream>

#include <map>

#include <memory>

#include <vector>

#include <string>

#include <limits>

#include <cstdint>

#ifdef _WIN32

#include <direct.h>

#else

#include <sys/stat.h>

#endif

#include <assert.h>

#include "OpDefined.hpp"

#include "Backend.hpp"

#include "mllm/Types.hpp"

🧰 Tools

🪛 Clang (14.0.6)

[error] 3-3: 'cstdio' file not found

(clang-diagnostic-error)

🤖 Prompt for AI Agents

In mllm/TensorImpl.hpp around lines 3 to 18, add the missing standard headers (at least <vector> and <cstdint> and <cstdlib> as they are commonly required by tensor implementations) and change the project include from <Types.hpp> to the project-qualified path "mllm/Types.hpp" (or the correct relative project path) so the file compiles reliably and resolves the Types header unambiguously.

coderabbitai · 2025-10-27T10:06:31Z

+    void to(Backend *target_backend) {
+        if (backend_ == target_backend) {
+            return;
+        }
+        // 路径1: 从任何后端迁移到主机 (CPU)
+        if (target_backend->type() == MLLM_CPU) {
+            if (location_ == ON_DEVICE) { // 从设备迁移到Host
+                void *new_host_ptr = nullptr;
+                target_backend->alloc(&new_host_ptr, cntSize() + 16, 128);
+                backend_->copy_to_host(new_host_ptr, device_memory_);
+                backend_->free_device(device_memory_);
+                host_ptr_ = new_host_ptr;
+                // cl_device_buffer_ = nullptr;
+                device_memory_.handle = nullptr;
+                location_ = ON_HOST;
+                allocated_ = count_;
            }
-            host_ptr_ = nullptr;
-            allocated_ = 0;
        }
+        // 路径2: 从主机 (CPU) 迁移到某个设备
+        else if (backend_->type() == MLLM_CPU) {
+            if (location_ == ON_HOST) {
+                device_memory_.size_in_bytes = cntSize();
+                target_backend->alloc_device(device_memory_, dtype_);
+                target_backend->copy_from_host(device_memory_, host_ptr_);
+                if (owns_host_ptr_) {
+                    backend_->free(host_ptr_);
+                }
+                host_ptr_ = nullptr;
+                location_ = ON_DEVICE;
+                // allocated_ = 0;// todo1418
+            }
+        } else {
+            std::cout << "Device -> Device migration via Host" << std::endl;
+            this->to(Backend::global_backends[MLLM_CPU].get());
+            this->to(target_backend);
+        }
+        backend_ = target_backend;
    }


🛠️ Refactor suggestion | 🟠 Major

Strengthen 'to' migration preconditions and error paths

Guard null backend_ and target_backend, and handle already-on-target+location cases cleanly.

- void to(Backend *target_backend) { - if (backend_ == target_backend) { + void to(Backend *target_backend) { + assert(target_backend && "target_backend must be non-null"); + assert(backend_ && "backend_ must be non-null before migration"); + if (backend_ == target_backend) { return; }

🤖 Prompt for AI Agents

In mllm/TensorImpl.hpp around lines 171 to 208, add defensive checks and stronger error handling in to(Backend *target_backend): validate that both backend_ and target_backend are non-null up-front and return or throw a clear error if not; early-return if backend_ == target_backend and the current location_ already matches the intended ON_HOST/ON_DEVICE state to avoid no-op migrations; when migrating CPU<->device, check results of alloc/alloc_device/copy_to_host/copy_from_host and handle failures by cleaning up any partially allocated memory and not updating backend_ or other state until the operation completes successfully; ensure owns_host_ptr_ and free paths are used only when safe (avoid double-free by checking host_ptr_ and owns_host_ptr_), set device_memory_.handle and host_ptr_ only after successful allocations/copies, and finally assign backend_ = target_backend only after the migration succeeded.

coderabbitai · 2025-10-27T10:06:31Z

 - [2025 August 28] 🔥🔥🔥 Support for MLLM V1 is ending soon. Before its retirement, V1 will integrate the following features: GPT-OSS and NPU QWEN2-VL. MLLM will then transition to V2, which can be viewed on the V2 branch.
 V2 will include brand-new capabilities:
    - A more Pythonic model authoring approach with eager execution
    - Compilation support and MLLM IR for easier NPU integration
    - Support for parallel execution of multiple models
    - A more refined engineering implementation


⚠️ Potential issue | 🟡 Minor

Fix nested list indentation per Markdown standards.

Lines 35–38 use 4-space indentation; Markdown expects 2-space indentation for nested lists. This affects rendering consistency across platforms.

Apply this diff to correct the indentation:

- [2025 August 28] 🔥🔥🔥 Support for MLLM V1 is ending soon. Before its retirement, V1 will integrate the following features: GPT-OSS and NPU QWEN2-VL. MLLM will then transition to V2, which can be viewed on the V2 branch. - V2 will include brand-new capabilities: - - A more Pythonic model authoring approach with eager execution - - Compilation support and MLLM IR for easier NPU integration - - Support for parallel execution of multiple models - - A more refined engineering implementation + V2 will include brand-new capabilities: + - A more Pythonic model authoring approach with eager execution + - Compilation support and MLLM IR for easier NPU integration + - Support for parallel execution of multiple models + - A more refined engineering implementation

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- [2025 August 28] 🔥🔥🔥 Support for MLLM V1 is ending soon. Before its retirement, V1 will integrate the following features: GPT-OSS and NPU QWEN2-VL. MLLM will then transition to V2, which can be viewed on the V2 branch.

V2 will include brand-new capabilities:

- A more Pythonic model authoring approach with eager execution

- Compilation support and MLLM IR for easier NPU integration

- Support for parallel execution of multiple models

- A more refined engineering implementation

- [2025 August 28] 🔥🔥🔥 Support for MLLM V1 is ending soon. Before its retirement, V1 will integrate the following features: GPT-OSS and NPU QWEN2-VL. MLLM will then transition to V2, which can be viewed on the V2 branch.

V2 will include brand-new capabilities:

- A more Pythonic model authoring approach with eager execution

- Compilation support and MLLM IR for easier NPU integration

- Support for parallel execution of multiple models

- A more refined engineering implementation

🧰 Tools

🪛 markdownlint-cli2 (0.18.1)

35-35: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

36-36: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

37-37: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

38-38: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

🤖 Prompt for AI Agents

In README.md around lines 33 to 38 the nested bullet list under the V2 description is indented with 4 spaces which breaks Markdown rendering on some platforms; change the nested list lines (35–38) to use 2-space indentation so each sub-bullet lines up directly under the parent item (use two spaces before each "-" and ensure same indentation for all nested bullets).

coderabbitai · 2025-10-27T10:06:31Z

 *`NOTE:` The QNN backend is preliminary version which can do end-to-end inference. It is still under active development for better performance and more supported models.*

-We support running Qwen-1.5-1.8B-Chat using [Qualcomm QNN](https://www.qualcomm.com/developer/software/qualcomm-ai-engine-direct-sdk) to get Hexagon NPU acceleration on devices with Snapdragon 8 Gen3. The details of QNN environment set up and design is [here](./src/backends/qnn/README.md). The prefilling stage is performered by QNN & CPU, and the inference stage is performed by CPU.
+We support running several Qwen family models including Qwen-2-vl using [Qualcomm QNN](https://www.qualcomm.com/developer/software/qualcomm-ai-engine-direct-sdk) to get Hexagon NPU acceleration on devices with Snapdragon 8 Gen3. The details of QNN environment set up and design is [here](./mllm/backends/qnn/README.md). The prefilling stage is performered by QNN & CPU, and the inference stage is performed by CPU.


⚠️ Potential issue | 🟡 Minor

Use descriptive link text instead of generic placeholders.

Lines 156 and 170 use generic "[here]" link text. Replace with descriptive text that indicates the target content.

Apply this diff:

- The details of QNN environment set up and design is [here](./mllm/backends/qnn/README.md). + The details of QNN environment set up and design is documented in the [QNN backend README](./mllm/backends/qnn/README.md).

- Details can be found in backend specific [README](./mllm/backends/qnn/README.md). + Details can be found in the [QNN backend README](./mllm/backends/qnn/README.md).

Also applies to: 170-170

🧰 Tools

🪛 markdownlint-cli2 (0.18.1)

156-156: Link text should be descriptive

(MD059, descriptive-link-text)

🤖 Prompt for AI Agents

In README.md around lines 156 and 170, replace the generic "[here]" link text with descriptive link text that explains the destination (e.g., "QNN environment setup and design" or "QNN backend README") so readers know what the link points to; update both occurrences to use clear, specific anchor text and keep the existing URL paths intact.

UbiquitousLearning and others added 22 commits June 14, 2025 15:17

feat: add FlashAttention2 && fix: MULTIMODELROPE

6b30afc

remove broken submodule

0ce66c5

fix: support for mac && fix xnn

627ea41

feat: add showui-cut in vtp;

6aa2881

feat: add klaidiai linear op

60da3be

fear: add qwen2-vl-7b

ec1db94

feat: add i8mm kleidai for a32bi4

42e5bb7

fix: cpubackend

012bfa6

feat: add openclbackend

00948f7

fix: add openclbackend

dab4d9d

feat: sage attention && ling-lite-moe

de00d80

feat: batch infer qwen

d08cb94

feat: add opencl qwen2.5-1.5b-q4_0

aa5b8ca

feat: add smallthinker

e92b996

feat: add qwen2.5vl

df0f0b4

fix: vison token purning && qwen2-vl's config

b30be26

fix: qwen3 kai_quant

2c46281

fix: llama3.2-kai

bb1f5c2

rename: src&include -> mllm

bd6cbfd

Merge branch 'main' into main

82b1159

coderabbitai Bot reviewed Oct 27, 2025

View reviewed changes

UbiquitousLearning approved these changes Oct 27, 2025

View reviewed changes

yirongjie merged commit 782926d into UbiquitousLearning:main Oct 27, 2025
2 checks passed

coderabbitai Bot mentioned this pull request Nov 3, 2025

feat(cpu, smollm3-tokenizer): add KAI SGEMM NEON implementation for ARM #503

Merged

This was referenced Nov 19, 2025

feat(thread-pool): implement HpcThreadPool for efficient CPU task management and update build configurations #531

Merged

feat(cpu-backend): add support for SME2 and SVE2 in ARM backend configurations #533

Merged

This was referenced Nov 20, 2025

feat: Implement Qwen NPU Decoding Support with Memory Management Fixes #537

Merged

fix(minicpmo): fix minicpmo tokenization logic & streaming generation #549

Merged

This was referenced Nov 28, 2025

feat: qwen2 cpu model and connection with npu prefill #555

Merged

OpenCL Backend Init #558

Merged

coderabbitai Bot mentioned this pull request Dec 16, 2025

feat(ascend): initial Ascend backend and add elementwise add op #564

Merged

This was referenced Dec 23, 2025

feat(Qnn AOT): Add MarkTensorIO pass and related changes for QNN AOT pipeline #569

Merged

[Bug]: mllm-qwen-npu failed to run on OnePlus Ace5 pro #575

Closed

This was referenced Jan 23, 2026

feat : Add Support of Qwen2.5Omni model (without talker) #610

Closed

feat : Add Support of Qwen2.5Omni Model and MiniCPM-o-4_5 Model #612

Open

This was referenced Feb 14, 2026

feat(benchmark): Add CPU benchmark tool with context length sweep #639

Open

feat(x86,cpu) add support for Qwen3 MoE model on x86 #619

Merged

		std::iostream::sync_with_stdio(false);
		Module::alloc_mmap = false;

Conversation

yirongjie commented Oct 27, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Areas requiring extra attention:

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yirongjie commented Oct 27, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Oct 27, 2025 •

edited

Loading