Skip to content

feat: add kai&qnn-vl&opencl#489

Merged
yirongjie merged 23 commits intoUbiquitousLearning:mainfrom
yirongjie:main
Oct 27, 2025
Merged

feat: add kai&qnn-vl&opencl#489
yirongjie merged 23 commits intoUbiquitousLearning:mainfrom
yirongjie:main

Conversation

@yirongjie
Copy link
Copy Markdown
Collaborator

@yirongjie yirongjie commented Oct 27, 2025

Summary by CodeRabbit

  • New Features

    • Added OpenCL backend support for GPU acceleration
    • Enhanced QNN/NPU backend capabilities for accelerated device inference
    • Expanded quantization format support with new optimization paths
    • Added example programs for multiple model architectures
  • Improvements

    • Performance optimizations for CPU operations via SIMD acceleration
    • Enhanced attention mechanisms for improved inference speed
    • Build system updates for better dependency management
    • Project structure reorganization for improved maintainability

UbiquitousLearning and others added 22 commits June 14, 2025 15:17
* Squashed commit of the following:

commit efde6d0d014b647b8ceea59441aef1bd3ac424c0
Author: yirongjie <yirj0809@gmail.com>
Date:   Tue May 27 16:09:16 2025 +0000

    fix: merge

commit fe7fb476717e99df2eac23ab7fd1088e03cf8b3c
Merge: f52bb32e 20e94c0
Author: yirongjie <yirj0809@gmail.com>
Date:   Tue May 27 16:09:08 2025 +0000

    Merge branch 'main' of https://github.com/yirongjie/mllm

commit f52bb32e5dbf4edcd4998d664ae071a1b5c8dbbb
Author: yirongjie <yirj0809@gmail.com>
Date:   Tue May 27 12:25:08 2025 +0000

    fix: merge from qnn-qwen2vl;

commit 6f6c2442f750363c6789e7717861ea3a216cf356
Author: yirongjie <yirj0809@gmail.com>
Date:   Tue May 27 12:24:17 2025 +0000

    Squashed commit of the following:

    commit 4862c76
    Author: oreomaker <zh002919@outlook.com>
    Date:   Thu May 15 14:59:37 2025 +0800

        refact: use hvx qnn silu(faster); usable showui npu version

    commit 5df1b07
    Author: oreomaker <zh002919@outlook.com>
    Date:   Wed May 14 22:10:52 2025 +0800

        feat: qnn dequantize_add hvx op

    commit c813f55
    Author: oreomaker <zh002919@outlook.com>
    Date:   Tue May 13 09:50:06 2025 +0800

        chore: format qnn op package code

    commit ea215f0
    Author: oreomaker <zh002919@outlook.com>
    Date:   Mon May 12 11:34:38 2025 +0800

        feat: free act tensors after qnn vit embedding

    commit e4f5011
    Author: oreomaker <zh002919@outlook.com>
    Date:   Mon May 12 11:14:30 2025 +0800

        chore: remove save data in modeling qwen2vlnpu

    commit 2dcb677
    Author: oreomaker <zh002919@outlook.com>
    Date:   Mon May 12 10:48:34 2025 +0800

        fix: seperate weights for embedding-lmhead when using rotated qwen2vl/showui

    commit 4847318
    Author: oreomaker <zh002919@outlook.com>
    Date:   Sun May 11 21:16:59 2025 +0800

        fix: cpu tensor free bug(todo: handle tensor free)

    commit 799b673
    Author: xudaliang <xudaliang@pku.edu.cn>
    Date:   Sat May 10 22:51:11 2025 +0800

        feat : new qwen2_vl model.

    commit dd1817d
    Author: xudaliang <xudaliang@pku.edu.cn>
    Date:   Sat May 10 22:50:35 2025 +0800

        feat : support qwen2-vl rotation model with fp bias.

    commit 305dc5c
    Author: oreomaker <zh002919@outlook.com>
    Date:   Thu May 8 21:37:35 2025 +0800

        feat: runnable qwen2vl qnn showui(2*256)

    commit 8e14815
    Author: oreomaker <zh002919@outlook.com>
    Date:   Thu May 8 21:36:33 2025 +0800

        fix: pre processing of qwen2vl

    commit e041296
    Author: oreomaker <zh002919@outlook.com>
    Date:   Thu May 8 21:34:07 2025 +0800

        refact: qwen vl npu modeling using closetFactor view(64->8x8)
        feat: get_position_id padding in Qwen2VL_ImagePatchAndEmbedding

    commit 5b17204
    Author: oreomaker <zh002919@outlook.com>
    Date:   Thu May 8 21:29:13 2025 +0800

        feat: vit(visual_xx) tensor reuse for qnn (noted as: QNN VLM trick)

    commit 7c42658
    Author: oreomaker <zh002919@outlook.com>
    Date:   Thu May 8 21:26:49 2025 +0800

        feat: finish cpu pipeline mrope

    commit 0962c00
    Author: oreomaker <zh002919@outlook.com>
    Date:   Tue May 6 11:39:29 2025 +0800

        feat: pipeline multimodal rope

    commit 5317933
    Author: oreomaker <zh002919@outlook.com>
    Date:   Tue May 6 11:38:10 2025 +0800

        refactor: use old&fast qnn silu

    commit 5bd14de
    Author: oreomaker <zh002919@outlook.com>
    Date:   Mon Apr 28 21:10:48 2025 +0800

        feat: runnable qwen 2 vl npu

    commit 1df6eed
    Author: oreomaker <zh002919@outlook.com>
    Date:   Sun Apr 27 10:13:44 2025 +0800

        refactor: tensor.to(QNN)

    commit d3d29c4
    Author: oreomaker <zh002919@outlook.com>
    Date:   Sat Apr 26 21:22:52 2025 +0800

        chore: remove saveData in qwen2vl modeling

    commit c40e0c0
    Author: oreomaker <zh002919@outlook.com>
    Date:   Sat Apr 26 20:51:16 2025 +0800

        feat: add qnn retrieve context info log

    commit 175d3a2
    Author: oreomaker <zh002919@outlook.com>
    Date:   Sat Apr 26 20:46:14 2025 +0800

        fix: qwen 2 vl npu input tensor backend(correct version)

    commit 871e920
    Author: oreomaker <zh002919@outlook.com>
    Date:   Fri Apr 25 09:50:05 2025 +0800

        fix: quantize i16 arm neon macro

    commit a2b802c
    Author: xudaliang <xudaliang@pku.edu.cn>
    Date:   Wed Apr 23 18:33:26 2025 +0800

        fix : Qwen2-VL prefill bugs: 1.FP32 KVCache. 2.LMHead does not execute.

    commit 8c66604
    Author: oreomaker <zh002919@outlook.com>
    Date:   Fri Apr 18 15:35:03 2025 +0800

        fix: restore qwen2.5 modeling

    commit f138beb
    Author: oreomaker <zh002919@outlook.com>
    Date:   Fri Apr 18 15:28:35 2025 +0800

        fix: restore debug change

    commit 09e12ce
    Merge: d725942 9b271a9
    Author: oreomaker <zh002919@outlook.com>
    Date:   Fri Apr 18 13:39:10 2025 +0800

        Merge branch 'debug-qwen2.5' of github.com:liang1232018/mllm into debug-qwen2.5

    commit d725942
    Author: oreomaker <zh002919@outlook.com>
    Date:   Fri Apr 18 13:39:04 2025 +0800

        dev: qnn sigmoid version silu
        feat: qnn backend f16 type input

    commit 9b271a9
    Author: xudaliang <xudaliang@pku.edu.cn>
    Date:   Fri Apr 18 13:24:52 2025 +0800

        fix : linear W8A8 bias uint8 type bug

    commit 793a6c6
    Author: xudaliang <xudaliang@pku.edu.cn>
    Date:   Fri Apr 18 13:23:49 2025 +0800

        fix : Shadow linear triger condition.

    commit 4e24bca
    Author: oreomaker <zh002919@outlook.com>
    Date:   Wed Apr 16 20:53:07 2025 +0800

        qwen 2.5 debug

    commit 4d74756
    Author: oreomaker <zh002919@outlook.com>
    Date:   Wed Apr 16 20:52:33 2025 +0800

        fix: shadow linear

    commit 5866e2b
    Author: oreomaker <zh002919@outlook.com>
    Date:   Tue Apr 15 22:17:12 2025 +0800

        qwen 2.5 debug

    commit 29e9b92
    Author: oreomaker <zh002919@outlook.com>
    Date:   Mon Apr 14 09:28:45 2025 +0800

        fix: remove shadow linear if(round_value) logic

    commit a61e837
    Author: oreomaker <zh002919@outlook.com>
    Date:   Sun Apr 13 22:03:45 2025 +0800

        feat: int16 qkv for qwen2.5 vl npu

    commit 566f21d
    Author: xudaliang <xudaliang@pku.edu.cn>
    Date:   Sun Apr 13 18:45:06 2025 +0800

        fix : modeling input quantize to I8, but dequantize with I16 bug.

    commit 60639d0
    Author: xudaliang <xudaliang@pku.edu.cn>
    Date:   Sun Apr 13 18:44:18 2025 +0800

        fix : LLaMADequantize INT16 to FP32 shuffle order bugs.

    commit a5cc652
    Author: xudaliang <xudaliang@pku.edu.cn>
    Date:   Sun Apr 13 17:31:10 2025 +0800

        fix : LLaMAQuantize FP32 to INT16 round scale error.

    commit f139822
    Author: oreomaker <zh002919@outlook.com>
    Date:   Sat Apr 12 22:24:30 2025 +0800

        fix: qnn int 16 linear bias(use int8 bias scale)

    commit 8831811
    Author: oreomaker <zh002919@outlook.com>
    Date:   Sat Apr 12 15:03:40 2025 +0800

        debug: qnn int16 linear

    commit 088fe09
    Author: xudaliang <xudaliang@pku.edu.cn>
    Date:   Fri Apr 11 23:22:41 2025 +0800

        feat : support INT16 dequantize and quantize.

    commit 73ebe87
    Merge: b73c1c3 6007443
    Author: liang1232018 <40791416+liang1232018@users.noreply.github.com>
    Date:   Wed Apr 9 14:50:25 2025 +0800

        Merge pull request UbiquitousLearning#12 from liang1232018/develop-zh

        Develop zh

    commit 6007443
    Merge: 1c8647e b73c1c3
    Author: liang1232018 <40791416+liang1232018@users.noreply.github.com>
    Date:   Wed Apr 9 14:50:07 2025 +0800

        Merge branch 'develop-xdl' into develop-zh

    commit 1c8647e
    Author: oreomaker <zh002919@outlook.com>
    Date:   Tue Apr 8 21:39:56 2025 +0800

        fix: qnn quant scale pow(2,bit) -> pow(2,bit-1)

    commit cc760ae
    Author: oreomaker <zh002919@outlook.com>
    Date:   Tue Apr 8 17:03:17 2025 +0800

        fix: op create param type->dtype

    commit 6afa80c
    Author: oreomaker <zh002919@outlook.com>
    Date:   Mon Apr 7 15:25:21 2025 +0800

        feat: Tensor::saveData only do when STATIC_READY

    commit 2ebded3
    Author: oreomaker <zh002919@outlook.com>
    Date:   Mon Apr 7 15:24:11 2025 +0800

        feat: add qnn int16 layer param & op
        todo: qnn llama package implement

    commit 4faeca8
    Author: oreomaker <zh002919@outlook.com>
    Date:   Mon Mar 24 15:52:54 2025 +0800

        dev: runnable qwen2vl npu (buggy)

    commit ebf110e
    Author: oreomaker <zh002919@outlook.com>
    Date:   Mon Mar 24 15:46:23 2025 +0800

        feat: add qwen vl export tool (todo: simulate infer and profile tools)

    commit bde9a92
    Author: oreomaker <zh002919@outlook.com>
    Date:   Mon Mar 24 15:44:25 2025 +0800

        dev: a just working version of qwen 2.5 npu

    commit 126c283
    Merge: 25de8c3 9d33aaf
    Author: oreomaker <zh002919@outlook.com>
    Date:   Mon Mar 24 15:43:30 2025 +0800

        Merge branch 'fix-qnn-python' into develop-zh

    commit 9d33aaf
    Author: oreomaker <zh002919@outlook.com>
    Date:   Fri Mar 21 16:01:23 2025 +0800

        fix: qnn profile quant bugs

    commit 25de8c3
    Author: oreomaker <zh002919@outlook.com>
    Date:   Thu Mar 20 16:00:19 2025 +0800

        refactor: add graph split layer for QNN, change the modeling
        note: xnnpack is affected, should not merge

    commit 690a24e
    Author: oreomaker <zh002919@outlook.com>
    Date:   Mon Mar 17 17:45:34 2025 +0800

        feat: QNN load cache execute

    commit 4f28330
    Author: oreomaker <zh002919@outlook.com>
    Date:   Sun Mar 9 22:33:21 2025 +0800

        dev: QNN graph merging execute

    commit b73c1c3
    Author: xudaliang <xudaliang@pku.edu.cn>
    Date:   Tue Nov 12 23:28:12 2024 +0800

        feat : support decoding model configuration.

    commit ec3d4e5
    Author: xudaliang <xudaliang@pku.edu.cn>
    Date:   Tue Nov 12 20:31:45 2024 +0800

        feat : support Qwen2.5 npu.

commit 7246d53
Author: yirongjie <yirj0809@gmail.com>
Date:   Tue May 27 07:12:53 2025 +0000

    feat: set run in Backends

commit 1150241
Author: yirongjie <yirj0809@gmail.com>
Date:   Sat May 24 07:57:09 2025 +0000

    fix: getFunc

commit 24db241
Author: yirongjie <yirj0809@gmail.com>
Date:   Fri May 23 05:16:41 2025 +0000

    fix: tensor function <Tensor *> to shared_ptr<Tensor>

commit 0ecce75
Author: yirongjie <yirj0809@gmail.com>
Date:   Thu May 22 14:05:11 2025 +0000

    feat:eager cpu

commit 9835db5
Author: yirongjie <yirj0809@gmail.com>
Date:   Fri Apr 18 14:57:21 2025 +0000

    fix: vtp

commit 30c3046
Author: yirongjie <yirj0809@gmail.com>
Date:   Wed Apr 16 06:49:46 2025 +0000

    fix: vtp

commit b416268
Author: yirongjie <yirj0809@gmail.com>
Date:   Tue Apr 15 08:40:22 2025 +0000

    fix: vtp

commit 6430ca8
Author: yirongjie <yirj0809@gmail.com>
Date:   Mon Apr 14 12:53:58 2025 +0000

    feat: vtp

commit f86bff6
Author: yirongjie <yirj0809@gmail.com>
Date:   Sun Mar 23 09:41:14 2025 +0000

    ref: add ShowUI

* feat: add FlashAttention2 && fix: MULTIMODELROPE

* remove broken submodule

---------

Co-authored-by: yirongjie <yirj0809@gmail.com>
Co-authored-by: yi <yi@U-21T7VPF4-1903.local>
commit efde6d0d014b647b8ceea59441aef1bd3ac424c0
Author: yirongjie <yirj0809@gmail.com>
Date:   Tue May 27 16:09:16 2025 +0000

    fix: merge

commit fe7fb476717e99df2eac23ab7fd1088e03cf8b3c
Merge: f52bb32e 20e94c0
Author: yirongjie <yirj0809@gmail.com>
Date:   Tue May 27 16:09:08 2025 +0000

    Merge branch 'main' of https://github.com/yirongjie/mllm

commit f52bb32e5dbf4edcd4998d664ae071a1b5c8dbbb
Author: yirongjie <yirj0809@gmail.com>
Date:   Tue May 27 12:25:08 2025 +0000

    fix: merge from qnn-qwen2vl;

commit 6f6c2442f750363c6789e7717861ea3a216cf356
Author: yirongjie <yirj0809@gmail.com>
Date:   Tue May 27 12:24:17 2025 +0000

    Squashed commit of the following:

    commit 4862c76
    Author: oreomaker <zh002919@outlook.com>
    Date:   Thu May 15 14:59:37 2025 +0800

        refact: use hvx qnn silu(faster); usable showui npu version

    commit 5df1b07
    Author: oreomaker <zh002919@outlook.com>
    Date:   Wed May 14 22:10:52 2025 +0800

        feat: qnn dequantize_add hvx op

    commit c813f55
    Author: oreomaker <zh002919@outlook.com>
    Date:   Tue May 13 09:50:06 2025 +0800

        chore: format qnn op package code

    commit ea215f0
    Author: oreomaker <zh002919@outlook.com>
    Date:   Mon May 12 11:34:38 2025 +0800

        feat: free act tensors after qnn vit embedding

    commit e4f5011
    Author: oreomaker <zh002919@outlook.com>
    Date:   Mon May 12 11:14:30 2025 +0800

        chore: remove save data in modeling qwen2vlnpu

    commit 2dcb677
    Author: oreomaker <zh002919@outlook.com>
    Date:   Mon May 12 10:48:34 2025 +0800

        fix: seperate weights for embedding-lmhead when using rotated qwen2vl/showui

    commit 4847318
    Author: oreomaker <zh002919@outlook.com>
    Date:   Sun May 11 21:16:59 2025 +0800

        fix: cpu tensor free bug(todo: handle tensor free)

    commit 799b673
    Author: xudaliang <xudaliang@pku.edu.cn>
    Date:   Sat May 10 22:51:11 2025 +0800

        feat : new qwen2_vl model.

    commit dd1817d
    Author: xudaliang <xudaliang@pku.edu.cn>
    Date:   Sat May 10 22:50:35 2025 +0800

        feat : support qwen2-vl rotation model with fp bias.

    commit 305dc5c
    Author: oreomaker <zh002919@outlook.com>
    Date:   Thu May 8 21:37:35 2025 +0800

        feat: runnable qwen2vl qnn showui(2*256)

    commit 8e14815
    Author: oreomaker <zh002919@outlook.com>
    Date:   Thu May 8 21:36:33 2025 +0800

        fix: pre processing of qwen2vl

    commit e041296
    Author: oreomaker <zh002919@outlook.com>
    Date:   Thu May 8 21:34:07 2025 +0800

        refact: qwen vl npu modeling using closetFactor view(64->8x8)
        feat: get_position_id padding in Qwen2VL_ImagePatchAndEmbedding

    commit 5b17204
    Author: oreomaker <zh002919@outlook.com>
    Date:   Thu May 8 21:29:13 2025 +0800

        feat: vit(visual_xx) tensor reuse for qnn (noted as: QNN VLM trick)

    commit 7c42658
    Author: oreomaker <zh002919@outlook.com>
    Date:   Thu May 8 21:26:49 2025 +0800

        feat: finish cpu pipeline mrope

    commit 0962c00
    Author: oreomaker <zh002919@outlook.com>
    Date:   Tue May 6 11:39:29 2025 +0800

        feat: pipeline multimodal rope

    commit 5317933
    Author: oreomaker <zh002919@outlook.com>
    Date:   Tue May 6 11:38:10 2025 +0800

        refactor: use old&fast qnn silu

    commit 5bd14de
    Author: oreomaker <zh002919@outlook.com>
    Date:   Mon Apr 28 21:10:48 2025 +0800

        feat: runnable qwen 2 vl npu

    commit 1df6eed
    Author: oreomaker <zh002919@outlook.com>
    Date:   Sun Apr 27 10:13:44 2025 +0800

        refactor: tensor.to(QNN)

    commit d3d29c4
    Author: oreomaker <zh002919@outlook.com>
    Date:   Sat Apr 26 21:22:52 2025 +0800

        chore: remove saveData in qwen2vl modeling

    commit c40e0c0
    Author: oreomaker <zh002919@outlook.com>
    Date:   Sat Apr 26 20:51:16 2025 +0800

        feat: add qnn retrieve context info log

    commit 175d3a2
    Author: oreomaker <zh002919@outlook.com>
    Date:   Sat Apr 26 20:46:14 2025 +0800

        fix: qwen 2 vl npu input tensor backend(correct version)

    commit 871e920
    Author: oreomaker <zh002919@outlook.com>
    Date:   Fri Apr 25 09:50:05 2025 +0800

        fix: quantize i16 arm neon macro

    commit a2b802c
    Author: xudaliang <xudaliang@pku.edu.cn>
    Date:   Wed Apr 23 18:33:26 2025 +0800

        fix : Qwen2-VL prefill bugs: 1.FP32 KVCache. 2.LMHead does not execute.

    commit 8c66604
    Author: oreomaker <zh002919@outlook.com>
    Date:   Fri Apr 18 15:35:03 2025 +0800

        fix: restore qwen2.5 modeling

    commit f138beb
    Author: oreomaker <zh002919@outlook.com>
    Date:   Fri Apr 18 15:28:35 2025 +0800

        fix: restore debug change

    commit 09e12ce
    Merge: d725942 9b271a9
    Author: oreomaker <zh002919@outlook.com>
    Date:   Fri Apr 18 13:39:10 2025 +0800

        Merge branch 'debug-qwen2.5' of github.com:liang1232018/mllm into debug-qwen2.5

    commit d725942
    Author: oreomaker <zh002919@outlook.com>
    Date:   Fri Apr 18 13:39:04 2025 +0800

        dev: qnn sigmoid version silu
        feat: qnn backend f16 type input

    commit 9b271a9
    Author: xudaliang <xudaliang@pku.edu.cn>
    Date:   Fri Apr 18 13:24:52 2025 +0800

        fix : linear W8A8 bias uint8 type bug

    commit 793a6c6
    Author: xudaliang <xudaliang@pku.edu.cn>
    Date:   Fri Apr 18 13:23:49 2025 +0800

        fix : Shadow linear triger condition.

    commit 4e24bca
    Author: oreomaker <zh002919@outlook.com>
    Date:   Wed Apr 16 20:53:07 2025 +0800

        qwen 2.5 debug

    commit 4d74756
    Author: oreomaker <zh002919@outlook.com>
    Date:   Wed Apr 16 20:52:33 2025 +0800

        fix: shadow linear

    commit 5866e2b
    Author: oreomaker <zh002919@outlook.com>
    Date:   Tue Apr 15 22:17:12 2025 +0800

        qwen 2.5 debug

    commit 29e9b92
    Author: oreomaker <zh002919@outlook.com>
    Date:   Mon Apr 14 09:28:45 2025 +0800

        fix: remove shadow linear if(round_value) logic

    commit a61e837
    Author: oreomaker <zh002919@outlook.com>
    Date:   Sun Apr 13 22:03:45 2025 +0800

        feat: int16 qkv for qwen2.5 vl npu

    commit 566f21d
    Author: xudaliang <xudaliang@pku.edu.cn>
    Date:   Sun Apr 13 18:45:06 2025 +0800

        fix : modeling input quantize to I8, but dequantize with I16 bug.

    commit 60639d0
    Author: xudaliang <xudaliang@pku.edu.cn>
    Date:   Sun Apr 13 18:44:18 2025 +0800

        fix : LLaMADequantize INT16 to FP32 shuffle order bugs.

    commit a5cc652
    Author: xudaliang <xudaliang@pku.edu.cn>
    Date:   Sun Apr 13 17:31:10 2025 +0800

        fix : LLaMAQuantize FP32 to INT16 round scale error.

    commit f139822
    Author: oreomaker <zh002919@outlook.com>
    Date:   Sat Apr 12 22:24:30 2025 +0800

        fix: qnn int 16 linear bias(use int8 bias scale)

    commit 8831811
    Author: oreomaker <zh002919@outlook.com>
    Date:   Sat Apr 12 15:03:40 2025 +0800

        debug: qnn int16 linear

    commit 088fe09
    Author: xudaliang <xudaliang@pku.edu.cn>
    Date:   Fri Apr 11 23:22:41 2025 +0800

        feat : support INT16 dequantize and quantize.

    commit 73ebe87
    Merge: b73c1c3 6007443
    Author: liang1232018 <40791416+liang1232018@users.noreply.github.com>
    Date:   Wed Apr 9 14:50:25 2025 +0800

        Merge pull request UbiquitousLearning#12 from liang1232018/develop-zh

        Develop zh

    commit 6007443
    Merge: 1c8647e b73c1c3
    Author: liang1232018 <40791416+liang1232018@users.noreply.github.com>
    Date:   Wed Apr 9 14:50:07 2025 +0800

        Merge branch 'develop-xdl' into develop-zh

    commit 1c8647e
    Author: oreomaker <zh002919@outlook.com>
    Date:   Tue Apr 8 21:39:56 2025 +0800

        fix: qnn quant scale pow(2,bit) -> pow(2,bit-1)

    commit cc760ae
    Author: oreomaker <zh002919@outlook.com>
    Date:   Tue Apr 8 17:03:17 2025 +0800

        fix: op create param type->dtype

    commit 6afa80c
    Author: oreomaker <zh002919@outlook.com>
    Date:   Mon Apr 7 15:25:21 2025 +0800

        feat: Tensor::saveData only do when STATIC_READY

    commit 2ebded3
    Author: oreomaker <zh002919@outlook.com>
    Date:   Mon Apr 7 15:24:11 2025 +0800

        feat: add qnn int16 layer param & op
        todo: qnn llama package implement

    commit 4faeca8
    Author: oreomaker <zh002919@outlook.com>
    Date:   Mon Mar 24 15:52:54 2025 +0800

        dev: runnable qwen2vl npu (buggy)

    commit ebf110e
    Author: oreomaker <zh002919@outlook.com>
    Date:   Mon Mar 24 15:46:23 2025 +0800

        feat: add qwen vl export tool (todo: simulate infer and profile tools)

    commit bde9a92
    Author: oreomaker <zh002919@outlook.com>
    Date:   Mon Mar 24 15:44:25 2025 +0800

        dev: a just working version of qwen 2.5 npu

    commit 126c283
    Merge: 25de8c3 9d33aaf
    Author: oreomaker <zh002919@outlook.com>
    Date:   Mon Mar 24 15:43:30 2025 +0800

        Merge branch 'fix-qnn-python' into develop-zh

    commit 9d33aaf
    Author: oreomaker <zh002919@outlook.com>
    Date:   Fri Mar 21 16:01:23 2025 +0800

        fix: qnn profile quant bugs

    commit 25de8c3
    Author: oreomaker <zh002919@outlook.com>
    Date:   Thu Mar 20 16:00:19 2025 +0800

        refactor: add graph split layer for QNN, change the modeling
        note: xnnpack is affected, should not merge

    commit 690a24e
    Author: oreomaker <zh002919@outlook.com>
    Date:   Mon Mar 17 17:45:34 2025 +0800

        feat: QNN load cache execute

    commit 4f28330
    Author: oreomaker <zh002919@outlook.com>
    Date:   Sun Mar 9 22:33:21 2025 +0800

        dev: QNN graph merging execute

    commit b73c1c3
    Author: xudaliang <xudaliang@pku.edu.cn>
    Date:   Tue Nov 12 23:28:12 2024 +0800

        feat : support decoding model configuration.

    commit ec3d4e5
    Author: xudaliang <xudaliang@pku.edu.cn>
    Date:   Tue Nov 12 20:31:45 2024 +0800

        feat : support Qwen2.5 npu.

commit 7246d53
Author: yirongjie <yirj0809@gmail.com>
Date:   Tue May 27 07:12:53 2025 +0000

    feat: set run in Backends

commit 1150241
Author: yirongjie <yirj0809@gmail.com>
Date:   Sat May 24 07:57:09 2025 +0000

    fix: getFunc

commit 24db241
Author: yirongjie <yirj0809@gmail.com>
Date:   Fri May 23 05:16:41 2025 +0000

    fix: tensor function <Tensor *> to shared_ptr<Tensor>

commit 0ecce75
Author: yirongjie <yirj0809@gmail.com>
Date:   Thu May 22 14:05:11 2025 +0000

    feat:eager cpu

commit 9835db5
Author: yirongjie <yirj0809@gmail.com>
Date:   Fri Apr 18 14:57:21 2025 +0000

    fix: vtp

commit 30c3046
Author: yirongjie <yirj0809@gmail.com>
Date:   Wed Apr 16 06:49:46 2025 +0000

    fix: vtp

commit b416268
Author: yirongjie <yirj0809@gmail.com>
Date:   Tue Apr 15 08:40:22 2025 +0000

    fix: vtp

commit 6430ca8
Author: yirongjie <yirj0809@gmail.com>
Date:   Mon Apr 14 12:53:58 2025 +0000

    feat: vtp

commit f86bff6
Author: yirongjie <yirj0809@gmail.com>
Date:   Sun Mar 23 09:41:14 2025 +0000

    ref: add ShowUI
Co-authored-by: liang1232018 <40791416+liang1232018@users.noreply.github.com>
Co-authored-by: oreomaker <70836772+oreomaker@users.noreply.github.com>
Co-authored-by: oreomaker <zh002919@outlook.com>
Co-authored-by: xudaliang <xudaliang@pku.edu.cn>
Co-authored-by: xwk <1263212259@qq.com>
Co-authored-by: yuerqiqi <2500526025@qq.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Oct 27, 2025

Caution

Review failed

Failed to post review comments

Walkthrough

Major architectural refactoring transitioning the codebase from src/ to mllm/ directory structure. Introduces comprehensive backend infrastructure (CPU, OpenCL, QNN), state management via singleton Context, new quantization types, SIMD-accelerated compute kernels, expanded tensor/module capabilities with smart pointer semantics, and replaces legacy examples with new model demos.

Changes

Cohort / File(s) Summary
Build Configuration & Path Migration
.clang-tidy.ignore, .gitignore, .gitmodules, CMakeLists.txt
Path remapping from src/ to mllm/ prefix; updated backend integration (OpenCL, QNN), architecture-specific flags, kleidiai kernel inclusion for ARM; AddressSanitizer support added; ASAN and OpenCL options introduced.
Documentation & Examples CMake
README.md, examples/CMakeLists.txt
Updated model tables with Hexagon NPU INT8 support; revised QNN pipeline references; introduced modular LLM/VLM library targets (mllm_llm, mllm_vlm); conditional executable creation with existence checks.
Core State & Context Management
mllm/Context.hpp, mllm/Context.cpp, mllm/StateManager.hpp
New singleton Context with InferenceStateManager (execution type, sequence lengths, QNN/CPU flags) and SpeculativeDecodingManager for draft state tracking.
Tensor & Type System
mllm/DataType.hpp, mllm/Types.hpp, mllm/Tensor.hpp, mllm/Tensor.cpp, mllm/TensorImpl.hpp
Added FP16 type aliases, quantization block structures; introduced DeviceMemory abstraction; Tensor now inherits from enable_shared_from_this; master/child relationships via weak_ptr; backend-aware allocation and device transitions; extensive operator overloads and tensor operations.
Backend & Operation Infrastructure
mllm/Backend.hpp, mllm/Backend.cpp, mllm/Op.hpp, mllm/OpDefined.hpp, include/OpDefined.hpp (removed)
Backend::global_backends changed to unique_ptr; new runOp signature replacing runFunc; device memory allocation stubs; OpType and TensorFuncType enums consolidated to mllm/OpDefined.hpp; Op::traced() accessor and dtype propagation in setUp.
Module & Layer Architecture
mllm/Module.hpp, mllm/Module.cpp, mllm/Layer.hpp, mllm/ParamLoader.hpp, mllm/ParamLoader.cpp
Module loader changed to shared_ptr; added batch generate() returning vector<vector>; Layer now manages Op creation, backend switching (to/cpu/cl); forwardNoInput() timing; multi-file loading via load_multifile; mmap-based ParamLoader with getParamMetadata, getInputStream APIs.
Generation & Trace
mllm/Generate.hpp, mllm/Trace.cpp, mllm/Parallel.hpp
FP16/FP32 dtype-aware score collection; greedy search method public interface; Trace now registers all inputs in activation_tensors; ChunkPipeline::run extended with clean_tensors parameter for cleanup.
CPU Backend Core
mllm/backends/cpu/CMakeLists.txt, mllm/backends/cpu/CPUBackend.hpp, mllm/backends/cpu/CPUBackend.cpp
CPU backend creation via CPUBackendCreator; comprehensive Op registration (arithmetic, neural nets, functions); convert_fp_data for FP16↔FP32; runOp orchestrates in-graph tracing and activation tensor management; conditional kleidiai/OpenMP support for ARM.
CPU Compute Headers & Utilities
mllm/backends/cpu/compute/ActivationFunction.hpp, mllm/backends/cpu/compute/Convolution.hpp/cpp, mllm/backends/cpu/compute/FeatureCheck.hpp, mllm/backends/cpu/compute/Pooling.hpp/cpp, mllm/backends/cpu/compute/SIMDMemory.hpp, mllm/backends/cpu/compute/Sigmoid.hpp
Updated includes to reference ggml third_party paths; Convolution public API (2D/3D); I8MM feature detection via ARM registers/sysctlbyname; Pooling with VecDotFP32 support; SIMD-accelerated Sigmoid (AVX2/NEON).
CPU GEMM & Matmul Implementations
mllm/backends/cpu/compute/Matmul.hpp/cpp, mllm/backends/cpu/compute/MatmulElastic.hpp/cpp, mllm/backends/cpu/compute/MatmulSparse.hpp/cpp, mllm/backends/cpu/compute/GemmFp.hpp
Replaced SME paths with LlamaFile GEMM; updated function pointer types (gemv_func, gemm_func); includes moved to ggml third_party; GemmFp adds FP32 and FP32↔FP16 micro-kernels (NEON/AVX) with packing.
Kleidiai GEMM Integration
mllm/backends/cpu/compute/GemmKleidiai.hpp, mllm/backends/cpu/compute/GemmKleidiai.cpp
Multi-precision GEMM with QSI4C32 (4-bit), FP16, FP32 paths; workspace management; packing/quantization helpers; BSHD layout support; runtime hardware capability detection.
Quantization GEMM & Q2K
mllm/backends/cpu/compute/GemmQ2K.hpp/cpp
Q2_K × Q8_K quantized GEMM and GEMV operations with NEON micro-kernels (8x16) and reference fallback.
Flash Attention & Sage Attention
mllm/backends/cpu/compute/FlashAttention2H.hpp, mllm/backends/cpu/compute/SageAttention.hpp, mllm/backends/cpu/compute/SageAttentionKVQ8.hpp, mllm/backends/cpu/compute/SageAttentionPT.hpp, mllm/backends/cpu/compute/SageQuantize.hpp
SIMD-accelerated FA2 with FP32 and mixed FP32/FP16 paths (BHSD layout); Sage Attention with KV quantization (Q8_0F), per-row quantization, mean computation, softmax with platform-optimized paths.
Transpose & Split Operations
mllm/backends/cpu/compute/Transpose2D.hpp, mllm/backends/cpu/compute/Transpose3D.hpp, mllm/backends/cpu/compute/Split.hpp
Matrix transpose with AVX/NEON 8x8/4x4 block optimization; 3D tensor transpose with permutation handling; efficient split with mixed FP32/FP16 outputs and type-aware conversion.
CPU Op Implementations
mllm/backends/cpu/op/CPUArgSortFunc.hpp, mllm/backends/cpu/op/CPUBinCountFunc.hpp, mllm/backends/cpu/op/CPUBinaryFunc.hpp, mllm/backends/cpu/op/CPUCat.cpp
Migrated from TensorFunction to Op-based architecture; reshape/execute return ErrorCode; Creator factory pattern; shift from args vector to constructor-injected parameters; multi-input operators (addTwo, subTwo, mulTwo, divTwo).
Removed Example Programs
examples/main_alpaca.cpp, examples/main_llama.cpp, examples/main_clip.cpp, examples/main_fuyu.cpp, examples/main_imagebind.cpp, examples/main_llava.cpp, examples/main_phonelm_npu.cpp, examples/main_phonelm_npu.hpp, examples/main_qwen_npu.cpp, examples/main_qwen_npu.hpp, examples/main_tinyllama.cpp, examples/main_vit.cpp, examples/main_vit.cpp
Deleted legacy inference examples and supporting headers.
New Model Demo Programs
examples/demo_qwen.cpp, examples/demo_qwen_batch.cpp, examples/demo_qwen_npu.cpp, examples/demo_qwen_npu_pipeline.cpp, examples/demo_qwen2.5_vl.cpp, examples/demo_qwen2_vl.cpp, examples/demo_qwen2_vl_npu.cpp, examples/demo_qwen2_vl_vtp.cpp, examples/demo_qwen3.cpp, examples/demo_llama3.cpp, examples/demo_showui.cpp, examples/demo_showui_npu.cpp, examples/demo_showui_vtp.cpp, examples/demo_bailing_moe.cpp, examples/demo_bailing_moe_mbp.cpp, examples/demo_smallthinker.cpp, examples/demo_smallthinker_mbp.cpp, examples/demo_minicpm_moe_mbm.cpp, examples/demo_minicpm_moe_mbp.cpp, examples/demo_tinyllama.cpp, examples/demo_sparse_llama.cpp, examples/demo_ds_qwen2.cpp, examples/demo_phonelm_npu.cpp, examples/demo_qwen2.5_npu.cpp (removed), examples/mllm_benchmark.cpp
New demos for Qwen (batch, NPU pipeline), Qwen2.5VL, Qwen2VL (CPU/NPU/VTP), Qwen3, LLaMA3, ShowUI, BailingMoE, SmallThinker, MiniCPM MoE, TinyLLaMA, SparseLLaMA; NPU variants use v2 API with context-based state management; ARM-specific defaults; OpenCL conditional setup; call clear_kvcache() and profiling() post-generation.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Areas requiring extra attention:

  • Tensor memory management refactoring: master_tensor_ and child_tensors_ conversion from raw pointers to weak_ptr/vector<weak_ptr> introduces safety but requires careful lifecycle verification, especially shallowCopyFrom and reconcileLayouts logic

  • Context singleton and InferenceStateManager: ensure thread-safe access patterns and state consistency across backends (CPU, QNN, OpenCL)

  • Backend::global_backends unique_ptr conversion: verify all access paths use .get() appropriately and no double-deletion or dangling pointer scenarios exist

  • Module::generate batch overload: new vector<vector> signature for batch generation requires validation of per-batch end detection and tensor cleanup

  • ParamLoader mmap implementation: alignment checks, fallback paths, and interaction with backend load_from_file hooks need verification

  • CPU Op Creator pattern migration: ensure all Op types properly register with CPUBackend and Creator factories wire OpParam → Op construction correctly

  • SIMD compute kernels (GemmFp, GemmKleidiai, FA2, SageAttention): platform-specific codepaths (AVX2, NEON) and fallback correctness; quantization block interpretations

  • Layer backend switching and initOp logic: backend type consistency checks and Op re-initialization on backend change

  • Examples NPU/QNN integration: v2 API usage, context state management, chunk handling, and graph freezing/switching logic

  • Possible missing ownership transfer semantics: verify Backend creation and ownership in Backend::global_backends is clear

  • State manager reset() completeness: ensure all inference state properly resets for new prompts/batches

Possibly related PRs

Suggested reviewers

  • oreomaker
  • chenghuaWang
  • liang1232018

Poem

🐰 Whiskers twitching with delight,
Paths refactored left and right,
SIMD kernels now take flight,
State and context, shining bright!
Backends bloom—what a sight!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.75% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title Check ❓ Inconclusive The pull request title "feat: add kai&qnn-vl&opencl" uses vague abbreviations (kai, qnn-vl, opencl) that lack clarity about what is actually being added to the codebase. While the title does reference real components included in the changeset, it fails to capture the primary architectural change evident throughout the PR: a massive directory restructuring from src/ to mllm/. The title does not adequately convey the scope and significance of the changes, which include extensive backend implementations, infrastructure additions (Context, StateManager, device memory management), and new computational kernels. The abbreviated format makes it difficult for a developer scanning commit history to understand the primary intent of this change. Consider revising the PR title to be more descriptive and less abbreviated. A clearer title might be: "refactor: migrate project structure to mllm and add Kleidiai, QNN vision-language, and OpenCL backends" or "feat: add Kleidiai/QNN-VL/OpenCL support with project restructuring" to better reflect both the architectural reorganization and the new feature additions.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 103

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (19)
README.md (1)

188-199: Specify language for fenced code block.

The code block starting at line 190 should declare a language for proper syntax highlighting. Based on the context (shell output), this should be marked as a shell or plaintext result block.

Apply this diff:

- Result are as followed:
+ Result are as followed (note: expected output is shown below):
  
- ```
+ ```plaintext
  > ./demo_qwen_npu

Alternatively, if this is intended to be interactive shell output, use:

- ```
+ ```shell
  > ./demo_qwen_npu
mllm/backends/cpu/compute/Matmul.hpp (1)

17-23: Contradictory preprocessor guards; unreachable #error.

#ifndef __ARM_NEON inside #ifdef __ARM_NEON can never trigger.

Apply:

-#ifdef __ARM_NEON
-
-#ifndef __ARM_NEON
-#error \
-    "The mllm-advance Armv8 backend is enbaled but __ARM_NEON is not defined. Pls use cross-compile toolchains(such as NDK) to compile."
-#endif
+#ifndef __ARM_NEON
+#error "The mllm-advance Armv8 backend is enabled but __ARM_NEON is not defined. Use an Armv8 toolchain (e.g., NDK)."
+#else

And keep the existing closing #endif at Line 157 as-is.

mllm/backends/cpu/op/CPUCat.cpp (1)

65-70: Batch concat copy logic breaks for varying batch sizes.

copysize and destination offset assume all inputs have inputs[0]->batch(). This corrupts memory when batches differ.

Apply:

-    for (int n = 0; n < inputs.size(); ++n) {
-        auto copysize = inputs[0]->batch() * inputs[0]->head() * inputs[0]->sequence() * inputs[0]->dimension();
-        memcpy(outputs[0]->ptrAt<float>(n * inputs[0]->batch(), 0, 0, 0), inputs[n]->ptrAt<float>(0, 0, 0, 0), sizeof(float) * copysize);
-    }
+    int dst_batch_offset = 0;
+    const int head = inputs[0]->head();
+    const int seq  = inputs[0]->sequence();
+    const int dim  = inputs[0]->dimension();
+    const size_t elems_per_item = static_cast<size_t>(head) * seq * dim;
+    for (int n = 0; n < inputs.size(); ++n) {
+        const int nb = inputs[n]->batch();
+        const size_t elems = static_cast<size_t>(nb) * elems_per_item;
+        memcpy(outputs[0]->ptrAt<float>(dst_batch_offset, 0, 0, 0),
+               inputs[n]->ptrAt<float>(0, 0, 0, 0),
+               sizeof(float) * elems);
+        dst_batch_offset += nb;
+    }

If non-float dtypes are possible here, use rawHostPtr() plus type_size(dtype) instead. See next comment.

mllm/backends/cpu/compute/MatmulElastic.cpp (1)

187-191: Bug: bitwise & used in loop condition; should be &&.

The condition n < (block + 1) * blck_0 & n < use_N evaluates with bitwise & and wrong precedence, causing incorrect loop bounds.

Apply:

-                    for (int n = block * blck_0; n < (block + 1) * blck_0 & n < use_N; n++) {
+                    for (int n = block * blck_0; n < (block + 1) * blck_0 && n < use_N; n++) {
mllm/TensorImpl.hpp (1)

252-262: Remove unreachable code in time()

The switch never executes due to the early return.

     int time() {
         assert(shape_.size() == 5);
-        return legacyShape(chls()[TIME]);
-        switch (ctype_) {
-        case BCTHW:
-            return legacyShape(2);
-        case BTHWC:
-            return legacyShape(1);
-        default: return -1;
-        }
+        return legacyShape(chls()[TIME]);
     }

If the intent was a layout-dependent path, decide one behavior and drop the other, or gate by ctype_ before returning.

mllm/backends/cpu/compute/Pooling.cpp (2)

55-63: Bug: assignment inside assert; should be comparison.

assert(padding_top = -blk_h); and assert(padding_left = -blk_w); assign, not compare.

-                        assert(padding_top = -blk_h);
+                        assert(padding_top == -blk_h);
...
-                        assert(padding_left = -blk_w);
+                        assert(padding_left == -blk_w);

If asserts are intended only for debug checks, consider removing the mutation to padding_* entirely and computing start offsets from -blk_* directly.

Also applies to: 129-135


69-73: Bug: use logical &&, not bitwise & in loop conditions.

Bitwise & yields wrong loop bounds.

-                    for (int k_h = start_k_h; k_h < kernel_h & blk_h + k_h < in_height; ++k_h) {
+                    for (int k_h = start_k_h; k_h < kernel_h && blk_h + k_h < in_height; ++k_h) {
...
-                        for (int k_w = start_k_w; k_w < kernel_w & blk_w + k_w < in_width; ++k_w) {
+                        for (int k_w = start_k_w; k_w < kernel_w && blk_w + k_w < in_width; ++k_w) {

Also applies to: 136-139

mllm/ParamLoader.hpp (1)

3-12: Missing standard headers cause TU‑order dependent builds.

This header uses FILE, vector, tuple, set, shared_ptr but doesn’t include them here. Add explicit includes.

 #include <cstdint>
 #include <map>
 #include <string>
 #include <utility>
 #include "Tensor.hpp"
 #include "Types.hpp"
 #include <initializer_list>
 #include <mutex>
+#include <cstdio>     // FILE, fread
+#include <memory>     // std::shared_ptr
+#include <set>        // std::set
+#include <tuple>      // std::tuple
+#include <vector>     // std::vector
examples/demo_qwen2_vl.cpp (1)

47-52: Guard against image/prompt length mismatch.

Indexing in_imgs[i] assumes same length as in_strs. Risk of OOB if prompts > images.

-        auto input_tensor = processor.process(in_str, in_imgs[i]);
+        const auto img_idx = std::min(i, static_cast<int>(in_imgs.size() - 1));
+        auto input_tensor = processor.process(in_str, in_imgs[img_idx]);
mllm/ParamLoader.cpp (3)

437-465: partialLoad mmap path reads from buffer_ which is never set in mmap ctor.

In the new design you only set mmap_buffer_; buffer_ stays null, so mmap partialLoad fails or UB.

Use mmap_buffer_.get():

-        if (!use_mmap_ || buffer_ == nullptr || offsets_.find(name) == offsets_.end()) {
+        if (!use_mmap_ || !mmap_buffer_ || offsets_.find(name) == offsets_.end()) {
             fprintf(stderr, "Error: mmap not initialized or tensor name not found for mmap partialLoad.\n");
             return false;
         }
@@
-            uint8_t *source_ptr_in_mmap = buffer_ + offset_info.first + (static_cast<uint64_t>(row) * colNum) * perValueLength;
+            uint8_t *source_ptr_in_mmap = mmap_buffer_.get() + offset_info.first
+                                        + (static_cast<uint64_t>(row) * colNum) * perValueLength;

Also, consider validating bounds before memcpy.


143-152: Avoid double-unmap; align dtor with shared_ptr-managed mapping.

mmap_buffer_ already unmaps via deleter; dtor also munmaps buffer_. If you ever set both to same region, this double-unmaps.

-    if (use_mmap_) {
-        if (use_mmap_ && buffer_ != nullptr && buffer_ != MAP_FAILED) {
-            munmap(buffer_, size_);
-            buffer_ = nullptr;
-        }
-    }
+    if (use_mmap_ && buffer_ && !mmap_buffer_) {
+        munmap(buffer_, size_);
+        buffer_ = nullptr;
+    }

393-401: MultiFileParamLoader dtor never closes files; condition inverted and local include.

  • #include <set> inside the function is odd.
  • if (closed.find(p.second) != closed.end()) closes only when already present; no files are closed.
-MultiFileParamLoader::~MultiFileParamLoader() {
-#include <set>
-    std::set<FILE *> closed;
-    for (const auto &p : files_) {
-        if (closed.find(p.second) != closed.end()) {
-            fclose(p.second);
-            closed.insert(p.second);
-        }
-    }
-}
+MultiFileParamLoader::~MultiFileParamLoader() {
+    std::set<FILE *> closed;
+    for (const auto &kv : files_) {
+        FILE *fp = kv.second;
+        if (closed.find(fp) == closed.end()) {
+            fclose(fp);
+            closed.insert(fp);
+        }
+    }
+}

Move #include <set> to the top of the file.

mllm/Parallel.hpp (3)

19-23: Chunk count off-by-one when length is a multiple of chunk_size.

Formula adds an extra chunk for exact multiples.

Use ceiling division:

-        const int seq_length_padding = (chunk_size - real_seq_length % chunk_size) + real_seq_length;
-        chunk_num = seq_length_padding / chunk_size;
+        chunk_num = (real_seq_length + chunk_size - 1) / chunk_size;

68-77: OpenMP misuse and skipped chunks.

  • Loop runs only chunk_num/2; chunk_num==1 executes nothing.
  • #pragma omp barrier outside a parallel region is invalid.
-        for (int chunk_id = 0; chunk_id < chunk_num / 2; ++chunk_id) {
+        for (int chunk_id = 0; chunk_id < (chunk_num + 1) / 2; ++chunk_id) {
             // for every two chunk, start at chunk_id * 2 to avoid no execute for
             for (int i = chunk_id * 2; i < num_graph + chunk_id * 2 + 5; ++i) {
 #pragma omp parallel for num_threads(2)
                 for (int pair_idx = 0; pair_idx < 2; ++pair_idx) {
                     executeFunc((chunk_id * 2) + pair_idx, i - (pair_idx * 4));
                 }
-#pragma omp barrier
-                // std::cout << "---------------------------" << std::endl;
+                // optional: add synchronization inside the parallel region if needed
             }
         }

Additionally, include <omp.h> in this header if not already via transitive includes to use omp_set_max_active_levels.


88-96: Negative index when real_seq_length % chunk_size == 0.

Indexing at -1 is UB. Use (real_seq_length - 1) % chunk_size.

-                auto value = result->dataAt<float>(0, 0, real_seq_length % chunk_size - 1, i);
+                const int pos = (real_seq_length - 1) % chunk_size;
+                auto value = result->dataAt<float>(0, 0, pos, i);
mllm/backends/cpu/op/CPUArgSortFunc.hpp (1)

23-39: Return integer indices and clean up comparator/unused state.

  • argsort writes indices but stores them as float; outputs dtype mirrors inputs. Use int32 indices.
  • compareIndices doesn’t depend on object state; make it static; drop lambda capture.
  • thread_count is unused; either use it or remove it.
-private:
-    int thread_count = 4;
-
-    // 自定义比较函数,用于对索引进行排序
-    bool compareIndices(const std::pair<int, float> &a, const std::pair<int, float> &b) {
+private:
+    int thread_count = 4; // TODO: use for parallelism or remove
+    // 自定义比较函数,用于对索引进行排序
+    static inline bool compareIndices(const std::pair<int, float> &a, const std::pair<int, float> &b) {
         return a.second < b.second;
     }
 
-    void argsort(float *input, int size, float *out_indices) {
+    void argsort(float *input, int size, int *out_indices) {
         std::vector<std::pair<int, float>> indexedInput(size);
         for (int i = 0; i < size; ++i) {
             indexedInput[i] = std::make_pair(i, input[i]);
         }
-        std::sort(indexedInput.begin(), indexedInput.end(), [this](const std::pair<int, float> &a, const std::pair<int, float> &b) {
-            return compareIndices(a, b);
-        });
+        std::sort(indexedInput.begin(), indexedInput.end(), &CPUargsortFunction::compareIndices);
         for (int i = 0; i < size; ++i) {
-            out_indices[i] = static_cast<float>(indexedInput[i].first);
+            out_indices[i] = indexedInput[i].first;
         }
     }
@@
-        outputs[0]->setDtype(inputs[0]->dtype()); // argsortk_values
+        outputs[0]->setDtype(DataType::DATA_TYPE_INT32); // indices
         return ErrorCode::MLLM_NO_ERROR;
     }
     ErrorCode execute(vector<shared_ptr<Tensor>> inputs, vector<shared_ptr<Tensor>> outputs) override {
         int size = inputs[0]->dimension();
         for (int b = 0; b < inputs[0]->batch(); b++) {
             float *data = inputs[0]->ptrAt<float>(b, 0, 0, 0);
-            float *out = outputs[0]->ptrAt<float>(b, 0, 0, 0);
+            int *out = outputs[0]->ptrAt<int>(b, 0, 0, 0);
             argsort(data, size, out);
         }
         return ErrorCode::MLLM_NO_ERROR;
     }

Also applies to: 46-61

mllm/Types.hpp (1)

280-306: Bug: returning -1 in a size_t function causes wraparound.

DataTypeSize returns -1 for several cases, but the function’s return type is size_t. This wraps to a huge value and will corrupt allocations/strides.

Apply a safe fallback (0) or throw. Minimal fix:

-    case MLLM_TYPE_Q4_1:
-    case MLLM_TYPE_Q8_1:
-        return -1;
+    case MLLM_TYPE_Q4_1:
+    case MLLM_TYPE_Q8_1:
+        return 0; // unsupported
...
-    case MLLM_TYPE_Q1_K:
-        return -1;
+    case MLLM_TYPE_Q1_K:
+        return 0;
...
-    case MLLM_TYPE_IQ2_XS:
-        return -1;
+    case MLLM_TYPE_IQ2_XS:
+        return 0;
-    case MLLM_TYPE_IQ1_S:
-        return -1;
+    case MLLM_TYPE_IQ1_S:
+        return 0;
-    case MLLM_TYPE_IQ1_M:
-        return -1;
+    case MLLM_TYPE_IQ1_M:
+        return 0;
-    case MLLM_TYPE_IQ2_S:
-        return -1;
+    case MLLM_TYPE_IQ2_S:
+        return 0;

Optionally log or assert on 0 to catch misuse.

mllm/Layer.hpp (1)

1042-1057: Use logical &&, not bitwise & in View dimension checks

Using & with integers here yields incorrect truthiness and can select the wrong reshape path.

-        if (batch == -1 & seq == -1 & head != -1 & dim != -1) { // keep b&s change h&d
+        if (batch == -1 && seq == -1 && head != -1 && dim != -1) { // keep b&s change h&d
@@
-        } else if (batch == -1 & dim == -1 & head != -1 & seq != -1) { // keep b&d change h&s
+        } else if (batch == -1 && dim == -1 && head != -1 && seq != -1) { // keep b&d change h&s
@@
-        } else if (head == -1 & dim == -1 & batch != -1 & seq != -1) { // keep h&d change b&s
+        } else if (head == -1 && dim == -1 && batch != -1 && seq != -1) { // keep h&d change b&s
@@
-        } else if (batch != -1 & dim != -1 & head != -1 & seq != -1) { // change all dimension.
+        } else if (batch != -1 && dim != -1 && head != -1 && seq != -1) { // change all dimension.
mllm/Tensor.hpp (1)

653-666: Incorrect channel/width index mapping for BCTHW/BTHWC.

WIDTH and CHANNLE indices are wrong; causes bad shape/indexing.

 case BTHWC:
   impl_->chls()[BATCH] = 0;
   impl_->chls()[TIME] = 1;
   impl_->chls()[HEIGHT] = 2;
-  impl_->chls()[WIDTH] = 3;
-  impl_->chls()[CHANNLE] = 3;
+  impl_->chls()[WIDTH] = 3;
+  impl_->chls()[CHANNLE] = 4;
   break;
 case BCTHW:
   impl_->chls()[BATCH] = 0;
-  impl_->chls()[CHANNLE] = 1;
-  impl_->chls()[TIME] = 2;
-  impl_->chls()[HEIGHT] = 3;
-  impl_->chls()[WIDTH] = 3;
+  impl_->chls()[CHANNLE] = 1;
+  impl_->chls()[TIME] = 2;
+  impl_->chls()[HEIGHT] = 3;
+  impl_->chls()[WIDTH] = 4;
   break;
♻️ Duplicate comments (2)
examples/demo_minicpm_moe_mbp.cpp (1)

11-11: Same verification needed as in demo_minicpm_moe_mbm.cpp.

This change disables memory-mapped allocations, identical to the change in demo_minicpm_moe_mbm.cpp. Please refer to the verification request in that file.

mllm/backends/cpu/compute/MatmulSparse.hpp (1)

8-9: Verify the include paths are correct.

Same issue as in Pooling.hpp: the static analysis tool reports that Tensor.hpp cannot be found. Please verify the include paths are correct relative to mllm/backends/cpu/compute/ or ensure the build system's include directories properly resolve these headers.

🧹 Nitpick comments (72)
.gitignore (1)

38-39: Clean up unnecessary blank lines.

Remove the consecutive blank lines at lines 38–39 to improve file organization.

Apply this diff:

  mllm/backends/qnn/sdk*
-
-
examples/demo_tinyllama.cpp (1)

32-32: Consider clarifying the purpose of this commented-out configuration.

This commented-out line has no functional effect. If it serves as a reference for optional configuration, consider adding a brief comment explaining when/why a developer might uncomment it. Otherwise, remove it to reduce code clutter.

mllm/backends/cpu/compute/Sigmoid.hpp (2)

28-69: Overall structure is solid, but wrap in namespace.

The SIMD implementation structure is well-designed with proper conditional compilation paths and scalar fallback. However, placing this function in global scope risks symbol conflicts in larger projects.

Apply this diff to wrap in a namespace:

+namespace mllm {
+namespace cpu {
+
 /**
  * @brief 对一个 float 数组进行 Sigmoid 计算 (支持 AVX 和 NEON 的高性能版本)
  * @param n   数组中元素的数量
  * @param y   指向输出数组的指针
  * @param x   指向输入数组的指针
  */
 void vec_sigmoid_f32(const int n, float *y, const float *x) {
     // ... implementation ...
 }
+
+}  // namespace cpu
+}  // namespace mllm

2-2: Consider using English comments for international collaboration.

The codebase contains Chinese comments (e.g., "包含 NEON 指令集的头文件"). For projects with international contributors, English comments improve accessibility.

Example:

-#include <arm_neon.h> // 包含 NEON 指令集的头文件
+#include <arm_neon.h> // Include NEON intrinsics header

This suggestion applies to all comments in the file (lines 2, 5, 10, 29-32, 37, 51, 65).

mllm/backends/cpu/compute/FeatureCheck.hpp (4)

4-6: Remove commented-out preprocessor directives.

The commented-out #if and #endif directives (lines 4 and 6) add no value and should be removed. The Chinese comment on line 5 should also be translated to English or removed for consistency.

Apply this diff:

-// #if defined(__aarch64__) && !defined(HWCAP_I8MM)
-#include <asm/hwcap.h> // 确保定义 HWCAP_I8MM
-// #endif
+#include <asm/hwcap.h> // Ensure HWCAP_I8MM is defined

27-78: Remove commented-out debug statements.

The function contains numerous commented-out debug statements (lines 27, 31, 37, 40-41, 46, 49, 52, 55, 60, 68, 74, 76). These should be removed for production code. If debugging capabilities are needed, consider using a logging framework or conditional compilation macros instead.


61-73: Consider RAII for file handling.

The C-style file handling works correctly but could benefit from C++ RAII patterns for better exception safety. Consider using std::ifstream or a custom RAII wrapper.

Example refactor using C++ streams:

#if defined(__linux__)
    std::ifstream cpuinfo("/proc/cpuinfo");
    if (cpuinfo) {
        std::string line;
        while (std::getline(cpuinfo, line)) {
            if (line.find("Features") != std::string::npos && 
                line.find("i8mm") != std::string::npos) {
                return true;
            }
        }
    }
#endif

48-51: Consider using named constants for magic numbers.

The bit positions (52) and mask (0xF) for the I8MM field, as well as the valid values (1, 2), should be named constants for better readability and maintainability.

Example:

// ARM Architecture Reference Manual: ID_AA64ISAR1_EL1.I8MM field
static constexpr uint8_t I8MM_FIELD_SHIFT = 52;
static constexpr uint64_t I8MM_FIELD_MASK = 0xF;
static constexpr uint8_t I8MM_SUPPORTED_VALUE_1 = 1;
static constexpr uint8_t I8MM_SUPPORTED_VALUE_2 = 2;

const uint8_t i8mm_field = (isar1 >> I8MM_FIELD_SHIFT) & I8MM_FIELD_MASK;
if (i8mm_field == I8MM_SUPPORTED_VALUE_1 || i8mm_field == I8MM_SUPPORTED_VALUE_2) {
    return true;
}
mllm/backends/cpu/compute/Matmul.cpp (1)

50-50: Consider simplifying the is_0 condition for clarity.

The condition src1->batch() != src0->batch() can be simplified to src0->batch() != 1 since we already know src1->batch() == 1 from the first check. This makes the intent clearer: we're checking if src0 has multiple batches that need broadcasting from src1's single batch.

-        int is_0 = (src1->batch() == 1 && src1->head() == 1 && src1->batch() != src0->batch()) ? 0 : 1;
+        int is_0 = (src1->batch() == 1 && src1->head() == 1 && src0->batch() != 1) ? 0 : 1;

Apply the same change at line 130.

Also applies to: 130-130

examples/demo_qwen3.cpp (1)

70-71: Consider simplifying KV cache management.

The cache is cleared in two places:

  • Line 63: Inside the callback when an end token is detected
  • Line 70: After generation completes (added in this PR)

When generation ends naturally (end token detected), the cache is cleared twice. While this ensures the cache is always cleared regardless of termination reason (end token vs max_new_tokens limit), you could simplify by removing the clearing at line 63 and relying solely on the post-generation clearing at line 70.

The profiling call at line 71 is a useful addition for gathering performance metrics in the demo.

If you prefer to eliminate the redundancy, apply this diff:

             auto [not_end, output_string] = tokenizer.postprocess(out_string);
             if (!not_end) {
-                model.clear_kvcache();
                 return false;
             }
mllm/backends/cpu/op/CPUBinCountFunc.hpp (2)

20-35: Input validation and perf notes (optional)

  • If non-F32 inputs are possible, either convert or reject with an ErrorCode before using dataAt/hostPtr.
  • Perf: compute max via raw pointer once to avoid dataAt overhead; optionally parallelize counting via per-thread histograms then reduce.

Example max via pointer:

-        for (int i = 0; i < size; ++i) {
-            int val = static_cast<int>(inputs[0]->dataAt<float>(0, 0, 0, i));
+        float* ptr = inputs[0]->hostPtr<float>();
+        for (int i = 0; i < size; ++i) {
+            int val = static_cast<int>(ptr[i]);
             if (val > max_val) {
                 max_val = val;
             }
         }

Optional: parallel counting with thread‑local histograms to avoid atomics.

Are non‑F32 input dtypes expected for this op?

Also applies to: 55-60


9-9: Drop unused include

is not used.

-#include <algorithm>
mllm/backends/cpu/compute/FlashAttention2H.hpp (3)

153-154: Make kv_group_size computation consistent and safe.

Prefill uses Q_Head / KV_Head directly; decode uses a guarded ternary. After adding asserts that Q_Head, KV_Head > 0, unify to a single formula.

Apply:

-        const int32_t kv_group_size = Q_Head / KV_Head;
+        const int32_t kv_group_size = Q_Head / KV_Head;
@@
-        const int32_t kv_group_size = (Q_Head > 0 && KV_Head > 0) ? Q_Head / KV_Head : 1;
+        const int32_t kv_group_size = Q_Head / KV_Head;

Alternatively, if you prefer defensive coding without relying on asserts:

+        const int32_t kv_group_size = (KV_Head > 0) ? (Q_Head / KV_Head) : 1;

Also applies to: 204-205, 615-616, 664-665


14-17: Architecture guard is too strict for generic x86 builds.

Only allowing AVX2 or __ARM_NEON will hard-fail SSE4-capable CPUs. Consider a scalar fallback path or a compile-time option to bypass the #error.

Introduce a scalar fallback block (or gate the #error behind a dedicated build flag like MLLM_STRICT_SIMD) to improve portability.

Also applies to: 39-41, 54-55


89-90: Unused flag: high_precision.

The high_precision member is configured but never read. If intentional for future work, add a brief TODO; otherwise remove to avoid confusion.

Add a comment or wire it into expf/accumulation choices if it is meant to control numerics.

Also applies to: 551-552

mllm/backends/cpu/compute/Matmul.hpp (2)

11-11: Avoid using namespace in a header.

Header-wide using namespace mllm; pollutes consumers.

Prefer either qualifying names (mllm::Tensor, mllm::ErrorCode) or wrap declarations in namespace mllm { ... } and convert namespace mllm::armv8 below to namespace armv8 inside. Happy to provide a patch if you choose either route.


29-36: Doc and naming nits (low priority).

  • “accpect” -> “accept”; “enbaled” -> “enabled”.
  • GEMM comments say C(M x K); should be C(M x N).
  • Consider renaming parameter src0_ to src0 for consistency.

Also applies to: 84-104, 124-136

mllm/backends/cpu/op/CPUCat.cpp (1)

64-104: Hardcoded float copies; generalize or assert dtype.

All memcpy paths use ptrAt<float>/sizeof(float). If tensors can be F16/BF16, this is UB.

  • If CPUCat guarantees FP32, add assert(outputs[0]->dtype()==MLLM_TYPE_F32) before copies.
  • Otherwise switch to rawHostPtr() and multiply by type_size(tensor->dtype()).
    I can send a complete patch once you confirm the intended dtype contract.
examples/demo_phonelm_npu.cpp (1)

69-71: Prefer dynamic_cast over static_cast for backend downcast.

Defensive in case the global backend at MLLM_QNN isn’t a QNNBackend (misconfig causes UB).

Apply:

-    static_cast<QNNBackend *>(Backend::global_backends[MLLM_QNN].get())->saveQNNContext();
+    if (auto *qnn = dynamic_cast<QNNBackend *>(Backend::global_backends[MLLM_QNN].get())) {
+        qnn->saveQNNContext();
+    } else {
+        std::cerr << "QNN backend not initialized.\n";
+        return -1;
+    }
examples/demo_bailing_moe.cpp (1)

39-41: Guard device selection under build flags.

MLLM_OPENCL path is asserted even when built without OpenCL. Consider failing fast on invalid -d values in non-OpenCL builds.

Wrap the assertion under #ifdef USE_OPENCL or normalize device to MLLM_CPU when OpenCL isn’t compiled.

examples/demo_qwen2.5_vl.cpp (2)

38-46: Guard against mismatched image/text counts.

Future edits may uncomment more images or prompts, causing out-of-range access.

     vector<string> in_imgs = {
         // "../assets/bus.png",
         "../assets/two_cats.jpg",
         // "../assets/bird_image.jpg",
     };
     vector<string> in_strs = {
         "<|vision_start|><|image_pad|><|vision_end|>Describe this image.",
     };
+
+    if (in_imgs.size() != in_strs.size()) {
+        std::cerr << "Size mismatch: in_imgs=" << in_imgs.size()
+                  << " vs in_strs=" << in_strs.size() << std::endl;
+        return 1;
+    }

27-27: Normalize "billion" casing for consistency; apply mapping to both 3B and 7B.

The current code only maps "3B" → "3b" but leaves "7B" unchanged. While QWenConfig's constructor performs lowercase conversion internally, explicitly normalizing both values in the demo improves clarity and consistency with the help text "[3B | 7B |]".

-    string model_billion = cmdParser.get<string>("billion") == "3B" ? "3b" : cmdParser.get<string>("billion");
+    string model_billion = cmdParser.get<string>("billion");
+    if (model_billion == "3B") model_billion = "3b";
+    if (model_billion == "7B") model_billion = "7b";
examples/demo_sparse_llama.cpp (1)

16-18: Avoid hardcoded second model path; expose via CLI and validate existence.

Baking "../models/ReLULlama_q4_k.mllm" makes the demo brittle across environments.

-    // cmdParser.add<string>("predictor", 'p', "specify mllm model predictor path", false, "../models/ReLULlama_predictor.mllm");
+    cmdParser.add<string>("extra", 'x', "extra model file (e.g., merged or adapter)", false, "../models/ReLULlama_q4_k.mllm");
...
-    // string predictor_path = cmdParser.get<string>("predictor");
+    string extra_path = cmdParser.get<string>("extra");
...
-    model.load_multifile({model_path, "../models/ReLULlama_q4_k.mllm"});
+    model.load_multifile({model_path, extra_path});

Optionally add a std::filesystem::exists check to emit a helpful error if the file is missing.

Also applies to: 26-33

examples/demo_showui_npu.cpp (1)

64-66: Minor: tidy log output.

-    std::cout << "num_iter" << num_iter << std::endl;
+    std::cout << "num_iter: " << num_iter << std::endl;
mllm/backends/cpu/compute/GemmFp.hpp (2)

30-33: Avoid potential macro clashes with min; prefer a distinct name or std::min.

-static inline int min(int a, int b) {
+static inline int imin(int a, int b) {
     return a < b ? a : b;
 }

And replace call sites: min(...) → imin(...)


104-152: Clarify GEMM semantics (C += AB vs C = AB) and document zeroing requirements.

Micro-kernels and fallback use +=. If callers expect overwrite, ensure C is zeroed beforehand or add a beta parameter.

Add a brief comment above gemm_fp32/gemm_fp32_fp16 documenting accumulation semantics.

mllm/backends/cpu/compute/Convolution.hpp (1)

8-10: Header hygiene: fix include paths and remove 'using namespace'

  • Use project-qualified paths to match new layout.
  • Avoid 'using namespace' in headers.
-#include "Tensor.hpp"
-#include "Types.hpp"
-using namespace mllm;
+#include "mllm/Tensor.hpp"
+#include "mllm/Types.hpp"

And qualify symbols with mllm:: in declarations if needed.

mllm/backends/cpu/compute/SageQuantize.hpp (3)

4-12: Tidy includes: add <type_traits>, drop unused iostream, qualify project header

-#include <iostream>
+#include <type_traits>
...
-#include "Types.hpp"
+#include "mllm/Types.hpp"

Removes weight and fixes compile for std::is_same_v.


24-37: Remove unused AVX helper

_mm256_hmax_ps is declared but never used. Drop to reduce code surface.


86-110: Avoid per-call heap allocation in quantize_new_token_to_sage_blocks

Allocating std::vector<float> smoothed_row(dim_size) for every token is costly. Compute per block in-register.

-    std::vector<float> smoothed_row(dim_size);
-    for (int d = 0; d < dim_size; ++d)
-        smoothed_row[d] = new_token_vector[d] - current_mean_data[d];
     for (int g = 0; g < num_k_blocks; ++g) {
         const int offset = g * QK8_0F;
-        const float *smoothed_block_ptr = smoothed_row.data() + offset;
-        float max_abs_val = 0.0f;
-        for (int d = 0; d < QK8_0F; ++d)
-            max_abs_val = std::max(max_abs_val, fabsf(smoothed_block_ptr[d]));
+        float max_abs_val = 0.0f;
+        float tmp[QK8_0F];
+        #pragma unroll
+        for (int d = 0; d < QK8_0F; ++d) {
+            tmp[d] = new_token_vector[offset + d] - current_mean_data[offset + d];
+            max_abs_val = std::max(max_abs_val, fabsf(tmp[d]));
+        }
         const float scale = (max_abs_val > 1e-9f) ? max_abs_val / 127.0f : 0.0f;
         out_ptr[g].scale = scale;
         const float inv_scale = (scale > 1e-9f) ? 1.0f / scale : 0.0f;
-        for (int d = 0; d < QK8_0F; ++d)
-            out_ptr[g].qs[d] =
-                static_cast<int8_t>(roundf(smoothed_block_ptr[d] * inv_scale));
+        #pragma unroll
+        for (int d = 0; d < QK8_0F; ++d) {
+            out_ptr[g].qs[d] = static_cast<int8_t>(roundf(tmp[d] * inv_scale));
+        }
     }
mllm/backends/cpu/compute/Transpose2D.hpp (1)

2-4: Trim includes and qualify project header

-#include <iostream>
-#include "Types.hpp"
+#include "mllm/Types.hpp"

[iostream] not used; remove it.

mllm/OpDefined.hpp (1)

10-134: Prefer scoped enums to avoid global identifier collisions.

Unscoped names like DIRECT, RANGE, VIEW can clash. Consider enum class OpType : int and enum class TensorFuncType : int. Migration can be phased.

CMakeLists.txt (2)

49-51: Be cautious resetting CMAKE_CXX_STANDARD_INCLUDE_DIRECTORIES.

Altering standard include dirs can trigger STL include failures (e.g., “string not found”). Prefer not to override this unless necessary.


258-299: Kleidiai third_party enforcement: add a clearer opt-out or fetch.

Hard FATAL if missing may hinder non-AArch64 devs. Gate with an option (e.g., KLEIDIAI_ENABLE) or provide a message guiding how to obtain it.

mllm/backends/cpu/op/CPUBinaryFunc.hpp (1)

33-43: Use the per-op thread_count or remove the field.

You store thread_count but never use it; OMP uses CPUBackend::cpu_threads.

-#pragma omp parallel for collapse(3) num_threads(CPUBackend::cpu_threads)
+#pragma omp parallel for collapse(3) num_threads(thread_count)

Apply similarly across ops, or drop the member and ctor arg if centralizing on CPUBackend::cpu_threads.

Also applies to: 71-82, 109-121, 149-161, 188-202, 220-246, 269-286, 309-338, 361-384

mllm/backends/cpu/compute/Pooling.cpp (1)

126-132: Initialize max with -inf, not a magic constant.

Use numeric limits for correctness across ranges.

-                    float value = -999999;
+                    float value = -std::numeric_limits<float>::infinity();

Remember to include if not already present in this TU.

mllm/Op.hpp (2)

64-64: Consider verifying inputs vector is non-empty.

The code now propagates ctype from inputs[0] to outputs, but doesn't check if inputs is empty. While this may be guaranteed by calling context, adding a defensive check could prevent crashes.

Consider adding a safety check:

 virtual ErrorCode setUp(vector<shared_ptr<Tensor>> inputs, vector<shared_ptr<Tensor>> outputs) {
+    assert(!inputs.empty() && "setUp requires at least one input");
     for (auto &output : outputs) {
         output->setDtype(activation_dtype_);
         output->setCtype(inputs[0]->ctype());

135-137: Mutable reference to internal state.

Exposing traced_ via a mutable reference allows external code to directly modify the internal tracing state. While this enables fine-grained control for tracing/instrumentation, it also breaks encapsulation. Consider whether a setter method would be more appropriate:

bool traced() const { return traced_; }
void setTraced(bool traced) { traced_ = traced; }
mllm/Context.cpp (1)

15-42: Commented backend init: remove or refactor with smart pointers/registry.

Large commented code with raw new risks bitrot. Either delete it or reintroduce it behind a factory using std::unique_ptr/shared_ptr and a registry to avoid leaks and globals.

examples/demo_qwen2_vl_vtp.cpp (2)

26-27: Remove unused variables.

thread_num and param_loader are never used.

-    int thread_num = cmdParser.get<int>("thread");
     CPUBackend::cpu_threads = cmdParser.get<int>("thread");
@@
-    ParamLoader param_loader(model_path);

Also applies to: 34-34


66-66: Prefer iostream consistently.

Use std::cout << '\n' instead of printf.

-        printf("\n");
+        std::cout << '\n';
mllm/ParamLoader.hpp (1)

105-108: Tighten API: const/noexcept and consistent alias.

Make getInputStream const and use the mllm_file alias; mark trivial getters noexcept.

-    ParamMetadata getParamMetadata(const std::string &name);
-    FILE *getInputStream();
-    std::string getParamPath() const;
+    ParamMetadata getParamMetadata(const std::string &name) const noexcept;
+    mllm_file *getInputStream() const noexcept;
+    std::string getParamPath() const noexcept;

Note: if metadata lookup mutates caches, drop noexcept and const accordingly.

examples/demo_showui_vtp.cpp (3)

25-26: Remove unused variables.

thread_num and param_loader are unused.

-    int thread_num = cmdParser.get<int>("thread");
     CPUBackend::cpu_threads = cmdParser.get<int>("thread");
@@
-    ParamLoader param_loader(model_path);

Also applies to: 28-29


36-41: Ensure inputs are paired; guard for mismatches.

Protect against future edits causing OOB during batching.

-    vector<string> in_imgs = {
+    vector<string> in_imgs = {
         "../assets/uidemo2.png"};
@@
-    for (int i = 0; i < in_strs.size(); ++i) {
+    if (in_imgs.size() != in_strs.size()) {
+        std::cerr << "in_imgs and in_strs size mismatch\n";
+        return 1;
+    }
+    for (size_t i = 0; i < in_strs.size(); ++i) {

59-59: Use iostream newline.

Prefer std::cout << '\n' over printf.

-        printf("\n");
+        std::cout << '\n';
examples/demo_qwen2_vl.cpp (4)

27-27: Normalize and validate --billion mapping.

Hard-coding only "2B" -> "1.5b" is brittle (case/alias variants, unhandled "7B"). Normalize case and handle known aliases; reject unknowns early.

Apply:

-    string model_billion = cmdParser.get<string>("billion") == "2B" ? "1.5b" : cmdParser.get<string>("billion");
+    auto billion_arg = cmdParser.get<string>("billion");
+    std::transform(billion_arg.begin(), billion_arg.end(), billion_arg.begin(), ::tolower);
+    string model_billion;
+    if (billion_arg == "2b" || billion_arg == "2") {
+        model_billion = "1.5b";
+    } else if (billion_arg == "7b" || billion_arg == "7") {
+        model_billion = "7b";
+    } else {
+        std::cerr << "Unsupported --billion: " << billion_arg << std::endl;
+        return 1;
+    }

Confirm Qwen2VLConfig accepted values for model_billion ("1.5b", "7b", etc.) to avoid runtime mismatches. If needed, I can search and adapt the mapping.


19-21: Polish help text for --billion.

Help string has a dangling | and narrow options. Clarify accepted values.

-    cmdParser.add<string>("billion", 'b', "[2B | 7B |]", false, "2B");
+    cmdParser.add<string>("billion", 'b', "model size: 2B|7B (aliases: 2,7)", false, "2B");

29-31: Remove unused variable.

thread_num is never used.

-    int thread_num = cmdParser.get<int>("thread");
-    CPUBackend::cpu_threads = cmdParser.get<int>("thread");
+    CPUBackend::cpu_threads = cmdParser.get<int>("thread");

32-32: Avoid unused ParamLoader construction.

This opens the model file but is unused; wastes startup time.

-    ParamLoader param_loader(model_path);
mllm/ParamLoader.cpp (2)

95-103: Null fp_ guard in file-IO path.

If fp_ is null (e.g., failed open), fseek/fread will crash. Guard early.

-        if (offsets_.find(name) == offsets_.end()) { return false; }
+        if (!fp_) { return false; }
+        if (offsets_.find(name) == offsets_.end()) { return false; }

Optionally check fread return and propagate errors.


299-305: Avoid raw new[] ownership escape.

Returning a raw heap buffer risks leaks. Prefer std::vector<uint8_t>.

-std::tuple<uint8_t *, uint64_t> ParamLoader::load(string name) {
+std::vector<uint8_t> ParamLoader::load(string name) {
@@
-    auto *data = new uint8_t[length];
+    std::vector<uint8_t> data(length);
@@
-    auto _ = fread(data, sizeof(uint8_t), length, fp_);
-    return std::make_tuple(data, length);
+    auto _ = fread(data.data(), sizeof(uint8_t), length, fp_);
+    data.resize(_);
+    return data;

Note: update declaration in header and call sites.

mllm/Context.hpp (2)

3-6: Trim unused heavy includes in header to speed builds.

Backend.hpp seems only referenced in commented code. Prefer forward declarations or remove.

-#include "Backend.hpp"
+// #include "Backend.hpp" // avoid heavy include; keep in .cpp if needed

9-12: Consider marking Instance() as noexcept.

Singleton creation should not throw; helps callers.

-    static Context &Instance();
+    static Context &Instance() noexcept;

Requires matching change in Context.cpp.

examples/demo_qwen.cpp (1)

27-33: Platform macro and help text polish.

  • ARM macro: use compiler-defined macros.
  • “billion” help string inconsistent with defaults.

Apply:

-#if defined(ARM)
+#if defined(__aarch64__) || defined(__arm__)
     default_model_path = "../models/qwen-2.5-1.5b-instruct-kai_q4_0_lm.mllm";
     default_model_billion = "1.5b-lm";
 #endif
-    cmdParser.add<string>("billion", 'b', "[0.5B | 1.8B | 1.5B | 3B |]", false, default_model_billion);
+    cmdParser.add<string>("billion", 'b', "[0.5b | 1.5b | 3b | 1.5b-lm]", false, default_model_billion);

Also applies to: 32-32

mllm/Parallel.hpp (1)

34-40: Minor: noisy stdout in library code.

Consider guarding the "num_graph" print with DEBUG to reduce console noise in demos.

-        std::cout << "num_graph: " << num_graph << std::endl;
+        // #ifdef DEBUGPRINT
+        // std::cout << "num_graph: " << num_graph << std::endl;
+        // #endif
mllm/backends/cpu/compute/Transpose3D.hpp (1)

16-18: Guard OpenMP header for non-OpenMP builds.

Unconditional <omp.h> include can break toolchains without OpenMP headers.

-#include <omp.h>
+#ifdef _OPENMP
+#include <omp.h>
+#endif

Note: the pragmas are ignored if OpenMP is off.

examples/demo_qwen2_vl_npu.cpp (1)

39-39: Remove unused variables and streamline thread setting.

  • ParamLoader is unused.
  • Use parsed thread_num to set CPU threads.
-    ParamLoader param_loader(model_path);
+    // ParamLoader not needed here; remove if unused.
@@
-    int thread_num = cmdParser.get<int>("thread");
-    CPUBackend::cpu_threads = cmdParser.get<int>("thread");
+    int thread_num = cmdParser.get<int>("thread");
+    CPUBackend::cpu_threads = thread_num;

Also applies to: 29-31

examples/demo_qwen_npu.cpp (1)

1-3: Header include path sanity-check (Context.hpp).

Static analysis flagged "Context.hpp not found". If include dirs don’t add mllm/, consider switching to "mllm/Context.hpp" (and likewise for other project headers) or fix include paths in the build. Also remove the redundant commented saveQNNContext line to avoid confusion.

-            // static_cast<QNNBackend *>(Backend::global_backends[MLLM_QNN].get())->saveQNNContext();
+            // (removed duplicate commented-out call)

Also applies to: 8-8

mllm/DataType.hpp (1)

39-48: Packed structs + vector loads: verify unaligned access path.

These blocks are packed (pragma pack(1)). NEON vld1/vld1q tolerate unaligned addresses on AArch64 but can be slower. If hotspots load these arrays via NEON, consider aligning parent allocations to 16 bytes or using memcpy to a local aligned buffer in inner loops.

Also applies to: 60-69, 103-118, 167-174

mllm/backends/cpu/compute/GemmQ2K.cpp (1)

44-61: Avoid hard‑coding OpenMP threads; respect runtime/config.

num_threads(4) hard‑codes parallelism and may fight CPUBackend::cpu_threads or OMP settings. Drop it or plumb a parameter.

-#pragma omp parallel for num_threads(4)
+#pragma omp parallel for
 ...
-#pragma omp parallel for num_threads(4)
+#pragma omp parallel for
 ...
-#pragma omp parallel for num_threads(4)
+#pragma omp parallel for

Also applies to: 226-292, 318-348

mllm/Types.hpp (2)

25-33: Header globals: clarify const vs runtime-tunable; ensure C++17.

KVCache_TYPE looks constant while KVCache_Type_eager/KVCache_batch are runtime-tunable. Consider:

  • Make KVCache_TYPE constexpr int (true constant).
  • Keep others as inline int but document thread-safety if mutated at runtime (atomic if written cross-thread).
    Also ensure the project enforces C++17 to support inline variables.

114-125: New BHSD mapping: double-check key collisions.

Chls2Type now mixes 4D and 5D keys. That’s fine, but ensure no duplicate 4D permutations map inconsistently elsewhere. Add a brief comment documenting BHSD semantics for maintainers.

mllm/backends/cpu/compute/Split.hpp (3)

196-223: F32 path can copy as a single contiguous block (faster, simpler).

Within an outer_idx block, the [split_size * inner_loop_size] region is contiguous. Prefer one memcpy; let libc use tuned SIMD.

-                const size_t copy_bytes = split_size * inner_loop_size * sizeof(float);
-                // memcpy(dst_base, src_base, copy_bytes);
-                for (int split_idx = 0; split_idx < split_size; ++split_idx) {
-                    const float *src = src_base + split_idx * inner_loop_size;
-                    float *dst = dst_base + split_idx * inner_loop_size;
-                    int count = inner_loop_size;
-#if defined(__AVX__)
-                    for (; count >= 8; count -= 8) {
-                        __m256 data = _mm256_loadu_ps(src);
-                        _mm256_storeu_ps(dst, data);
-                        src += 8;
-                        dst += 8;
-                    }
-#elif defined(__ARM_NEON)
-                    for (; count >= 4; count -= 4) {
-                        float32x4_t data = vld1q_f32(src);
-                        vst1q_f32(dst, data);
-                        src += 4;
-                        dst += 4;
-                    }
-#endif
-                    for (; count > 0; --count) *dst++ = *src++;
-                }
+                const size_t copy_bytes = static_cast<size_t>(split_size) * inner_loop_size * sizeof(float);
+                std::memcpy(dst_base, src_base, copy_bytes);

184-186: OpenMP loop index: prefer signed int for widest OMP compatibility.

Some OpenMP toolchains are stricter with signed loop vars. Minor, but avoids warnings.

-    for (size_t i = 0; i < out.size(); ++i) {
+    for (int i = 0; i < static_cast<int>(out.size()); ++i) {

132-134: Header include path robustness.

Including "Types.hpp" from a nested folder relies on include_dirs being set. Prefer "mllm/Types.hpp" for resilience across targets, or ensure include dirs add ${PROJECT_SOURCE_DIR}/mllm.

examples/CMakeLists.txt (2)

31-40: OpenMP linking via flags is brittle; use imported target.

Passing -fopenmp/-static-openmp in target_link_libraries is non‑portable. Prefer OpenMP::OpenMP_CXX set up at configure time; handle static/dynamic elsewhere.

-        if (ARM AND NOT (CMAKE_HOST_SYSTEM_NAME STREQUAL "Darwin" AND NOT CMAKE_CROSSCOMPILING))
-            target_link_libraries(${target} PUBLIC -fopenmp -static-openmp)
-        else()
-            target_link_libraries(${target} PUBLIC -fopenmp)
-        endif()
+        target_link_libraries(${target} PUBLIC OpenMP::OpenMP_CXX)

Also replace ARM detection with a consistent check (CMAKE_SYSTEM_PROCESSOR) if you need special-casing.


52-58: mllm_llm/mllm_vlm should propagate include dirs.

Examples include headers like "Context.hpp". Ensure these libraries set PUBLIC include dirs (e.g., ${PROJECT_SOURCE_DIR} and ${PROJECT_SOURCE_DIR}/mllm) so example targets build without local include tweaks.

# in the same block after add_library(...)
target_include_directories(mllm_llm PUBLIC ${PROJECT_SOURCE_DIR} ${PROJECT_SOURCE_DIR}/mllm)
target_include_directories(mllm_vlm PUBLIC ${PROJECT_SOURCE_DIR} ${PROJECT_SOURCE_DIR}/mllm)
mllm/backends/cpu/CMakeLists.txt (1)

154-171: Avoid undefined ARM/APK vars; use imported target for OpenMP.

The ARM/APK condition uses undeclared cache vars, leading to fragile paths. Prefer a single branch:

-if(OpenMP_FOUND)
-    ...
-    if(ARM AND NOT APK)
-        ...
-    else()
-    target_link_libraries(mllm_cpu
-            PUBLIC
-            OpenMP::OpenMP_CXX
-        )
-    endif()
-endif()
+if(OpenMP_FOUND)
+    target_link_libraries(mllm_cpu PUBLIC OpenMP::OpenMP_CXX)
+endif()

If you truly need static OpenMP on specific cross-ARM, gate it via a well-defined option (e.g., MLLM_OPENMP_STATIC) and documented toolchain.

mllm/StateManager.hpp (1)

123-131: Make getters const and avoid unnecessary copies

These accessors don’t mutate state. Mark them const; optionally return a const reference for the vector to avoid copies.

-    unsigned int getLastDraftLength() {
+    unsigned int getLastDraftLength() const {
         return last_draft_length_;
     }
-    std::vector<unsigned int> getLastVerifiedPositionIds() {
-        return last_verified_position_ids_;
-    }
-    bool isUsingDraft() {
+    const std::vector<unsigned int> &getLastVerifiedPositionIds() const {
+        return last_verified_position_ids_;
+    }
+    bool isUsingDraft() const {
         return using_draft_;
     }
mllm/backends/cpu/CPUBackend.cpp (1)

383-392: DEBUGOPTIME: static counter is not thread-safe

op_count is a function-static shared across threads. If runOp can be called concurrently, this will data-race.

  • Make op_count thread_local, or
  • Protect updates with a mutex when DEBUGOPTIME is enabled.
-    static int op_count = 0;
+    thread_local static int op_count = 0;
mllm/Tensor.hpp (1)

150-161: Child/parent cleanup is not thread-safe.

Erasing from master->childTensors() without synchronization can race if tensors are used cross-thread.

Guard with a mutex on the container or document single-threaded graph construction/destruction assumptions.

mllm/backends/cpu/compute/SageAttentionPT.hpp (1)

18-21: Duplicate AVX2 include.

Minor cleanup.

-#include <immintrin.h>
-#include <immintrin.h>
+#include <immintrin.h>

Comment thread .gitignore
.vscode/
.idea/
.cache/
.DS_Store
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove duplicate .DS_Store entry.

Lines 5 and 40 both define .DS_Store. Keep one and remove the duplicate for clarity.

Apply this diff:

- .DS_Store
  examples/test.cpp

Remove line 40 and consolidate at line 5 (already present).

Also applies to: 40-40

🤖 Prompt for AI Agents
In .gitignore around lines 5 to 40, there is a duplicate .DS_Store entry
(present at line 5 and again at line 40); remove the duplicate at line 40 and
leave the single .DS_Store entry at line 5 so the file contains only one
.DS_Store line for clarity.

* @date 2025-07-01
*
*/
#include "Module.hpp"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix include path or include dirs: 'Module.hpp' not found

Either adjust the include to the new layout or add include directories in CMake.

Suggested change:

-#include "Module.hpp"
+#include "mllm/Module.hpp"

If other headers moved similarly, align them too.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#include "Module.hpp"
#include "mllm/Module.hpp"
🧰 Tools
🪛 Clang (14.0.6)

[error] 8-8: 'Module.hpp' file not found

(clang-diagnostic-error)

🤖 Prompt for AI Agents
In examples/demo_bailing_moe_mbp.cpp around line 8, the include "#include
\"Module.hpp\"" cannot be found; update the include path to the file's new
location (for example adjust to the correct relative path like
"subdir/Module.hpp") or add the directory containing Module.hpp to the project's
include directories in CMakeLists.txt (target_include_directories or
include_directories) and apply the same fix to any other moved headers so all
includes align with the new layout.

Comment on lines +20 to +21
std::iostream::sync_with_stdio(false);
Module::alloc_mmap = false;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use ios_base for sync_with_stdio

Prefer the canonical call to avoid ADL/scope surprises.

-    std::iostream::sync_with_stdio(false);
+    std::ios_base::sync_with_stdio(false);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
std::iostream::sync_with_stdio(false);
Module::alloc_mmap = false;
std::ios_base::sync_with_stdio(false);
Module::alloc_mmap = false;
🤖 Prompt for AI Agents
In examples/demo_bailing_moe_mbp.cpp around lines 20 to 21, the code calls
std::iostream::sync_with_stdio(false) which can trigger ADL/scope surprises;
replace this call with the canonical ios_base::sync_with_stdio(false) to call
the function on the standard ios_base type and avoid ambiguity.

Comment on lines +79 to +85
LlmTextGeneratorOpts opt{
.max_new_tokens = max_new_tokens > 0 ? max_new_tokens : static_cast<size_t>(tokens_limit - input_tensor.sequence()),
.do_sample = false,
.temperature = 0.3F,
.top_k = 50,
.top_p = 0.F,
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Guard against negative max_new_tokens (size_t underflow → huge generation)

If input length ≥ tokens_limit, (tokens_limit - input_tensor.sequence()) becomes negative and, after cast, a gigantic size_t. Clamp to ≥ 0.

-        LlmTextGeneratorOpts opt{
-            .max_new_tokens = max_new_tokens > 0 ? max_new_tokens : static_cast<size_t>(tokens_limit - input_tensor.sequence()),
+        const int budget = tokens_limit - input_tensor.sequence();
+        LlmTextGeneratorOpts opt{
+            .max_new_tokens = max_new_tokens > 0 ? static_cast<size_t>(max_new_tokens)
+                                                 : static_cast<size_t>(std::max(0, budget)),
             .do_sample = false,
             .temperature = 0.3F,
             .top_k = 50,
             .top_p = 0.F,
         };
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
LlmTextGeneratorOpts opt{
.max_new_tokens = max_new_tokens > 0 ? max_new_tokens : static_cast<size_t>(tokens_limit - input_tensor.sequence()),
.do_sample = false,
.temperature = 0.3F,
.top_k = 50,
.top_p = 0.F,
};
const int budget = tokens_limit - input_tensor.sequence();
LlmTextGeneratorOpts opt{
.max_new_tokens = max_new_tokens > 0 ? static_cast<size_t>(max_new_tokens)
: static_cast<size_t>(std::max(0, budget)),
.do_sample = false,
.temperature = 0.3F,
.top_k = 50,
.top_p = 0.F,
};
🤖 Prompt for AI Agents
In examples/demo_bailing_moe_mbp.cpp around lines 79 to 85, the expression
casting (tokens_limit - input_tensor.sequence()) to size_t can underflow when
input length >= tokens_limit, producing a huge max_new_tokens; change the logic
to clamp the computed new-token budget to zero (e.g., compute a signed
difference or check if input_tensor.sequence() >= tokens_limit and use 0,
otherwise use tokens_limit - input_tensor.sequence()) before casting, then
assign that non-negative value to max_new_tokens so generation is never set to a
massive size_t from underflow.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Confirm C++ standard for designated initializers

Designated initializers require C++20. Ensure the project sets CMAKE_CXX_STANDARD to 20 or adjust initialization.

Run:


🏁 Script executed:

#!/bin/bash
rg -nP 'CMAKE_CXX_STANDARD\s*\(?\s*([0-9]+)\s*\)?' -S -C2 -- cmake CMakeLists.txt

Length of output: 195


🏁 Script executed:

#!/bin/bash
# Verify the specific file and its content
fd -t f "demo_bailing_moe_mbp.cpp" --exec cat {} +

Length of output: 4560


🏁 Script executed:

#!/bin/bash
# Check for target-specific or per-file C++ standard settings
rg -nP '(set_target_properties|target_compile_features|set_source_files_properties)' -C3 --type=cmake

Length of output: 4422


🏁 Script executed:

#!/bin/bash
# Check CMAKE_CXX_FLAGS and compiler-related settings
rg -nP 'CMAKE_CXX_(FLAGS|STANDARD)' -C2 --type=cmake

Length of output: 2204


Fix C++ standard to 20 or use C++17 compatible initialization

The project sets CMAKE_CXX_STANDARD 17 (CMakeLists.txt:13), but the code uses designated initializers which require C++20. Either update CMAKE_CXX_STANDARD 17 to CMAKE_CXX_STANDARD 20 in CMakeLists.txt, or replace the designated initializer syntax with traditional initialization.

🤖 Prompt for AI Agents
In examples/demo_bailing_moe_mbp.cpp around lines 79 to 85, the code uses C++20
designated initializers for LlmTextGeneratorOpts which conflicts with the
project C++17 setting; either update CMakeLists.txt to set CMAKE_CXX_STANDARD to
20 (or 23) so the compiler allows designated initializers, or change this
initialization to C++17-compatible syntax by constructing the struct with
positional/aggregate initialization or by default-constructing then assigning
each field (e.g., LlmTextGeneratorOpts opt; opt.max_new_tokens = ...;
opt.do_sample = ...; etc.), ensuring no C++20-only features remain.

Comment on lines +8 to +13
#include "Types.hpp"
#include "cmdline.h"
#include "models/ling/configuration_bailing_moe.hpp"
#include "models/ling/modeling_bailing_moe.hpp"
#include "models/ling/tokenization_bailing.hpp"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Resolve missing headers and include paths.

Clang: 'Types.hpp' file not found. Also this TU uses std and CPUBackend symbols.

Apply:

-#include "Types.hpp"
+#include "mllm/Types.hpp"
+#include "mllm/backends/cpu/CPUBackend.hpp"
+#include <cassert>
+#include <iostream>
+#include <string>
+#include <vector>
🧰 Tools
🪛 Clang (14.0.6)

[error] 8-8: 'Types.hpp' file not found

(clang-diagnostic-error)

🤖 Prompt for AI Agents
In examples/demo_bailing_moe.cpp around lines 8 to 13, the translation unit is
failing because Types.hpp cannot be found and the file uses std and CPUBackend
symbols without the proper headers; fix by (1) correcting the include to the
real path or adding the containing directory to the compiler's include paths so
"Types.hpp" resolves, (2) adding the missing header that declares CPUBackend
(e.g., the project's backend header that defines CPUBackend), and (3) including
or qualifying the C++ standard headers used (e.g., include <string>, <vector> or
<iostream> as needed, or prefix types with std::) to remove implicit std
references.

Comment thread mllm/Generate.hpp
Comment on lines +167 to +189
inline void _tensor_to_vec_of_multiIndices(Tensor &t, std::vector<std::vector<float>> &scores, std::vector<int> indices) {
assert(t.batch() == 1 && "Batch size of result is not 1. Which is not supported for now.");
assert(t.head() == 1 && "The 3rd dim of result should be one. e.g.:[1, 1, seq, hidden]");
int _dims = t.dimension();
// TODO: 考虑QNN进行padding
// padding prefill for QNN
// if (is_padding) {
// if (chunk_size > 0) {
// _seq = (seq_before_padding - 1) % chunk_size;
// } else {
// _seq = seq_before_padding - 1;
// }
// }
for (int idx = 0; idx < indices.size(); ++idx) {
std::vector<float> values(t.dimension());
int _seq = indices[idx];
for (int i = 0; i < _dims; ++i) {
auto value = t.dataAt<float>(0, 0, _seq, i);
values[i] = value;
}
scores.push_back(values);
}
unsigned int generate(Tensor &t) override {
std::cerr << "Should use generate_SD" << std::endl;
assert(false);
return -1;
};
unsigned int generate_SD(Tensor &t, TracePool &tp);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Inconsistent dtype handling across tensor conversion methods.

The newly added _tensor_to_vec_of_multiIndices method uses t.dataAt<float>() (line 184) without dtype checking, while the modified _tensor_to_vec method (lines 84-92) properly handles both FP16 and FP32. Since this PR introduces FP16 support infrastructure, FP16 tensors may be passed to this method, causing incorrect data interpretation or runtime errors.

Similarly, other existing methods (_tensor_to_vec_with_idx line 102, _tensor_to_multivec line 124, _tensor_to_multivec_with_idx line 148) also lack dtype checking and should be verified.

Apply this diff to add dtype handling to the new method:

         for (int idx = 0; idx < indices.size(); ++idx) {
             std::vector<float> values(t.dimension());
             int _seq = indices[idx];
             for (int i = 0; i < _dims; ++i) {
-                auto value = t.dataAt<float>(0, 0, _seq, i);
+                float value;
+                if (t.dtype() == MLLM_TYPE_F16) {
+                    value = MLLM_FP16_TO_FP32(t.dataAt<mllm_fp16_t>(0, 0, _seq, i));
+                } else if (t.dtype() == MLLM_TYPE_F32) {
+                    value = t.dataAt<float>(0, 0, _seq, i);
+                } else {
+                    throw std::runtime_error("Unsupported dtype for text generation.");
+                }
                 values[i] = value;
             }
             scores.push_back(values);

Verify whether other tensor-to-vector conversion methods need the same dtype handling:


🏁 Script executed:

#!/bin/bash
# Description: Find all tensor data access patterns in conversion methods to check for dtype handling

# Search for dataAt<float> calls in the file
rg -n "dataAt<float>" mllm/Generate.hpp

Length of output: 308


🏁 Script executed:

#!/bin/bash
# Extract context around each dataAt<float> line to see if dtype checking is present

# Show lines 80-110 to check _tensor_to_vec and _tensor_to_vec_with_idx
echo "=== Lines 80-110: _tensor_to_vec and _tensor_to_vec_with_idx ===" 
sed -n '80,110p' mllm/Generate.hpp

echo ""
echo "=== Lines 120-155: _tensor_to_multivec and _tensor_to_multivec_with_idx ==="
sed -n '120,155p' mllm/Generate.hpp

echo ""
echo "=== Lines 180-189: _tensor_to_vec_of_multiIndices ==="
sed -n '180,189p' mllm/Generate.hpp

echo ""
echo "=== Check for dtype handling patterns in the file ==="
rg -n "MLLM_FP16_TO_FP32|t\.dtype()" mllm/Generate.hpp

Length of output: 3577


Inconsistent dtype handling across tensor conversion methods - verified critical issue.

The verification confirms the review comment is accurate. The _tensor_to_vec method (lines 85-91) has proper dtype handling with MLLM_FP16_TO_FP32 conversion, but the newly added _tensor_to_vec_of_multiIndices (line 184) and three other existing methods (_tensor_to_vec_with_idx line 102, _tensor_to_multivec line 124, _tensor_to_multivec_with_idx line 148) all lack dtype checking and directly use dataAt<float>(). Since FP16 support is being introduced, this will cause incorrect data interpretation when FP16 tensors are passed to these methods.

At minimum, apply the suggested dtype handling to the new method (lines 180-189). The three other existing methods should also be fixed to maintain consistency.

🤖 Prompt for AI Agents
In mllm/Generate.hpp around lines 167-189, _tensor_to_vec_of_multiIndices reads
tensor values with dataAt<float>() without handling FP16; update this function
to mirror the dtype-aware logic used in _tensor_to_vec (lines ~85-91): check the
tensor dtype (or an is_fp16 flag / MLLM_FP16_TO_FP32 macro), and if FP16 convert
values to float (e.g., load as FP16 and apply MLLM_FP16_TO_FP32) before storing
into the float vector; if FP32, continue using dataAt<float>(). Also apply the
same dtype-aware fix to the other methods mentioned (_tensor_to_vec_with_idx
~102, _tensor_to_multivec ~124, _tensor_to_multivec_with_idx ~148) to ensure
consistent FP16 support across all conversions.

Comment thread mllm/TensorImpl.hpp
Comment on lines 3 to 18
#include <cstdio>
#include <iomanip>
#include <cmath>
#include <fstream>
#include <iostream>
#include <map>
#include <memory>
#include <vector>
// #include <vector>
#ifdef _WIN32
#include <direct.h>
#else
#include <sys/stat.h>
#endif
#include <Types.hpp>
#include <assert.h>
// #include <sys/stat.h>

#include "OpDefined.hpp"
#include "Backend.hpp"
#include <Types.hpp>

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix missing standard headers and include path

Required headers are missing; Types include should be project-qualified.

 #include <cstdio>
-#include <iostream>
+#include <iostream>
 #include <map>
 #include <memory>
-// #include <vector>
+#include <vector>
+#include <string>
+#include <limits>
+#include <cstdint>
...
-#include <Types.hpp>
+#include "mllm/Types.hpp"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#include <cstdio>
#include <iomanip>
#include <cmath>
#include <fstream>
#include <iostream>
#include <map>
#include <memory>
#include <vector>
// #include <vector>
#ifdef _WIN32
#include <direct.h>
#else
#include <sys/stat.h>
#endif
#include <Types.hpp>
#include <assert.h>
// #include <sys/stat.h>
#include "OpDefined.hpp"
#include "Backend.hpp"
#include <Types.hpp>
#include <cstdio>
#include <iostream>
#include <map>
#include <memory>
#include <vector>
#include <string>
#include <limits>
#include <cstdint>
#ifdef _WIN32
#include <direct.h>
#else
#include <sys/stat.h>
#endif
#include <assert.h>
#include "OpDefined.hpp"
#include "Backend.hpp"
#include "mllm/Types.hpp"
🧰 Tools
🪛 Clang (14.0.6)

[error] 3-3: 'cstdio' file not found

(clang-diagnostic-error)

🤖 Prompt for AI Agents
In mllm/TensorImpl.hpp around lines 3 to 18, add the missing standard headers
(at least <vector> and <cstdint> and <cstdlib> as they are commonly required by
tensor implementations) and change the project include from <Types.hpp> to the
project-qualified path "mllm/Types.hpp" (or the correct relative project path)
so the file compiles reliably and resolves the Types header unambiguously.

Comment thread mllm/TensorImpl.hpp
Comment on lines +171 to 208
void to(Backend *target_backend) {
if (backend_ == target_backend) {
return;
}
// 路径1: 从任何后端迁移到主机 (CPU)
if (target_backend->type() == MLLM_CPU) {
if (location_ == ON_DEVICE) { // 从设备迁移到Host
void *new_host_ptr = nullptr;
target_backend->alloc(&new_host_ptr, cntSize() + 16, 128);
backend_->copy_to_host(new_host_ptr, device_memory_);
backend_->free_device(device_memory_);
host_ptr_ = new_host_ptr;
// cl_device_buffer_ = nullptr;
device_memory_.handle = nullptr;
location_ = ON_HOST;
allocated_ = count_;
}
host_ptr_ = nullptr;
allocated_ = 0;
}
// 路径2: 从主机 (CPU) 迁移到某个设备
else if (backend_->type() == MLLM_CPU) {
if (location_ == ON_HOST) {
device_memory_.size_in_bytes = cntSize();
target_backend->alloc_device(device_memory_, dtype_);
target_backend->copy_from_host(device_memory_, host_ptr_);
if (owns_host_ptr_) {
backend_->free(host_ptr_);
}
host_ptr_ = nullptr;
location_ = ON_DEVICE;
// allocated_ = 0;// todo1418
}
} else {
std::cout << "Device -> Device migration via Host" << std::endl;
this->to(Backend::global_backends[MLLM_CPU].get());
this->to(target_backend);
}
backend_ = target_backend;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Strengthen 'to' migration preconditions and error paths

Guard null backend_ and target_backend, and handle already-on-target+location cases cleanly.

-    void to(Backend *target_backend) {
-        if (backend_ == target_backend) {
+    void to(Backend *target_backend) {
+        assert(target_backend && "target_backend must be non-null");
+        assert(backend_ && "backend_ must be non-null before migration");
+        if (backend_ == target_backend) {
             return;
         }
🤖 Prompt for AI Agents
In mllm/TensorImpl.hpp around lines 171 to 208, add defensive checks and
stronger error handling in to(Backend *target_backend): validate that both
backend_ and target_backend are non-null up-front and return or throw a clear
error if not; early-return if backend_ == target_backend and the current
location_ already matches the intended ON_HOST/ON_DEVICE state to avoid no-op
migrations; when migrating CPU<->device, check results of
alloc/alloc_device/copy_to_host/copy_from_host and handle failures by cleaning
up any partially allocated memory and not updating backend_ or other state until
the operation completes successfully; ensure owns_host_ptr_ and free paths are
used only when safe (avoid double-free by checking host_ptr_ and
owns_host_ptr_), set device_memory_.handle and host_ptr_ only after successful
allocations/copies, and finally assign backend_ = target_backend only after the
migration succeeded.

Comment thread README.md
Comment on lines 33 to 38
- [2025 August 28] 🔥🔥🔥 Support for MLLM V1 is ending soon. Before its retirement, V1 will integrate the following features: GPT-OSS and NPU QWEN2-VL. MLLM will then transition to V2, which can be viewed on the V2 branch.
V2 will include brand-new capabilities:
- A more Pythonic model authoring approach with eager execution
- Compilation support and MLLM IR for easier NPU integration
- Support for parallel execution of multiple models
- A more refined engineering implementation
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix nested list indentation per Markdown standards.

Lines 35–38 use 4-space indentation; Markdown expects 2-space indentation for nested lists. This affects rendering consistency across platforms.

Apply this diff to correct the indentation:

 - [2025 August 28] 🔥🔥🔥 Support for MLLM V1 is ending soon. Before its retirement, V1 will integrate the following features: GPT-OSS and NPU QWEN2-VL. MLLM will then transition to V2, which can be viewed on the V2 branch.
-    V2 will include brand-new capabilities:
-    - A more Pythonic model authoring approach with eager execution
-    - Compilation support and MLLM IR for easier NPU integration
-    - Support for parallel execution of multiple models
-    - A more refined engineering implementation
+  V2 will include brand-new capabilities:
+  - A more Pythonic model authoring approach with eager execution
+  - Compilation support and MLLM IR for easier NPU integration
+  - Support for parallel execution of multiple models
+  - A more refined engineering implementation
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- [2025 August 28] 🔥🔥🔥 Support for MLLM V1 is ending soon. Before its retirement, V1 will integrate the following features: GPT-OSS and NPU QWEN2-VL. MLLM will then transition to V2, which can be viewed on the V2 branch.
V2 will include brand-new capabilities:
- A more Pythonic model authoring approach with eager execution
- Compilation support and MLLM IR for easier NPU integration
- Support for parallel execution of multiple models
- A more refined engineering implementation
- [2025 August 28] 🔥🔥🔥 Support for MLLM V1 is ending soon. Before its retirement, V1 will integrate the following features: GPT-OSS and NPU QWEN2-VL. MLLM will then transition to V2, which can be viewed on the V2 branch.
V2 will include brand-new capabilities:
- A more Pythonic model authoring approach with eager execution
- Compilation support and MLLM IR for easier NPU integration
- Support for parallel execution of multiple models
- A more refined engineering implementation
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

35-35: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


36-36: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


37-37: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


38-38: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

🤖 Prompt for AI Agents
In README.md around lines 33 to 38 the nested bullet list under the V2
description is indented with 4 spaces which breaks Markdown rendering on some
platforms; change the nested list lines (35–38) to use 2-space indentation so
each sub-bullet lines up directly under the parent item (use two spaces before
each "-" and ensure same indentation for all nested bullets).

Comment thread README.md
*`NOTE:` The QNN backend is preliminary version which can do end-to-end inference. It is still under active development for better performance and more supported models.*

We support running Qwen-1.5-1.8B-Chat using [Qualcomm QNN](https://www.qualcomm.com/developer/software/qualcomm-ai-engine-direct-sdk) to get Hexagon NPU acceleration on devices with Snapdragon 8 Gen3. The details of QNN environment set up and design is [here](./src/backends/qnn/README.md). The prefilling stage is performered by QNN & CPU, and the inference stage is performed by CPU.
We support running several Qwen family models including Qwen-2-vl using [Qualcomm QNN](https://www.qualcomm.com/developer/software/qualcomm-ai-engine-direct-sdk) to get Hexagon NPU acceleration on devices with Snapdragon 8 Gen3. The details of QNN environment set up and design is [here](./mllm/backends/qnn/README.md). The prefilling stage is performered by QNN & CPU, and the inference stage is performed by CPU.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use descriptive link text instead of generic placeholders.

Lines 156 and 170 use generic "[here]" link text. Replace with descriptive text that indicates the target content.

Apply this diff:

- The details of QNN environment set up and design is [here](./mllm/backends/qnn/README.md).
+ The details of QNN environment set up and design is documented in the [QNN backend README](./mllm/backends/qnn/README.md).
- Details can be found in backend specific [README](./mllm/backends/qnn/README.md).
+ Details can be found in the [QNN backend README](./mllm/backends/qnn/README.md).

Also applies to: 170-170

🧰 Tools
🪛 markdownlint-cli2 (0.18.1)

156-156: Link text should be descriptive

(MD059, descriptive-link-text)

🤖 Prompt for AI Agents
In README.md around lines 156 and 170, replace the generic "[here]" link text
with descriptive link text that explains the destination (e.g., "QNN environment
setup and design" or "QNN backend README") so readers know what the link points
to; update both occurrences to use clear, specific anchor text and keep the
existing URL paths intact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants