Skip to content

V2 Release#545

Closed
oreomaker wants to merge 812 commits intomainfrom
v2
Closed

V2 Release#545
oreomaker wants to merge 812 commits intomainfrom
v2

Conversation

@oreomaker
Copy link
Copy Markdown
Collaborator

@oreomaker oreomaker commented Nov 23, 2025

Summary by CodeRabbit

  • New Features

    • Added lazy visual-language model implementations with optimized KV cache management for Qwen2.5VL and Qwen2VL models
    • Added multi-platform development container support for ARM, CUDA 12.4, CUDA 12.8, and QNN environments
    • Added support for modern build infrastructure and CI/CD workflows for Android NDK, macOS Apple Silicon, and documentation deployment
  • Infrastructure & Documentation

    • Restructured project configuration with expanded feature flags and build options
    • Modernized development tooling configuration (.clang-format, .clang-tidy, .editorconfig)
    • Added comprehensive API, architecture, and backend-specific documentation
    • Enhanced GitHub workflows with new issue templates and contribution guidelines
  • Build & Deployment

    • Updated CMake configuration with standardized build options and packaging support
    • Added Docker support for multiple platforms and development environments
    • Reorganized submodule structure for improved dependency management

✏️ Tip: You can customize this high-level summary in your review settings.

chenghuaWang and others added 30 commits October 15, 2025 22:02
feat(cli): add mllm-llm-benchmark tool for performance testing
- Define QNN_QUANT_SCALE_NAME constant for quant scale key
- Replace all occurrences of "quant_scale" string literal
- Improve code maintainability and reduce typo risks
- Ensure consistent usage of quant scale identifier
- Simplify future modifications to quant scale key name
- Add QNNOpNamingPass to assign unique names to unnamed operations
- Traverse subgraphs and name ops using module_name.op_type.index pattern
- Handle CallGraphOp and SubGraphOp during IR traversal
- Ensure all QNN operations have unique identifiers for graph construction
- Add pass factory function and integrate with existing pass infrastructure
Added new source files Nn.cc and Compile.cc to the MllmFFIExtension library
in CMakeLists.txt to extend the FFI interface.

feat(build): format MLIR installation script

Reformatted the cmake command in install_mlir.sh to a single line for better
readability and consistency in the build script.
- Add new kai_sme.cpp and kai_sme.hpp files with proper copyright headers
- Implement ARM-specific linear kernel using SME instructions
- Include necessary header guards and license information
- Remove empty KernelSelector files that were not being used
- Add QNNCastTypeOp to handle type casting with quantization
- Support both quantize and dequantize operations
- Integrate with QNN backend for graph node creation
- Handle scale propagation for int8 and int16 types
- Add pattern matching for CastType operations in IR
- Add `config_0.6B_w4a8_i8mm_kai.json` with model architecture settings
- Add `quant_cfg_0.6B_w4a8_i8mm_kai.json` with layer-wise quantization hints
- Configure KaiLinear implementation types for various modules

perf(cpu): add label support for KaiLinear implementations

- Insert labels for kai linear implementations to enable goto jumps
- Optimize forward path by switching implementations based on input shape

refactor(mllm): comment out memory cleanup temporarily

- Comment out `clearAll()` call in `shutdownContext()`
- Mark as FIXME for CUDA compatibility

style(qwen3): reformat function signature for readability

- Reformat `makeRotaryPosEmbedding` function declaration to fit within
  line limits
- Improve code style consistency

fix(qwen3): remove redundant finish token callback

- Remove unnecessary finish token callback in Qwen3Session
- Clean up post-processing logic for radix tree insertion
Adds a new devcontainer.json file for cu128 environment with comprehensive
VS Code extension setup including Python, C++, debugging, and formatting
tools.
feat(qwen3): add config and quantization files for 0.6B model
- Added `rmsnorm_fp32_inplace` and `rmsnorm_fp16_inplace` functions in ARM kernels
- Updated RMSNormOp to support inplace operations using the new kernel functions
- Modified LinearOp and related classes to support tensor redirection
- Enhanced FlashAttention2Op with updated kernel includes and input handling
- Added new test cases for FlashAttention2 with improved accuracy checks
- Fixed contiguous tensor assertions in RMSNorm and RoPE operations
- Extended Layer macros to support redirect attribute for ops
- Updated StaticCache with new methods for KV cache management
- Improved FA2 kernel tests with radix attention support and better validation
feat(cpu): add inplace rmsnorm implementations for fp32 and fp16
- Skip tensor data printing when trace mode is enable
- Add QNNMulOp class with reshape implementation for broadcasting
- Implement QNNMulPattern to add ElementWiseMultiply nodes to QNN graph
- Update QNNAddPattern to use standard ElementWiseAdd operator
- Add tensor shape compatibility checks for Mul operations
- Include proper error handling for tensor operations and backend access
- Add factory class for QNNMulOp creation
- Switch implementation from Conv2d to FullyConnected operator
- Reshape weights to 2D [out_channels, in_channels] format
- Convert bias to int32 type for proper quantization handling
- Remove unused biasInt32_ tensor member
- Update reshape logic to flatten input for FullyConnected
- Add keep_dims parameter for HTP support
- Remove stride and pad parameters for Conv2d
- Simplify bias conversion logic for quantized operations
- Implemented hash() method combining tensor uuid and attached views uuids
- Updated tensor IR caching to use hash instead of uuid
- Add QNNX2XOp to handle data transfer between CPU and QNN shared buffer
- Implement forward method to perform memory copy using std::memcpy
- Create QNNX2XOpFactory for op creation in QNN backend
- Add QNNX2XPattern as a placeholder that should not appear in QNN graph
- Include OpTypes header in QNNDispatcher
- Execute X2X op setup and forward in QNN dispatcher for kX2X operations
jialilve and others added 28 commits November 20, 2025 06:24
feat: Implement Qwen NPU Decoding Support with Memory Management Fixes
    - ensure CausalMask layer is materialized on CPU before running kernel tests
    - add deterministic Prefill/Decode/Append regressions based on runScenario helper
    - exercise new coverage under build-tests/bin/Mllm-Test-CPUKernel --gtest_filter=CausalMaskOpTest.*
…aths

- Update Hexagon SDK requirement from 5.x to 6.x in documentation
- Adjust Makefile execution logic in HexagonMakeTask to use updated paths
- Update library names from 'libQnnMllmPackage' to 'libQnnLLaMAPackage'
- Modify build configuration files to reflect new package location
- Ensure proper renaming of CPU and HTP libraries after build
QNN Op Package Migrate to v2
feat: add DeepSeek-OCR support, C++ API updates, and dual-model loadi…
- Add detailed documentation for mllm's operator plugin system
- Document in-tree and out-of-tree operator registration methods
- Include examples for implementing custom operators and factories
- Add plugin descriptor and build configuration guidelines
- Update model configuration examples with GGUF quantization hints
- Document supported quantization types in mllm-quantizer
- Add guidance on selecting appropriate quantization methods
- Remove outdated backend addition guide from quick start index
Added a note about model version compatibility and recommendations.
feat(build): update threading options for Apple GCD support in build  configurations
fix(docs): update links for Qwen2 and Qwen2.5 models in README
feat(docs): add mllm-params-inspector tool usage instructions to README
Add build status entries for OrangePi AI Pro (310B) and OrangePi AI Studio (310P)
with Ubuntu 22.04 in the compatibility matrix.
docs(readme): add OrangePi AI Pro and Studio build status
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Nov 23, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

Comprehensive v2 repository restructure introducing modernized build infrastructure (CMake with extensive feature flags), new lazy visual-language model algorithms with dynamic KV caching, extensive C++ SDK preparation, multi-platform Docker support, refined development tooling (clang-format, clang-tidy, devcontainers), GitHub CI/CD workflows, and extensive API documentation.

Changes

Cohort / File(s) Summary
Build & CMake Infrastructure
CMakeLists.txt, cmake/CPM.cmake, cmake/mllmConfig.cmake.in, benchmarks/CMakeLists.txt, benchmarks/cpu/CMakeLists.txt, benchmarks/ext_stl/CMakeLists.txt, algorithms/*/CMakeLists.txt
Modernized CMake v3.21+ configuration with C++20, extensive feature flags (MLLM_ENABLE_TEST, MLLM_BUILD_*_BACKEND, MLLM_EXT_ENABLE, etc.), CPM package manager integration, Git commit hash detection, Tracy profiling, and platform-specific thread vendors (OpenMP, Apple GCD).
Lazy VLM Algorithm Implementation
algorithms/lazy_vlm/HKVCache.{hpp,cpp}, algorithms/lazy_vlm/HKVCacheFast.{hpp,cpp}, algorithms/lazy_vlm/models/qwen2_5vl/*.{hpp,cpp}, algorithms/lazy_vlm/models/qwen2vl/*.{hpp,cpp}, algorithms/lazy_vlm/LazyVLMQwen2*.cpp, algorithms/lazy_vlm/run*.py
Complete lazy visual-language model system with hierarchical KV caching (HKVCache, HKVCacheFast), Qwen2.5VL and Qwen2VL model implementations, attention-based pruning, dynamic token selection, and Python build/deployment scripts.
Benchmark Implementations
benchmarks/cpu/arm_mllm_blas_sgemm.cpp, benchmarks/ext_stl/intrusive_ptr.cpp
ARM CPU BLAS benchmarking (GEMV, batched GEMV with NEON optimization) and intrusive pointer performance comparison against std::shared_ptr.
Code Quality & Formatting
.clang-format, .clang-tidy, .clang-tidy.ignore, .pre-commit-config.yaml, .clangd
Updated clang-format with C++20 focus, C++ column limit 128, 2-space indentation; clang-tidy enabled with warnings-as-errors, broader checks (google-, modernize-, performance-*); pre-commit hook for clang-format; clangd C++20 configuration.
Development Environment
.devcontainer/*/devcontainer.json, docker/Dockerfile.*, docker/README.md, .editorconfig, .vscode/*
Multi-platform devcontainers (ARM, CUDA 12.4, CUDA 12.8, QNN) with VSCode extensions; Dockerfiles for ARM NDK, CUDA variants, and QNN SDK; editor config for project-wide formatting.
GitHub Workflows & Templates
.github/workflows/*.yml, .github/ISSUE_TEMPLATE/*.yml, .github/pull_request_template.md, .github/copilot-instructions.md
CI pipelines for Android NDK, macOS Apple Silicon, documentation deployment, and pymllm nightly builds; structured issue templates (bugs, features, model support, performance, research); PR template and Copilot guidelines.
Project Metadata & Configuration
.gitignore, .gitmodules, CODEOWNERS, LICENSE, AUTHORS, README.md, algorithms/.gitignore
Restructured .gitignore to granular file patterns, replaced submodules (removed android/pybind11, added fmt/benchmark/kleidiai/cccl/cutlass/llvm-project/tokenizers), CODEOWNERS for code ownership, updated LICENSE year to 2025, regenerated AUTHORS, comprehensive README redesign.
Documentation Structure
docs/, docs/conf.py, docs/index.rst, docs/api/*, docs/arch/*, docs/cache/*, docs/compile/*, docs/cpu_backend/*, docs/qnn_backend/*, docs/contribute/*, docs/qa/*
Complete Sphinx documentation suite with C++ API reference (Tensor, Module, Layer, NN, Functional, ARGeneration), architecture guides (module/layer/dispatcher, IR levels, op plugin system, tensor layout), backend-specific docs (CPU threads/FA2/ARM/X86, QNN design), contribution guidelines, and FAQ.
Fancy Algorithm Skeleton
algorithms/fancy_algorithm/{.gitignore,CMakeLists.txt,README.md,main.cpp,run.py}
Minimal custom algorithm development template with CMake configuration, empty main entry point, and Android build/deployment script.

Sequence Diagram(s)

sequenceDiagram
    participant Dev as Developer
    participant Git as Git Workflow
    participant Build as CMake Build
    participant CI as GitHub Actions
    participant Img as Container Image
    participant Test as Testing

    Dev->>Git: Push to v2 branch
    Git->>CI: Trigger workflow (PR/push)
    
    par CI Jobs
        CI->>Img: build-android (Ubuntu)
        Img->>Build: cmake + Ninja (NDK r28b)
        Build->>Test: Execute Android build
        
        CI->>Img: build-macos (macOS)
        Img->>Build: cmake + Clang (Apple Silicon)
        Build->>Test: Execute macOS build
        
        CI->>Img: docs-deploy (Ubuntu)
        Img->>Build: Sphinx + Doxygen
        Build->>Test: Build docs → Deploy Pages
        
        CI->>Img: pymllm-nightly (macOS)
        Img->>Build: Build wheel (bump version)
        Build->>Test: Upload PyPI artifact
    end
    
    Dev->>Build: Local: cmake (C++20, feature flags)
    Build->>Build: Link MllmRT, MllmCPUBackend, etc.
    Build->>Test: Run benchmarks (ARM BLAS, IntrusivePtr)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Key areas requiring attention:

  • Lazy VLM Algorithm Implementation (algorithms/lazy_vlm/HKVCache*.{hpp,cpp}, modeling_qwen2*.hpp): Dense multi-layer cache management, attention-based pruning logic, dynamic tensor slicing, and per-layer state tracking across prefill/decode phases. Critical for correctness of visual token selection and KV cache consistency.
  • CMakeLists.txt Overhaul: Extensive new CMake options (30+ flags), intricate platform-specific logic (OpenMP vs. Apple GCD), CPM package integration, and multi-backend build targets (ARM, CUDA, QNN, Ascend). Requires validation of conditional dependencies and build artifact organization.
  • Submodule Restructuring (.gitmodules): Significant replacement of tracked dependencies; impacts reproducibility and vendored library integration across multiple backend paths.
  • CI/CD Workflow Changes: Removal of main.yml and introduction of multiple specialized workflows; risk of CI coverage gaps if branch/path patterns are misconfigured.
  • Configuration Consistency: Clang-format, clang-tidy, and clangd changes must align with project standards and not conflict with existing IDE/linter expectations.

Possibly related PRs

Suggested reviewers

  • yirongjie
  • chenghuaWang
  • liang1232018

Poem

🐰 A bunny hops through v2's grand maze,
With caches bright and algorithms ablaze!
From clang-format to docs so fine,
And workflows that build in parallel lines.
Lazy tokens pruned with bunny care,
MLLM now blooms—a framework beyond compare! 🌟

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch v2

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between db227d7 and 0ff1f20.

⛔ Files ignored due to path filters (18)
  • assets/australia.jpg is excluded by !**/*.jpg
  • assets/bird_audio.wav is excluded by !**/*.wav
  • assets/bird_image.jpg is excluded by !**/*.jpg
  • assets/bus.png is excluded by !**/*.png
  • assets/car_audio.wav is excluded by !**/*.wav
  • assets/car_image.jpg is excluded by !**/*.jpg
  • assets/cat.jpg is excluded by !**/*.jpg
  • assets/chat_record_demo.png is excluded by !**/*.png
  • assets/dog_audio.wav is excluded by !**/*.wav
  • assets/dog_image.jpg is excluded by !**/*.jpg
  • assets/shadow_execution.png is excluded by !**/*.png
  • assets/two_cats.jpg is excluded by !**/*.jpg
  • assets/uidemo.jpg is excluded by !**/*.jpg
  • assets/uidemo2.png is excluded by !**/*.png
  • docs/_static/img/arch.png is excluded by !**/*.png
  • docs/_static/img/qnn-trace-execute-seq.png is excluded by !**/*.png
  • docs/_static/img/tensor-storage.png is excluded by !**/*.png
  • mllm-cli/go.sum is excluded by !**/*.sum
📒 Files selected for processing (107)
  • .clang-format (1 hunks)
  • .clang-tidy (1 hunks)
  • .clang-tidy.ignore (1 hunks)
  • .clangd (1 hunks)
  • .devcontainer/arm/devcontainer.json (1 hunks)
  • .devcontainer/cu124/devcontainer.json (1 hunks)
  • .devcontainer/cu128/devcontainer.json (1 hunks)
  • .devcontainer/qnn/devcontainer.json (1 hunks)
  • .editorconfig (1 hunks)
  • .github/ISSUE_TEMPLATE/01-bugs-report.yml (1 hunks)
  • .github/ISSUE_TEMPLATE/02-feature_request.yml (1 hunks)
  • .github/ISSUE_TEMPLATE/03-model-support-request.yml (1 hunks)
  • .github/ISSUE_TEMPLATE/04-performance.yml (1 hunks)
  • .github/ISSUE_TEMPLATE/05-research-experiment.yml (1 hunks)
  • .github/copilot-instructions.md (1 hunks)
  • .github/pull_request_template.md (1 hunks)
  • .github/workflows/build-android.yml (1 hunks)
  • .github/workflows/build-osx.yml (1 hunks)
  • .github/workflows/docs-deploy.yml (1 hunks)
  • .github/workflows/main.yml (0 hunks)
  • .github/workflows/pymllm-macos-nightly.yml (1 hunks)
  • .gitignore (1 hunks)
  • .gitmodules (1 hunks)
  • .pre-commit-config.yaml (1 hunks)
  • .vscode/extensions.json (1 hunks)
  • .vscode/settings_recommended.json (1 hunks)
  • AUTHORS (1 hunks)
  • CMakeLists.txt (1 hunks)
  • CODEOWNERS (1 hunks)
  • LICENSE (2 hunks)
  • README.md (4 hunks)
  • algorithms/.gitignore (1 hunks)
  • algorithms/fancy_algorithm/.gitignore (1 hunks)
  • algorithms/fancy_algorithm/CMakeLists.txt (1 hunks)
  • algorithms/fancy_algorithm/README.md (1 hunks)
  • algorithms/fancy_algorithm/main.cpp (1 hunks)
  • algorithms/fancy_algorithm/run.py (1 hunks)
  • algorithms/lazy_vlm/.gitignore (1 hunks)
  • algorithms/lazy_vlm/CMakeLists.txt (1 hunks)
  • algorithms/lazy_vlm/HKVCache.cpp (1 hunks)
  • algorithms/lazy_vlm/HKVCache.hpp (1 hunks)
  • algorithms/lazy_vlm/HKVCacheFast.cpp (1 hunks)
  • algorithms/lazy_vlm/HKVCacheFast.hpp (1 hunks)
  • algorithms/lazy_vlm/LazyVLMQwen2VL.cpp (1 hunks)
  • algorithms/lazy_vlm/LazyVLMQwen2VLFast.cpp (1 hunks)
  • algorithms/lazy_vlm/LazyVLMQwen2_5VL.cpp (1 hunks)
  • algorithms/lazy_vlm/LazyVLMQwen2_5VLFast.cpp (1 hunks)
  • algorithms/lazy_vlm/models/qwen2_5vl/lazy_vlm_cfg.hpp (1 hunks)
  • algorithms/lazy_vlm/models/qwen2_5vl/lazy_vlm_cfg_fast.hpp (1 hunks)
  • algorithms/lazy_vlm/models/qwen2_5vl/modeling_qwen2_5vl.hpp (1 hunks)
  • algorithms/lazy_vlm/models/qwen2_5vl/modeling_qwen2_5vl_fast.hpp (1 hunks)
  • algorithms/lazy_vlm/models/qwen2vl/lazy_vlm_cfg.hpp (1 hunks)
  • algorithms/lazy_vlm/models/qwen2vl/modeling_qwen2vl.hpp (1 hunks)
  • algorithms/lazy_vlm/run.py (1 hunks)
  • algorithms/lazy_vlm/run_remote_android.py (1 hunks)
  • android (0 hunks)
  • benchmarks/CMakeLists.txt (1 hunks)
  • benchmarks/cpu/CMakeLists.txt (1 hunks)
  • benchmarks/cpu/arm_mllm_blas_sgemm.cpp (1 hunks)
  • benchmarks/ext_stl/CMakeLists.txt (1 hunks)
  • benchmarks/ext_stl/intrusive_ptr.cpp (1 hunks)
  • cmake/CPM.cmake (1 hunks)
  • cmake/mllmConfig.cmake.in (1 hunks)
  • docker/Dockerfile.arm (1 hunks)
  • docker/Dockerfile.cu124 (1 hunks)
  • docker/Dockerfile.cu128 (1 hunks)
  • docker/Dockerfile.qnn (1 hunks)
  • docker/README.md (1 hunks)
  • docs/.gitignore (1 hunks)
  • docs/Doxyfile (1 hunks)
  • docs/Makefile (1 hunks)
  • docs/algorithms/index.rst (1 hunks)
  • docs/algorithms/pruning.rst (1 hunks)
  • docs/api/argeneration.rst (1 hunks)
  • docs/api/functional.rst (1 hunks)
  • docs/api/index.rst (1 hunks)
  • docs/api/layer.rst (1 hunks)
  • docs/api/mllm.rst (1 hunks)
  • docs/api/module.rst (1 hunks)
  • docs/api/nn.rst (1 hunks)
  • docs/api/tensor.rst (1 hunks)
  • docs/arch/arch.rst (1 hunks)
  • docs/arch/index.rst (1 hunks)
  • docs/arch/op_plugin_system.rst (1 hunks)
  • docs/arch/support_ops.rst (1 hunks)
  • docs/arch/tensor.rst (1 hunks)
  • docs/cache/index.rst (1 hunks)
  • docs/compile/index.rst (1 hunks)
  • docs/compile/ir.rst (1 hunks)
  • docs/conf.py (1 hunks)
  • docs/contribute/guidelines.rst (1 hunks)
  • docs/contribute/index.rst (1 hunks)
  • docs/contribute/model_supports.rst (1 hunks)
  • docs/contribute/roadmap.rst (1 hunks)
  • docs/cpu_backend/arm/index.rst (1 hunks)
  • docs/cpu_backend/arm/mllm_blas.rst (1 hunks)
  • docs/cpu_backend/arm/multithread_behaviors.rst (1 hunks)
  • docs/cpu_backend/fa2_radix_paged.rst (1 hunks)
  • docs/cpu_backend/index.rst (1 hunks)
  • docs/cpu_backend/threads.rst (1 hunks)
  • docs/cpu_backend/x86/index.rst (1 hunks)
  • docs/index.rst (1 hunks)
  • docs/make.bat (1 hunks)
  • docs/qa/index.rst (1 hunks)
  • docs/qnn_backend/core_design.rst (1 hunks)
  • docs/qnn_backend/index.rst (1 hunks)
  • docs/qnn_backend/qnn_model_convert.rst (1 hunks)
⛔ Files not processed due to max files limit (17)
  • docs/qnn_backend/setup_env.rst
  • docs/quantization/data_types.rst
  • docs/quantization/how_to_add_new_dtype.rst
  • docs/quantization/index.rst
  • docs/quick_start/how_to_add_backend.rst
  • docs/quick_start/how_to_add_op.rst
  • docs/quick_start/how_to_async.rst
  • docs/quick_start/how_to_model.rst
  • docs/quick_start/how_to_perf.rst
  • docs/quick_start/index.rst
  • docs/requirements.txt
  • docs/service/index.rst
  • docs/service/mllm_cli.rst
  • docs/talks/index.rst
  • examples/CMakeLists.txt
  • examples/deepseek_ocr/CMakeLists.txt
  • examples/deepseek_ocr/main.cpp

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@oreomaker oreomaker closed this Nov 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants