Skip to content

feat(cpu-backend): add support for SME2 and SVE2 in ARM backend configurations#533

Merged
chenghuaWang merged 2 commits intoUbiquitousLearning:v2from
chenghuaWang:v2
Nov 20, 2025
Merged

feat(cpu-backend): add support for SME2 and SVE2 in ARM backend configurations#533
chenghuaWang merged 2 commits intoUbiquitousLearning:v2from
chenghuaWang:v2

Conversation

@chenghuaWang
Copy link
Copy Markdown
Collaborator

@chenghuaWang chenghuaWang commented Nov 19, 2025

Summary by CodeRabbit

  • New Features

    • Added SME2 (Scalable Matrix Extension 2) and SVE2 (Scalable Vector Extension 2) support for ARM/Apple Silicon platforms, enabling optimized quantization and matmul operations.
    • Added six new Kai Linear implementation variants with improved quantization support.
    • Enhanced FP16 support detection to recognize IEEE-format FP16.
  • Build & Infrastructure

    • Updated build configurations across multiple platforms (Apple Silicon, Android, CLI) to enable SME2/SVE2 optimizations.
    • Modified Android deployment paths and disabled thread vendor for Android builds.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Nov 19, 2025

Walkthrough

This PR introduces SME2 (Scalable Matrix Extension 2) support for the ARM CPU backend, adds new f32-based KAI kernel variants with optimizations across NEON, dotprod, and i8mm instruction sets, updates multiple build configurations to enable SME2 and SVE2, improves ARM feature detection (IEEE FP16, SVE availability), and refactors tile-based parallelization in the KAI kernel dispatcher.

Changes

Cohort / File(s) Summary
CMake Configuration
CMakeLists.txt, mllm/backends/cpu/CMakeLists.txt
Introduces MLLM_CPU_BACKEND_USE_SME2 option (default OFF), defines SME2-specific source group MLLM_KAI_SOURCES_SME2 with assembly/C/dotprod files, conditionally includes SME2 matmul headers and sources in MllmCPUBackend target.
ARM/KAI Kernel Implementations
mllm/backends/cpu/kernels/arm/linear/gemm_aarch64.cpp, mllm/backends/cpu/kernels/arm/linear/kai.cpp, mllm/backends/cpu/kernels/arm/linear/kai_sme.cpp, mllm/backends/cpu/kernels/arm/linear/kai_sme.hpp
Adds SVE scaffolding and feature detection (mllm_cpu_has_sve()), enhances q4_0_q8_0 paths with bias handling under SVE conditions, refactors tile-based parallelization in kai.cpp using precomputed tile splits, expands kai_sme header documentation with encoding scheme details.
Type Definitions and Enums
mllm/core/aops/LinearOp.hpp
Adds six new f32-based KaiLinear enums (SME2 mopa/dot, NEON dotprod/i8mm variants) and corresponding string mappings in str2LinearImplTypes and LinearImplTypes2Str.
Data Type & Feature Support
mllm/core/DataTypes.hpp, mllm/backends/cpu/kernels/common/ggml/vec_dot.hpp
Extends ARM FP16 detection to include __ARM_FP16_FORMAT_IEEE, adds explicit arm_sve.h inclusion guard, changes mllm_cpu_get_sve_cnt() from inline to static inline with platform-specific logic for macOS (returns 128) and Linux (uses prctl with PR_SVE_VL_LEN_MASK).
Utility & Functional Updates
mllm/backends/cpu/kernels/common/ggml/matmul.cpp, mllm/nn/Functional.hpp, mllm/backends/cpu/ops/LinearOp.cpp
Removes single-line Chinese comment in matmul.cpp, replaces #pragma unroll with vector capacity pre-reservation in split/chunk functions, expands switch-case blocks in LinearOp.cpp to recognize additional KaiLinear implementation types.
Build Task Configurations
tasks/build_android.yaml, tasks/build_android_debug.yaml, tasks/build_android_opencl.yaml, tasks/build_android_qnn.yaml, tasks/build_android_qnn_debug.yaml, tasks/adb_push.yaml
Adds MLLM_KERNEL_USE_THREADS_VENDOR_MLLM=OFF across Android build variants, updates AdbPushTask destination path from /data/local/tmp/mllm/bin/ to /data/local/tmp/mllm-v2/bin/, adds mllm-qwen3-runner and mllm-qwen3-service binaries.
Apple Silicon Build Configurations
tasks/build_osx_apple_silicon.yaml, tasks/build_osx_apple_silicon_accelerate.yaml, tasks/build_osx_apple_silicon_compile_stack.yml, tasks/build_osx_apple_silicon_dbg.yaml, tasks/build_osx_apple_silicon_perf.yaml, tasks/build_osx_cli.yaml
Replaces CPU backend compile options from SME to sme2+sve2, adds -DMLLM_CPU_BACKEND_USE_SME2=ON flag across all Apple Silicon and OSX CLI build variants.
Dependency Updates
mllm/backends/cpu/vendors/kleidiai
Updates KleidaiAI submodule reference from 8ca2267... to 84796ec....

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Areas requiring extra attention:
    • mllm/backends/cpu/kernels/arm/linear/kai.cpp: Tile-based parallelization refactoring introduces precomputed tile_splits logic that differs from original nested loop structure; verify correctness of tile indexing and pointer calculations.
    • mllm/backends/cpu/CMakeLists.txt: Complex conditional logic for SME2 source group definition and include path management; ensure proper scoping and dependency ordering.
    • mllm/core/aops/LinearOp.hpp: Six new enum entries with bidirectional string mappings must remain synchronized; verify no misalignment between str2LinearImplTypes and LinearImplTypes2Str.
    • mllm/backends/cpu/kernels/common/ggml/vec_dot.hpp: Platform-specific SVE count implementation (macOS vs. Linux prctl behavior) requires verification against target platforms.

Possibly related PRs

Suggested reviewers

  • yirongjie
  • oreomaker

Poem

🐰 A hop through SME2 lands so fine,
New kernels march in perfect line—
With tiles and SVE, feat'ring bright,
ARM extensions reach new height!
Silicon gleams from dawn to eve,
Our bundled bits, you won't believe! 🚀✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main objective of the changeset: adding SME2 and SVE2 support to ARM backend configurations.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@chenghuaWang
Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Nov 20, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
mllm/backends/cpu/kernels/arm/linear/ggml/gemm_aarch64.cpp (1)

1033-1153: Fix SVE bias load instruction

The review comment correctly identifies a critical bug. The SVE implementation loads bias data using ld1b (load bytes), but the scalar fallback shows bias is FP32:

  • Line 1132 in scalar fallback: const float* bias_ptr = (const float*)bias; casts bias to float*
  • Line 1136 reads 8 FP32 values: sumf[j] = bias_ptr[x * ncols_interleaved + j]; with ncols_interleaved=8
  • SVE advance of #0x20 (32 bytes) = 8 × 4-byte floats confirms FP32 layout

Using ld1b loads only 8 individual bytes per lane and reinterprets them as floating-point values, producing garbage. The fix should use ld1w to load 8 FP32 words (32 bytes total), matching the scalar path.

-                       "ld1b { z17.b }, p0/Z, [%x[bias_ptr]]\n"
+                       "ld1w { z17.s }, p0/Z, [%x[bias_ptr]]\n"
                        "fadd z31.s, p0/M, z31.s, z17.s\n"
                        "add %x[bias_ptr], %x[bias_ptr], #0x20\n"
CMakeLists.txt (1)

89-108: Update line 93 to match new SME2/SVE2 compilation options.

Line 93 sets MLLM_CPU_BACKEND_COMPILE_OPTIONS with the old "+sme" flag for the py_mllm build path, while the new build configurations use "+sme2+sve2". This inconsistency means Python bindings on Apple will have different feature support than other builds.

Apply this diff to align py_mllm compile options with the SME2/SVE2 updates:

  if(MLLM_ENABLE_PY_MLLM)
    add_compile_definitions(MLLM_ENABLE_PY_MLLM=1)
    if(APPLE)
      set(MLLM_BUILD_ARM_BACKEND ON CACHE BOOL "Build ARM backend" FORCE)
-     set(MLLM_CPU_BACKEND_COMPILE_OPTIONS "-march=native+fp16+fp16fml+dotprod+i8mm+sme" CACHE STRING "CPU compile options" FORCE)
+     set(MLLM_CPU_BACKEND_COMPILE_OPTIONS "-march=native+fp16+fp16fml+dotprod+i8mm+sme2+sve2" CACHE STRING "CPU compile options" FORCE)
+     set(MLLM_CPU_BACKEND_USE_SME2 ON CACHE BOOL "Enable SME2" FORCE)
      set(MLLM_USE_BLAS ON CACHE BOOL "Use BLAS" FORCE)
🧹 Nitpick comments (5)
mllm/nn/Functional.hpp (3)

71-78: Pre‑reserve in split(num, x, dim) is a good micro‑optimization

ret.reserve(num); avoids repeated reallocations when pushing num tensors and is semantically safe given the existing loop.

If there’s any chance num could be negative or disagree with outputs.size(), consider guarding or asserting on num > 0 and using ret.reserve(outputs.size()) (or static_cast<size_t>(num)) for additional robustness.


94-100: Same pre‑reserve optimization for chunk(num, x, dim) looks correct

Mirroring split, ret.reserve(num); here also reduces allocations and matches the subsequent loop usage.

Same as above, a small assertion on num > 0 (and possibly matching outputs.size()) would make this API more defensive against misuse without changing normal behavior.


6-7: Add explicit includes for std::array and std::numeric_limits

This header uses std::array and std::numeric_limits but only includes <vector> and <cstdint>. Relying on transitive includes is brittle and may break with toolchain changes.

Consider explicitly adding:

-#include <vector>
-#include <cstdint>
+#include <vector>
+#include <array>
+#include <cstdint>
+#include <limits>

to make the dependencies clear and compilation more robust.

Also applies to: 35-37, 124-127

mllm/backends/cpu/kernels/arm/linear/kai_sme.hpp (1)

5-43: Good KAI naming doc; consider guarding arm_neon.h include

The added terminology block is helpful, and the SME header scaffold is fine. One small portability nit: <arm_neon.h> is included unconditionally, while in other places (e.g., vec_dot.hpp) NEON headers are guarded by #ifdef __ARM_NEON. If this header ever gets pulled into a non‑ARM build, that unconditional include can break the build or static analysis.

Consider:

-#include <arm_neon.h>
+#if defined(__ARM_NEON)
+#include <arm_neon.h>
+#endif

to mirror the pattern used elsewhere.

mllm/backends/cpu/kernels/arm/linear/kai.cpp (1)

414-440: Tile‑index–based parallelization looks correct; minor nits

Precomputing tile_splits and parallelizing over tile_idx preserves the original tiling semantics and should give better load balancing across both M and N; the run_matmul arguments and offsets remain unchanged.

Two minor cleanups you may want to consider:

  • Reserve capacity for tile_splits to avoid reallocations:
const int m_tiles = (M + m_step - 1) / m_step;
const int n_tiles = (N + n_step - 1) / n_step;
std::vector<std::pair<int, int>> tile_splits;
tile_splits.reserve(m_tiles * n_tiles);
  • Make the loop bound type explicit to match the macro’s induction type, e.g.:
const int tile_count = static_cast<int>(tile_splits.size());
MLLM_CONDITIONAL_PARALLEL_FOR(thread_count > 1, thread_count, tile_idx, 0, tile_count, 1, { ... });

to avoid any implicit size_tint conversions inside the macro.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f8fe667 and 1ab6109.

📒 Files selected for processing (25)
  • CMakeLists.txt (1 hunks)
  • mllm/backends/cpu/CMakeLists.txt (2 hunks)
  • mllm/backends/cpu/kernels/arm/linear/ggml/gemm_aarch64.cpp (2 hunks)
  • mllm/backends/cpu/kernels/arm/linear/kai.cpp (1 hunks)
  • mllm/backends/cpu/kernels/arm/linear/kai_sme.cpp (1 hunks)
  • mllm/backends/cpu/kernels/arm/linear/kai_sme.hpp (1 hunks)
  • mllm/backends/cpu/kernels/common/ggml/matmul.cpp (0 hunks)
  • mllm/backends/cpu/kernels/common/ggml/vec_dot.hpp (2 hunks)
  • mllm/backends/cpu/ops/LinearOp.cpp (1 hunks)
  • mllm/backends/cpu/vendors/kleidiai (1 hunks)
  • mllm/core/DataTypes.hpp (2 hunks)
  • mllm/core/aops/LinearOp.hpp (3 hunks)
  • mllm/nn/Functional.hpp (2 hunks)
  • tasks/adb_push.yaml (1 hunks)
  • tasks/build_android.yaml (1 hunks)
  • tasks/build_android_debug.yaml (1 hunks)
  • tasks/build_android_opencl.yaml (1 hunks)
  • tasks/build_android_qnn.yaml (1 hunks)
  • tasks/build_android_qnn_debug.yaml (1 hunks)
  • tasks/build_osx_apple_silicon.yaml (1 hunks)
  • tasks/build_osx_apple_silicon_accelerate.yaml (1 hunks)
  • tasks/build_osx_apple_silicon_compile_stack.yml (1 hunks)
  • tasks/build_osx_apple_silicon_dbg.yaml (1 hunks)
  • tasks/build_osx_apple_silicon_perf.yaml (1 hunks)
  • tasks/build_osx_cli.yaml (1 hunks)
💤 Files with no reviewable changes (1)
  • mllm/backends/cpu/kernels/common/ggml/matmul.cpp
🧰 Additional context used
🪛 Clang (14.0.6)
mllm/backends/cpu/kernels/arm/linear/kai_sme.hpp

[error] 36-36: 'arm_neon.h' file not found

(clang-diagnostic-error)

mllm/backends/cpu/kernels/common/ggml/vec_dot.hpp

[error] 31-31: 'mllm/core/DataTypes.hpp' file not found

(clang-diagnostic-error)

🔇 Additional comments (25)
tasks/adb_push.yaml (2)

10-11: Clarify when these new binaries are built and available.

The new binaries mllm-qwen3-runner and mllm-qwen3-service have been added to the push list. Are these built unconditionally for all ARM64 Android builds, or only under specific configuration flags (e.g., when building with SME2/SVE2 support)? This affects whether the ADB push task will fail if the binaries don't exist.


16-16: Verify the destination path change is intentional and won't break existing workflows.

The destination path has been changed from /data/local/tmp/mllm/bin/ to /data/local/tmp/mllm-v2/bin/. This is a breaking change affecting all deployed binaries. While this may be intentional given the PR targets the v2 branch, ensure that:

  1. All dependent CI/CD pipelines, test scripts, and automation have been updated to use the new path.
  2. Existing v1 deployments (if any) won't be disrupted or confused by the path separation.

If this is intentional versioning separation, consider adding a comment to the YAML to explain the rationale.

mllm/core/DataTypes.hpp (2)

9-10: LGTM! Broader ARM FP16 detection.

The addition of __ARM_FP16_FORMAT_IEEE appropriately extends FP16 feature detection to include IEEE format support, enabling native FP16 on a wider range of ARM configurations.


22-23: LGTM! Consistent typedef logic.

The typedef guard correctly mirrors the include guard from lines 9-10, ensuring consistent behavior when selecting between native float16_t and the half_float::half fallback.

mllm/backends/cpu/vendors/kleidiai (1)

1-1: Verify the submodule update contains the expected SME2 kernel implementations.

The codebase shows strong integration of SME2 support: the root CMakeLists.txt adds MLLM_CPU_BACKEND_USE_SME2 option, and mllm/backends/cpu/CMakeLists.txt gates SME2-specific kernel sources from the kleidiai submodule (e.g., kai_matmul_clamp_f32_qsi8d32p1vlx4_qai4c32p4vlx4_1vlx4vl_sme2_mopa_asm.S). This indicates the updated submodule commit should provide these SME2 assembly kernels.

However, the sandbox cannot access the submodule commits directly to confirm. Please manually verify that the new kleidiai commit (84796ece210fbf736c2c51236f4690e0b8812861) includes:

  • SME2-specific assembly implementations (files matching *sme2_* pattern)
  • SVE2 optimizations if applicable
  • KAI common SME support (kai_common_sme_asm.S)
mllm/backends/cpu/kernels/arm/linear/kai_sme.cpp (1)

4-6: SME2 scaffold TU looks fine

This is a harmless placeholder that wires the SME header into a TU; no correctness concerns for now.

tasks/build_android_opencl.yaml (1)

6-16: Android OpenCL config flag is consistent

Adding -DMLLM_KERNEL_USE_THREADS_VENDOR_MLLM=OFF aligns this task with the other Android builds in the PR and is safe from a build/config perspective.

tasks/build_android_debug.yaml (1)

6-15: Debug Android config updated consistently

Disabling MLLM_KERNEL_USE_THREADS_VENDOR_MLLM here mirrors the other Android tasks; no issues from a build/config standpoint.

tasks/build_android_qnn_debug.yaml (1)

12-21: QNN debug task threading flag is coherent

Setting MLLM_KERNEL_USE_THREADS_VENDOR_MLLM=OFF here keeps QNN debug builds aligned with the other Android variants and avoids mixing thread vendors.

mllm/backends/cpu/kernels/arm/linear/ggml/gemm_aarch64.cpp (1)

17-32: SVE include and feature helper are reasonable

The new #ifdef __ARM_FEATURE_SVE + #include <arm_sve.h> and mllm_cpu_has_sve() helper scoped in the anonymous namespace are straightforward and keep SVE feature checks local to this TU. No functional concerns here.

mllm/backends/cpu/kernels/common/ggml/vec_dot.hpp (1)

33-45: SVE detection and mllm_cpu_get_sve_cnt helper look sane

Restricting <sys/prctl.h> to __linux__ && __ARM_FEATURE_SVE and making mllm_cpu_get_sve_cnt() static inline avoids non‑Linux build breakage and ODR issues. Returning 128 on Apple Silicon and using PR_SVE_GET_VL & PR_SVE_VL_LEN_MASK on Linux aligns with typical SVE VL detection patterns.

Also applies to: 335-346

mllm/core/aops/LinearOp.hpp (3)

35-41: LGTM! Well-organized enum additions.

The new f32 KaiLinear variants are clearly grouped and follow consistent naming conventions. The comment on line 35 provides helpful context for the grouping.


83-97: LGTM! String-to-enum mappings are complete and consistent.

All six new variants have corresponding string mappings that follow the established naming pattern.


135-149: LGTM! Enum-to-string mappings are bidirectionally consistent.

The reverse mappings correctly mirror the forward mappings from lines 83-97.

tasks/build_android.yaml (1)

15-15: LGTM! Threading vendor configuration is explicit and consistent.

The addition of -DMLLM_KERNEL_USE_THREADS_VENDOR_MLLM=OFF explicitly disables the MLLM thread vendor in favor of OpenMP (line 14), making the threading configuration clear and intentional.

tasks/build_osx_apple_silicon_perf.yaml (1)

7-8: LGTM! Consistent SME2/SVE2 enablement across build variants.

The changes mirror those in build_osx_apple_silicon.yaml, maintaining consistency across build configurations. The same verification regarding hardware support applies.

tasks/build_android_qnn.yaml (1)

22-22: LGTM! Consistent threading configuration for Android QNN builds.

The threading vendor configuration matches the pattern established in other Android build configurations (build_android.yaml, etc.), ensuring consistency across Android build variants.

tasks/build_osx_apple_silicon_accelerate.yaml (1)

7-8: LGTM! SME2/SVE2 support for Accelerate framework build.

Consistent enablement of SME2/SVE2 across all OSX Apple Silicon build variants. The Accelerate framework build also benefits from these ARMv9 extensions.

tasks/build_osx_cli.yaml (1)

10-11: LGTM! SME2/SVE2 support for CLI and SDK builds.

The CLI and C SDK builds consistently enable SME2/SVE2 support, ensuring users benefit from these optimizations across all distribution formats.

tasks/build_osx_apple_silicon_dbg.yaml (1)

7-8: ✓ SME2/SVE2 enablement in debug build is consistent.

tasks/build_osx_apple_silicon_compile_stack.yml (1)

7-8: ✓ Consistent with parallel build configuration changes.

CMakeLists.txt (1)

42-43: ✓ New SME2 CMake option correctly declared and placed.

mllm/backends/cpu/CMakeLists.txt (3)

25-27: ✓ Include directories for f32_qsi8d32p_qai4c32p properly added.


81-92: ✓ SME2-specific sources are correctly gated by the feature flag.


100-100: Verify that unconditional NEON sources (lines 70–78) are intentional.

Lines 70–78 add new f32_qsi8d32p_qai4c32p NEON kernel variants that are always compiled, independent of the MLLM_CPU_BACKEND_USE_SME2 flag. Confirm this is intentional: these should be available as baseline optimizations even when SME2 is disabled.

Comment on lines +70 to +78
# SME Related
${kleidiai_SOURCE_DIR}/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qai4c32p/kai_matmul_clamp_f32_qsi8d32p1x4_qai4c32p4x4_1x4_neon_dotprod_asm.S
${kleidiai_SOURCE_DIR}/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qai4c32p/kai_matmul_clamp_f32_qsi8d32p1x4_qai4c32p4x4_1x4_neon_dotprod.c
${kleidiai_SOURCE_DIR}/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qai4c32p/kai_matmul_clamp_f32_qsi8d32p1x8_qai4c32p4x8_1x4_neon_dotprod_asm.S
${kleidiai_SOURCE_DIR}/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qai4c32p/kai_matmul_clamp_f32_qsi8d32p1x8_qai4c32p4x8_1x4_neon_dotprod.c
${kleidiai_SOURCE_DIR}/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qai4c32p/kai_matmul_clamp_f32_qsi8d32p4x4_qai4c32p4x4_8x4_neon_dotprod_asm.S
${kleidiai_SOURCE_DIR}/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qai4c32p/kai_matmul_clamp_f32_qsi8d32p4x4_qai4c32p4x4_8x4_neon_dotprod.c
${kleidiai_SOURCE_DIR}/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qai4c32p/kai_matmul_clamp_f32_qsi8d32p4x8_qai4c32p4x8_8x4_neon_i8mm_asm.S
${kleidiai_SOURCE_DIR}/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qai4c32p/kai_matmul_clamp_f32_qsi8d32p4x8_qai4c32p4x8_8x4_neon_i8mm.c
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clarify the comment: these are NEON variants, not SME2-specific.

The comment "SME Related" at line 70 is misleading. Lines 70-78 add NEON-based kernels (dotprod and i8mm variants) for f32_qsi8d32p_qai4c32p, which are unconditionally compiled as baseline optimizations. These should not be labeled as "SME Related" since the actual SME2-specific variants are gated separately (lines 84–88).

Update the comment to clarify these are non-SME2 NEON optimizations:

-    # SME Related
+    # f32_qsi8d32p_qai4c32p NEON variants (dotprod, i8mm)
🤖 Prompt for AI Agents
mllm/backends/cpu/CMakeLists.txt lines 70-78: the comment "SME Related" is
misleading because the listed source files are NEON-based kernel variants
(dotprod and i8mm) compiled as baseline optimizations, not SME2-specific; change
the comment to something like "NEON baseline variants (non-SME2)" or
"NEON-related (baseline optimizations)" to clearly indicate these are NEON
variants and not SME/SME2 gated files.

Comment on lines +344 to +350
case aops::LinearImplTypes::kKaiLinear_f32_qai8dxp_qsi4c32p_mxk_nxk_qai8dxp1x4_qsi4c32p4x4_1x4:
case aops::LinearImplTypes::kKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p1vlx4_qai4c32p4vlx4_1vlx4vl_sme2_mopa:
case aops::LinearImplTypes::kKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p1x4_qai4c32p4vlx4_1x4vl_sme2_dot:
case aops::LinearImplTypes::kKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p1x4_qai4c32p4x4_1x4_neon_dotprod:
case aops::LinearImplTypes::kKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p1x8_qai4c32p4x8_1x4_neon_dotprod:
case aops::LinearImplTypes::kKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p4x4_qai4c32p4x4_8x4_neon_dotprod:
case aops::LinearImplTypes::kKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p4x8_qai4c32p4x8_8x4_neon_i8mm: {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Critical: Missing forward() implementations for new LinearImplTypes.

The new KaiLinear f32 variants added in lines 345-350 are handled in reshape() but have no corresponding implementations in the forward() method (lines 99-314). If any of these implementation types are selected:

  • kKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p1vlx4_qai4c32p4vlx4_1vlx4vl_sme2_mopa
  • kKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p1x4_qai4c32p4vlx4_1x4vl_sme2_dot
  • kKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p1x4_qai4c32p4x4_1x4_neon_dotprod
  • kKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p1x8_qai4c32p4x8_1x4_neon_dotprod
  • kKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p4x4_qai4c32p4x4_8x4_neon_dotprod
  • kKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p4x8_qai4c32p4x8_8x4_neon_i8mm

The operation will fail in forward() with "LinearImplTypes not supported" error (line 310).

You need to add corresponding case implementations in the forward() method similar to the existing KaiLinear cases (e.g., lines 166-279), or remove these enum values from the reshape() switch if the implementations are not ready.

Run the following script to check if there are kernel implementations for these new types elsewhere in the codebase:


🏁 Script executed:

#!/bin/bash
# Search for potential kernel implementations for the new SME2/SVE2 variants

echo "Searching for SME2 mopa implementations:"
rg -n "qsi8d32p1vlx4.*qai4c32p4vlx4.*1vlx4vl.*sme2.*mopa" --type=cpp --type=hpp -C 3

echo -e "\n\nSearching for SME2 dot implementations:"
rg -n "qsi8d32p1x4.*qai4c32p4vlx4.*1x4vl.*sme2.*dot" --type=cpp --type=hpp -C 3

echo -e "\n\nSearching for NEON dotprod implementations:"
rg -n "qsi8d32p.*qai4c32p.*neon.*dotprod" --type=cpp --type=hpp -C 3

echo -e "\n\nSearching for NEON i8mm implementations:"
rg -n "qsi8d32p4x8.*qai4c32p4x8.*8x4.*neon.*i8mm" --type=cpp --type=hpp -C 3

Length of output: 283


Let me verify this issue by examining the LinearOp.cpp file more closely with corrected search parameters:


🏁 Script executed:

#!/bin/bash
# Search for the forward() and reshape() methods in LinearOp.cpp
echo "=== Checking for forward() method and its case handling ==="
rg -n "void forward\(" mllm/backends/cpu/ops/LinearOp.cpp -A 250 | head -200

echo -e "\n\n=== Checking for reshape() method and its case handling ==="
rg -n "void reshape\(" mllm/backends/cpu/ops/LinearOp.cpp -A 300 | tail -100

echo -e "\n\n=== Searching for the specific new variant names in LinearOp.cpp ==="
rg -n "kKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p1vlx4_qai4c32p4vlx4_1vlx4vl_sme2_mopa" mllm/backends/cpu/ops/LinearOp.cpp

rg -n "kKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p1x4_qai4c32p4vlx4_1x4vl_sme2_dot" mllm/backends/cpu/ops/LinearOp.cpp

echo -e "\n\n=== Check for default case or error handling in forward() ==="
rg -n "LinearImplTypes not supported" mllm/backends/cpu/ops/LinearOp.cpp -B 5 -A 2

Length of output: 654



🏁 Script executed:

#!/bin/bash
# View the sections around forward() and reshape() to see full context
echo "=== Lines 99-115 (start of forward() area) ==="
sed -n '99,115p' mllm/backends/cpu/ops/LinearOp.cpp

echo -e "\n=== Lines 300-320 (end of forward() area with default case) ==="
sed -n '300,320p' mllm/backends/cpu/ops/LinearOp.cpp

echo -e "\n=== Lines 340-360 (reshape() with new variants) ==="
sed -n '340,360p' mllm/backends/cpu/ops/LinearOp.cpp

echo -e "\n=== Search for all case statements in forward() area (lines 99-314) ==="
sed -n '99,314p' mllm/backends/cpu/ops/LinearOp.cpp | rg -n "case aops::LinearImplTypes::" | tail -20

echo -e "\n=== Count case statements in reshape() area (lines 315-400) ==="
sed -n '315,400p' mllm/backends/cpu/ops/LinearOp.cpp | rg -n "case aops::LinearImplTypes::" | head -15

Length of output: 5408


Critical: Missing forward() implementations for new LinearImplTypes confirmed.

The six new KaiLinear f32 variants at lines 345-350 are present in reshape() but have no corresponding case implementations in the forward() method (lines 99-314). When any of these types are selected, execution reaches the default case at line 310, triggering NYI("LinearImplTypes not supported"):

  • kKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p1vlx4_qai4c32p4vlx4_1vlx4vl_sme2_mopa
  • kKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p1x4_qai4c32p4vlx4_1x4vl_sme2_dot
  • kKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p1x4_qai4c32p4x4_1x4_neon_dotprod
  • kKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p1x8_qai4c32p4x8_1x4_neon_dotprod
  • kKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p4x4_qai4c32p4x4_8x4_neon_dotprod
  • kKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p4x8_qai4c32p4x8_8x4_neon_i8mm

Add case implementations in forward() matching the pattern of existing KaiLinear cases (lines 164-182), or remove these enum values from reshape() if implementations are not ready.

Comment on lines +7 to +8
- '-DMLLM_CPU_BACKEND_COMPILE_OPTIONS="-march=native+fp16+fp16fml+dotprod+i8mm+sme2+sve2"'
- "-DMLLM_CPU_BACKEND_USE_SME2=ON"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

Verify SME2/SVE2 hardware support on target devices.

The addition of +sme2+sve2 to compile options and enabling MLLM_CPU_BACKEND_USE_SME2 assumes the target Apple Silicon processors support these ARMv9 extensions. While newer M-series chips do support these, ensure that:

  1. Minimum supported macOS version documentation is updated if needed
  2. Runtime CPU feature detection is in place to prevent illegal instruction errors on older Apple Silicon
  3. Build artifacts are tested on the oldest supported Apple Silicon hardware

Check if there's runtime CPU feature detection for SME2/SVE2:


🏁 Script executed:

#!/bin/bash
# Search for CPU feature detection related to SME2/SVE2

echo "Searching for SME2/SVE2 feature detection:"
rg -n "SME2|SVE2|__ARM_FEATURE_SME|__ARM_FEATURE_SVE" --type=cpp --type=hpp -C 3

echo -e "\n\nSearching for CPU capability checks:"
rg -n "cpuinfo|hwcaps|getauxval.*HWCAP.*SME|HWCAP.*SVE" --type=cpp --type=hpp -C 3

Length of output: 137


🏁 Script executed:

#!/bin/bash
# Search for SME2/SVE2 feature detection without file type restrictions

echo "Searching for SME2/SVE2 feature detection:"
rg -n "SME2|SVE2|__ARM_FEATURE_SME|__ARM_FEATURE_SVE" -C 3

echo -e "\n\nSearching for CPU capability checks:"
rg -n "cpuinfo|hwcaps|getauxval" -C 3

echo -e "\n\nSearching for runtime feature detection patterns:"
rg -n "cpu.*detect|feature.*detect|runtime.*check|hwcap" -i -C 2

echo -e "\n\nChecking build configuration files for minimum version:"
find . -name "*.yaml" -o -name "*.yml" -o -name "CMakeLists.txt" -o -name "*.cmake" | head -20 | xargs grep -l "apple\|silicon\|sme2\|sve2" -i 2>/dev/null | head -10

Length of output: 41464


🏁 Script executed:

#!/bin/bash
# Check for runtime CPU feature detection and conditional kernel dispatch

echo "Checking for kernel dispatch or conditional execution paths based on SME2:"
rg -n "mllm_cpu_has_sve|MLLM_HOST_FEATURE_SVE|MLLM_CPU_BACKEND_USE_SME2" -C 2 | head -100

echo -e "\n\nSearching for fallback or non-SME2 code paths:"
rg -n "if.*sme2|if.*sve2|else.*neon|fallback" -i -C 2 | head -80

echo -e "\n\nChecking for minimum macOS version specification:"
grep -r "MACOSX_DEPLOYMENT_TARGET\|deployment.target\|minimum.*version" --include="*.cmake" --include="*.yaml" --include="*.yml" 2>/dev/null | head -20

echo -e "\n\nLooking at CPUBackend initialization for runtime detection:"
rg -n "CPUBackend::" mllm/backends/cpu/CPUBackend.cpp -A 5 | head -80

Length of output: 15796


🏁 Script executed:

#!/bin/bash
# Final verification: check documentation and README for minimum version info

echo "Checking README files for Apple Silicon / macOS version info:"
find . -iname "README*" -type f | head -5 | xargs grep -i "apple\|macos\|m1\|m2\|m3\|silicon" 2>/dev/null | head -20

echo -e "\n\nChecking CHANGELOG or release notes:"
find . -iname "CHANGELOG*" -o -iname "RELEASE*" -o -iname "HISTORY*" | head -3 | xargs grep -i "apple\|m1\|sme2\|sve2" 2>/dev/null | head -20

echo -e "\n\nSearching codebase for any documentation on minimum supported Apple Silicon:"
rg -i "minimum.*macos\|minimum.*apple\|m1.*support\|deployment" --type=md --type=rst --type=txt | head -20

echo -e "\n\nChecking if there's a dispatch mechanism at runtime:"
rg -n "dispatch|select.*kernel|choose.*impl" mllm/backends/cpu/kernels/ | head -20

Length of output: 1308


Verify Apple Silicon hardware support requirements and implement runtime CPU feature detection.

The review concerns are valid. Verification confirms:

  1. No runtime CPU feature detection exists - mllm_cpu_has_sve() (gemm_aarch64.cpp:26-30) is compile-time only via __ARM_FEATURE_SVE preprocessor macro. No getauxval, hwcaps, or dynamic capability checks are present.

  2. No minimum version documentation - Build YAML files (tasks/build_osx_apple_silicon.yaml and others) unconditionally enable MLLM_CPU_BACKEND_USE_SME2=ON without version constraints. No README, CMakeLists.txt, or release notes specify minimum macOS/Apple Silicon version requirements.

  3. No fallback mechanism - The -march=native+sme2+sve2 compile flag causes the compiler to generate SME2/SVE2 instructions. Binaries built on M3/M4 will contain these instructions and crash with illegal instruction errors on M1/M2 hardware, which does not support these extensions.

Required fixes:

  • Add runtime CPU feature detection (e.g., via getauxval on compatible systems or macOS-specific APIs) to conditionally enable SME2/SVE2 kernels
  • Document minimum supported Apple Silicon version (likely M3+) in build files and README
  • Provide fallback NEON-only implementations or use conditional compilation guards
🤖 Prompt for AI Agents
In tasks/build_osx_apple_silicon.yaml around lines 7-8 the build unconditionally
enables SME2/SVE2 compile flags and the codebase lacks runtime CPU feature
detection or version documentation, which will produce illegal-instruction
crashes on older Apple Silicon (M1/M2). Remove or guard the hard-coded
'-march=...+sme2+sve2' and '-DMLLM_CPU_BACKEND_USE_SME2=ON' flags in the YAML
and implement runtime detection in the CPU backend (e.g., use macOS-specific
sysctl/host_info or getauxval/hwcaps where available) to choose SME2/SVE2
kernels only when supported; add a NEON-only fallback kernel path and
conditional build flags so builds can produce safe binaries on older chips; and
update the build YAML and README/CMakeLists to document the minimum supported
Apple Silicon family (e.g., M3+) and clearly state when SME2/SVE2 optimizations
are applied.

@chenghuaWang chenghuaWang merged commit c44ed80 into UbiquitousLearning:v2 Nov 20, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant