feat(cpu-backend): add support for SME2 and SVE2 in ARM backend configurations#533
Conversation
WalkthroughThis PR introduces SME2 (Scalable Matrix Extension 2) support for the ARM CPU backend, adds new f32-based KAI kernel variants with optimizations across NEON, dotprod, and i8mm instruction sets, updates multiple build configurations to enable SME2 and SVE2, improves ARM feature detection (IEEE FP16, SVE availability), and refactors tile-based parallelization in the KAI kernel dispatcher. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
mllm/backends/cpu/kernels/arm/linear/ggml/gemm_aarch64.cpp (1)
1033-1153: Fix SVE bias load instructionThe review comment correctly identifies a critical bug. The SVE implementation loads bias data using
ld1b(load bytes), but the scalar fallback shows bias is FP32:
- Line 1132 in scalar fallback:
const float* bias_ptr = (const float*)bias;casts bias tofloat*- Line 1136 reads 8 FP32 values:
sumf[j] = bias_ptr[x * ncols_interleaved + j];with ncols_interleaved=8- SVE advance of
#0x20(32 bytes) = 8 × 4-byte floats confirms FP32 layoutUsing
ld1bloads only 8 individual bytes per lane and reinterprets them as floating-point values, producing garbage. The fix should useld1wto load 8 FP32 words (32 bytes total), matching the scalar path.- "ld1b { z17.b }, p0/Z, [%x[bias_ptr]]\n" + "ld1w { z17.s }, p0/Z, [%x[bias_ptr]]\n" "fadd z31.s, p0/M, z31.s, z17.s\n" "add %x[bias_ptr], %x[bias_ptr], #0x20\n"CMakeLists.txt (1)
89-108: Update line 93 to match new SME2/SVE2 compilation options.Line 93 sets
MLLM_CPU_BACKEND_COMPILE_OPTIONSwith the old "+sme" flag for the py_mllm build path, while the new build configurations use "+sme2+sve2". This inconsistency means Python bindings on Apple will have different feature support than other builds.Apply this diff to align py_mllm compile options with the SME2/SVE2 updates:
if(MLLM_ENABLE_PY_MLLM) add_compile_definitions(MLLM_ENABLE_PY_MLLM=1) if(APPLE) set(MLLM_BUILD_ARM_BACKEND ON CACHE BOOL "Build ARM backend" FORCE) - set(MLLM_CPU_BACKEND_COMPILE_OPTIONS "-march=native+fp16+fp16fml+dotprod+i8mm+sme" CACHE STRING "CPU compile options" FORCE) + set(MLLM_CPU_BACKEND_COMPILE_OPTIONS "-march=native+fp16+fp16fml+dotprod+i8mm+sme2+sve2" CACHE STRING "CPU compile options" FORCE) + set(MLLM_CPU_BACKEND_USE_SME2 ON CACHE BOOL "Enable SME2" FORCE) set(MLLM_USE_BLAS ON CACHE BOOL "Use BLAS" FORCE)
🧹 Nitpick comments (5)
mllm/nn/Functional.hpp (3)
71-78: Pre‑reserve insplit(num, x, dim)is a good micro‑optimization
ret.reserve(num);avoids repeated reallocations when pushingnumtensors and is semantically safe given the existing loop.If there’s any chance
numcould be negative or disagree withoutputs.size(), consider guarding or asserting onnum > 0and usingret.reserve(outputs.size())(orstatic_cast<size_t>(num)) for additional robustness.
94-100: Same pre‑reserve optimization forchunk(num, x, dim)looks correctMirroring
split,ret.reserve(num);here also reduces allocations and matches the subsequent loop usage.Same as above, a small assertion on
num > 0(and possibly matchingoutputs.size()) would make this API more defensive against misuse without changing normal behavior.
6-7: Add explicit includes forstd::arrayandstd::numeric_limitsThis header uses
std::arrayandstd::numeric_limitsbut only includes<vector>and<cstdint>. Relying on transitive includes is brittle and may break with toolchain changes.Consider explicitly adding:
-#include <vector> -#include <cstdint> +#include <vector> +#include <array> +#include <cstdint> +#include <limits>to make the dependencies clear and compilation more robust.
Also applies to: 35-37, 124-127
mllm/backends/cpu/kernels/arm/linear/kai_sme.hpp (1)
5-43: Good KAI naming doc; consider guardingarm_neon.hincludeThe added terminology block is helpful, and the SME header scaffold is fine. One small portability nit:
<arm_neon.h>is included unconditionally, while in other places (e.g.,vec_dot.hpp) NEON headers are guarded by#ifdef __ARM_NEON. If this header ever gets pulled into a non‑ARM build, that unconditional include can break the build or static analysis.Consider:
-#include <arm_neon.h> +#if defined(__ARM_NEON) +#include <arm_neon.h> +#endifto mirror the pattern used elsewhere.
mllm/backends/cpu/kernels/arm/linear/kai.cpp (1)
414-440: Tile‑index–based parallelization looks correct; minor nitsPrecomputing
tile_splitsand parallelizing overtile_idxpreserves the original tiling semantics and should give better load balancing across both M and N; therun_matmularguments and offsets remain unchanged.Two minor cleanups you may want to consider:
- Reserve capacity for
tile_splitsto avoid reallocations:const int m_tiles = (M + m_step - 1) / m_step; const int n_tiles = (N + n_step - 1) / n_step; std::vector<std::pair<int, int>> tile_splits; tile_splits.reserve(m_tiles * n_tiles);
- Make the loop bound type explicit to match the macro’s induction type, e.g.:
const int tile_count = static_cast<int>(tile_splits.size()); MLLM_CONDITIONAL_PARALLEL_FOR(thread_count > 1, thread_count, tile_idx, 0, tile_count, 1, { ... });to avoid any implicit
size_t→intconversions inside the macro.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (25)
CMakeLists.txt(1 hunks)mllm/backends/cpu/CMakeLists.txt(2 hunks)mllm/backends/cpu/kernels/arm/linear/ggml/gemm_aarch64.cpp(2 hunks)mllm/backends/cpu/kernels/arm/linear/kai.cpp(1 hunks)mllm/backends/cpu/kernels/arm/linear/kai_sme.cpp(1 hunks)mllm/backends/cpu/kernels/arm/linear/kai_sme.hpp(1 hunks)mllm/backends/cpu/kernels/common/ggml/matmul.cpp(0 hunks)mllm/backends/cpu/kernels/common/ggml/vec_dot.hpp(2 hunks)mllm/backends/cpu/ops/LinearOp.cpp(1 hunks)mllm/backends/cpu/vendors/kleidiai(1 hunks)mllm/core/DataTypes.hpp(2 hunks)mllm/core/aops/LinearOp.hpp(3 hunks)mllm/nn/Functional.hpp(2 hunks)tasks/adb_push.yaml(1 hunks)tasks/build_android.yaml(1 hunks)tasks/build_android_debug.yaml(1 hunks)tasks/build_android_opencl.yaml(1 hunks)tasks/build_android_qnn.yaml(1 hunks)tasks/build_android_qnn_debug.yaml(1 hunks)tasks/build_osx_apple_silicon.yaml(1 hunks)tasks/build_osx_apple_silicon_accelerate.yaml(1 hunks)tasks/build_osx_apple_silicon_compile_stack.yml(1 hunks)tasks/build_osx_apple_silicon_dbg.yaml(1 hunks)tasks/build_osx_apple_silicon_perf.yaml(1 hunks)tasks/build_osx_cli.yaml(1 hunks)
💤 Files with no reviewable changes (1)
- mllm/backends/cpu/kernels/common/ggml/matmul.cpp
🧰 Additional context used
🪛 Clang (14.0.6)
mllm/backends/cpu/kernels/arm/linear/kai_sme.hpp
[error] 36-36: 'arm_neon.h' file not found
(clang-diagnostic-error)
mllm/backends/cpu/kernels/common/ggml/vec_dot.hpp
[error] 31-31: 'mllm/core/DataTypes.hpp' file not found
(clang-diagnostic-error)
🔇 Additional comments (25)
tasks/adb_push.yaml (2)
10-11: Clarify when these new binaries are built and available.The new binaries
mllm-qwen3-runnerandmllm-qwen3-servicehave been added to the push list. Are these built unconditionally for all ARM64 Android builds, or only under specific configuration flags (e.g., when building with SME2/SVE2 support)? This affects whether the ADB push task will fail if the binaries don't exist.
16-16: Verify the destination path change is intentional and won't break existing workflows.The destination path has been changed from
/data/local/tmp/mllm/bin/to/data/local/tmp/mllm-v2/bin/. This is a breaking change affecting all deployed binaries. While this may be intentional given the PR targets the v2 branch, ensure that:
- All dependent CI/CD pipelines, test scripts, and automation have been updated to use the new path.
- Existing v1 deployments (if any) won't be disrupted or confused by the path separation.
If this is intentional versioning separation, consider adding a comment to the YAML to explain the rationale.
mllm/core/DataTypes.hpp (2)
9-10: LGTM! Broader ARM FP16 detection.The addition of
__ARM_FP16_FORMAT_IEEEappropriately extends FP16 feature detection to include IEEE format support, enabling native FP16 on a wider range of ARM configurations.
22-23: LGTM! Consistent typedef logic.The typedef guard correctly mirrors the include guard from lines 9-10, ensuring consistent behavior when selecting between native
float16_tand thehalf_float::halffallback.mllm/backends/cpu/vendors/kleidiai (1)
1-1: Verify the submodule update contains the expected SME2 kernel implementations.The codebase shows strong integration of SME2 support: the root CMakeLists.txt adds
MLLM_CPU_BACKEND_USE_SME2option, andmllm/backends/cpu/CMakeLists.txtgates SME2-specific kernel sources from the kleidiai submodule (e.g.,kai_matmul_clamp_f32_qsi8d32p1vlx4_qai4c32p4vlx4_1vlx4vl_sme2_mopa_asm.S). This indicates the updated submodule commit should provide these SME2 assembly kernels.However, the sandbox cannot access the submodule commits directly to confirm. Please manually verify that the new kleidiai commit (84796ece210fbf736c2c51236f4690e0b8812861) includes:
- SME2-specific assembly implementations (files matching
*sme2_*pattern)- SVE2 optimizations if applicable
- KAI common SME support (
kai_common_sme_asm.S)mllm/backends/cpu/kernels/arm/linear/kai_sme.cpp (1)
4-6: SME2 scaffold TU looks fineThis is a harmless placeholder that wires the SME header into a TU; no correctness concerns for now.
tasks/build_android_opencl.yaml (1)
6-16: Android OpenCL config flag is consistentAdding
-DMLLM_KERNEL_USE_THREADS_VENDOR_MLLM=OFFaligns this task with the other Android builds in the PR and is safe from a build/config perspective.tasks/build_android_debug.yaml (1)
6-15: Debug Android config updated consistentlyDisabling
MLLM_KERNEL_USE_THREADS_VENDOR_MLLMhere mirrors the other Android tasks; no issues from a build/config standpoint.tasks/build_android_qnn_debug.yaml (1)
12-21: QNN debug task threading flag is coherentSetting
MLLM_KERNEL_USE_THREADS_VENDOR_MLLM=OFFhere keeps QNN debug builds aligned with the other Android variants and avoids mixing thread vendors.mllm/backends/cpu/kernels/arm/linear/ggml/gemm_aarch64.cpp (1)
17-32: SVE include and feature helper are reasonableThe new
#ifdef __ARM_FEATURE_SVE+#include <arm_sve.h>andmllm_cpu_has_sve()helper scoped in the anonymous namespace are straightforward and keep SVE feature checks local to this TU. No functional concerns here.mllm/backends/cpu/kernels/common/ggml/vec_dot.hpp (1)
33-45: SVE detection andmllm_cpu_get_sve_cnthelper look saneRestricting
<sys/prctl.h>to__linux__ && __ARM_FEATURE_SVEand makingmllm_cpu_get_sve_cnt()static inlineavoids non‑Linux build breakage and ODR issues. Returning 128 on Apple Silicon and usingPR_SVE_GET_VL & PR_SVE_VL_LEN_MASKon Linux aligns with typical SVE VL detection patterns.Also applies to: 335-346
mllm/core/aops/LinearOp.hpp (3)
35-41: LGTM! Well-organized enum additions.The new f32 KaiLinear variants are clearly grouped and follow consistent naming conventions. The comment on line 35 provides helpful context for the grouping.
83-97: LGTM! String-to-enum mappings are complete and consistent.All six new variants have corresponding string mappings that follow the established naming pattern.
135-149: LGTM! Enum-to-string mappings are bidirectionally consistent.The reverse mappings correctly mirror the forward mappings from lines 83-97.
tasks/build_android.yaml (1)
15-15: LGTM! Threading vendor configuration is explicit and consistent.The addition of
-DMLLM_KERNEL_USE_THREADS_VENDOR_MLLM=OFFexplicitly disables the MLLM thread vendor in favor of OpenMP (line 14), making the threading configuration clear and intentional.tasks/build_osx_apple_silicon_perf.yaml (1)
7-8: LGTM! Consistent SME2/SVE2 enablement across build variants.The changes mirror those in
build_osx_apple_silicon.yaml, maintaining consistency across build configurations. The same verification regarding hardware support applies.tasks/build_android_qnn.yaml (1)
22-22: LGTM! Consistent threading configuration for Android QNN builds.The threading vendor configuration matches the pattern established in other Android build configurations (
build_android.yaml, etc.), ensuring consistency across Android build variants.tasks/build_osx_apple_silicon_accelerate.yaml (1)
7-8: LGTM! SME2/SVE2 support for Accelerate framework build.Consistent enablement of SME2/SVE2 across all OSX Apple Silicon build variants. The Accelerate framework build also benefits from these ARMv9 extensions.
tasks/build_osx_cli.yaml (1)
10-11: LGTM! SME2/SVE2 support for CLI and SDK builds.The CLI and C SDK builds consistently enable SME2/SVE2 support, ensuring users benefit from these optimizations across all distribution formats.
tasks/build_osx_apple_silicon_dbg.yaml (1)
7-8: ✓ SME2/SVE2 enablement in debug build is consistent.tasks/build_osx_apple_silicon_compile_stack.yml (1)
7-8: ✓ Consistent with parallel build configuration changes.CMakeLists.txt (1)
42-43: ✓ New SME2 CMake option correctly declared and placed.mllm/backends/cpu/CMakeLists.txt (3)
25-27: ✓ Include directories for f32_qsi8d32p_qai4c32p properly added.
81-92: ✓ SME2-specific sources are correctly gated by the feature flag.
100-100: Verify that unconditional NEON sources (lines 70–78) are intentional.Lines 70–78 add new
f32_qsi8d32p_qai4c32pNEON kernel variants that are always compiled, independent of theMLLM_CPU_BACKEND_USE_SME2flag. Confirm this is intentional: these should be available as baseline optimizations even when SME2 is disabled.
| # SME Related | ||
| ${kleidiai_SOURCE_DIR}/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qai4c32p/kai_matmul_clamp_f32_qsi8d32p1x4_qai4c32p4x4_1x4_neon_dotprod_asm.S | ||
| ${kleidiai_SOURCE_DIR}/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qai4c32p/kai_matmul_clamp_f32_qsi8d32p1x4_qai4c32p4x4_1x4_neon_dotprod.c | ||
| ${kleidiai_SOURCE_DIR}/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qai4c32p/kai_matmul_clamp_f32_qsi8d32p1x8_qai4c32p4x8_1x4_neon_dotprod_asm.S | ||
| ${kleidiai_SOURCE_DIR}/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qai4c32p/kai_matmul_clamp_f32_qsi8d32p1x8_qai4c32p4x8_1x4_neon_dotprod.c | ||
| ${kleidiai_SOURCE_DIR}/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qai4c32p/kai_matmul_clamp_f32_qsi8d32p4x4_qai4c32p4x4_8x4_neon_dotprod_asm.S | ||
| ${kleidiai_SOURCE_DIR}/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qai4c32p/kai_matmul_clamp_f32_qsi8d32p4x4_qai4c32p4x4_8x4_neon_dotprod.c | ||
| ${kleidiai_SOURCE_DIR}/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qai4c32p/kai_matmul_clamp_f32_qsi8d32p4x8_qai4c32p4x8_8x4_neon_i8mm_asm.S | ||
| ${kleidiai_SOURCE_DIR}/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qai4c32p/kai_matmul_clamp_f32_qsi8d32p4x8_qai4c32p4x8_8x4_neon_i8mm.c |
There was a problem hiding this comment.
Clarify the comment: these are NEON variants, not SME2-specific.
The comment "SME Related" at line 70 is misleading. Lines 70-78 add NEON-based kernels (dotprod and i8mm variants) for f32_qsi8d32p_qai4c32p, which are unconditionally compiled as baseline optimizations. These should not be labeled as "SME Related" since the actual SME2-specific variants are gated separately (lines 84–88).
Update the comment to clarify these are non-SME2 NEON optimizations:
- # SME Related
+ # f32_qsi8d32p_qai4c32p NEON variants (dotprod, i8mm)🤖 Prompt for AI Agents
mllm/backends/cpu/CMakeLists.txt lines 70-78: the comment "SME Related" is
misleading because the listed source files are NEON-based kernel variants
(dotprod and i8mm) compiled as baseline optimizations, not SME2-specific; change
the comment to something like "NEON baseline variants (non-SME2)" or
"NEON-related (baseline optimizations)" to clearly indicate these are NEON
variants and not SME/SME2 gated files.
| case aops::LinearImplTypes::kKaiLinear_f32_qai8dxp_qsi4c32p_mxk_nxk_qai8dxp1x4_qsi4c32p4x4_1x4: | ||
| case aops::LinearImplTypes::kKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p1vlx4_qai4c32p4vlx4_1vlx4vl_sme2_mopa: | ||
| case aops::LinearImplTypes::kKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p1x4_qai4c32p4vlx4_1x4vl_sme2_dot: | ||
| case aops::LinearImplTypes::kKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p1x4_qai4c32p4x4_1x4_neon_dotprod: | ||
| case aops::LinearImplTypes::kKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p1x8_qai4c32p4x8_1x4_neon_dotprod: | ||
| case aops::LinearImplTypes::kKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p4x4_qai4c32p4x4_8x4_neon_dotprod: | ||
| case aops::LinearImplTypes::kKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p4x8_qai4c32p4x8_8x4_neon_i8mm: { |
There was a problem hiding this comment.
🧩 Analysis chain
Critical: Missing forward() implementations for new LinearImplTypes.
The new KaiLinear f32 variants added in lines 345-350 are handled in reshape() but have no corresponding implementations in the forward() method (lines 99-314). If any of these implementation types are selected:
kKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p1vlx4_qai4c32p4vlx4_1vlx4vl_sme2_mopakKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p1x4_qai4c32p4vlx4_1x4vl_sme2_dotkKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p1x4_qai4c32p4x4_1x4_neon_dotprodkKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p1x8_qai4c32p4x8_1x4_neon_dotprodkKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p4x4_qai4c32p4x4_8x4_neon_dotprodkKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p4x8_qai4c32p4x8_8x4_neon_i8mm
The operation will fail in forward() with "LinearImplTypes not supported" error (line 310).
You need to add corresponding case implementations in the forward() method similar to the existing KaiLinear cases (e.g., lines 166-279), or remove these enum values from the reshape() switch if the implementations are not ready.
Run the following script to check if there are kernel implementations for these new types elsewhere in the codebase:
🏁 Script executed:
#!/bin/bash
# Search for potential kernel implementations for the new SME2/SVE2 variants
echo "Searching for SME2 mopa implementations:"
rg -n "qsi8d32p1vlx4.*qai4c32p4vlx4.*1vlx4vl.*sme2.*mopa" --type=cpp --type=hpp -C 3
echo -e "\n\nSearching for SME2 dot implementations:"
rg -n "qsi8d32p1x4.*qai4c32p4vlx4.*1x4vl.*sme2.*dot" --type=cpp --type=hpp -C 3
echo -e "\n\nSearching for NEON dotprod implementations:"
rg -n "qsi8d32p.*qai4c32p.*neon.*dotprod" --type=cpp --type=hpp -C 3
echo -e "\n\nSearching for NEON i8mm implementations:"
rg -n "qsi8d32p4x8.*qai4c32p4x8.*8x4.*neon.*i8mm" --type=cpp --type=hpp -C 3Length of output: 283
Let me verify this issue by examining the LinearOp.cpp file more closely with corrected search parameters:
🏁 Script executed:
#!/bin/bash
# Search for the forward() and reshape() methods in LinearOp.cpp
echo "=== Checking for forward() method and its case handling ==="
rg -n "void forward\(" mllm/backends/cpu/ops/LinearOp.cpp -A 250 | head -200
echo -e "\n\n=== Checking for reshape() method and its case handling ==="
rg -n "void reshape\(" mllm/backends/cpu/ops/LinearOp.cpp -A 300 | tail -100
echo -e "\n\n=== Searching for the specific new variant names in LinearOp.cpp ==="
rg -n "kKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p1vlx4_qai4c32p4vlx4_1vlx4vl_sme2_mopa" mllm/backends/cpu/ops/LinearOp.cpp
rg -n "kKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p1x4_qai4c32p4vlx4_1x4vl_sme2_dot" mllm/backends/cpu/ops/LinearOp.cpp
echo -e "\n\n=== Check for default case or error handling in forward() ==="
rg -n "LinearImplTypes not supported" mllm/backends/cpu/ops/LinearOp.cpp -B 5 -A 2Length of output: 654
🏁 Script executed:
#!/bin/bash
# View the sections around forward() and reshape() to see full context
echo "=== Lines 99-115 (start of forward() area) ==="
sed -n '99,115p' mllm/backends/cpu/ops/LinearOp.cpp
echo -e "\n=== Lines 300-320 (end of forward() area with default case) ==="
sed -n '300,320p' mllm/backends/cpu/ops/LinearOp.cpp
echo -e "\n=== Lines 340-360 (reshape() with new variants) ==="
sed -n '340,360p' mllm/backends/cpu/ops/LinearOp.cpp
echo -e "\n=== Search for all case statements in forward() area (lines 99-314) ==="
sed -n '99,314p' mllm/backends/cpu/ops/LinearOp.cpp | rg -n "case aops::LinearImplTypes::" | tail -20
echo -e "\n=== Count case statements in reshape() area (lines 315-400) ==="
sed -n '315,400p' mllm/backends/cpu/ops/LinearOp.cpp | rg -n "case aops::LinearImplTypes::" | head -15Length of output: 5408
Critical: Missing forward() implementations for new LinearImplTypes confirmed.
The six new KaiLinear f32 variants at lines 345-350 are present in reshape() but have no corresponding case implementations in the forward() method (lines 99-314). When any of these types are selected, execution reaches the default case at line 310, triggering NYI("LinearImplTypes not supported"):
kKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p1vlx4_qai4c32p4vlx4_1vlx4vl_sme2_mopakKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p1x4_qai4c32p4vlx4_1x4vl_sme2_dotkKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p1x4_qai4c32p4x4_1x4_neon_dotprodkKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p1x8_qai4c32p4x8_1x4_neon_dotprodkKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p4x4_qai4c32p4x4_8x4_neon_dotprodkKaiLinear_f32_qsi8d32p_qai4c32p_mxk_nxk_qsi8d32p4x8_qai4c32p4x8_8x4_neon_i8mm
Add case implementations in forward() matching the pattern of existing KaiLinear cases (lines 164-182), or remove these enum values from reshape() if implementations are not ready.
| - '-DMLLM_CPU_BACKEND_COMPILE_OPTIONS="-march=native+fp16+fp16fml+dotprod+i8mm+sme2+sve2"' | ||
| - "-DMLLM_CPU_BACKEND_USE_SME2=ON" |
There was a problem hiding this comment.
🧩 Analysis chain
Verify SME2/SVE2 hardware support on target devices.
The addition of +sme2+sve2 to compile options and enabling MLLM_CPU_BACKEND_USE_SME2 assumes the target Apple Silicon processors support these ARMv9 extensions. While newer M-series chips do support these, ensure that:
- Minimum supported macOS version documentation is updated if needed
- Runtime CPU feature detection is in place to prevent illegal instruction errors on older Apple Silicon
- Build artifacts are tested on the oldest supported Apple Silicon hardware
Check if there's runtime CPU feature detection for SME2/SVE2:
🏁 Script executed:
#!/bin/bash
# Search for CPU feature detection related to SME2/SVE2
echo "Searching for SME2/SVE2 feature detection:"
rg -n "SME2|SVE2|__ARM_FEATURE_SME|__ARM_FEATURE_SVE" --type=cpp --type=hpp -C 3
echo -e "\n\nSearching for CPU capability checks:"
rg -n "cpuinfo|hwcaps|getauxval.*HWCAP.*SME|HWCAP.*SVE" --type=cpp --type=hpp -C 3Length of output: 137
🏁 Script executed:
#!/bin/bash
# Search for SME2/SVE2 feature detection without file type restrictions
echo "Searching for SME2/SVE2 feature detection:"
rg -n "SME2|SVE2|__ARM_FEATURE_SME|__ARM_FEATURE_SVE" -C 3
echo -e "\n\nSearching for CPU capability checks:"
rg -n "cpuinfo|hwcaps|getauxval" -C 3
echo -e "\n\nSearching for runtime feature detection patterns:"
rg -n "cpu.*detect|feature.*detect|runtime.*check|hwcap" -i -C 2
echo -e "\n\nChecking build configuration files for minimum version:"
find . -name "*.yaml" -o -name "*.yml" -o -name "CMakeLists.txt" -o -name "*.cmake" | head -20 | xargs grep -l "apple\|silicon\|sme2\|sve2" -i 2>/dev/null | head -10Length of output: 41464
🏁 Script executed:
#!/bin/bash
# Check for runtime CPU feature detection and conditional kernel dispatch
echo "Checking for kernel dispatch or conditional execution paths based on SME2:"
rg -n "mllm_cpu_has_sve|MLLM_HOST_FEATURE_SVE|MLLM_CPU_BACKEND_USE_SME2" -C 2 | head -100
echo -e "\n\nSearching for fallback or non-SME2 code paths:"
rg -n "if.*sme2|if.*sve2|else.*neon|fallback" -i -C 2 | head -80
echo -e "\n\nChecking for minimum macOS version specification:"
grep -r "MACOSX_DEPLOYMENT_TARGET\|deployment.target\|minimum.*version" --include="*.cmake" --include="*.yaml" --include="*.yml" 2>/dev/null | head -20
echo -e "\n\nLooking at CPUBackend initialization for runtime detection:"
rg -n "CPUBackend::" mllm/backends/cpu/CPUBackend.cpp -A 5 | head -80Length of output: 15796
🏁 Script executed:
#!/bin/bash
# Final verification: check documentation and README for minimum version info
echo "Checking README files for Apple Silicon / macOS version info:"
find . -iname "README*" -type f | head -5 | xargs grep -i "apple\|macos\|m1\|m2\|m3\|silicon" 2>/dev/null | head -20
echo -e "\n\nChecking CHANGELOG or release notes:"
find . -iname "CHANGELOG*" -o -iname "RELEASE*" -o -iname "HISTORY*" | head -3 | xargs grep -i "apple\|m1\|sme2\|sve2" 2>/dev/null | head -20
echo -e "\n\nSearching codebase for any documentation on minimum supported Apple Silicon:"
rg -i "minimum.*macos\|minimum.*apple\|m1.*support\|deployment" --type=md --type=rst --type=txt | head -20
echo -e "\n\nChecking if there's a dispatch mechanism at runtime:"
rg -n "dispatch|select.*kernel|choose.*impl" mllm/backends/cpu/kernels/ | head -20Length of output: 1308
Verify Apple Silicon hardware support requirements and implement runtime CPU feature detection.
The review concerns are valid. Verification confirms:
-
No runtime CPU feature detection exists -
mllm_cpu_has_sve()(gemm_aarch64.cpp:26-30) is compile-time only via__ARM_FEATURE_SVEpreprocessor macro. No getauxval, hwcaps, or dynamic capability checks are present. -
No minimum version documentation - Build YAML files (tasks/build_osx_apple_silicon.yaml and others) unconditionally enable
MLLM_CPU_BACKEND_USE_SME2=ONwithout version constraints. No README, CMakeLists.txt, or release notes specify minimum macOS/Apple Silicon version requirements. -
No fallback mechanism - The
-march=native+sme2+sve2compile flag causes the compiler to generate SME2/SVE2 instructions. Binaries built on M3/M4 will contain these instructions and crash with illegal instruction errors on M1/M2 hardware, which does not support these extensions.
Required fixes:
- Add runtime CPU feature detection (e.g., via
getauxvalon compatible systems or macOS-specific APIs) to conditionally enable SME2/SVE2 kernels - Document minimum supported Apple Silicon version (likely M3+) in build files and README
- Provide fallback NEON-only implementations or use conditional compilation guards
🤖 Prompt for AI Agents
In tasks/build_osx_apple_silicon.yaml around lines 7-8 the build unconditionally
enables SME2/SVE2 compile flags and the codebase lacks runtime CPU feature
detection or version documentation, which will produce illegal-instruction
crashes on older Apple Silicon (M1/M2). Remove or guard the hard-coded
'-march=...+sme2+sve2' and '-DMLLM_CPU_BACKEND_USE_SME2=ON' flags in the YAML
and implement runtime detection in the CPU backend (e.g., use macOS-specific
sysctl/host_info or getauxval/hwcaps where available) to choose SME2/SVE2
kernels only when supported; add a NEON-only fallback kernel path and
conditional build flags so builds can produce safe binaries on older chips; and
update the build YAML and README/CMakeLists to document the minimum supported
Apple Silicon family (e.g., M3+) and clearly state when SME2/SVE2 optimizations
are applied.
Summary by CodeRabbit
New Features
Build & Infrastructure
✏️ Tip: You can customize this high-level summary in your review settings.