Skip to content

feat(mllm_kernel): add initial implementation of mllm-kernel with CPU and JIT utilities#634

Merged
chenghuaWang merged 12 commits intoUbiquitousLearning:mainfrom
chenghuaWang:wch-main
Feb 16, 2026
Merged

feat(mllm_kernel): add initial implementation of mllm-kernel with CPU and JIT utilities#634
chenghuaWang merged 12 commits intoUbiquitousLearning:mainfrom
chenghuaWang:wch-main

Conversation

@chenghuaWang
Copy link
Copy Markdown
Collaborator

@chenghuaWang chenghuaWang commented Feb 3, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Added JIT kernel compilation support for CPU and CUDA with TVM integration.
    • Introduced add_constant kernel with Highway SIMD optimization.
    • Added support for SM8845 Qualcomm chipset.
    • Kernel launch utilities for GPU execution with error handling.
  • Documentation

    • Added project documentation and build configuration guide.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 3, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR introduces a comprehensive mllm-kernel package with JIT compilation infrastructure for CPU and CUDA kernels using TVM-FFI, Highway SIMD integration, C++ utility headers for tensor matching and device operations, CMake build configuration, and example constant-add kernels. Additionally, it adds SM8845 chipset support to the QNN backend and updates documentation references.

Changes

Cohort / File(s) Summary
mllm-kernel JIT Compilation Core
mllm-kernel/mllm_kernel/jit_utils/compile.py, mllm-kernel/mllm_kernel/jit_utils/__init__.py
Introduces JIT kernel loading utilities with caching decorator, path/include discovery, CPU architecture flags, template argument generation, and TVM-FFI wrappers for inline C++/CUDA compilation via load_cpu_jit() and load_cuda_jit(). Exports path constants and compiler flags.
CPU Kernel Implementation
mllm-kernel/mllm_kernel/cpu/__init__.py, mllm-kernel/mllm_kernel/cpu/jit/...
Implements CPU-accelerated constant-add kernels with Highway SIMD support; includes compile-time templated and runtime variants with input validation, contiguity handling, and cached JIT module factories.
C++ Tensor & Device Utilities
mllm-kernel/include/mllm_kernel/tensor.hpp, mllm-kernel/include/mllm_kernel/utils.hpp, mllm-kernel/include/mllm_kernel/utils.cuh
Provides host-side tensor metadata matching with symbolic shape/dtype/device binding, fluent TensorMatcher verification API, CUDA kernel launch wrapper with PDL support, pointer arithmetic helpers, and runtime error handling infrastructure.
mllm-kernel Build System
mllm-kernel/CMakeLists.txt, mllm-kernel/cmake/..., mllm-kernel/cmake/MllmKernelConfig.cmake.in, mllm-kernel/pyproject.toml
Configures CMake project with CPU/CUDA/Ascend build options, integrates Highway via CPM, installs kernel targets and headers, generates package config files, and defines Python build metadata with scikit-build-core backend.
Source Location & Formatting
mllm-kernel/include/mllm_kernel/source_location.hpp, mllm-kernel/.clang-format, mllm-kernel/.gitignore
Adds C++ source_location compatibility layer (fallback for older standards), C++ formatting rules, and build artifact ignores.
Package Initialization
mllm-kernel/mllm_kernel/__init__.py, mllm-kernel/mllm_kernel/__main__.py
Exports package version and submodules; provides CLI tool to generate .clangd configuration with CUDA detection and compute capability queries.
CPU Kernel Sources
mllm-kernel/mllm_kernel/cpu/csrc/add_constant.cpp
Defines TVM-FFI kernel entry points for compile-time and runtime constant addition (implementation incomplete).
Documentation & Examples
mllm-kernel/README.md, README.md, README-ZH.md, examples/qwen3_qnn_aot/config_1.7B.json
Adds mllm-kernel project README with features and usage; updates model table references in main READMEs to include W4A16-SM8650 entry; updates model config parameter.
QNN Backend Enhancement
mllm/backends/qnn/aot/QnnTargetMachine.hpp, mllm/backends/qnn/aot/QnnTargetMachineParser.cpp
Adds SM8845 chipset (value 97) to enum with parser support; retains SM8850 with documentation clarification.

Sequence Diagram(s)

sequenceDiagram
    actor User as Python User
    participant add_constant.py as add_constant.py
    participant JIT as JIT Cache<br/>(compile.py)
    participant TVM as TVM-FFI
    participant Compiler as C++ Compiler

    User->>add_constant.py: add_constant(src_tensor, const=4)
    add_constant.py->>add_constant.py: validate(dtype=float32, constant∈[1,2,4,8,16])
    add_constant.py->>add_constant.py: ensure contiguous
    add_constant.py->>JIT: _jit_add_constant_module(const=4)
    alt Module cached
        JIT-->>add_constant.py: return cached Module
    else First load
        JIT->>TVM: load_cpu_jit(add_constant.cpp, wrapper_name=...)
        TVM->>Compiler: compile with Highway includes & flags
        Compiler-->>TVM: compiled shared object
        TVM-->>JIT: Module
        JIT-->>add_constant.py: Module
    end
    add_constant.py->>TVM: module.kernel(dst, src)
    TVM-->>add_constant.py: execution complete
    add_constant.py-->>User: return dst_tensor
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~55 minutes

Suggested reviewers

  • oreomaker
  • liang1232018
  • yirongjie

Poem

🐰 A kernel package hops to life,
With JIT threads and SIMD might,
TVM compiles what we define,
Highway speeds through each design,
Constants add with graceful flight! 🚀

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request has no description provided by the author, while the repository template expects a detailed summary with reference to contribution guidelines. Please add a comprehensive pull request description explaining the purpose, implementation details, and testing of the mllm-kernel module. Reference the contribution guidelines in your description.
Docstring Coverage ⚠️ Warning Docstring coverage is 18.40% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: introducing the initial mllm-kernel module with CPU support and JIT utilities for kernel compilation and loading.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@mllm-kernel/jit_utils/compile.py`:
- Around line 30-37: The current memoization in wrapper builds a cache key from
(args, tuple(sorted(kwargs.items()))) which raises TypeError if any arg is
unhashable; either document this limitation near result_map/wrapper (noting
_resolve_kernel_path currently has no args) or make key creation robust: inside
wrapper, attempt to build the original key and, on TypeError, fall back to a
stable string-based key (e.g., canonical repr/json/pickle of args+kwargs or
hashing that) so lookups into result_map never raise; ensure you still use fn
and result_map unchanged otherwise and keep behavior identical for hashable
inputs.
- Around line 70-83: The dtype mapping MLLM_KERNEL_TEMPLATE_DTYPE_MAP is missing
an entry for torch.float64 (double) which will cause a KeyError in make_cpp_args
when double-precision tensors are used; add the mapping torch.float64: "fp64_t"
to MLLM_KERNEL_TEMPLATE_DTYPE_MAP and update any tests or usages of
make_cpp_args to ensure "fp64_t" is accepted by downstream template generation
(search for MLLM_KERNEL_TEMPLATE_DTYPE_MAP and make_cpp_args to apply the
change).
- Around line 160-171: MLLM_KERNEL_DEFAULT_INCLUDE_DIRS is defined as a
pathlib.Path (via MLLM_KERNEL_CPU_INCLUDE_DIR / MLLM_KERNEL_CUDA_INCLUDE_DIR)
but passed to load_inline and concatenated with extra_include_paths (a list),
causing a TypeError; convert the default include(s) to a list/sequence of
strings before concatenation (e.g., map Path -> str and wrap in list/tuple) so
that the call in load_inline
(extra_include_paths=MLLM_KERNEL_DEFAULT_INCLUDE_DIRS + extra_include_paths)
concatenates two compatible sequences; update the code that sets
MLLM_KERNEL_DEFAULT_INCLUDE_DIRS or transform it inline prior to the load_inline
call and keep references to MLLM_KERNEL_DEFAULT_INCLUDE_DIRS,
MLLM_KERNEL_CPU_INCLUDE_DIR, MLLM_KERNEL_CUDA_INCLUDE_DIR, and the load_inline
invocation to locate where to change.
🧹 Nitpick comments (1)
mllm-kernel/jit_utils/compile.py (1)

94-95: KeyError on unmapped torch.dtype will have an unclear error message.

If a torch.dtype not in MLLM_KERNEL_TEMPLATE_DTYPE_MAP is passed (e.g., torch.float64, torch.complex64), line 95 raises a KeyError rather than the informative TypeError on line 96-98.

♻️ Suggested fix
         if isinstance(arg, torch.dtype):
-            return MLLM_KERNEL_TEMPLATE_DTYPE_MAP[arg]
+            if arg not in MLLM_KERNEL_TEMPLATE_DTYPE_MAP:
+                raise TypeError(
+                    f"Unsupported torch.dtype for mllm kernel template: {arg}"
+                )
+            return MLLM_KERNEL_TEMPLATE_DTYPE_MAP[arg]
         raise TypeError(

Comment on lines +30 to +37
result_map = {}

@functools.wraps(fn)
def wrapper(*args, **kwargs):
key = (args, tuple(sorted(kwargs.items(), key=lambda x: x[0])))
if key not in result_map:
result_map[key] = fn(*args, **kwargs)
return result_map[key]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential TypeError if arguments contain unhashable types.

The cache key is built as (args, tuple(sorted(kwargs.items()))). If any argument in args is unhashable (e.g., a list or dict), this will raise a TypeError when checking key not in result_map.

Currently, _resolve_kernel_path() is the only decorated function and takes no arguments, so this is safe for now. Consider adding a comment documenting this limitation, or handle unhashable args gracefully if broader use is anticipated.

📝 Suggested documentation
 def cache_once(fn: F) -> F:
     """
     NOTE: `functools.lru_cache` is not compatible with `torch.compile`
     So we manually implement a simple cache_once decorator to replace it.
+
+    NOTE: All arguments must be hashable. This decorator will raise TypeError
+    if called with unhashable arguments (e.g., lists, dicts).
     """
🤖 Prompt for AI Agents
In `@mllm-kernel/jit_utils/compile.py` around lines 30 - 37, The current
memoization in wrapper builds a cache key from (args,
tuple(sorted(kwargs.items()))) which raises TypeError if any arg is unhashable;
either document this limitation near result_map/wrapper (noting
_resolve_kernel_path currently has no args) or make key creation robust: inside
wrapper, attempt to build the original key and, on TypeError, fall back to a
stable string-based key (e.g., canonical repr/json/pickle of args+kwargs or
hashing that) so lookups into result_map never raise; ensure you still use fn
and result_map unchanged otherwise and keep behavior identical for hashable
inputs.

Comment on lines +70 to +83
MLLM_KERNEL_TEMPLATE_DTYPE_MAP: dict[torch.dtype, str] = {
torch.float: "fp32_t",
torch.float16: "fp16_t",
torch.bfloat16: "bfp16_t",
torch.int8: "int8_t",
torch.int16: "int16_t",
torch.int32: "int32_t",
torch.int64: "int64_t",
torch.uint8: "uint8_t",
torch.uint16: "uint16_t",
torch.uint32: "uint32_t",
torch.uint64: "uint64_t",
torch.bool: "bool_t",
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing torch.float64 mapping.

The dtype map includes torch.float (fp32) but omits torch.float64 (double precision). If double-precision kernels are expected, this will cause a KeyError in make_cpp_args.

🔧 Suggested fix
 MLLM_KERNEL_TEMPLATE_DTYPE_MAP: dict[torch.dtype, str] = {
     torch.float: "fp32_t",
+    torch.float64: "fp64_t",
     torch.float16: "fp16_t",
     torch.bfloat16: "bfp16_t",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
MLLM_KERNEL_TEMPLATE_DTYPE_MAP: dict[torch.dtype, str] = {
torch.float: "fp32_t",
torch.float16: "fp16_t",
torch.bfloat16: "bfp16_t",
torch.int8: "int8_t",
torch.int16: "int16_t",
torch.int32: "int32_t",
torch.int64: "int64_t",
torch.uint8: "uint8_t",
torch.uint16: "uint16_t",
torch.uint32: "uint32_t",
torch.uint64: "uint64_t",
torch.bool: "bool_t",
}
MLLM_KERNEL_TEMPLATE_DTYPE_MAP: dict[torch.dtype, str] = {
torch.float: "fp32_t",
torch.float64: "fp64_t",
torch.float16: "fp16_t",
torch.bfloat16: "bfp16_t",
torch.int8: "int8_t",
torch.int16: "int16_t",
torch.int32: "int32_t",
torch.int64: "int64_t",
torch.uint8: "uint8_t",
torch.uint16: "uint16_t",
torch.uint32: "uint32_t",
torch.uint64: "uint64_t",
torch.bool: "bool_t",
}
🤖 Prompt for AI Agents
In `@mllm-kernel/jit_utils/compile.py` around lines 70 - 83, The dtype mapping
MLLM_KERNEL_TEMPLATE_DTYPE_MAP is missing an entry for torch.float64 (double)
which will cause a KeyError in make_cpp_args when double-precision tensors are
used; add the mapping torch.float64: "fp64_t" to MLLM_KERNEL_TEMPLATE_DTYPE_MAP
and update any tests or usages of make_cpp_args to ensure "fp64_t" is accepted
by downstream template generation (search for MLLM_KERNEL_TEMPLATE_DTYPE_MAP and
make_cpp_args to apply the change).

Comment thread mllm-kernel/mllm_kernel/jit_utils/compile.py Outdated
…mentation and reorganize jit_utils structure
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@mllm-kernel/mllm_kernel/jit_utils/compile.py`:
- Around line 64-66: The placeholder function
_resolve_cpu_simd_features_to_cxx_flags currently contains only a pass and thus
returns None despite its annotation list[str]; replace the pass with an explicit
return of an empty list (return []) so callers can safely concatenate or iterate
the result (alternatively raise NotImplementedError if you want to make it
explicitly unimplemented), ensuring the function matches its declared return
type.

Comment thread mllm-kernel/mllm_kernel/jit_utils/compile.py Outdated
… configuration

- Added support for the SM8845 chipset in QnnTargetMachine.hpp and QnnTargetMachineParser.cpp.
- Introduced CMakeLists.txt for building the mllm-kernel project with options for CPU, CUDA, and Ascend kernels.
- Removed outdated pyproject-cpu.toml and pyproject-cuda.toml files, and added a new pyproject.toml for unified package configuration.
- Updated README.md with installation instructions and project structure.
- Added JIT compilation utilities and initial CPU kernel implementations for adding constants using Highway SIMD.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@mllm-kernel/mllm_kernel/cpu/csrc/add_constant.cpp`:
- Around line 16-29: The functions add_constant (template<int Constant>) and
add_constant_runtime currently compute pointers and size but never write to dst;
implement element-wise addition writing to dst_ptr: iterate i from 0 to n-1 and
set dst_ptr[i] = src_ptr[i] + (Constant for add_constant or the float parameter
constant for add_constant_runtime), ensuring you use the already-obtained float*
dst_ptr, const float* src_ptr and size_t n so outputs are populated.

In `@mllm-kernel/mllm_kernel/cpu/jit/add_constant.py`:
- Around line 55-121: Both add_constant and add_constant_runtime must validate
that src is on the CPU to avoid passing GPU addresses to CPU JIT code; add a
device check at the start of each function (e.g., assert or raise TypeError if
src.device.type != "cpu") before any tensor operations or calls into
_jit_add_constant_module/_jit_add_constant_runtime_module, and ensure dst is
created on the same CPU device (torch.empty_like(src) is fine once src is
confirmed CPU). Ensure the error message mentions the function name
(add_constant or add_constant_runtime) and that only CPU tensors are accepted.

In `@mllm/backends/qnn/aot/QnnTargetMachine.hpp`:
- Around line 29-30: Add the missing Python FFI wrapper for the new enum value
by adding a static method SM8845() to the QcomChipset class that mirrors the
existing SM8850() wrapper; implement it to return
tvm_ffi.get_global_func("mllm.qualcomm.QcomChipset.SM8845")(), ensuring the
method name is exactly SM8845 and placed alongside the other chipset wrappers in
pymllm/ffi/__init__.py so the Python API matches the C++ enum.
🧹 Nitpick comments (10)
mllm-kernel/mllm_kernel/cpu/include/mllm_kernel/common.h (1)

19-28: Avoid copying tvm::ffi::Tensor and reconcile the “type checking” note.
GetDataPtr takes the tensor by value and the comment claims type checking without any guard. Consider passing by const reference and either adding a dtype check (if the TVM API supports it) or updating the comment.

🔧 Suggested tweak
 template<typename T>
-inline T* GetDataPtr(tvm::ffi::Tensor arr) {
+inline T* GetDataPtr(const tvm::ffi::Tensor& arr) {
   return static_cast<T*>(arr.data_ptr());
 }
mllm-kernel/README.md (1)

77-105: Add language specifier to the fenced code block.

The project structure code block should have a language specified for consistent rendering. Use text or plaintext for directory tree structures.

📝 Suggested fix
-```
+```text
 mllm-kernel/
 ├── cmake/
mllm-kernel/cmake/CPM.cmake (1)

14-26: Improve error message to include download failure details.

When the download fails, include the actual error message from download_status to help users diagnose network or URL issues.

📝 Suggested fix
   list(GET download_status 0 download_status_code)
+  list(GET download_status 1 download_error_message)
   if(NOT download_status_code EQUAL 0)
     # Fallback: copy from parent mllm project if available
     set(PARENT_CPM "${CMAKE_CURRENT_SOURCE_DIR}/../cmake/CPM.cmake")
     if(EXISTS ${PARENT_CPM})
       message(STATUS "Using CPM.cmake from parent project")
       file(COPY ${PARENT_CPM} DESTINATION "${CMAKE_BINARY_DIR}/cmake/")
       file(RENAME "${CMAKE_BINARY_DIR}/cmake/CPM.cmake" ${CPM_DOWNLOAD_LOCATION})
     else()
-      message(FATAL_ERROR "Failed to download CPM.cmake")
+      message(FATAL_ERROR "Failed to download CPM.cmake: ${download_error_message}")
     endif()
   endif()
mllm-kernel/cmake/MllmKernelConfig.cmake.in (1)

15-16: Highway availability flag may be inaccurate when CPU build is disabled.

MLLM_KERNEL_HAS_HIGHWAY is unconditionally set to TRUE, but Highway is only installed when MLLM_KERNEL_BUILD_CPU is enabled. Consider making this conditional via a CMake variable substitution.

📝 Suggested approach

In CMakeLists.txt, set a variable before configure_package_config_file:

if(MLLM_KERNEL_BUILD_CPU AND highway_ADDED)
  set(MLLM_KERNEL_HAS_HIGHWAY_VALUE TRUE)
else()
  set(MLLM_KERNEL_HAS_HIGHWAY_VALUE FALSE)
endif()

Then in the template:

-set(MLLM_KERNEL_HAS_HIGHWAY TRUE)
+set(MLLM_KERNEL_HAS_HIGHWAY `@MLLM_KERNEL_HAS_HIGHWAY_VALUE`@)
mllm-kernel/mllm_kernel/jit_utils/__init__.py (1)

21-32: Consider sorting __all__ for consistency.

The __all__ list could be sorted alphabetically for easier maintenance and to satisfy linter expectations. This is optional.

📝 Sorted version
 __all__ = [
+    "MLLM_KERNEL_CPU_CSRC_DIR",
+    "MLLM_KERNEL_CPU_INCLUDE_DIR",
+    "MLLM_KERNEL_CPU_PATH",
+    "MLLM_KERNEL_CUDA_PATH",
+    "MLLM_KERNEL_TOP_PATH",
+    "_tvm_ffi_cpp_load_inline",
     "cache_once",
+    "load_cpu_jit",
+    "load_cuda_jit",
     "make_cpp_args",
-    "load_cpu_jit",
-    "load_cuda_jit",
-    "_tvm_ffi_cpp_load_inline",
-    "MLLM_KERNEL_TOP_PATH",
-    "MLLM_KERNEL_CPU_PATH",
-    "MLLM_KERNEL_CUDA_PATH",
-    "MLLM_KERNEL_CPU_CSRC_DIR",
-    "MLLM_KERNEL_CPU_INCLUDE_DIR",
 ]
mllm-kernel/CMakeLists.txt (3)

25-38: Remove redundant Highway option settings.

The Highway options are set both via set(... FORCE) (lines 25-28) and via OPTIONS in CPMAddPackage (lines 35-37). Choose one approach to avoid confusion.

📝 Suggested fix - keep only CPMAddPackage OPTIONS
 if(MLLM_KERNEL_BUILD_CPU)
-  # Disable Highway tests and examples to speed up build
-  set(HWY_ENABLE_TESTS OFF CACHE BOOL "Disable Highway tests" FORCE)
-  set(HWY_ENABLE_EXAMPLES OFF CACHE BOOL "Disable Highway examples" FORCE)
-  set(HWY_ENABLE_CONTRIB OFF CACHE BOOL "Disable Highway contrib" FORCE)
-  set(BUILD_TESTING OFF CACHE BOOL "Disable testing" FORCE)
-
   CPMAddPackage(
     NAME highway
     GITHUB_REPOSITORY google/highway
     GIT_TAG 1.3.0
     OPTIONS
       "HWY_ENABLE_TESTS OFF"
       "HWY_ENABLE_EXAMPLES OFF"
       "HWY_ENABLE_CONTRIB OFF"
+      "BUILD_TESTING OFF"
   )

84-101: CPU kernel installation runs unconditionally.

The CPU kernel includes and sources are installed even when MLLM_KERNEL_BUILD_CPU is OFF. Consider wrapping these in a conditional block for consistency.

📝 Suggested fix
+if(MLLM_KERNEL_BUILD_CPU)
 # Install CPU kernel includes
 install(
   DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}/mllm_kernel/cpu/include/
   DESTINATION include
   FILES_MATCHING
   PATTERN "*.h"
   PATTERN "*.hpp"
 )

 # Install CPU kernel sources (for JIT compilation)
 install(
   DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}/mllm_kernel/cpu/csrc/
   DESTINATION share/mllm_kernel/cpu/csrc
   FILES_MATCHING
   PATTERN "*.cpp"
   PATTERN "*.hpp"
   PATTERN "*.h"
 )
+endif()

139-142: Avoid hardcoded version string.

The Highway version is hardcoded as "1.3.0" but should reference the version from CPMAddPackage. Consider storing it in a variable.

📝 Suggested fix
+set(MLLM_KERNEL_HIGHWAY_VERSION "1.3.0")
+
 CPMAddPackage(
   NAME highway
   GITHUB_REPOSITORY google/highway
-  GIT_TAG 1.3.0
+  GIT_TAG ${MLLM_KERNEL_HIGHWAY_VERSION}
   ...
 )
 ...
 if(MLLM_KERNEL_BUILD_CPU AND highway_ADDED)
-  message(STATUS "  Highway version: 1.3.0")
+  message(STATUS "  Highway version: ${MLLM_KERNEL_HIGHWAY_VERSION}")
mllm-kernel/mllm_kernel/jit_utils/compile.py (2)

6-6: Remove unused import.

The os module is imported but never used in this file.

📝 Suggested fix
-import os

107-118: Placeholder implementation for CPU arch flags.

The function returns an empty list with a TODO comment. This means no architecture-specific optimizations (like -march=native) are applied. Consider implementing basic detection or documenting the limitation.

Would you like me to help implement basic CPU architecture detection, or should this be tracked as a follow-up issue?

Comment on lines +16 to +29
template<int Constant>
void add_constant(tvm::ffi::Tensor dst, tvm::ffi::Tensor src) {
float* dst_ptr = GetDataPtr<float>(dst);
const float* src_ptr = GetConstDataPtr<float>(src);
size_t n = GetNumElements(src);
}

/**
* @brief TVM FFI callable kernel class for runtime constant.
*/
void add_constant_runtime(tvm::ffi::Tensor dst, tvm::ffi::Tensor src, float constant) {
float* dst_ptr = GetDataPtr<float>(dst);
const float* src_ptr = GetConstDataPtr<float>(src);
size_t n = GetNumElements(src);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Kernel is a no-op; dst is never written.
Both entry points compute pointers/size but don’t perform the add, so outputs are uninitialized.

✅ Minimal functional implementation
 template<int Constant>
 void add_constant(tvm::ffi::Tensor dst, tvm::ffi::Tensor src) {
   float* dst_ptr = GetDataPtr<float>(dst);
   const float* src_ptr = GetConstDataPtr<float>(src);
   size_t n = GetNumElements(src);
+  for (size_t i = 0; i < n; ++i) {
+    dst_ptr[i] = src_ptr[i] + static_cast<float>(Constant);
+  }
 }
@@
 void add_constant_runtime(tvm::ffi::Tensor dst, tvm::ffi::Tensor src, float constant) {
   float* dst_ptr = GetDataPtr<float>(dst);
   const float* src_ptr = GetConstDataPtr<float>(src);
   size_t n = GetNumElements(src);
+  for (size_t i = 0; i < n; ++i) {
+    dst_ptr[i] = src_ptr[i] + constant;
+  }
 }
🤖 Prompt for AI Agents
In `@mllm-kernel/mllm_kernel/cpu/csrc/add_constant.cpp` around lines 16 - 29, The
functions add_constant (template<int Constant>) and add_constant_runtime
currently compute pointers and size but never write to dst; implement
element-wise addition writing to dst_ptr: iterate i from 0 to n-1 and set
dst_ptr[i] = src_ptr[i] + (Constant for add_constant or the float parameter
constant for add_constant_runtime), ensuring you use the already-obtained float*
dst_ptr, const float* src_ptr and size_t n so outputs are populated.

Comment on lines +55 to +121
def add_constant(src: torch.Tensor, constant: int) -> torch.Tensor:
"""
Add a compile-time constant to each element of a tensor using Highway SIMD.

This version uses template specialization for the constant, which can enable
additional compiler optimizations. Supported constants: 1, 2, 4, 8, 16.

Args:
src: Input tensor (must be float32 and contiguous)
constant: Constant to add (must be one of: 1, 2, 4, 8, 16)

Returns:
Output tensor with same shape as input

Example:
>>> import torch
>>> from mllm_kernel.cpu.jit import add_constant
>>> x = torch.randn(1024)
>>> y = add_constant(x, 16) # y = x + 16
"""
if constant not in (1, 2, 4, 8, 16):
raise ValueError(
f"Constant must be one of [1, 2, 4, 8, 16], got {constant}. "
"Use add_constant_runtime for arbitrary constants."
)

if src.dtype != torch.float32:
raise TypeError(f"Expected float32 tensor, got {src.dtype}")

if not src.is_contiguous():
src = src.contiguous()

dst = torch.empty_like(src)
module = _jit_add_constant_module(constant)
module.add_constant(dst, src)
return dst


def add_constant_runtime(src: torch.Tensor, constant: float) -> torch.Tensor:
"""
Add a runtime constant to each element of a tensor using Highway SIMD.

This version accepts any float constant at runtime, offering more flexibility
than the template version.

Args:
src: Input tensor (must be float32 and contiguous)
constant: Constant to add (any float value)

Returns:
Output tensor with same shape as input

Example:
>>> import torch
>>> from mllm_kernel.cpu.jit import add_constant_runtime
>>> x = torch.randn(1024)
>>> y = add_constant_runtime(x, 3.14159) # y = x + 3.14159
"""
if src.dtype != torch.float32:
raise TypeError(f"Expected float32 tensor, got {src.dtype}")

if not src.is_contiguous():
src = src.contiguous()

dst = torch.empty_like(src)
module = _jit_add_constant_runtime_module()
module.add_constant_runtime(dst, src, float(constant))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's understand the file structure and locate the JIT module implementations
find . -type f -name "*.py" | grep -E "(add_constant|jit)" | head -20

Repository: UbiquitousLearning/mllm

Length of output: 302


🏁 Script executed:

# Check the full file to see the JIT module implementations
wc -l mllm-kernel/mllm_kernel/cpu/jit/add_constant.py

Repository: UbiquitousLearning/mllm

Length of output: 118


🏁 Script executed:

# Read the full file to understand the JIT module setup
cat -n mllm-kernel/mllm_kernel/cpu/jit/add_constant.py

Repository: UbiquitousLearning/mllm

Length of output: 4432


🏁 Script executed:

# Check the JIT compilation utilities
cat -n mllm-kernel/mllm_kernel/jit_utils/compile.py

Repository: UbiquitousLearning/mllm

Length of output: 14992


🏁 Script executed:

# Look for the C++ implementation
find . -type f -name "add_constant.cpp" 2>/dev/null

Repository: UbiquitousLearning/mllm

Length of output: 118


🏁 Script executed:

# Search for other CPU JIT functions to see if they have device checks
find . -type f -name "*.py" -path "*/cpu/jit/*" -exec grep -l "def " {} \;

Repository: UbiquitousLearning/mllm

Length of output: 116


🏁 Script executed:

# Search for device checks in similar functions
rg "device.type.*cpu|Expected.*CPU" --type py

Repository: UbiquitousLearning/mllm

Length of output: 583


🏁 Script executed:

# Read the C++ implementation
cat -n mllm-kernel/mllm_kernel/cpu/csrc/add_constant.cpp

Repository: UbiquitousLearning/mllm

Length of output: 1180


🏁 Script executed:

# Search for similar device checks in other Python JIT wrappers
find . -type f -path "*/cpu/jit/*.py" -exec grep -l "add_constant" {} \;

Repository: UbiquitousLearning/mllm

Length of output: 162


🏁 Script executed:

# Check if there are any test cases that might test device behavior
find . -type f -name "*test*.py" -path "*cpu*jit*" -o -name "*test*.py" -path "*add_constant*" 2>/dev/null | head -10

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Search for how device validation is done elsewhere in similar kernels
rg "\.device\." --type py -A 2 | head -30

Repository: UbiquitousLearning/mllm

Length of output: 1212


🏁 Script executed:

# Look at the common.h header to understand what GetDataPtr does
find . -type f -name "common.h" | xargs cat

Repository: UbiquitousLearning/mllm

Length of output: 888


Add CPU device validation to both functions.
These functions pass tensor pointers directly to CPU JIT code via data_ptr(), which dereferences memory at the returned address. If a CUDA tensor is passed, data_ptr() returns a GPU memory address that the CPU C++ code will attempt to read, causing crashes or silent data corruption. Add an explicit CPU device check before any tensor operations.

🛡️ Suggested fix
 def add_constant(src: torch.Tensor, constant: int) -> torch.Tensor:
     if constant not in (1, 2, 4, 8, 16):
         raise ValueError(
             f"Constant must be one of [1, 2, 4, 8, 16], got {constant}. "
             "Use add_constant_runtime for arbitrary constants."
         )
 
+    if src.device.type != "cpu":
+        raise TypeError(f"Expected CPU tensor, got {src.device}")
     if src.dtype != torch.float32:
         raise TypeError(f"Expected float32 tensor, got {src.dtype}")
 
 def add_constant_runtime(src: torch.Tensor, constant: float) -> torch.Tensor:
+    if src.device.type != "cpu":
+        raise TypeError(f"Expected CPU tensor, got {src.device}")
     if src.dtype != torch.float32:
         raise TypeError(f"Expected float32 tensor, got {src.dtype}")
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 76-79: Avoid specifying long messages outside the exception class

(TRY003)


[warning] 82-82: Avoid specifying long messages outside the exception class

(TRY003)


[warning] 114-114: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In `@mllm-kernel/mllm_kernel/cpu/jit/add_constant.py` around lines 55 - 121, Both
add_constant and add_constant_runtime must validate that src is on the CPU to
avoid passing GPU addresses to CPU JIT code; add a device check at the start of
each function (e.g., assert or raise TypeError if src.device.type != "cpu")
before any tensor operations or calls into
_jit_add_constant_module/_jit_add_constant_runtime_module, and ensure dst is
created on the same CPU device (torch.empty_like(src) is fine once src is
confirmed CPU). Ensure the error message mentions the function name
(add_constant or add_constant_runtime) and that only CPU tensors are accepted.

Comment on lines +29 to +30
SM8850 = 87, // v81
SM8845 = 97, // v81
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, find and examine the QnnTargetMachine.hpp file
find . -name "QnnTargetMachine.hpp" -type f

Repository: UbiquitousLearning/mllm

Length of output: 111


🏁 Script executed:

# Check for SM8845 references in the codebase
rg -n 'SM8845' -A2 -B2

Repository: UbiquitousLearning/mllm

Length of output: 892


🏁 Script executed:

# Check for SM8850 FFI wrapper and overall FFI structure
rg -n 'SM8850|SM8845' --type=py -A2 -B2

Repository: UbiquitousLearning/mllm

Length of output: 851


🏁 Script executed:

# Check the complete QcomChipset class in the FFI file
sed -n '450,500p' pymllm/ffi/__init__.py

Repository: UbiquitousLearning/mllm

Length of output: 1924


🏁 Script executed:

# Check file encoding and trailing whitespace for QnnTargetMachine.hpp
file ./mllm/backends/qnn/aot/QnnTargetMachine.hpp

Repository: UbiquitousLearning/mllm

Length of output: 111


🏁 Script executed:

# Check for trailing whitespace and line endings in the specific lines
hexdump -C ./mllm/backends/qnn/aot/QnnTargetMachine.hpp | grep -A2 -B2 "SM8845"

Repository: UbiquitousLearning/mllm

Length of output: 112


🏁 Script executed:

# Check file encoding and trailing whitespace using Python
python3 <<'EOF'
import sys

filepath = "./mllm/backends/qnn/aot/QnnTargetMachine.hpp"

with open(filepath, 'rb') as f:
    content = f.read()

# Check BOM
if content.startswith(b'\xef\xbb\xbf'):
    print("ERROR: File has UTF-8 BOM")
else:
    print("OK: No UTF-8 BOM detected")

# Check encoding
try:
    text = content.decode('utf-8')
    print("OK: File is valid UTF-8")
except UnicodeDecodeError as e:
    print(f"ERROR: File is not valid UTF-8: {e}")
    sys.exit(1)

# Check for control codes (C0: 0x00-0x08, 0x0B-0x0C, 0x0E-0x1F; C1: 0x7F-0x9F)
forbidden_chars = []
for i, char in enumerate(text):
    code = ord(char)
    # C0 control codes (excluding 0x09 tab and 0x0A line feed)
    if (0x00 <= code <= 0x08) or (0x0B <= code <= 0x0C) or (0x0E <= code <= 0x1F):
        forbidden_chars.append((i, code, 'C0'))
    # C1 control codes and DEL
    elif (0x7F <= code <= 0x9F):
        forbidden_chars.append((i, code, 'C1'))

if forbidden_chars:
    print(f"ERROR: Found {len(forbidden_chars)} forbidden control codes:")
    for pos, code, type_ in forbidden_chars[:5]:
        print(f"  Position {pos}: 0x{code:02X} ({type_})")
else:
    print("OK: No forbidden control codes found")

# Check for trailing whitespace (excluding newlines)
lines = text.split('\n')
trailing_ws_lines = []
for i, line in enumerate(lines, 1):
    if line.rstrip('\r') != line.rstrip():
        trailing_ws_lines.append(i)

if trailing_ws_lines:
    print(f"ERROR: Found trailing whitespace on {len(trailing_ws_lines)} lines: {trailing_ws_lines[:10]}")
else:
    print("OK: No trailing whitespace found")

# Check line endings (should be LF only, not CRLF)
if '\r\n' in text.decode('utf-8', errors='replace'):
    print("ERROR: File uses CRLF line endings (should use LF)")
elif '\r' in text.decode('utf-8', errors='replace'):
    print("ERROR: File uses CR line endings (should use LF)")
else:
    print("OK: File uses LF line endings")

# Show lines around SM8845
lines = text.split('\n')
for i, line in enumerate(lines):
    if 'SM8845' in line:
        print(f"\nLine {i+1}: {repr(line)}")
        if i > 0:
            print(f"Line {i}: {repr(lines[i-1])}")
        if i < len(lines) - 1:
            print(f"Line {i+2}: {repr(lines[i+1])}")
EOF

Repository: UbiquitousLearning/mllm

Length of output: 357


🏁 Script executed:

# Check if SM8845 FFI wrapper exists in pymllm/ffi/__init__.py
rg -n 'def SM8845' pymllm/ffi/__init__.py

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Double-check for SM8845 in the entire pymllm/ffi/__init__.py file
rg 'SM8845' pymllm/ffi/__init__.py

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Get the line count and structure of the QcomChipset class
rg -n 'class QcomChipset' -A 100 pymllm/ffi/__init__.py | head -120

Repository: UbiquitousLearning/mllm

Length of output: 4007


🏁 Script executed:

# Verify line endings in QnnTargetMachine.hpp using od
od -c ./mllm/backends/qnn/aot/QnnTargetMachine.hpp | grep -A5 "S   M   8   8   4   5"

Repository: UbiquitousLearning/mllm

Length of output: 107


Add missing SM8845() FFI wrapper to maintain API consistency.

The C++ enum addition is correct and follows all coding guidelines. However, the SM8845 chipset is missing its Python FFI wrapper in pymllm/ffi/__init__.py. Since SM8850() has a corresponding wrapper at line 471-472, SM8845() must be added to the QcomChipset class for API completeness and consistency.

Add the following after line 472 in pymllm/ffi/__init__.py:
        `@staticmethod`
        def SM8845() -> QcomChipset:
            return tvm_ffi.get_global_func("mllm.qualcomm.QcomChipset.SM8845")()
🤖 Prompt for AI Agents
In `@mllm/backends/qnn/aot/QnnTargetMachine.hpp` around lines 29 - 30, Add the
missing Python FFI wrapper for the new enum value by adding a static method
SM8845() to the QcomChipset class that mirrors the existing SM8850() wrapper;
implement it to return
tvm_ffi.get_global_func("mllm.qualcomm.QcomChipset.SM8845")(), ensuring the
method name is exactly SM8845 and placed alongside the other chipset wrappers in
pymllm/ffi/__init__.py so the Python API matches the C++ enum.

…ADME files

- Added CUDA standard support and enabled export of compile commands in CMakeLists.txt.
- Removed outdated README.md files for ascend, cpu, and cuda directories.
- Refactored namespace usage in add_constant.cpp for clarity.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@mllm-kernel/CMakeLists.txt`:
- Around line 105-111: The export install block for MllmKernelTargets must be
guarded so CMake won't call INSTALL(EXPORT) with an empty export; wrap the
install(...) for MllmKernelTargets (FILE MllmKernelTargets.cmake NAMESPACE
mllm_kernel::) inside a conditional that only runs when targets were actually
created — for example check the build flag MLLM_KERNEL_BUILD_CPU or test for at
least one exported target (e.g. if(TARGET mllm_kernel::<one_of_your_targets>) or
maintain a boolean like MLLM_KERNEL_HAVE_TARGETS that's set when you create
targets and use if(MLLM_KERNEL_HAVE_TARGETS) before the install(...) call.

In `@mllm-kernel/mllm_kernel/cpu/csrc/add_constant.cpp`:
- Line 20: The printf call in add_constant.cpp is using the wrong format
specifier for the size_t variable n; locate the printf("xwk: %d\n", n) and
change it to use the %zu specifier (or cast n to unsigned long and use %lu if
you prefer consistent width across platforms) so the size_t is printed safely
and avoids undefined behavior.
- Line 6: The include directive currently references the wrong extension; change
the include in add_constant.cpp from "mllm_kernel/common.h" to the correct
header "mllm_kernel/common.hpp" so the compiler finds the actual header (update
the include line in mllm_kernel/cpu/csrc/add_constant.cpp accordingly).

Comment on lines +105 to +111
# Export targets
install(
EXPORT MllmKernelTargets
FILE MllmKernelTargets.cmake
NAMESPACE mllm_kernel::
DESTINATION lib/cmake/mllm_kernel
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name CMakeLists.txt -path "*/mllm-kernel/*" | head -5

Repository: UbiquitousLearning/mllm

Length of output: 95


🏁 Script executed:

cat -n ./mllm-kernel/CMakeLists.txt | head -150

Repository: UbiquitousLearning/mllm

Length of output: 5244


🏁 Script executed:

rg "MllmKernelTargets" ./mllm-kernel/

Repository: UbiquitousLearning/mllm

Length of output: 405


🏁 Script executed:

cat -n ./mllm-kernel/cmake/MllmKernelConfig.cmake.in

Repository: UbiquitousLearning/mllm

Length of output: 616


🏁 Script executed:

rg "EXPORT\s+MllmKernelTargets" -A 2 -B 2 ./mllm-kernel/

Repository: UbiquitousLearning/mllm

Length of output: 813


🏁 Script executed:

rg "add_library|add_executable" ./mllm-kernel/CMakeLists.txt -A 1

Repository: UbiquitousLearning/mllm

Length of output: 174


🏁 Script executed:

web_search
CMake install EXPORT empty target set behavior

Repository: UbiquitousLearning/mllm

Length of output: 161


🌐 Web query:

CMake install EXPORT empty target set error behavior

💡 Result:

install(EXPORT <name> …) can only install an export set that exists. An export set is created implicitly when at least one install(TARGETS … EXPORT <name> …) rule contributes targets to that set (the install docs describe install(EXPORT) as installing targets “associated with the export <export-name> using the EXPORT option of install(TARGETS)”).[1]

So if no targets end up associated with <name>, CMake errors out at configure/generate time with:

  • CMake Error: INSTALL(EXPORT) given unknown export "<name>" (commonly seen when the set is effectively empty, e.g., due to component/config gating).[2]

A known example is when the only “thing” in the set is a NAMELINK_ONLY install; CMake can still emit the “unknown export” error even though you wrote install(TARGETS … EXPORT Development …), because the export set winds up empty in practice.[2]

Typical workarounds

  • Only call install(EXPORT <name> …) when you know at least one target will be added to that export set (guard it with the same options/components that control the install(TARGETS … EXPORT <name> …) calls).
  • Ensure the set contains at least one real exported target/artifact (or avoid splitting in a way that leaves a component’s export set empty; the namelink-only split is the common pitfall).[2]

Sources: CMake install() documentation for install(EXPORT) semantics[1], and Kitware issue report showing the “unknown export” error when an export set ends up empty (namelink-only case).[2]

[1] (cmake.org)
[2] (gitlab.kitware.com)

Citations:


Guard export installation when no targets are exported.

If MLLM_KERNEL_BUILD_CPU is OFF or Highway fails to add, no targets are contributed to MllmKernelTargets. CMake will fail with "INSTALL(EXPORT) given unknown export" error when attempting to install an empty export set. Guard this install block to only run when at least one target is available.

🔧 Suggested guard
-install(
-  EXPORT MllmKernelTargets
-  FILE MllmKernelTargets.cmake
-  NAMESPACE mllm_kernel::
-  DESTINATION lib/cmake/mllm_kernel
-)
+if(TARGET mllm_kernel_highway)
+  install(
+    EXPORT MllmKernelTargets
+    FILE MllmKernelTargets.cmake
+    NAMESPACE mllm_kernel::
+    DESTINATION lib/cmake/mllm_kernel
+  )
+endif()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Export targets
install(
EXPORT MllmKernelTargets
FILE MllmKernelTargets.cmake
NAMESPACE mllm_kernel::
DESTINATION lib/cmake/mllm_kernel
)
# Export targets
if(TARGET mllm_kernel_highway)
install(
EXPORT MllmKernelTargets
FILE MllmKernelTargets.cmake
NAMESPACE mllm_kernel::
DESTINATION lib/cmake/mllm_kernel
)
endif()
🤖 Prompt for AI Agents
In `@mllm-kernel/CMakeLists.txt` around lines 105 - 111, The export install block
for MllmKernelTargets must be guarded so CMake won't call INSTALL(EXPORT) with
an empty export; wrap the install(...) for MllmKernelTargets (FILE
MllmKernelTargets.cmake NAMESPACE mllm_kernel::) inside a conditional that only
runs when targets were actually created — for example check the build flag
MLLM_KERNEL_BUILD_CPU or test for at least one exported target (e.g. if(TARGET
mllm_kernel::<one_of_your_targets>) or maintain a boolean like
MLLM_KERNEL_HAVE_TARGETS that's set when you create targets and use
if(MLLM_KERNEL_HAVE_TARGETS) before the install(...) call.

//
// Add constant kernel using Highway SIMD.

#include "mllm_kernel/common.h"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Locate common.h and check include roots
echo "=== Searching for common.h ==="
find . -name "common.h" -type f

echo -e "\n=== Checking file structure around add_constant.cpp ==="
find . -path "*/mllm-kernel/mllm_kernel/cpu/csrc/add_constant.cpp" -o -path "*/mllm-kernel/mllm_kernel/common.h" | head -20

echo -e "\n=== Searching for include_directories and related configurations ==="
find . -name "CMakeLists.txt" -type f | xargs grep -l "include_directories\|target_include_directories" 2>/dev/null | grep mllm-kernel

echo -e "\n=== Listing files in mllm-kernel directory ==="
find ./mllm-kernel -type f -name "*.h" | head -20

Repository: UbiquitousLearning/mllm

Length of output: 353


🏁 Script executed:

#!/bin/bash
# Examine the structure and configuration more carefully

echo "=== Directory structure of mllm-kernel ==="
find ./mllm-kernel -type d | head -20

echo -e "\n=== All files in mllm-kernel (non-hidden) ==="
find ./mllm-kernel -type f | head -30

echo -e "\n=== CMakeLists.txt contents (first 100 lines) ==="
head -100 ./mllm-kernel/CMakeLists.txt

echo -e "\n=== add_constant.cpp includes and first lines ==="
head -20 ./mllm-kernel/mllm_kernel/cpu/csrc/add_constant.cpp

echo -e "\n=== Search for any file with 'common' in its name ==="
find . -iname "*common*" -type f

Repository: UbiquitousLearning/mllm

Length of output: 5014


Include path uses incorrect file extension.

The include statement references mllm_kernel/common.h, but the actual header file is mllm_kernel/common.hpp (at ./mllm-kernel/include/mllm_kernel/common.hpp). Change the include to #include "mllm_kernel/common.hpp" to resolve the compilation error.

🧰 Tools
🪛 Clang (14.0.6)

[error] 6-6: 'mllm_kernel/common.h' file not found

(clang-diagnostic-error)

🤖 Prompt for AI Agents
In `@mllm-kernel/mllm_kernel/cpu/csrc/add_constant.cpp` at line 6, The include
directive currently references the wrong extension; change the include in
add_constant.cpp from "mllm_kernel/common.h" to the correct header
"mllm_kernel/common.hpp" so the compiler finds the actual header (update the
include line in mllm_kernel/cpu/csrc/add_constant.cpp accordingly).

float* dst_ptr = GetDataPtr<float>(dst);
const float* src_ptr = GetConstDataPtr<float>(src);
size_t n = GetNumElements(src);
printf("xwk: %d\n", n);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n mllm-kernel/mllm_kernel/cpu/csrc/add_constant.cpp | head -30

Repository: UbiquitousLearning/mllm

Length of output: 1125


Use %zu format specifier for size_t variable.

Variable n is declared as size_t on line 19, but the printf on line 20 uses %d which is for int. This causes undefined behavior on platforms where size_t differs from int.

🔧 Suggested fix
-  printf("xwk: %d\n", n);
+  printf("xwk: %zu\n", n);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
printf("xwk: %d\n", n);
printf("xwk: %zu\n", n);
🤖 Prompt for AI Agents
In `@mllm-kernel/mllm_kernel/cpu/csrc/add_constant.cpp` at line 20, The printf
call in add_constant.cpp is using the wrong format specifier for the size_t
variable n; locate the printf("xwk: %d\n", n) and change it to use the %zu
specifier (or cast n to unsigned long and use %lu if you prefer consistent width
across platforms) so the size_t is printed safely and avoids undefined behavior.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@README.md`:
- Around line 75-78: Update the table header string "Model(v1)" in README.md to
"Model(v2)" so it matches the corresponding header in README-ZH.md; locate the
header row containing the literal "Model(v1)" and replace it with "Model(v2)"
ensuring the rest of the table markup (the pipe-separated columns and separator
row) remains unchanged.

Comment thread README.md
Comment on lines 75 to +78
| Model(v1) | CPU | Hexagon NPU <br> INT8 |
|-----------------------------------------------------------------------------|------|-----------------------|
| [Qwen3-0.6B](https://github.com/QwenLM/Qwen3) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen3-0.6B-w4a32kai) | |
| [Qwen3-1.7B](https://github.com/QwenLM/Qwen3) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen3-1.7B-w4a8-i8mm-kai) | |
| [Qwen3-1.7B](https://github.com/QwenLM/Qwen3) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen3-1.7B-w4a8-i8mm-kai) | [W4A16-SM8650](https://modelscope.cn/models/mllmTeam/Qwen3-1.7B-Qnn-AOT-SM8650/) |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Table header says "Model(v1)" but this is the mllm v2 section.

The Chinese README (README-ZH.md line 77) was updated to Model(v2), but this file still reads Model(v1). This should be consistent across both READMEs.

-| Model(v1)                                                                   | CPU  | Hexagon NPU <br> INT8 |
+| Model(v2)                                                                   | CPU  | Hexagon NPU <br> INT8 |
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| Model(v1) | CPU | Hexagon NPU <br> INT8 |
|-----------------------------------------------------------------------------|------|-----------------------|
| [Qwen3-0.6B](https://github.com/QwenLM/Qwen3) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen3-0.6B-w4a32kai) | |
| [Qwen3-1.7B](https://github.com/QwenLM/Qwen3) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen3-1.7B-w4a8-i8mm-kai) | |
| [Qwen3-1.7B](https://github.com/QwenLM/Qwen3) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen3-1.7B-w4a8-i8mm-kai) | [W4A16-SM8650](https://modelscope.cn/models/mllmTeam/Qwen3-1.7B-Qnn-AOT-SM8650/) |
| Model(v2) | CPU | Hexagon NPU <br> INT8 |
|-----------------------------------------------------------------------------|------|-----------------------|
| [Qwen3-0.6B](https://github.com/QwenLM/Qwen3) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen3-0.6B-w4a32kai) | |
| [Qwen3-1.7B](https://github.com/QwenLM/Qwen3) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen3-1.7B-w4a8-i8mm-kai) | [W4A16-SM8650](https://modelscope.cn/models/mllmTeam/Qwen3-1.7B-Qnn-AOT-SM8650/) |
🤖 Prompt for AI Agents
In `@README.md` around lines 75 - 78, Update the table header string "Model(v1)"
in README.md to "Model(v2)" so it matches the corresponding header in
README-ZH.md; locate the header row containing the literal "Model(v1)" and
replace it with "Model(v2)" ensuring the rest of the table markup (the
pipe-separated columns and separator row) remains unchanged.

…ty files

- Removed obsolete .clangd configuration file and added a new Python script to generate it based on the project's requirements.
- Introduced new header files for source location management and utility functions for CUDA and host environments.
- Updated README.md with instructions for generating the recommended .clangd configuration.
- Added include paths for CPU, CUDA, and Ascend in the JIT utilities to streamline compilation.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🤖 Fix all issues with AI agents
In `@mllm-kernel/include/mllm_kernel/tensor.hpp`:
- Around line 173-178: The standalone attribute `[[unlikely]];` is applied to an
empty statement and has no effect; remove that empty-statement attribute and
attach the attribute to the branch check instead — i.e., in function verify(...)
(use symbols verify, has_value, value_, m_name_str, Panic) delete the lone
`[[unlikely]];` and place `[[unlikely]]` immediately on the `if (value_ !=
value)` condition so the branch prediction hint applies to the failing-size
branch that calls Panic.

In `@mllm-kernel/include/mllm_kernel/utils.cuh`:
- Around line 31-68: The file currently defines top-level namespaces like device
(and host ranges) causing unresolved symbols such as DebugInfo and panic which
live in mllm_kernel::host; wrap the existing top-level namespace blocks inside
namespace mllm_kernel (e.g., change namespace device { ... } to namespace
mllm_kernel { namespace device { ... } } ), so symbols like PDLWaitPrimary,
PDLTriggerSecondary, and pointer::offset are declared under mllm_kernel::device,
and unqualified references to DebugInfo and panic resolve to mllm_kernel::host;
alternatively, if you prefer not to nest, explicitly qualify uses of DebugInfo
and panic with mllm_kernel::host::DebugInfo / mllm_kernel::host::panic or add a
using declaration inside the device namespace to bring them into scope.
- Around line 72-77: The standalone [[unlikely]] attribute is applied to an
empty statement and has no effect; move it onto the if-statement so the branch
hint applies to the error path. Modify RuntimeDeviceCheck so the if reads: if
(error != ::cudaSuccess) [[unlikely]] { ::host::panic(location, "CUDA error: ",
::cudaGetErrorString(error)); } ensuring DebugInfo, ::host::panic and
::cudaGetErrorString are unchanged.

In `@mllm-kernel/include/mllm_kernel/utils.hpp`:
- Line 135: The helper dtype_bytes currently does integer-divide dtype.bits by 8
which yields 0 for sub-byte types; update dtype_bytes (involving DLDataType) to
compute bytes with ceiling division (e.g., use or implement div_ceil(dtype.bits,
8u)) or add a debug/assert to catch bits<8 so it never returns 0; ensure the
change returns at least 1 for any positive bit-width and update any dependent
calculations that assume full-byte sizes.
- Around line 130-133: div_ceil can invoke signed integer overflow via the
expression (a + b - 1) when a and b are large; change div_ceil to first promote
inputs to a common integral type (e.g., std::common_type_t<T,U>), then compute
the result without addition that could overflow by using quotient-plus-remainder
logic: compute q = aa / bb and r = aa % bb and return q + (r != 0); update the
function signature (div_ceil) to use the promoted variables aa/bb so mixed
signed/unsigned types behave consistently and avoid intermediate overflow.
- Around line 4-34: The conditional CUDA preprocessor block that handles
CUDA_VERSION <= 12010 includes a non-existent header "source_location.h",
causing build failures; in the branch guarded by __CUDACC__ and CUDA_VERSION <=
12010 (the block that defines/undefines consteval and manipulates
__cpp_consteval/_NODISCARD), replace the include of "source_location.h" with
"source_location.hpp" so the correct header is used for the source_location
implementation.

In `@mllm-kernel/mllm_kernel/__main__.py`:
- Around line 83-94: The PathMatch regex in the clangd_content f-string is
over-escaped and will only match paths containing a literal backslash; update
the string used in clangd_content so the pattern is ".*\\.(cu|cuh)$" (i.e., use
a single escaped backslash in the Python source) so the emitted file contains
".*\.(cu|cuh)$" and will match normal Unix .cu/.cuh files; change the pattern
inside the f-string where clangd_content is defined (the line containing
PathMatch) — keeping render_flags, base_flags and cuda_flags usage unchanged.
- Around line 54-60: The call to subprocess.run("nvidia-smi", ...) can raise
CalledProcessError or FileNotFoundError and must be wrapped in a try/except:
catch CalledProcessError and FileNotFoundError around the subprocess.run and on
exception set compute_cap = None (or similar sentinel) instead of letting the
exception propagate; only parse compute_cap into major, minor if compute_cap is
not None; build cuda_flags only when compute_cap is available and non-empty;
finally, when writing the .clangd content, conditionally emit the CUDA-related
section only if cuda_flags is non-empty so systems without nvidia-smi/GPU skip
CUDA flags gracefully.

In `@mllm-kernel/mllm_kernel/jit_utils/compile.py`:
- Line 129: The mapping for torch.bfloat16 is misspelled as "bfp16_t" and must
be corrected to "bf16_t" so generated C++ templates use the existing alias;
update the mapping entry (where torch.bfloat16 is mapped in the type-to-C++
mapping—e.g., the TYPE_TO_CTYPE / similar dict in mllm_kernel.jit_utils.compile)
to use "bf16_t" instead of "bfp16_t" so the JIT kernel compiles against the
utils.cuh alias (using bf16_t = __nv_bfloat16).

In `@mllm-kernel/README.md`:
- Line 29: The README's development install instruction uses the wrong git clone
URL string "https://github.com/mllm/mllm-kernel.git"; update that clone command
to the correct repository URL "https://github.com/UbiquitousLearning/mllm.git"
so developers clone the proper repo (replace the incorrect URL in the README
line containing the git clone command).
🧹 Nitpick comments (11)
mllm-kernel/mllm_kernel/__main__.py (2)

47-51: Nit: use iterable unpacking instead of concatenation (RUF005).

♻️ Suggested change
-    include_paths = [find_include_path(), find_dlpack_include_path()] + [
-        MLLM_KERNEL_CPU_INCLUDE_DIR,
-        MLLM_KERNEL_CUDA_INCLUDE_DIR,
-        MLLM_KERNEL_INCLUDE_DIR,
-    ]
+    include_paths = [
+        find_include_path(),
+        find_dlpack_include_path(),
+        MLLM_KERNEL_CPU_INCLUDE_DIR,
+        MLLM_KERNEL_CUDA_INCLUDE_DIR,
+        MLLM_KERNEL_INCLUDE_DIR,
+    ]

114-118: Command name show-clangd-recommend-config is misleading — it writes the file, not just shows it.

The command actually creates a .clangd file (line 99). Consider a name like generate-clangd-config or init-clangd to set the right expectation.

mllm-kernel/README.md (1)

96-96: Add a language identifier to the fenced code block.

The project structure code block lacks a language specifier. Use text or plaintext to satisfy markdownlint (MD040).

-```
+```text
mllm-kernel/include/mllm_kernel/source_location.hpp (1)

16-19: Consider adding [[nodiscard]] to accessor functions.

The standard std::source_location marks its accessors [[nodiscard]]. Adding the attribute to the fallback keeps the two paths consistent and lets compilers warn on discarded return values.

Suggested fix
-  constexpr unsigned line() const noexcept { return 0; }
-  constexpr unsigned column() const noexcept { return 0; }
-  constexpr const char* file_name() const noexcept { return ""; }
-  constexpr const char* function_name() const noexcept { return ""; }
+  [[nodiscard]] constexpr unsigned line() const noexcept { return 0; }
+  [[nodiscard]] constexpr unsigned column() const noexcept { return 0; }
+  [[nodiscard]] constexpr const char* file_name() const noexcept { return ""; }
+  [[nodiscard]] constexpr const char* function_name() const noexcept { return ""; }
mllm-kernel/include/mllm_kernel/utils.hpp (1)

118-126: Parameter pack name offset shadows the enclosing function name.

Both pointer::offset functions have a parameter pack named offset, identical to the function name. While technically valid C++, it's confusing and may trigger shadowing warnings on some compilers.

Suggested rename
 template<typename T = char, std::integral... U>
-inline auto offset(void* ptr, U... offset) -> void* {
-  return static_cast<T*>(ptr) + (... + offset);
+inline auto offset(void* ptr, U... offsets) -> void* {
+  return static_cast<T*>(ptr) + (... + offsets);
 }
 
 template<typename T = char, std::integral... U>
-inline auto offset(const void* ptr, U... offset) -> const void* {
-  return static_cast<const T*>(ptr) + (... + offset);
+inline auto offset(const void* ptr, U... offsets) -> const void* {
+  return static_cast<const T*>(ptr) + (... + offsets);
 }
mllm-kernel/include/mllm_kernel/utils.cuh (1)

56-64: Same parameter-name shadowing as utils.hpp.

The device-side pointer::offset functions also have parameter pack U... offset shadowing the function name. Consider renaming to offsets for consistency with any fix applied in utils.hpp.

mllm-kernel/include/mllm_kernel/tensor.hpp (2)

371-371: std::span over std::initializer_list — lifetime depends on single-expression chaining.

shape_ and strides_ are stored as std::span<const SizeRef>, but they're initialized from std::initializer_list whose backing array only lives until the end of the full-expression. The const&&-qualified verify method guides callers toward a single chained expression (e.g., TensorMatcher({...}).with_dtype<float>().verify(view)), which keeps the backing arrays alive.

This is fine for the intended pattern, but a auto&& m = TensorMatcher({...}) followed by a later use would dangle silently. Consider adding a brief comment at the class declaration documenting this constraint, so future maintainers don't accidentally break the pattern.

Also applies to: 373-379


31-33: kAnySize and kNullSize have the same value.

Both are static_cast<int64_t>(-1). While functionally correct (kAnySize is checked before set_value, so they serve different roles at different code paths), having two distinct semantic concepts mapped to the same sentinel is a maintenance hazard. Consider using different sentinel values, or a comment explaining why they share a value.

mllm-kernel/mllm_kernel/jit_utils/compile.py (2)

107-117: Stub _get_cpu_arch_flags() returns empty list — TODO left unimplemented.

The function has a # TODO comment and returns []. This means no architecture-specific SIMD flags (e.g., -mavx2, -march=native) are passed to the compiler, potentially resulting in sub-optimal codegen for CPU JIT kernels.

Would you like me to generate an implementation that detects CPU features and returns appropriate flags, or open an issue to track this?


27-41: cache_once is not thread-safe.

The result_map dict is accessed without synchronization. If load_cpu_jit or other cached functions are called concurrently (e.g., from multiple threads in a server), there's a potential data race on the dictionary. Consider using threading.Lock if concurrent use is expected.

mllm-kernel/mllm_kernel/jit_utils/__init__.py (1)

32-32: Private function _tvm_ffi_cpp_load_inline exported in __all__.

By Python convention, the leading underscore signals a private/internal symbol. Exporting it in __all__ is contradictory. Consider either removing the underscore (making it a public legacy API) or removing it from __all__ to keep it as an internal-only helper.

Comment on lines +173 to +178
auto verify(int64_t value, const char* prefix, int64_t dim) -> void {
if (this->has_value()) {
if (value_ != value) {
[[unlikely]];
Panic("Size mismatch for ", m_name_str(prefix, dim), ": expected ", value_, " but got ", value);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

[[unlikely]] on empty statement (same pattern as utils.cuh).

[[unlikely]]; on Line 176 applies to the empty statement, not to the enclosing if branch. This has no effect on branch prediction.

Suggested fix
     if (this->has_value()) {
-      if (value_ != value) {
-        [[unlikely]];
-        Panic("Size mismatch for ", m_name_str(prefix, dim), ": expected ", value_, " but got ", value);
+      if (value_ != value) [[unlikely]] {
+        Panic("Size mismatch for ", m_name_str(prefix, dim), ": expected ", value_, " but got ", value);
       }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto verify(int64_t value, const char* prefix, int64_t dim) -> void {
if (this->has_value()) {
if (value_ != value) {
[[unlikely]];
Panic("Size mismatch for ", m_name_str(prefix, dim), ": expected ", value_, " but got ", value);
}
auto verify(int64_t value, const char* prefix, int64_t dim) -> void {
if (this->has_value()) {
if (value_ != value) [[unlikely]] {
Panic("Size mismatch for ", m_name_str(prefix, dim), ": expected ", value_, " but got ", value);
}
🤖 Prompt for AI Agents
In `@mllm-kernel/include/mllm_kernel/tensor.hpp` around lines 173 - 178, The
standalone attribute `[[unlikely]];` is applied to an empty statement and has no
effect; remove that empty-statement attribute and attach the attribute to the
branch check instead — i.e., in function verify(...) (use symbols verify,
has_value, value_, m_name_str, Panic) delete the lone `[[unlikely]];` and place
`[[unlikely]]` immediately on the `if (value_ != value)` condition so the branch
prediction hint applies to the failing-size branch that calls Panic.

Comment on lines +31 to +68
namespace device {

#define SGL_DEVICE __forceinline__ __device__

inline constexpr auto kWarpThreads = 32u;
inline constexpr auto kFullMask = 0xffffffffu;

template<bool kUsePDL>
SGL_DEVICE void PDLWaitPrimary() {
#ifndef USE_ROCM
if constexpr (kUsePDL) { asm volatile("griddepcontrol.wait;" ::: "memory"); }
#endif
}

template<bool kUsePDL>
SGL_DEVICE void PDLTriggerSecondary() {
#ifndef USE_ROCM
if constexpr (kUsePDL) { asm volatile("griddepcontrol.launch_dependents;" :::); }
#endif
}

namespace pointer {

// we only allow void * pointer arithmetic for safety

template<typename T = char, std::integral... U>
SGL_DEVICE auto offset(void* ptr, U... offset) -> void* {
return static_cast<T*>(ptr) + (... + offset);
}

template<typename T = char, std::integral... U>
SGL_DEVICE auto offset(const void* ptr, U... offset) -> const void* {
return static_cast<const T*>(ptr) + (... + offset);
}

} // namespace pointer

} // namespace device
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Namespace mismatch: device and host should be under mllm_kernel.

The file declares namespace device and namespace host at global scope, but DebugInfo, panic, and other utilities are defined in mllm_kernel::host (via utils.hpp). Unqualified references to DebugInfo (Line 72, 79, 84, 88, 128) and panic (Line 75) will fail to resolve, causing compilation errors when this header is included.

Suggested fix — wrap in `mllm_kernel`
+namespace mllm_kernel {
+
 namespace device {
 // ... device code unchanged ...
 }  // namespace device
 
 namespace host {
 // ... host code unchanged ...
 }  // namespace host
+
+}  // namespace mllm_kernel

Also applies to: 70-132

🤖 Prompt for AI Agents
In `@mllm-kernel/include/mllm_kernel/utils.cuh` around lines 31 - 68, The file
currently defines top-level namespaces like device (and host ranges) causing
unresolved symbols such as DebugInfo and panic which live in mllm_kernel::host;
wrap the existing top-level namespace blocks inside namespace mllm_kernel (e.g.,
change namespace device { ... } to namespace mllm_kernel { namespace device {
... } } ), so symbols like PDLWaitPrimary, PDLTriggerSecondary, and
pointer::offset are declared under mllm_kernel::device, and unqualified
references to DebugInfo and panic resolve to mllm_kernel::host; alternatively,
if you prefer not to nest, explicitly qualify uses of DebugInfo and panic with
mllm_kernel::host::DebugInfo / mllm_kernel::host::panic or add a using
declaration inside the device namespace to bring them into scope.

Comment on lines +72 to +77
inline void RuntimeDeviceCheck(::cudaError_t error, DebugInfo location = {}) {
if (error != ::cudaSuccess) {
[[unlikely]];
::host::panic(location, "CUDA error: ", ::cudaGetErrorString(error));
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

[[unlikely]] on empty statement has no effect.

[[unlikely]]; on Line 74 is an attribute applied to an empty statement — it does not affect the if branch or the subsequent panic call. To hint branch prediction, attach the attribute to the if statement body.

Suggested fix
 inline void RuntimeDeviceCheck(::cudaError_t error, DebugInfo location = {}) {
-  if (error != ::cudaSuccess) {
-    [[unlikely]];
-    ::host::panic(location, "CUDA error: ", ::cudaGetErrorString(error));
+  if (error != ::cudaSuccess) [[unlikely]] {
+    panic(location, "CUDA error: ", ::cudaGetErrorString(error));
   }
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
inline void RuntimeDeviceCheck(::cudaError_t error, DebugInfo location = {}) {
if (error != ::cudaSuccess) {
[[unlikely]];
::host::panic(location, "CUDA error: ", ::cudaGetErrorString(error));
}
}
inline void RuntimeDeviceCheck(::cudaError_t error, DebugInfo location = {}) {
if (error != ::cudaSuccess) [[unlikely]] {
::host::panic(location, "CUDA error: ", ::cudaGetErrorString(error));
}
}
🤖 Prompt for AI Agents
In `@mllm-kernel/include/mllm_kernel/utils.cuh` around lines 72 - 77, The
standalone [[unlikely]] attribute is applied to an empty statement and has no
effect; move it onto the if-statement so the branch hint applies to the error
path. Modify RuntimeDeviceCheck so the if reads: if (error != ::cudaSuccess)
[[unlikely]] { ::host::panic(location, "CUDA error: ",
::cudaGetErrorString(error)); } ensuring DebugInfo, ::host::panic and
::cudaGetErrorString are unchanged.

Comment on lines +4 to +34
#ifdef __CUDACC__
#include <cuda.h>
#if CUDA_VERSION <= 12010

#pragma push_macro("__cpp_consteval")
#pragma push_macro("_NODISCARD")
#pragma push_macro("__builtin_LINE")

#pragma clang diagnostic push
#pragma clang diagnostic ignored "-Wbuiltin-macro-redefined"
#define __cpp_consteval 201811L
#pragma clang diagnostic pop

#ifdef _NODISCARD
#undef _NODISCARD
#define _NODISCARD
#endif

#define consteval constexpr

#include "source_location.h"

#undef consteval
#pragma pop_macro("__cpp_consteval")
#pragma pop_macro("_NODISCARD")
#else // __CUDACC__ && CUDA_VERSION > 12010
#include "source_location.hpp"
#endif
#else // no __CUDACC__
#include "source_location.hpp"
#endif
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Search for source_location files in the repository
fd 'source_location'

Repository: UbiquitousLearning/mllm

Length of output: 118


🏁 Script executed:

# Check if source_location.h exists specifically
find . -name "source_location.h" -type f

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Get directory structure around include files
git ls-files | grep -E 'source_location|utils\.hpp' | head -20

Repository: UbiquitousLearning/mllm

Length of output: 160


🏁 Script executed:

# Check for any CUDA-related headers or documentation
fd 'source_location' -t f | xargs ls -la 2>/dev/null || true

Repository: UbiquitousLearning/mllm

Length of output: 166


🏁 Script executed:

cat -n mllm-kernel/include/mllm_kernel/utils.hpp | head -35

Repository: UbiquitousLearning/mllm

Length of output: 1120


🏁 Script executed:

# Check if there's any documentation or comments explaining the .h vs .hpp distinction
cat -n mllm-kernel/include/mllm_kernel/source_location.hpp

Repository: UbiquitousLearning/mllm

Length of output: 973


Fix build failure for CUDA ≤ 12010: source_location.h doesn't exist.

Line 24 includes "source_location.h", but only source_location.hpp exists in the repository. CUDA builds with version ≤ 12010 will fail with a missing header error. Change line 24 to #include "source_location.hpp" to match the existing file.

🤖 Prompt for AI Agents
In `@mllm-kernel/include/mllm_kernel/utils.hpp` around lines 4 - 34, The
conditional CUDA preprocessor block that handles CUDA_VERSION <= 12010 includes
a non-existent header "source_location.h", causing build failures; in the branch
guarded by __CUDACC__ and CUDA_VERSION <= 12010 (the block that
defines/undefines consteval and manipulates __cpp_consteval/_NODISCARD), replace
the include of "source_location.h" with "source_location.hpp" so the correct
header is used for the source_location implementation.

Comment on lines +130 to +133
template<std::integral T, std::integral U>
inline constexpr auto div_ceil(T a, U b) {
return (a + b - 1) / b;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

div_ceil has signed-overflow UB for large inputs.

a + b - 1 can overflow when both a and b are large positive signed integers, which is undefined behavior. A safer formulation avoids the intermediate addition.

Suggested fix
 template<std::integral T, std::integral U>
 inline constexpr auto div_ceil(T a, U b) {
-  return (a + b - 1) / b;
+  return a / b + (a % b != 0 ? 1 : 0);
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
template<std::integral T, std::integral U>
inline constexpr auto div_ceil(T a, U b) {
return (a + b - 1) / b;
}
template<std::integral T, std::integral U>
inline constexpr auto div_ceil(T a, U b) {
return a / b + (a % b != 0 ? 1 : 0);
}
🤖 Prompt for AI Agents
In `@mllm-kernel/include/mllm_kernel/utils.hpp` around lines 130 - 133, div_ceil
can invoke signed integer overflow via the expression (a + b - 1) when a and b
are large; change div_ceil to first promote inputs to a common integral type
(e.g., std::common_type_t<T,U>), then compute the result without addition that
could overflow by using quotient-plus-remainder logic: compute q = aa / bb and r
= aa % bb and return q + (r != 0); update the function signature (div_ceil) to
use the promoted variables aa/bb so mixed signed/unsigned types behave
consistently and avoid intermediate overflow.

return (a + b - 1) / b;
}

inline auto dtype_bytes(DLDataType dtype) -> std::size_t { return static_cast<std::size_t>(dtype.bits / 8); }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

dtype_bytes returns 0 for sub-byte types (e.g., 4-bit quantized).

Integer division bits / 8 truncates, so any DLDataType with bits < 8 yields 0 bytes. If sub-byte quantized types are expected in the future, this could silently produce incorrect pointer arithmetic or buffer sizes downstream.

Consider at minimum a debug assertion, or use div_ceil here:

inline auto dtype_bytes(DLDataType dtype) -> std::size_t {
  return static_cast<std::size_t>(div_ceil(dtype.bits, 8u));
}
🤖 Prompt for AI Agents
In `@mllm-kernel/include/mllm_kernel/utils.hpp` at line 135, The helper
dtype_bytes currently does integer-divide dtype.bits by 8 which yields 0 for
sub-byte types; update dtype_bytes (involving DLDataType) to compute bytes with
ceiling division (e.g., use or implement div_ceil(dtype.bits, 8u)) or add a
debug/assert to catch bits<8 so it never returns 0; ensure the change returns at
least 1 for any positive bit-width and update any dependent calculations that
assume full-byte sizes.

Comment on lines +54 to +60
status = subprocess.run(
args=["nvidia-smi", "--query-gpu=compute_cap", "--format=csv,noheader"],
capture_output=True,
check=True,
)
compute_cap = status.stdout.decode("utf-8").strip().split("\n")[0]
major, minor = compute_cap.split(".")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

nvidia-smi failure crashes the entire function on systems without an NVIDIA GPU.

check=True raises CalledProcessError if nvidia-smi returns non-zero, and if nvidia-smi isn't installed at all, a FileNotFoundError is raised. Since the function already gracefully handles a missing CUDA installation (lines 77-78), it should also handle the absence of nvidia-smi/GPU — just skip CUDA flags entirely.

🐛 Proposed fix: wrap nvidia-smi in a try/except and make CUDA flags conditional
-    status = subprocess.run(
-        args=["nvidia-smi", "--query-gpu=compute_cap", "--format=csv,noheader"],
-        capture_output=True,
-        check=True,
-    )
-    compute_cap = status.stdout.decode("utf-8").strip().split("\n")[0]
-    major, minor = compute_cap.split(".")
+    compute_cap = None
+    try:
+        status = subprocess.run(
+            args=["nvidia-smi", "--query-gpu=compute_cap", "--format=csv,noheader"],
+            capture_output=True,
+            check=True,
+        )
+        raw = status.stdout.decode("utf-8").strip().split("\n")[0]
+        major, minor = raw.split(".")
+        compute_cap = (major, minor)
+    except (subprocess.CalledProcessError, FileNotFoundError, ValueError):
+        logger.warning("Could not query GPU compute capability; CUDA flags will be skipped.")

Then, later, make the CUDA flag block conditional on compute_cap is not None:

-    cuda_flags = [
-        "-xcuda",
-        f"--cuda-gpu-arch=sm_{major}{minor}",
-    ]
-
-    cuda_path = _find_cuda_path()
-    if cuda_path is not None:
-        cuda_flags.append(f"--cuda-path={cuda_path}")
-        logger.info(f"Detected CUDA installation at: {cuda_path}")
-    else:
-        logger.warning("CUDA installation not found; skip --cuda-path")
+    cuda_flags = []
+    if compute_cap is not None:
+        major, minor = compute_cap
+        cuda_flags = [
+            "-xcuda",
+            f"--cuda-gpu-arch=sm_{major}{minor}",
+        ]
+        cuda_path = _find_cuda_path()
+        if cuda_path is not None:
+            cuda_flags.append(f"--cuda-path={cuda_path}")
+            logger.info(f"Detected CUDA installation at: {cuda_path}")
+        else:
+            logger.warning("CUDA installation not found; skip --cuda-path")

And conditionally emit the CUDA section of the .clangd file only when cuda_flags is non-empty.

🤖 Prompt for AI Agents
In `@mllm-kernel/mllm_kernel/__main__.py` around lines 54 - 60, The call to
subprocess.run("nvidia-smi", ...) can raise CalledProcessError or
FileNotFoundError and must be wrapped in a try/except: catch CalledProcessError
and FileNotFoundError around the subprocess.run and on exception set compute_cap
= None (or similar sentinel) instead of letting the exception propagate; only
parse compute_cap into major, minor if compute_cap is not None; build cuda_flags
only when compute_cap is available and non-empty; finally, when writing the
.clangd content, conditionally emit the CUDA-related section only if cuda_flags
is non-empty so systems without nvidia-smi/GPU skip CUDA flags gracefully.

Comment on lines +83 to +94
clangd_content = f"""
CompileFlags:
Add:
{render_flags(base_flags)}

---
If:
PathMatch: .*\\\\.(cu|cuh)$
CompileFlags:
Add:
{render_flags(cuda_flags)}
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

PathMatch regex has an extra backslash — it won't match .cu/.cuh files on Linux.

.*\\\\.(cu|cuh)$ in the Python f-string produces .*\\.(cu|cuh)$ in the output file. In the regex, \\ matches a literal backslash character, so the pattern requires a \ before the extension — this won't match normal Unix paths like src/kernel.cu.

Use .*\\.(cu|cuh)$ in the Python source to produce .*\.(cu|cuh)$ in the file (escaped dot matching a literal .).

🐛 Proposed fix
-  PathMatch: .*\\\\.(cu|cuh)$
+  PathMatch: .*\\.(cu|cuh)$
🤖 Prompt for AI Agents
In `@mllm-kernel/mllm_kernel/__main__.py` around lines 83 - 94, The PathMatch
regex in the clangd_content f-string is over-escaped and will only match paths
containing a literal backslash; update the string used in clangd_content so the
pattern is ".*\\.(cu|cuh)$" (i.e., use a single escaped backslash in the Python
source) so the emitted file contains ".*\.(cu|cuh)$" and will match normal Unix
.cu/.cuh files; change the pattern inside the f-string where clangd_content is
defined (the line containing PathMatch) — keeping render_flags, base_flags and
cuda_flags usage unchanged.

MLLM_KERNEL_TEMPLATE_DTYPE_MAP: dict[torch.dtype, str] = {
torch.float: "fp32_t",
torch.float16: "fp16_t",
torch.bfloat16: "bfp16_t",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Typo: "bfp16_t" should be "bf16_t".

The C++ type alias defined in utils.cuh is bf16_t (Line 18: using bf16_t = __nv_bfloat16;), but the Python mapping uses "bfp16_t" (extra p). This will produce invalid C++ template arguments for any bfloat16 JIT kernel, causing a compilation failure at runtime.

Suggested fix
-    torch.bfloat16: "bfp16_t",
+    torch.bfloat16: "bf16_t",
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
torch.bfloat16: "bfp16_t",
torch.bfloat16: "bf16_t",
🤖 Prompt for AI Agents
In `@mllm-kernel/mllm_kernel/jit_utils/compile.py` at line 129, The mapping for
torch.bfloat16 is misspelled as "bfp16_t" and must be corrected to "bf16_t" so
generated C++ templates use the existing alias; update the mapping entry (where
torch.bfloat16 is mapped in the type-to-C++ mapping—e.g., the TYPE_TO_CTYPE /
similar dict in mllm_kernel.jit_utils.compile) to use "bf16_t" instead of
"bfp16_t" so the JIT kernel compiles against the utils.cuh alias (using bf16_t =
__nv_bfloat16).

Comment thread mllm-kernel/README.md Outdated

```bash
# Clone and install in development mode
git clone https://github.com/mllm/mllm-kernel.git
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Incorrect repository URL in development installation instructions.

The clone URL https://github.com/mllm/mllm-kernel.git does not match the actual repository (UbiquitousLearning/mllm). This will confuse developers trying to set up a development environment.

Suggested fix
-git clone https://github.com/mllm/mllm-kernel.git
+git clone https://github.com/UbiquitousLearning/mllm.git
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
git clone https://github.com/mllm/mllm-kernel.git
git clone https://github.com/UbiquitousLearning/mllm.git
🤖 Prompt for AI Agents
In `@mllm-kernel/README.md` at line 29, The README's development install
instruction uses the wrong git clone URL string
"https://github.com/mllm/mllm-kernel.git"; update that clone command to the
correct repository URL "https://github.com/UbiquitousLearning/mllm.git" so
developers clone the proper repo (replace the incorrect URL in the README line
containing the git clone command).

- Added a new CUDA kernel for adding a constant to tensor elements, utilizing template specialization for optimization.
- Introduced JIT compilation support for the add_constant kernel, allowing compile-time constant specification.
- Updated README.md with instructions for generating the .clangd configuration.
- Refactored namespace usage in utils.cuh for consistency.
- Created new utility files for JIT compilation and added necessary imports in existing modules.
- Updated copyright notices to reflect the MLLM Team and changed licensing to MIT in various files.
- Enhanced README.md with current status and installation instructions for CPU and CUDA JIT paths.
- Added new utility files for CUDA tensor operations and improved the structure of existing modules.
- Introduced a new `tilelang` dependency for CUDA tensor language support.
Copy link
Copy Markdown
Owner

@UbiquitousLearning UbiquitousLearning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chenghuaWang
Copy link
Copy Markdown
Collaborator Author

python -m mllm_kernel show-config
=== mllm-kernel environment ===
[OS]
system: Linux
release: 5.15.0-124-generic
version: xxxxxxxx
platform: Linux-5.15.0-124-generic-x86_64-with-glibc2.35
machine: x86_64

[Python]
version: 3.10.19
executable: /opt/conda/bin/python

[CPU]
model: xxxxxxxx
logical_cores: xxxxxx
physical_cores: xxxxxxx

[CUDA]
CUDA_PATH/CUDA_HOME: /usr/local/cuda
nvcc_path: /usr/local/cuda/bin/nvcc
nvidia_smi_path: /usr/bin/nvidia-smi
nvcc_cuda_version: 12.8
nvidia_smi_driver_version: 570.124.06
nvidia_smi_cuda_version: 12.8
torch_installed: yes
torch_version: 2.9.1+cu128
torch_cuda_build_version: 12.8
torch_cuda_available: True
torch_cuda_device_count: 8
gpu[0]: NVIDIA xxxxxx, compute_capability=sm_90
gpu[1]: NVIDIA xxxxxx, compute_capability=sm_90
gpu[2]: NVIDIA xxxxxx, compute_capability=sm_90
gpu[3]: NVIDIA xxxxxx, compute_capability=sm_90
gpu[4]: NVIDIA xxxxxx, compute_capability=sm_90
gpu[5]: NVIDIA xxxxxx, compute_capability=sm_90
gpu[6]: NVIDIA xxxxxx, compute_capability=sm_90
gpu[7]: NVIDIA xxxxxx, compute_capability=sm_90

=== registered jit kernels ===
device | kernel               | cached
-------+----------------------+-------
cpu    | add_constant<16>     | no    
cpu    | add_constant<1>      | no    
cpu    | add_constant<2>      | no    
cpu    | add_constant<4>      | no    
cpu    | add_constant<8>      | no    
cpu    | add_constant_runtime | no    
cuda   | add_constant<16>     | yes   
cuda   | add_constant<1>      | no    
cuda   | add_constant<2>      | no    
cuda   | add_constant<4>      | no    
cuda   | add_constant<8>      | no    

…t utilities

- Introduced a new `.codespellrc` configuration file to manage spelling errors in the codebase.
- Expanded `README.md` with instructions for displaying runtime environment details and JIT registration/cache status.
- Updated licensing information in several header files to reflect the SGLang Team and changed to Apache License 2.0.
- Refactored CPU kernel implementations to improve clarity and maintainability, including the addition of new utility functions for constant addition.
- Enhanced JIT utilities to streamline kernel registration and compilation processes.
@chenghuaWang chenghuaWang merged commit c67485a into UbiquitousLearning:main Feb 16, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants