feat(thread-pool): implement HpcThreadPool for efficient CPU task management and update build configurations by chenghuaWang · Pull Request #531 · UbiquitousLearning/mllm

chenghuaWang · 2025-11-19T02:46:48Z

Summary by CodeRabbit

New Features
- Introduced a high-performance thread pool system for improved multi-threaded execution and better resource utilization across CPU operations.
- Added configurable thread pool initialization with lifecycle controls for optimized performance management.
Chores
- Updated build configurations across OSX, Android, and CLI platforms to leverage the new thread pool system.

…agement and update build configurations

coderabbitai · 2025-11-19T02:46:53Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

The pull request introduces a custom HPC thread pool infrastructure (HpcThreadPool) with NUMA-aware affinity support on Linux. This new threading backend is integrated into the CPU backend through CMake configuration options, replaces the default Apple GCD and OpenMP threading paths, and includes new parallel execution macros. Build configurations and model code are updated to use the new thread vendor path.

Changes

Cohort / File(s)	Summary
CMake Configuration `CMakeLists.txt`	Adds new option `MLLM_KERNEL_USE_THREADS_VENDOR_MLLM` to enable custom thread pool; changes `MLLM_KERNEL_THREADS_VENDOR_OPENMP` default from ON to OFF; adds compile definition when both threading and MLLM vendor are enabled.
HPC Thread Pool Core `mllm/engine/HpcThreadPool.hpp`, `mllm/engine/HpcThreadPool.cpp`	Introduces `HpcThreadPoolTask` struct with function callback, loop parameters, and state remapping; implements `HpcThreadPool` class with task queuing, worker thread management, NUMA affinity binding, task slot management, lifecycle controls (activate/deactivate/idle/wakeup), and internal synchronization primitives.
CPU Backend Integration `mllm/backends/cpu/CPUBackend.hpp`, `mllm/backends/cpu/CPUBackend.cpp`	Adds destructor, `getThreadPool()`, `initThreadPool(int32_t)`, `taskIndex()` methods; introduces free functions `idleHpcThreadPool()` and `wakeupHpcThreadPool()` for thread pool lifecycle control via Context lookup.
Parallel Execution Macros `mllm/core/Parallel.hpp`	Adds new vendor path for `MLLM_KERNEL_USE_THREADS_VENDOR_MLLM` with macros `MLLM_AUTO_PARALLEL_BEGIN`/`_END`, `MLLM_AUTO_PARALLEL_FOR_BEGIN`/`_END`, `MLLM_AUTO_PARALLEL_FOR_BEGIN_NT`/`_END_NT`, and `MLLM_SET_NUM_THREADS()` that integrate with HpcThreadPool; includes new public includes for Context and CPUBackend.
Runtime Initialization `mllm/mllm.hpp`	Adds conditional thread pool initialization in `initializeContext()` guarded by `MLLM_KERNEL_USE_THREADS_VENDOR_MLLM` macro.
Model Integration `mllm/models/qwen3/modeling_qwen3_service.hpp`	Updates `Qwen3Session::streamGenerate` to call `wakeupHpcThreadPool()` at start and `idleHpcThreadPool()` on cleanup; extracts `max_length` parameter and extends termination condition to check against token count.
Build Configuration `tasks/build_osx_apple_silicon.yaml`, `tasks/build_osx_apple_silicon_accelerate.yaml`, `tasks/build_osx_apple_silicon_compile_stack.yml`, `tasks/build_osx_apple_silicon_dbg.yaml`, `tasks/build_osx_apple_silicon_perf.yaml`, `tasks/build_osx_cli.yaml`, `tasks/build_sdk_android.yaml`, `tasks/build_sdk_osx_apple_silicon.yaml`	Replaces `-DMLLM_KERNEL_THREADS_VENDOR_APPLE_GCD=ON` with `-DMLLM_KERNEL_USE_THREADS_VENDOR_MLLM=ON` across macOS build targets; disables OpenMP in Android build.
Documentation `mllm/backends/cpu/kernels/arm/linear/kai_sme.hpp`	Adds header comments listing two supported matmul variants.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant Ctx as Context
    participant Backend as CPUBackend
    participant Pool as HpcThreadPool
    participant Worker as Worker Threads

    App->>Ctx: initializeContext()
    Ctx->>Backend: getThreadPool()/initThreadPool()
    Backend->>Pool: HpcThreadPool(num_threads)
    Pool->>Worker: spawn workers (except thread 0)
    
    App->>Backend: wakeupHpcThreadPool()
    Backend->>Pool: wakeup()
    Pool->>Worker: signal condition_variable
    
    App->>Backend: push(task, slot_idx)
    Backend->>Pool: splitTask(task)
    Pool->>Pool: assign task to slot
    Pool->>Pool: execute chunk on initiating thread
    Pool->>Worker: workers process remaining chunks
    Worker->>Worker: busy-wait on atomic_bool flags
    
    App->>Backend: idleHpcThreadPool()
    Backend->>Pool: idle()
    Pool->>Worker: wait on condition_variable (no work)

sequenceDiagram
    participant Code as Kernel Code
    participant Macro as MLLM_AUTO_PARALLEL_*
    participant Backend as CPUBackend
    participant Pool as HpcThreadPool

    Code->>Macro: MLLM_AUTO_PARALLEL_FOR_BEGIN(iter, start, end, step)
    Macro->>Backend: getThreadPool()
    Macro->>Backend: acquireTaskSlot()
    Macro->>Macro: create HpcThreadPoolTask with lambda
    Macro->>Macro: set task.start/end/step
    Macro->>Pool: push(task, task_slot_idx)
    Pool->>Pool: splitTask distributes work
    
    Code->>Macro: loop body executes per task
    
    Macro->>Macro: MLLM_AUTO_PARALLEL_FOR_END()
    Macro->>Backend: releaseTaskSlot(task_slot_idx)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

HpcThreadPool implementation (mllm/engine/HpcThreadPool.hpp/cpp): Dense logic with task scheduling, NUMA affinity handling, worker thread lifecycle, synchronization primitives, and platform-specific code paths requires careful review.
Parallel macro system (mllm/core/Parallel.hpp): New vendor-specific macro implementations that integrate with thread pool; complex task slot management and lambda-based task definition.
Integration points: Multiple interdependent changes across CMake, CPU backend, Context lookup, and model code that need consistent interaction.
Build configuration scope: Eight build YAML files changing threading vendors and OpenMP settings require verification of consistency.

Possibly related PRs

feat: add kai&qnn-vl&opencl #489: Shares code-level changes in CPUBackend and Context integration for threading infrastructure and context management.

Suggested reviewers

yirongjie
oreomaker

Poem

🐰 A thread pool hops into the fray,
With NUMA affinity guiding the way,
Tasks split and distributed with care,
Macros orchestrate work everywhere,
Building blocks stronger, more thread-aware! 🌟

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: implementing HpcThreadPool for efficient CPU task management and updating build configurations to support it.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chenghuaWang · 2025-11-19T02:47:04Z

@coderabbitai review

coderabbitai · 2025-11-19T02:47:09Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (9)

mllm/mllm.hpp (1)
145-164: Clarify and guard CPU thread-pool initialization semantics

The conditional thread-pool init:
auto host_backend = cpu::createCPUBackend();
#ifdef MLLM_KERNEL_USE_THREADS_VENDOR_MLLM
  host_backend->initThreadPool(ctx.getCpuOpThreads());
#endif
ctx.registerBackend(host_backend);
looks structurally sound and matches the new CPUBackend API, but two small points are worth tightening:

Thread count validation: consider asserting or clamping ctx.getCpuOpThreads() to >= 1 before passing it to initThreadPool to avoid accidental zero/negative configuration.

Single-initialization assumption: if initializeContext() can be called more than once in some flows, it may be safer for CPUBackend::initThreadPool to either guard against re‑initialization (no-op if thread_pool_ already set) or explicitly assert that it is only called once.

Both are minor, but they help avoid subtle lifecycle issues if configuration or context setup changes later.
mllm/models/qwen3/modeling_qwen3_service.hpp (2)
34-81: Address “easily swappable parameters” in makeRotaryPosEmbedding

The implementation of makeRotaryPosEmbedding looks correct, but the first two parameters:
inline auto makeRotaryPosEmbedding(
    Tensor& position_ids,
    const Tensor& inv_freq,
    float attention_scaling = 1.0f)
are both Tensor (differing only in constness), which is what clang’s bugprone-easily-swappable-parameters is complaining about.

To quiet that warning and make call sites harder to misuse, consider one of:

Reordering and/or renaming to emphasize roles, e.g. inv_freq first, position_ids second, and/or:

Wrapping one of them in a small strongly-typed struct (e.g. struct PositionIds { Tensor value; };) so the types differ, or

If you’re confident in the API, annotating/suppressing this specific warning for the function in your clang-tidy configuration.

Functionally this is fine; this is primarily about static-analysis noise and call-site clarity.

395-498: Thread-pool lifecycle and generation limits in streamGenerate

The new logic in Qwen3Session::streamGenerate:

Wakes the CPU HPC thread pool at the start (mllm::cpu::wakeupHpcThreadPool();) and idles it at the end (mllm::cpu::idleHpcThreadPool();).

Introduces max_length from the request, stores it in args["max_length"], and uses it together with package_cnt to decide when to finish streaming.

A few focused points:
Exception safety for pool idling
If anything between wakeup and the final idleHpcThreadPool() throws, the pool may remain active longer than intended. If this matters for your resource model, consider an RAII helper or std::unique_ptr with a custom deleter to guarantee idleHpcThreadPool() on scope exit.
Clarify semantics of max_length vs package_cnt
Currently the termination condition is:
if (idx == model_->cfg.eos_token_id || package_cnt + 1 >= max_length)
which effectively caps the number of callback invocations (packages). If max_length is conceptually a token limit in the underlying ARGeneration logic, this local cap might not match users’ expectations. A brief comment here explaining that max_length is treated as a “max streamed chunks” limit would help future readers.
Lambda capture set is tight and correct
The extended capture [this, &max_length, &request, &full_seq_idx, &package_cnt, &callback] is reasonable; none of these outlive the call to model_->streamGenerate in the current design. If streamGenerate is ever changed to be asynchronous w.r.t. the stack frame, these references will need revisiting, so a short comment near the call site might be useful.
Overall the control flow is coherent; the points above are about robustness and clarity.
mllm/backends/cpu/CPUBackend.cpp (2)
63-92: Thread-pool owner semantics are fine; consider light safeguards in initThreadPool

The new lifecycle hooks:
CPUBackend::~CPUBackend() {
  if (thread_pool_) { thread_pool_->__threadPoolDestroy(); }
}

void CPUBackend::initThreadPool(int32_t num_threads) {
  thread_pool_ = std::make_shared<HpcThreadPool>(num_threads);
  thread_pool_->activate();
  task_index_ = thread_pool_->acquireTaskSlot();
}
are internally consistent: the backend owns the pool and ensures destruction on backend teardown.

Two small robustness tweaks you might consider:

Guard against repeated initialization: if initThreadPool is called twice, the first pool is dropped without an explicit destroy. Either assert !thread_pool_ or call __threadPoolDestroy() on the existing pool before overwriting.

Validate num_threads: adding a MLLM_RT_ASSERT(num_threads > 0); (or similar) at the top of initThreadPool would catch misconfigurations early.

These are low-cost and help avoid subtle lifecycle/resource issues.

95-107: Harden idleHpcThreadPool / wakeupHpcThreadPool and rename tp

The new helpers:
void idleHpcThreadPool() {
  auto& ctx = Context::instance();
  auto host_bk = std::static_pointer_cast<::mllm::cpu::CPUBackend>(ctx.getBackend(kCPU));
  auto tp = host_bk->getThreadPool();
  if (tp) { tp->idle(); }
}

void wakeupHpcThreadPool() {
  auto& ctx = Context::instance();
  auto host_bk = std::static_pointer_cast<::mllm::cpu::CPUBackend>(ctx.getBackend(kCPU));
  auto tp = host_bk->getThreadPool();
  if (tp) { tp->wakeup(); }
}
work logically, but a couple of small improvements would make them safer and satisfy static analysis:

Null backend guard: if getBackend(kCPU) can ever fail or return a non-CPU backend (e.g., misconfiguration), std::static_pointer_cast is undefined. A defensive check around ctx.getBackend(kCPU) (or using dynamic_pointer_cast with an assert) would make this more robust.

Rename tp: to address the identifier-length warning and improve readability, rename tp to something like thread_pool:
auto thread_pool = host_bk->getThreadPool();
if (thread_pool) { thread_pool->idle(); }
Same for wakeupHpcThreadPool.
mllm/core/Parallel.hpp (1)

3-10: Header dependencies and vendor guard assumptions

Including Context.hpp and CPUBackend.hpp here makes Parallel.hpp depend directly on the engine and CPU backend. That’s OK if this header is only used in core/engine, but it will cause layering issues if included in low‑level utility code or non‑CPU builds.

Also, the top‑level guard assumes exactly one of MLLM_KERNEL_THREADS_VENDOR_APPLE_GCD, MLLM_KERNEL_THREADS_VENDOR_OPENMP, or MLLM_KERNEL_USE_THREADS_VENDOR_MLLM is defined. If multiple vendors are enabled at once, the subsequent macro definitions will conflict.

If that’s not guaranteed by the build system, consider either enforcing exclusivity with a #error or splitting the vendor blocks under a mutually exclusive #elif chain.
mllm/engine/HpcThreadPool.hpp (3)
16-17: Use a typed constant instead of a macro for task slot limit

MLLM_HPC_THREAD_POOL_TASK_LIMITS is effectively a small constant used to size vectors and loop bounds. A constexpr int in namespace scope is safer and clearer than a macro, and also avoids polluting the global macro namespace:
-#define MLLM_HPC_THREAD_POOL_TASK_LIMITS 2
+inline constexpr int kHpcThreadPoolTaskLimit = 2;
and then use kHpcThreadPoolTaskLimit throughout the implementation.

65-73: Consider replacing raw std::atomic_bool* with value semantics

tasks_ stores std::atomic_bool* and the implementation manually allocates/frees these flags in the constructor/destructor. This works but is error‑prone and makes ownership less obvious.

Given the fixed per‑task, per‑thread flag counts, it would be simpler to store them by value:
-  std::vector<bool> task_available_;
-  std::vector<std::pair<HpcThreadPoolTask, std::vector<std::atomic_bool*>>> tasks_;
+  std::vector<bool> task_available_;
+  std::vector<std::pair<HpcThreadPoolTask, std::vector<std::atomic_bool>>> tasks_;
and drop the new/delete logic in the .cpp. This reduces manual memory management and keeps everything RAII‑managed.

56-56: Avoid double‑underscore in public method name

The method __threadPoolDestroy() also uses a leading double underscore, which is reserved in C++. Renaming it to something like threadPoolDestroy() or making it a private helper and calling it from the destructor would avoid potential UB and better reflect its intent.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1c91446 and 4df246f.

📒 Files selected for processing (17)

CMakeLists.txt (2 hunks)
mllm/backends/cpu/CPUBackend.cpp (2 hunks)
mllm/backends/cpu/CPUBackend.hpp (1 hunks)
mllm/backends/cpu/kernels/arm/linear/kai_sme.hpp (1 hunks)
mllm/core/Parallel.hpp (2 hunks)
mllm/engine/HpcThreadPool.cpp (1 hunks)
mllm/engine/HpcThreadPool.hpp (1 hunks)
mllm/mllm.hpp (1 hunks)
mllm/models/qwen3/modeling_qwen3_service.hpp (4 hunks)
tasks/build_osx_apple_silicon.yaml (1 hunks)
tasks/build_osx_apple_silicon_accelerate.yaml (1 hunks)
tasks/build_osx_apple_silicon_compile_stack.yml (1 hunks)
tasks/build_osx_apple_silicon_dbg.yaml (1 hunks)
tasks/build_osx_apple_silicon_perf.yaml (1 hunks)
tasks/build_osx_cli.yaml (1 hunks)
tasks/build_sdk_android.yaml (1 hunks)
tasks/build_sdk_osx_apple_silicon.yaml (1 hunks)

🧰 Additional context used

🪛 Clang (14.0.6)

mllm/engine/HpcThreadPool.cpp

[error] 3-3: 'mllm/engine/HpcThreadPool.hpp' file not found

(clang-diagnostic-error)

[error] 16-16: redundant void argument list in function definition

(modernize-redundant-void-arg,-warnings-as-errors)

[error] 17-17: variable 'cpuset' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 18-18: variable 'thread' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

mllm/backends/cpu/CPUBackend.cpp

[error] 98-98: variable name 'tp' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)

[error] 105-105: variable name 'tp' is too short, expected at least 3 characters

(readability-identifier-length,-warnings-as-errors)

mllm/models/qwen3/modeling_qwen3_service.hpp

[error] 34-34: 2 adjacent parameters of 'makeRotaryPosEmbedding' of convertible types are easily swapped by mistake

(bugprone-easily-swappable-parameters,-warnings-as-errors)

[error] 441-441: variable 'package_cnt' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

mllm/core/Parallel.hpp

[error] 3-3: 'mllm/engine/Context.hpp' file not found

(clang-diagnostic-error)

mllm/backends/cpu/CPUBackend.hpp

[error] 12-12: class 'CPUBackend' defines a non-default destructor but does not define a copy constructor, a copy assignment operator, a move constructor or a move assignment operator

(cppcoreguidelines-special-member-functions,-warnings-as-errors)

mllm/engine/HpcThreadPool.hpp

[error] 8-8: 'mutex' file not found

(clang-diagnostic-error)

[error] 16-16: macro 'MLLM_HPC_THREAD_POOL_TASK_LIMITS' used to declare a constant; consider using a 'constexpr' constant

(cppcoreguidelines-macro-usage,-warnings-as-errors)

[error] 20-20: constructor does not initialize these fields: func, start, end

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)

[error] 27-27: declaration uses identifier '__state_remap', which is a reserved identifier

(bugprone-reserved-identifier,-warnings-as-errors)

[error] 30-30: class 'HpcThreadPool' defines a non-default destructor but does not define a copy constructor, a copy assignment operator, a move constructor or a move assignment operator

(cppcoreguidelines-special-member-functions,-warnings-as-errors)

[error] 56-56: declaration uses identifier '__threadPoolDestroy', which is a reserved identifier

(bugprone-reserved-identifier,-warnings-as-errors)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build-macos
GitHub Check: build-android

🔇 Additional comments (11)

tasks/build_osx_apple_silicon_compile_stack.yml (1)

11-13: Flag switch to MLLM thread vendor looks consistent

The build now enables MLLM_KERNEL_USE_THREADS_VENDOR_MLLM with threads on and OpenMP off; this matches the new CMake option and the project-wide move to the MLLM thread pool. No issues spotted.

tasks/build_osx_apple_silicon_accelerate.yaml (1)

8-12: Consistent adoption of MLLM thread vendor

This config now enables MLLM_KERNEL_USE_THREADS_VENDOR_MLLM with OpenMP disabled, aligning with the new thread pool backend and other OSX Apple Silicon build tasks. Looks good.

tasks/build_sdk_osx_apple_silicon.yaml (1)

10-13: SDK build correctly switched to MLLM thread vendor

The SDK build now uses MLLM_KERNEL_USE_THREADS_VENDOR_MLLM while keeping OpenMP off. This is consistent with the updated CMake options and other OSX builds.

tasks/build_osx_apple_silicon.yaml (1)

8-10: OSX Release build aligned with new thread vendor

Using MLLM_KERNEL_USE_THREADS_VENDOR_MLLM here keeps the main OSX Release build consistent with the new thread-pool backend and with the other updated task files.

tasks/build_osx_apple_silicon_dbg.yaml (1)

10-12: Debug build threading flags are consistent

The debug configuration now uses MLLM_KERNEL_USE_THREADS_VENDOR_MLLM with OpenMP disabled, matching the release and perf builds. No problems seen.

mllm/backends/cpu/kernels/arm/linear/kai_sme.hpp (1)

4-7: Non-functional documentation comment looks good

The added “Supports” comments clarify which matmul variants this header covers, with no impact on behavior or interfaces. Change is straightforward.
CMakeLists.txt (1)
45-49: Enforce mutual exclusivity among thread vendors at CMake configuration time

The current defaults and CMake logic allow MLLM_KERNEL_USE_THREADS_VENDOR_MLLM (defaulting to ON) to coexist with MLLM_KERNEL_THREADS_VENDOR_OPENMP or MLLM_KERNEL_THREADS_VENDOR_APPLE_GCD. While platform-level forcing and build configs avoid overlap in practice, the defaults permit simultaneous activation.

In mllm/core/Parallel.hpp, the macro blocks use separate #if statements (not #elif), so when multiple vendors are defined, the last definition silently overwrites previous ones—the MLLM_AUTO_PARALLEL_* macros end up using whichever vendor was defined last, not the explicitly intended one.

Tighten this by adding a CMake-level validation in CMakeLists.txt to enforce exactly one vendor:
set(VENDOR_COUNT 0)
if(MLLM_KERNEL_USE_THREADS_VENDOR_MLLM)
  math(EXPR VENDOR_COUNT "${VENDOR_COUNT} + 1")
endif()
if(MLLM_KERNEL_THREADS_VENDOR_OPENMP)
  math(EXPR VENDOR_COUNT "${VENDOR_COUNT} + 1")
endif()
if(MLLM_KERNEL_THREADS_VENDOR_APPLE_GCD)
  math(EXPR VENDOR_COUNT "${VENDOR_COUNT} + 1")
endif()

if(VENDOR_COUNT GREATER 1)
  message(FATAL_ERROR "Only one thread vendor may be enabled. Found ${VENDOR_COUNT}.")
endif()
This makes vendor selection predictable and prevents silent macro redefinition surprises.

Also applies to: lines 118–121 and related error checks in mllm/CMakeLists.txt and mllm/backends/cpu/CMakeLists.txt
tasks/build_osx_cli.yaml (1)

8-17: Thread vendor switch to MLLM looks correct

The change to -DMLLM_KERNEL_USE_THREADS_VENDOR_MLLM=ON is consistent with the new internal thread-pool path and keeps the rest of the OSX CLI config intact. Please just ensure other OSX build YAMLs remain consistent with this choice.

If not already done, verify other OSX configs use the same vendor flag and that CMake’s MLLM_KERNEL_USE_THREADS_VENDOR_MLLM option is respected in all relevant presets.

tasks/build_sdk_android.yaml (1)

6-19: Android SDK threading flags align with new MLLM thread vendor

The combination:

MLLM_KERNEL_USE_THREADS=ON

MLLM_KERNEL_THREADS_VENDOR_OPENMP=OFF

MLLM_KERNEL_USE_THREADS_VENDOR_MLLM=ON

correctly migrates Android SDK builds to the internal thread-pool vendor. No YAML issues spotted. It would be good to run at least one perf/regression pass comparing this against the previous OpenMP configuration to catch any platform-specific regressions.

tasks/build_osx_apple_silicon_perf.yaml (1)

5-16: Perf preset correctly opts into MLLM thread vendor

Enabling -DMLLM_KERNEL_USE_THREADS_VENDOR_MLLM=ON here is consistent with the rest of the thread-vendor migration and keeps OpenMP explicitly disabled. Looks good from a config perspective; just ensure perf baselines for this preset are updated/checked under the new thread-pool backend.

mllm/engine/HpcThreadPool.cpp (1)

161-204: Thread pool ctor/dtor lifetime assumptions

The constructor pre‑allocates per‑task, per‑thread completion flags and spawns worker threads for indices [1, thread_cnt_), and the destructor sets stop_, wakes all workers, joins them, and then deletes the allocated std::atomic_bool objects. This is structurally sound, but it relies on:

No tasks being in flight when the destructor begins (otherwise user code may still be running during teardown).

External users balancing activate()/deactivate() and acquireTaskSlot()/releaseTaskSlot() so that available_task_slots_ and tasks_[] are not left in an inconsistent state.

Please double‑check that CPUBackend and model code never destroy the pool while tasks are outstanding and that slot management is strictly paired.

coderabbitai · 2025-11-19T02:54:49Z

+#include "mllm/engine/HpcThreadPool.hpp"

 namespace mllm::cpu {

 class CPUBackend final : public Backend {
 public:
+  ~CPUBackend();
+
  CPUBackend();
+
+  HpcThreadPool::ptr_t getThreadPool();
+
+  void initThreadPool(int32_t num_threads);
+
+  int32_t taskIndex();
+
+ private:
+  int32_t task_index_ = -1;
+  HpcThreadPool::ptr_t thread_pool_ = nullptr;
 };


⚠️ Potential issue | 🟠 Major

Make CPUBackend explicitly non-copyable/movable

Adding a non-trivial destructor:

class CPUBackend final : public Backend { public: ~CPUBackend(); CPUBackend(); HpcThreadPool::ptr_t getThreadPool(); void initThreadPool(int32_t num_threads); int32_t taskIndex(); private: int32_t task_index_ = -1; HpcThreadPool::ptr_t thread_pool_ = nullptr; };

triggers the clang guideline about special member functions. Semantically, copying or moving a backend that owns a thread pool would be problematic anyway, so it’s clearer to make this explicit:

class CPUBackend final : public Backend { public: CPUBackend(); ~CPUBackend(); CPUBackend(const CPUBackend&) = delete; CPUBackend& operator=(const CPUBackend&) = delete; CPUBackend(CPUBackend&&) = delete; CPUBackend& operator=(CPUBackend&&) = delete; HpcThreadPool::ptr_t getThreadPool(); void initThreadPool(int32_t num_threads); int32_t taskIndex(); private: int32_t task_index_ = -1; HpcThreadPool::ptr_t thread_pool_ = nullptr; };

This both matches the intended ownership model and silences the static-analysis error.

Also applies to: 29-34

🧰 Tools

🪛 Clang (14.0.6)

[error] 12-12: class 'CPUBackend' defines a non-default destructor but does not define a copy constructor, a copy assignment operator, a move constructor or a move assignment operator

(cppcoreguidelines-special-member-functions,-warnings-as-errors)

🤖 Prompt for AI Agents

In mllm/backends/cpu/CPUBackend.hpp around lines 8-27 (and also apply same change to lines 29-34), the class defines a non-trivial destructor but does not declare special member functions, which allows accidental copy/move of an object that owns a thread pool; explicitly mark the copy constructor, copy assignment, move constructor and move assignment as deleted by adding declarations for CPUBackend(const CPUBackend&) = delete; CPUBackend& operator=(const CPUBackend&) = delete; CPUBackend(CPUBackend&&) = delete; CPUBackend& operator=(CPUBackend&&) = delete; in the public section of the class (keeping the existing default constructor and destructor) so the class is non-copyable and non-movable.

coderabbitai · 2025-11-19T02:54:50Z

+#ifdef MLLM_KERNEL_USE_THREADS_VENDOR_MLLM
+#include "mllm/engine/HpcThreadPool.hpp"
+
+#define MLLM_AUTO_PARALLEL_BEGIN(__iter__, __num__)                                                                           \
+  {                                                                                                                           \
+    auto host_bk =                                                                                                            \
+        std::static_pointer_cast<::mllm::cpu::CPUBackend>(::mllm::Context::instance().getBackend(::mllm::DeviceTypes::kCPU)); \
+    ::mllm::HpcThreadPoolTask task;                                                                                           \
+    task.start = 0;                                                                                                           \
+    task.end = __num__;                                                                                                       \
+    task.step = 1;                                                                                                            \
+    task.func = [&](int __iter__) {
+#define MLLM_AUTO_PARALLEL_END()                                         \
+  }                                                                      \
+  ;                                                                      \
+  host_bk->getThreadPool()->push(std::move(task), host_bk->taskIndex()); \
+  }
+
+#define MLLM_AUTO_PARALLEL_FOR_BEGIN(__iter__, __start__, __end__, __step__)                                                  \
+  {                                                                                                                           \
+    auto host_bk =                                                                                                            \
+        std::static_pointer_cast<::mllm::cpu::CPUBackend>(::mllm::Context::instance().getBackend(::mllm::DeviceTypes::kCPU)); \
+    ::mllm::HpcThreadPoolTask task;                                                                                           \
+    task.start = __start__;                                                                                                   \
+    task.end = __end__;                                                                                                       \
+    task.step = __step__;                                                                                                     \
+    task.func = [&](int __iter__) {
+#define MLLM_AUTO_PARALLEL_FOR_END()                                     \
+  }                                                                      \
+  ;                                                                      \
+  host_bk->getThreadPool()->push(std::move(task), host_bk->taskIndex()); \
+  }
+
+#define MLLM_AUTO_PARALLEL_FOR_BEGIN_NT(__iter__, __start__, __end__, __step__, __num_threads__)                              \
+  {                                                                                                                           \
+    auto host_bk =                                                                                                            \
+        std::static_pointer_cast<::mllm::cpu::CPUBackend>(::mllm::Context::instance().getBackend(::mllm::DeviceTypes::kCPU)); \
+    ::mllm::HpcThreadPoolTask task;                                                                                           \
+    task.start = __start__;                                                                                                   \
+    task.end = __end__;                                                                                                       \
+    task.step = __step__;                                                                                                     \
+    task.func = [&](int __iter__) {
+#define MLLM_AUTO_PARALLEL_FOR_END_NT()                                  \
+  }                                                                      \
+  ;                                                                      \
+  host_bk->getThreadPool()->push(std::move(task), host_bk->taskIndex()); \
+  }
+
+#define MLLM_SET_NUM_THREADS(num_threads) \
+  do { (void)(num_threads); } while (0)
+#endif


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

HPC vendor macros: clarify preconditions and avoid fragile captures / ignored parameters

The MLLM vendor macros correctly wrap a HpcThreadPoolTask and push it to the CPU backend’s pool, but there are a few risks:

host_bk and its thread pool are assumed to be non‑null and fully initialised (Context::instance().getBackend(kCPU), host_bk->getThreadPool()). If context initialisation or initThreadPool are skipped in some flows, these macros will crash on first use.

The lambda uses [&] capture, which needlessly captures host_bk and any other locals by reference. It’s safe today because the lambda body doesn’t use host_bk, but it’s fragile if future changes reference captured locals after the macro’s scope ends.

MLLM_AUTO_PARALLEL_FOR_BEGIN_NT ignores the __num_threads__ argument entirely, which is a behavioural change vs. the OpenMP implementation and may surprise callers who expect per‑call control over thread count.

Suggested adjustments:

- auto host_bk = \ - std::static_pointer_cast<::mllm::cpu::CPUBackend>(::mllm::Context::instance().getBackend(::mllm::DeviceTypes::kCPU)); \ + auto host_bk = \ + std::static_pointer_cast<::mllm::cpu::CPUBackend>(::mllm::Context::instance().getBackend(::mllm::DeviceTypes::kCPU)); \ + MLLM_ASSERT(host_bk && host_bk->getThreadPool()); /* or equivalent runtime check */ \ @@ - task.func = [&](int __iter__) { + task.func = [&, host_bk](int __iter__) { @@ -#define MLLM_AUTO_PARALLEL_FOR_BEGIN_NT(__iter__, __start__, __end__, __step__, __num_threads__) \ +#define MLLM_AUTO_PARALLEL_FOR_BEGIN_NT(__iter__, __start__, __end__, __step__, __num_threads__) \ { \ @@ - task.step = __step__; \ + task.step = __step__; \ + (void)__num_threads__; /* currently unused; consider threading it into pool config if needed */ \

(Replace MLLM_ASSERT with whatever assertion/logging facility you use.)

At minimum, please document that __num_threads__ is ignored for the MLLM vendor path and ensure the context/backend initialisation order guarantees host_bk->getThreadPool() is always valid before these macros are used.

</invoke> --- 🏁 Script executed: ```shell #!/bin/bash # Search for usage of the three HPC vendor macros in the codebase echo "=== Searching for MLLM_AUTO_PARALLEL_BEGIN usage ===" rg -nP 'MLLM_AUTO_PARALLEL_BEGIN\s*\(' --type=hpp --type=cpp echo -e "\n=== Searching for MLLM_AUTO_PARALLEL_FOR_BEGIN usage ===" rg -nP 'MLLM_AUTO_PARALLEL_FOR_BEGIN\s*\(' --type=hpp --type=cpp echo -e "\n=== Searching for MLLM_AUTO_PARALLEL_FOR_BEGIN_NT usage ===" rg -nP 'MLLM_AUTO_PARALLEL_FOR_BEGIN_NT\s*\(' --type=hpp --type=cpp echo -e "\n=== Check for Context initialization and getBackend patterns ===" rg -nC3 'getBackend' --type=hpp --type=cpp | head -50 echo -e "\n=== Search for CPUBackend and getThreadPool ===" rg -nC2 'getThreadPool' --type=hpp --type=cpp | head -50

Length of output: 426

🏁 Script executed:

#!/bin/bash # Search for OpenMP macro definitions to compare __num_threads__ handling echo "=== Search for OpenMP parallel pragmas and macro definitions ===" rg -nP 'MLLM_AUTO_PARALLEL' --type=hpp --type=cpp -C2 | grep -A5 -B5 'OpenMP\|omp_set_num_threads\|num_threads' echo -e "\n=== Look for HpcThreadPool definition and thread count handling ===" fd -e hpp -e cpp | xargs grep -l 'HpcThreadPool' | head -5 echo -e "\n=== Search for assertion/error checking patterns in similar code ===" rg -nP 'MLLM_ASSERT|assert\s*$' --type=hpp --type=cpp | head -30

Length of output: 408

🏁 Script executed:

#!/bin/bash # Find HpcThreadPool definition fd -type f \( -name '*HpcThreadPool*' -o -name '*ThreadPool*' $ | head -10 # Search for where thread count is handled in thread pool rg -nC5 'num_threads' --type=hpp --type=cpp | grep -A5 -B5 'HpcThreadPool'

Length of output: 198

🌐 Web query:

Does C++ lambda capture `[&]` affect the safety of captured variables after scope ends?

💡 Result:

Short answer: Yes — capturing with [&] does NOT extend the lifetime of the referenced objects. If a lambda that captured by reference is invoked after the original objects go out of scope, you get a dangling reference / undefined behavior. [1][2]

What to do instead (brief):

Return/store lambdas that need to outlive the scope: capture by value ([=] or explicit captures), or use init-capture (e.g. x = std::move(x)). [1][3]

For shared ownership use smart pointers (std::shared_ptr) captured by value. [2]

References:

cppreference on lambda captures and lifetimes.

SEI CERT C++ guideline: lambda object must not outlive its reference-captured objects.

Practical discussion / examples (StackOverflow and tutorials).

🏁 Script executed:

#!/bin/bash # Search for macro usage without problematic --type flag echo "=== MLLM_AUTO_PARALLEL macros usage ===" rg -n 'MLLM_AUTO_PARALLEL_BEGIN\(' mllm/ echo -e "\n=== MLLM_AUTO_PARALLEL_FOR_BEGIN usage ===" rg -n 'MLLM_AUTO_PARALLEL_FOR_BEGIN\(' mllm/ echo -e "\n=== MLLM_AUTO_PARALLEL_FOR_BEGIN_NT usage ===" rg -n 'MLLM_AUTO_PARALLEL_FOR_BEGIN_NT\(' mllm/ echo -e "\n=== Check entire Parallel.hpp structure ===" cat mllm/core/Parallel.hpp | head -150

Length of output: 13414

🏁 Script executed:

#!/bin/bash # Examine HpcThreadPool implementation echo "=== HpcThreadPool.hpp ===" cat mllm/engine/HpcThreadPool.hpp echo -e "\n=== HpcThreadPool.cpp (first 100 lines) ===" head -100 mllm/engine/HpcThreadPool.cpp

Length of output: 3995

🏁 Script executed:

#!/bin/bash # Check CPUBackend for getThreadPool and initialization echo "=== CPUBackend.hpp snippet around getThreadPool ===" rg -n -C5 'getThreadPool' mllm/backends/cpu/CPUBackend.hpp echo -e "\n=== CPUBackend initialization and Context ===" rg -n -C3 'initThreadPool\|ThreadPool\(' mllm/backends/cpu/CPUBackend.cpp | head -60 echo -e "\n=== Check Context implementation ===" rg -n -C3 'Context::instance\|getBackend' mllm/ | head -80

Length of output: 323

🏁 Script executed:

#!/bin/bash # Search for OpenMP #else branch in Parallel.hpp to compare __num_threads__ handling echo "=== Full Parallel.hpp to see OpenMP version ===" cat mllm/core/Parallel.hpp echo -e "\n=== Search for omp_set_num_threads or num_threads in OpenMP section ===" rg -n 'omp_set_num_threads\|pragma omp\|__num_threads__' mllm/core/Parallel.hpp

Length of output: 10376

🏁 Script executed:

#!/bin/bash # Check if task lambda is executed synchronously or asynchronously echo "=== HpcThreadPool push method and task execution ===" rg -n -A10 'void push\|ThreadPoolTask' mllm/engine/HpcThreadPool.hpp

Length of output: 52

Fix null checks, explicit lambda captures, and num_threads handling in HPC vendor macros

The three concerns identified are all valid:

Null pointer risk: The macros lack null checks on Context::instance().getBackend() and getThreadPool(). If context initialization is incomplete or skipped, the dereferences will crash.

Fragile lambda capture: The [&] capture in the lambda doesn't extend object lifetimes. If the thread pool executes tasks asynchronously after the macro's scope ends, host_bk will be a dangling reference. Even if currently safe due to synchronous execution, this is fragile for future changes.

__num_threads__ ignored: The OpenMP path respects __num_threads__ via num_threads(__num_threads__), but the MLLM vendor path accepts but completely ignores it. Callers expecting per-call thread count control will see silent no-ops.

Required fixes:

Add explicit null checks (assertion or runtime validation) after getBackend() and getThreadPool().

Change lambda capture from [&] to [&, host_bk] to explicitly capture by reference only what's needed and prevent accidental future dependencies on dangling locals.

Either implement __num_threads__ support in the thread pool or document prominently that it is ignored on the MLLM vendor path to prevent caller confusion.

🤖 Prompt for AI Agents

In mllm/core/Parallel.hpp lines 82-132, the vendor-threadpool macros lack null checks on Context::instance().getBackend() and getThreadPool(), use a fragile [&] lambda capture that can dangle, and ignore the __num_threads__ parameter; fix by (1) adding runtime null checks (or assertions) immediately after retrieving the backend and its thread pool and bail/throw with a clear message if null, (2) change the lambda capture to capture host_bk by value (e.g., capture host_bk) so the backend pointer lifetime is preserved for async execution, and (3) propagate the __num_threads__ argument into the thread-pool API (either by setting a field on the task or passing it into getThreadPool()->push) or, if the thread pool cannot support per-task thread counts, explicitly document/emit a logged warning that __num_threads__ is ignored on this vendor path so callers are not silently misled.

coderabbitai · 2025-11-19T02:54:50Z

+#if defined(__gnu__linux__)
+#include <pthread.h>
+#include <sched.h>
+#endif
+
+#include <cstdint>
+
+namespace mllm {
+
+#if defined(__gnu_linux__)
+static cpu_set_t getNumaAffinity(void) {
+  cpu_set_t cpuset;
+  pthread_t thread;
+  thread = pthread_self();
+  CPU_ZERO(&cpuset);
+  pthread_getaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
+  return cpuset;
+}
+#else
+static uint32_t getNumaAffinity() {
+  return 0;  // no NUMA support
+}
+#endif


⚠️ Potential issue | 🔴 Critical

Fix Linux platform macro mismatch and clean up NUMA helper

Lines 6-9 use __gnu__linux__ while lines 15-23 use __gnu_linux__. On GNU/Linux only __gnu_linux__ (and __linux__) are standard; the current mismatch can skip the pthread.h / sched.h includes while still compiling the getNumaAffinity body, leading to undefined types and compilation failures on Linux.

Also, the void parameter list and uninitialized-variable warnings on cpuset/thread are easy to address.

Consider something like:

-#if defined(__gnu__linux__) -#include <pthread.h> -#include <sched.h> -#endif + #if defined(__gnu_linux__) || defined(__linux__) + #include <pthread.h> + #include <sched.h> + #endif @@ -#if defined(__gnu_linux__) -static cpu_set_t getNumaAffinity(void) { - cpu_set_t cpuset; - pthread_t thread; - thread = pthread_self(); +#if defined(__gnu_linux__) || defined(__linux__) +static cpu_set_t getNumaAffinity() { + cpu_set_t cpuset{}; + pthread_t thread = pthread_self(); CPU_ZERO(&cpuset); pthread_getaffinity_np(thread, sizeof(cpu_set_t), &cpuset); return cpuset; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

#if defined(__gnu__linux__)

#include <pthread.h>

#include <sched.h>

#endif

#include <cstdint>

namespace mllm {

#if defined(__gnu_linux__)

static cpu_set_t getNumaAffinity(void) {

cpu_set_t cpuset;

pthread_t thread;

thread = pthread_self();

CPU_ZERO(&cpuset);

pthread_getaffinity_np(thread, sizeof(cpu_set_t), &cpuset);

return cpuset;

}

#else

static uint32_t getNumaAffinity() {

return 0; // no NUMA support

}

#endif

#if defined(__gnu_linux__) || defined(__linux__)

#include <pthread.h>

#include <sched.h>

#endif

#include <cstdint>

namespace mllm {

#if defined(__gnu_linux__) || defined(__linux__)

static cpu_set_t getNumaAffinity() {

cpu_set_t cpuset{};

pthread_t thread = pthread_self();

CPU_ZERO(&cpuset);

pthread_getaffinity_np(thread, sizeof(cpu_set_t), &cpuset);

return cpuset;

}

#else

static uint32_t getNumaAffinity() {

return 0; // no NUMA support

}

#endif

🧰 Tools

🪛 Clang (14.0.6)

[error] 16-16: redundant void argument list in function definition

(modernize-redundant-void-arg,-warnings-as-errors)

[error] 17-17: variable 'cpuset' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

[error] 18-18: variable 'thread' is not initialized

(cppcoreguidelines-init-variables,-warnings-as-errors)

coderabbitai · 2025-11-19T02:54:50Z

+void HpcThreadPool::splitTask(HpcThreadPoolTask&& task, int task_slot_idx) {
+  // There are no task slots, use this main thread to compute!
+  if (available_task_slots_ == 0) {
+    for (int i = task.start; i < task.end; i += task.step) { task.func(i); }
+    return;
+  }
+  int tiles_num = (task.end - task.start) / task.step;
+
+  int _cnt = 0;
+  std::vector<int> true_idx(tiles_num);
+  for (int i = task.start; i < task.end; i += task.step) { true_idx[_cnt++] = i; }
+
+  // When tiles_num > thread_cnt. We need to nested for loops. Which means static attach works to one thread.
+  //
+  // NOTE: Static dispatch may have performance issues in bit.LITTLE CPU Arch.
+  //
+  // NOTE: The dispatch logic is below:
+  // e.g.: threads is 4, tiles_name is 12.
+  //  0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3
+  if (tiles_num > thread_cnt_) {
+    tasks_[task_slot_idx].first = {
+        .start = 0,
+        .end = thread_cnt_,
+        .step = 1,
+        .func =
+            [tiles_num, &task, &true_idx, this](int thread_idx) {
+              for (int v = thread_idx; v < tiles_num; v += thread_cnt_) { task.func(true_idx[v]); }
+            },
+    };
+    tiles_num = thread_cnt_;
+  } else {
+    tasks_[task_slot_idx].first = {
+        .start = 0,
+        .end = tiles_num,
+        .step = 1,
+        .func = [tiles_num, &task, &true_idx, this](int thread_idx) { task.func(true_idx[thread_idx]); },
+    };
+  }
+  {
+    for (int i = 1; i < tiles_num; ++i) { *tasks_[task_slot_idx].second[i] = true; }
+  }
+
+  // Explain why we start a 0 idx function.
+  //
+  // FIXME: The main thread is also used for compute or NOT ?
+  tasks_[task_slot_idx].first.func(true_idx[0]);
+
+  // Wait for all threads to complete
+  bool complete = true;
+  do {
+    complete = true;
+    for (int i = 1; i < tiles_num; ++i) {
+      if (*tasks_[task_slot_idx].second[i]) {
+        complete = false;
+        break;
+      }
+    }
+    std::this_thread::yield();
+  } while (!complete);
+}


⚠️ Potential issue | 🔴 Critical

Incorrect main‑thread dispatch index in splitTask can cause OOB access / wrong work partitioning

In splitTask, the main thread calls:

tasks_[task_slot_idx].first.func(true_idx[0]);

but func expects a thread index, not a tile value:

In the tiles_num > thread_cnt_ branch the lambda interprets its argument as thread_idx and iterates tiles v = thread_idx; v < tiles_num; v += thread_cnt_. Passing true_idx[0] (which is task.start) instead of 0 breaks this mapping when task.start != 0.

In the tiles_num <= thread_cnt_ branch, the lambda uses true_idx[thread_idx]. If task.start > 0, then true_idx[0] may be larger than or equal to tiles_num, causing an out‑of‑bounds access on true_idx.

The main thread should use logical thread index 0, not the first tile value:

- // FIXME: The main thread is also used for compute or NOT ? - tasks_[task_slot_idx].first.func(true_idx[0]); + // Main thread executes the work for logical thread index 0. + tasks_[task_slot_idx].first.func(0);

Also note that this path busy‑waits with std::this_thread::yield() on all worker flags; if this becomes a hotspot, you may want to introduce a condition-variable or barrier-based completion mechanism instead of pure spinning.

🤖 Prompt for AI Agents

mllm/engine/HpcThreadPool.cpp around lines 100-159: the main thread currently calls tasks_[task_slot_idx].first.func(true_idx[0]) which passes a tile value instead of a logical thread index, causing incorrect partitioning and possible out‑of‑bounds access; change that call to tasks_[task_slot_idx].first.func(0) so the main thread uses thread index 0; additionally, ensure any comments/reference to why index 0 is used are updated and consider replacing the spin-wait (std::this_thread::yield()) with a condition-variable or barrier later if spinning becomes a hotspot.

coderabbitai · 2025-11-19T02:54:50Z

+#define MLLM_HPC_THREAD_POOL_TASK_LIMITS 2
+
+namespace mllm {
+
+struct HpcThreadPoolTask {
+  std::function<void(int)> func;
+  int start;
+  int end;
+  int step = 1;
+
+  // Things that will be modified by the thread pool
+  bool __state_remap = false;
+};


⚠️ Potential issue | 🟡 Minor

Initialisation and naming in HpcThreadPoolTask

HpcThreadPoolTask leaves start and end uninitialised on default construction. The new macros always assign them before use, but any other caller that default‑constructs the task and forgets to set both may hit undefined behaviour.

The field name __state_remap uses a leading double underscore, which is reserved for the implementation in C++.

Consider giving the struct a safe default constructor / in‑class initialisers and renaming the flag:

-struct HpcThreadPoolTask { - std::function<void(int)> func; - int start; - int end; - int step = 1; - - // Things that will be modified by the thread pool - bool __state_remap = false; -}; +struct HpcThreadPoolTask { + std::function<void(int)> func{}; + int start = 0; + int end = 0; + int step = 1; + + // Things that will be modified by the thread pool + bool state_remap = false; +};

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

#define MLLM_HPC_THREAD_POOL_TASK_LIMITS 2

namespace mllm {

struct HpcThreadPoolTask {

std::function<void(int)> func;

int start;

int end;

int step = 1;

// Things that will be modified by the thread pool

bool __state_remap = false;

};

#define MLLM_HPC_THREAD_POOL_TASK_LIMITS 2

namespace mllm {

struct HpcThreadPoolTask {

std::function<void(int)> func{};

int start = 0;

int end = 0;

int step = 1;

// Things that will be modified by the thread pool

bool state_remap = false;

};

🧰 Tools

🪛 Clang (14.0.6)

[error] 16-16: macro 'MLLM_HPC_THREAD_POOL_TASK_LIMITS' used to declare a constant; consider using a 'constexpr' constant

(cppcoreguidelines-macro-usage,-warnings-as-errors)

[error] 20-20: constructor does not initialize these fields: func, start, end

(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)

[error] 27-27: declaration uses identifier '__state_remap', which is a reserved identifier

(bugprone-reserved-identifier,-warnings-as-errors)

🤖 Prompt for AI Agents

In mllm/engine/HpcThreadPool.hpp around lines 16 to 28, HpcThreadPoolTask leaves start and end uninitialised and uses a reserved identifier __state_remap; add safe defaults for all members (e.g., std::function<void(int)> func = nullptr; int start = 0; int end = 0; int step = 1;) or provide an explicit default constructor that initializes them, and rename the flag to a non-reserved identifier such as state_remap or is_state_remap (update all uses accordingly) to avoid UB and reserved-name violations.

…and initialize task parameters

chenghuaWang and others added 2 commits November 19, 2025 10:44

feat(thread-pool): implement HpcThreadPool for efficient CPU task man…

9c8d9da

…agement and update build configurations

Merge branch 'UbiquitousLearning:v2' into v2

4df246f

chenghuaWang requested review from oreomaker and yirongjie as code owners November 19, 2025 02:46

coderabbitai Bot reviewed Nov 19, 2025

View reviewed changes

fix(hpc-thread-pool): update platform checks for Linux compatibility …

281ff6e

…and initialize task parameters

chenghuaWang merged commit ca341cc into UbiquitousLearning:v2 Nov 19, 2025
2 of 3 checks passed

coderabbitai Bot mentioned this pull request Nov 23, 2025

V2 Release #545

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(thread-pool): implement HpcThreadPool for efficient CPU task management and update build configurations#531

feat(thread-pool): implement HpcThreadPool for efficient CPU task management and update build configurations#531
chenghuaWang merged 3 commits intoUbiquitousLearning:v2from
chenghuaWang:v2

chenghuaWang commented Nov 19, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Nov 19, 2025 •

edited

Loading

Review skipped

Uh oh!

chenghuaWang commented Nov 19, 2025

Uh oh!

coderabbitai Bot commented Nov 19, 2025

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Nov 19, 2025

Uh oh!

coderabbitai Bot Nov 19, 2025

Uh oh!

coderabbitai Bot Nov 19, 2025

Uh oh!

coderabbitai Bot Nov 19, 2025

Uh oh!

coderabbitai Bot Nov 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chenghuaWang commented Nov 19, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

chenghuaWang commented Nov 19, 2025

Uh oh!

coderabbitai Bot commented Nov 19, 2025

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chenghuaWang commented Nov 19, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Nov 19, 2025 •

edited

Loading