feat(thread-pool): implement HpcThreadPool for efficient CPU task management and update build configurations#531
Conversation
…agement and update build configurations
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThe pull request introduces a custom HPC thread pool infrastructure ( Changes
Sequence Diagram(s)sequenceDiagram
participant App as Application
participant Ctx as Context
participant Backend as CPUBackend
participant Pool as HpcThreadPool
participant Worker as Worker Threads
App->>Ctx: initializeContext()
Ctx->>Backend: getThreadPool()/initThreadPool()
Backend->>Pool: HpcThreadPool(num_threads)
Pool->>Worker: spawn workers (except thread 0)
App->>Backend: wakeupHpcThreadPool()
Backend->>Pool: wakeup()
Pool->>Worker: signal condition_variable
App->>Backend: push(task, slot_idx)
Backend->>Pool: splitTask(task)
Pool->>Pool: assign task to slot
Pool->>Pool: execute chunk on initiating thread
Pool->>Worker: workers process remaining chunks
Worker->>Worker: busy-wait on atomic_bool flags
App->>Backend: idleHpcThreadPool()
Backend->>Pool: idle()
Pool->>Worker: wait on condition_variable (no work)
sequenceDiagram
participant Code as Kernel Code
participant Macro as MLLM_AUTO_PARALLEL_*
participant Backend as CPUBackend
participant Pool as HpcThreadPool
Code->>Macro: MLLM_AUTO_PARALLEL_FOR_BEGIN(iter, start, end, step)
Macro->>Backend: getThreadPool()
Macro->>Backend: acquireTaskSlot()
Macro->>Macro: create HpcThreadPoolTask with lambda
Macro->>Macro: set task.start/end/step
Macro->>Pool: push(task, task_slot_idx)
Pool->>Pool: splitTask distributes work
Code->>Macro: loop body executes per task
Macro->>Macro: MLLM_AUTO_PARALLEL_FOR_END()
Macro->>Backend: releaseTaskSlot(task_slot_idx)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (9)
mllm/mllm.hpp (1)
145-164: Clarify and guard CPU thread-pool initialization semanticsThe conditional thread-pool init:
auto host_backend = cpu::createCPUBackend(); #ifdef MLLM_KERNEL_USE_THREADS_VENDOR_MLLM host_backend->initThreadPool(ctx.getCpuOpThreads()); #endif ctx.registerBackend(host_backend);looks structurally sound and matches the new CPUBackend API, but two small points are worth tightening:
- Thread count validation: consider asserting or clamping
ctx.getCpuOpThreads()to>= 1before passing it toinitThreadPoolto avoid accidental zero/negative configuration.- Single-initialization assumption: if
initializeContext()can be called more than once in some flows, it may be safer forCPUBackend::initThreadPoolto either guard against re‑initialization (no-op ifthread_pool_already set) or explicitly assert that it is only called once.Both are minor, but they help avoid subtle lifecycle issues if configuration or context setup changes later.
mllm/models/qwen3/modeling_qwen3_service.hpp (2)
34-81: Address “easily swappable parameters” inmakeRotaryPosEmbeddingThe implementation of
makeRotaryPosEmbeddinglooks correct, but the first two parameters:inline auto makeRotaryPosEmbedding( Tensor& position_ids, const Tensor& inv_freq, float attention_scaling = 1.0f)are both
Tensor(differing only in constness), which is what clang’sbugprone-easily-swappable-parametersis complaining about.To quiet that warning and make call sites harder to misuse, consider one of:
- Reordering and/or renaming to emphasize roles, e.g.
inv_freqfirst,position_idssecond, and/or:- Wrapping one of them in a small strongly-typed struct (e.g.
struct PositionIds { Tensor value; };) so the types differ, or- If you’re confident in the API, annotating/suppressing this specific warning for the function in your clang-tidy configuration.
Functionally this is fine; this is primarily about static-analysis noise and call-site clarity.
395-498: Thread-pool lifecycle and generation limits instreamGenerateThe new logic in
Qwen3Session::streamGenerate:
- Wakes the CPU HPC thread pool at the start (
mllm::cpu::wakeupHpcThreadPool();) and idles it at the end (mllm::cpu::idleHpcThreadPool();).- Introduces
max_lengthfrom the request, stores it inargs["max_length"], and uses it together withpackage_cntto decide when to finish streaming.A few focused points:
Exception safety for pool idling
If anything between wakeup and the finalidleHpcThreadPool()throws, the pool may remain active longer than intended. If this matters for your resource model, consider an RAII helper orstd::unique_ptrwith a custom deleter to guaranteeidleHpcThreadPool()on scope exit.Clarify semantics of
max_lengthvspackage_cnt
Currently the termination condition is:if (idx == model_->cfg.eos_token_id || package_cnt + 1 >= max_length)which effectively caps the number of callback invocations (packages). If
max_lengthis conceptually a token limit in the underlying ARGeneration logic, this local cap might not match users’ expectations. A brief comment here explaining thatmax_lengthis treated as a “max streamed chunks” limit would help future readers.Lambda capture set is tight and correct
The extended capture[this, &max_length, &request, &full_seq_idx, &package_cnt, &callback]is reasonable; none of these outlive the call tomodel_->streamGeneratein the current design. IfstreamGenerateis ever changed to be asynchronous w.r.t. the stack frame, these references will need revisiting, so a short comment near the call site might be useful.Overall the control flow is coherent; the points above are about robustness and clarity.
mllm/backends/cpu/CPUBackend.cpp (2)
63-92: Thread-pool owner semantics are fine; consider light safeguards ininitThreadPoolThe new lifecycle hooks:
CPUBackend::~CPUBackend() { if (thread_pool_) { thread_pool_->__threadPoolDestroy(); } } void CPUBackend::initThreadPool(int32_t num_threads) { thread_pool_ = std::make_shared<HpcThreadPool>(num_threads); thread_pool_->activate(); task_index_ = thread_pool_->acquireTaskSlot(); }are internally consistent: the backend owns the pool and ensures destruction on backend teardown.
Two small robustness tweaks you might consider:
- Guard against repeated initialization: if
initThreadPoolis called twice, the first pool is dropped without an explicit destroy. Either assert!thread_pool_or call__threadPoolDestroy()on the existing pool before overwriting.- Validate
num_threads: adding aMLLM_RT_ASSERT(num_threads > 0);(or similar) at the top ofinitThreadPoolwould catch misconfigurations early.These are low-cost and help avoid subtle lifecycle/resource issues.
95-107: HardenidleHpcThreadPool/wakeupHpcThreadPooland renametpThe new helpers:
void idleHpcThreadPool() { auto& ctx = Context::instance(); auto host_bk = std::static_pointer_cast<::mllm::cpu::CPUBackend>(ctx.getBackend(kCPU)); auto tp = host_bk->getThreadPool(); if (tp) { tp->idle(); } } void wakeupHpcThreadPool() { auto& ctx = Context::instance(); auto host_bk = std::static_pointer_cast<::mllm::cpu::CPUBackend>(ctx.getBackend(kCPU)); auto tp = host_bk->getThreadPool(); if (tp) { tp->wakeup(); } }work logically, but a couple of small improvements would make them safer and satisfy static analysis:
- Null backend guard: if
getBackend(kCPU)can ever fail or return a non-CPU backend (e.g., misconfiguration),std::static_pointer_castis undefined. A defensive check aroundctx.getBackend(kCPU)(or usingdynamic_pointer_castwith an assert) would make this more robust.- Rename
tp: to address the identifier-length warning and improve readability, renametpto something likethread_pool:auto thread_pool = host_bk->getThreadPool(); if (thread_pool) { thread_pool->idle(); }Same for
wakeupHpcThreadPool.mllm/core/Parallel.hpp (1)
3-10: Header dependencies and vendor guard assumptionsIncluding
Context.hppandCPUBackend.hpphere makesParallel.hppdepend directly on the engine and CPU backend. That’s OK if this header is only used in core/engine, but it will cause layering issues if included in low‑level utility code or non‑CPU builds.Also, the top‑level guard assumes exactly one of
MLLM_KERNEL_THREADS_VENDOR_APPLE_GCD,MLLM_KERNEL_THREADS_VENDOR_OPENMP, orMLLM_KERNEL_USE_THREADS_VENDOR_MLLMis defined. If multiple vendors are enabled at once, the subsequent macro definitions will conflict.If that’s not guaranteed by the build system, consider either enforcing exclusivity with a
#erroror splitting the vendor blocks under a mutually exclusive#elifchain.mllm/engine/HpcThreadPool.hpp (3)
16-17: Use a typed constant instead of a macro for task slot limit
MLLM_HPC_THREAD_POOL_TASK_LIMITSis effectively a small constant used to size vectors and loop bounds. Aconstexpr intin namespace scope is safer and clearer than a macro, and also avoids polluting the global macro namespace:-#define MLLM_HPC_THREAD_POOL_TASK_LIMITS 2 +inline constexpr int kHpcThreadPoolTaskLimit = 2;and then use
kHpcThreadPoolTaskLimitthroughout the implementation.
65-73: Consider replacing rawstd::atomic_bool*with value semantics
tasks_storesstd::atomic_bool*and the implementation manually allocates/frees these flags in the constructor/destructor. This works but is error‑prone and makes ownership less obvious.Given the fixed per‑task, per‑thread flag counts, it would be simpler to store them by value:
- std::vector<bool> task_available_; - std::vector<std::pair<HpcThreadPoolTask, std::vector<std::atomic_bool*>>> tasks_; + std::vector<bool> task_available_; + std::vector<std::pair<HpcThreadPoolTask, std::vector<std::atomic_bool>>> tasks_;and drop the
new/deletelogic in the.cpp. This reduces manual memory management and keeps everything RAII‑managed.
56-56: Avoid double‑underscore in public method nameThe method
__threadPoolDestroy()also uses a leading double underscore, which is reserved in C++. Renaming it to something likethreadPoolDestroy()or making it a private helper and calling it from the destructor would avoid potential UB and better reflect its intent.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
CMakeLists.txt(2 hunks)mllm/backends/cpu/CPUBackend.cpp(2 hunks)mllm/backends/cpu/CPUBackend.hpp(1 hunks)mllm/backends/cpu/kernels/arm/linear/kai_sme.hpp(1 hunks)mllm/core/Parallel.hpp(2 hunks)mllm/engine/HpcThreadPool.cpp(1 hunks)mllm/engine/HpcThreadPool.hpp(1 hunks)mllm/mllm.hpp(1 hunks)mllm/models/qwen3/modeling_qwen3_service.hpp(4 hunks)tasks/build_osx_apple_silicon.yaml(1 hunks)tasks/build_osx_apple_silicon_accelerate.yaml(1 hunks)tasks/build_osx_apple_silicon_compile_stack.yml(1 hunks)tasks/build_osx_apple_silicon_dbg.yaml(1 hunks)tasks/build_osx_apple_silicon_perf.yaml(1 hunks)tasks/build_osx_cli.yaml(1 hunks)tasks/build_sdk_android.yaml(1 hunks)tasks/build_sdk_osx_apple_silicon.yaml(1 hunks)
🧰 Additional context used
🪛 Clang (14.0.6)
mllm/engine/HpcThreadPool.cpp
[error] 3-3: 'mllm/engine/HpcThreadPool.hpp' file not found
(clang-diagnostic-error)
[error] 16-16: redundant void argument list in function definition
(modernize-redundant-void-arg,-warnings-as-errors)
[error] 17-17: variable 'cpuset' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 18-18: variable 'thread' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
mllm/backends/cpu/CPUBackend.cpp
[error] 98-98: variable name 'tp' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
[error] 105-105: variable name 'tp' is too short, expected at least 3 characters
(readability-identifier-length,-warnings-as-errors)
mllm/models/qwen3/modeling_qwen3_service.hpp
[error] 34-34: 2 adjacent parameters of 'makeRotaryPosEmbedding' of convertible types are easily swapped by mistake
(bugprone-easily-swappable-parameters,-warnings-as-errors)
[error] 441-441: variable 'package_cnt' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
mllm/core/Parallel.hpp
[error] 3-3: 'mllm/engine/Context.hpp' file not found
(clang-diagnostic-error)
mllm/backends/cpu/CPUBackend.hpp
[error] 12-12: class 'CPUBackend' defines a non-default destructor but does not define a copy constructor, a copy assignment operator, a move constructor or a move assignment operator
(cppcoreguidelines-special-member-functions,-warnings-as-errors)
mllm/engine/HpcThreadPool.hpp
[error] 8-8: 'mutex' file not found
(clang-diagnostic-error)
[error] 16-16: macro 'MLLM_HPC_THREAD_POOL_TASK_LIMITS' used to declare a constant; consider using a 'constexpr' constant
(cppcoreguidelines-macro-usage,-warnings-as-errors)
[error] 20-20: constructor does not initialize these fields: func, start, end
(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)
[error] 27-27: declaration uses identifier '__state_remap', which is a reserved identifier
(bugprone-reserved-identifier,-warnings-as-errors)
[error] 30-30: class 'HpcThreadPool' defines a non-default destructor but does not define a copy constructor, a copy assignment operator, a move constructor or a move assignment operator
(cppcoreguidelines-special-member-functions,-warnings-as-errors)
[error] 56-56: declaration uses identifier '__threadPoolDestroy', which is a reserved identifier
(bugprone-reserved-identifier,-warnings-as-errors)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build-macos
- GitHub Check: build-android
🔇 Additional comments (11)
tasks/build_osx_apple_silicon_compile_stack.yml (1)
11-13: Flag switch to MLLM thread vendor looks consistentThe build now enables
MLLM_KERNEL_USE_THREADS_VENDOR_MLLMwith threads on and OpenMP off; this matches the new CMake option and the project-wide move to the MLLM thread pool. No issues spotted.tasks/build_osx_apple_silicon_accelerate.yaml (1)
8-12: Consistent adoption of MLLM thread vendorThis config now enables
MLLM_KERNEL_USE_THREADS_VENDOR_MLLMwith OpenMP disabled, aligning with the new thread pool backend and other OSX Apple Silicon build tasks. Looks good.tasks/build_sdk_osx_apple_silicon.yaml (1)
10-13: SDK build correctly switched to MLLM thread vendorThe SDK build now uses
MLLM_KERNEL_USE_THREADS_VENDOR_MLLMwhile keeping OpenMP off. This is consistent with the updated CMake options and other OSX builds.tasks/build_osx_apple_silicon.yaml (1)
8-10: OSX Release build aligned with new thread vendorUsing
MLLM_KERNEL_USE_THREADS_VENDOR_MLLMhere keeps the main OSX Release build consistent with the new thread-pool backend and with the other updated task files.tasks/build_osx_apple_silicon_dbg.yaml (1)
10-12: Debug build threading flags are consistentThe debug configuration now uses
MLLM_KERNEL_USE_THREADS_VENDOR_MLLMwith OpenMP disabled, matching the release and perf builds. No problems seen.mllm/backends/cpu/kernels/arm/linear/kai_sme.hpp (1)
4-7: Non-functional documentation comment looks goodThe added “Supports” comments clarify which matmul variants this header covers, with no impact on behavior or interfaces. Change is straightforward.
CMakeLists.txt (1)
45-49: Enforce mutual exclusivity among thread vendors at CMake configuration timeThe current defaults and CMake logic allow
MLLM_KERNEL_USE_THREADS_VENDOR_MLLM(defaulting toON) to coexist withMLLM_KERNEL_THREADS_VENDOR_OPENMPorMLLM_KERNEL_THREADS_VENDOR_APPLE_GCD. While platform-level forcing and build configs avoid overlap in practice, the defaults permit simultaneous activation.In
mllm/core/Parallel.hpp, the macro blocks use separate#ifstatements (not#elif), so when multiple vendors are defined, the last definition silently overwrites previous ones—theMLLM_AUTO_PARALLEL_*macros end up using whichever vendor was defined last, not the explicitly intended one.Tighten this by adding a CMake-level validation in
CMakeLists.txtto enforce exactly one vendor:set(VENDOR_COUNT 0) if(MLLM_KERNEL_USE_THREADS_VENDOR_MLLM) math(EXPR VENDOR_COUNT "${VENDOR_COUNT} + 1") endif() if(MLLM_KERNEL_THREADS_VENDOR_OPENMP) math(EXPR VENDOR_COUNT "${VENDOR_COUNT} + 1") endif() if(MLLM_KERNEL_THREADS_VENDOR_APPLE_GCD) math(EXPR VENDOR_COUNT "${VENDOR_COUNT} + 1") endif() if(VENDOR_COUNT GREATER 1) message(FATAL_ERROR "Only one thread vendor may be enabled. Found ${VENDOR_COUNT}.") endif()This makes vendor selection predictable and prevents silent macro redefinition surprises.
Also applies to: lines 118–121 and related error checks in
mllm/CMakeLists.txtandmllm/backends/cpu/CMakeLists.txttasks/build_osx_cli.yaml (1)
8-17: Thread vendor switch to MLLM looks correctThe change to
-DMLLM_KERNEL_USE_THREADS_VENDOR_MLLM=ONis consistent with the new internal thread-pool path and keeps the rest of the OSX CLI config intact. Please just ensure other OSX build YAMLs remain consistent with this choice.If not already done, verify other OSX configs use the same vendor flag and that CMake’s
MLLM_KERNEL_USE_THREADS_VENDOR_MLLMoption is respected in all relevant presets.tasks/build_sdk_android.yaml (1)
6-19: Android SDK threading flags align with new MLLM thread vendorThe combination:
MLLM_KERNEL_USE_THREADS=ONMLLM_KERNEL_THREADS_VENDOR_OPENMP=OFFMLLM_KERNEL_USE_THREADS_VENDOR_MLLM=ONcorrectly migrates Android SDK builds to the internal thread-pool vendor. No YAML issues spotted. It would be good to run at least one perf/regression pass comparing this against the previous OpenMP configuration to catch any platform-specific regressions.
tasks/build_osx_apple_silicon_perf.yaml (1)
5-16: Perf preset correctly opts into MLLM thread vendorEnabling
-DMLLM_KERNEL_USE_THREADS_VENDOR_MLLM=ONhere is consistent with the rest of the thread-vendor migration and keeps OpenMP explicitly disabled. Looks good from a config perspective; just ensure perf baselines for this preset are updated/checked under the new thread-pool backend.mllm/engine/HpcThreadPool.cpp (1)
161-204: Thread pool ctor/dtor lifetime assumptionsThe constructor pre‑allocates per‑task, per‑thread completion flags and spawns worker threads for indices
[1, thread_cnt_), and the destructor setsstop_, wakes all workers, joins them, and then deletes the allocatedstd::atomic_boolobjects. This is structurally sound, but it relies on:
- No tasks being in flight when the destructor begins (otherwise user code may still be running during teardown).
- External users balancing
activate()/deactivate()andacquireTaskSlot()/releaseTaskSlot()so thatavailable_task_slots_andtasks_[]are not left in an inconsistent state.Please double‑check that CPUBackend and model code never destroy the pool while tasks are outstanding and that slot management is strictly paired.
| #include "mllm/engine/HpcThreadPool.hpp" | ||
|
|
||
| namespace mllm::cpu { | ||
|
|
||
| class CPUBackend final : public Backend { | ||
| public: | ||
| ~CPUBackend(); | ||
|
|
||
| CPUBackend(); | ||
|
|
||
| HpcThreadPool::ptr_t getThreadPool(); | ||
|
|
||
| void initThreadPool(int32_t num_threads); | ||
|
|
||
| int32_t taskIndex(); | ||
|
|
||
| private: | ||
| int32_t task_index_ = -1; | ||
| HpcThreadPool::ptr_t thread_pool_ = nullptr; | ||
| }; |
There was a problem hiding this comment.
Make CPUBackend explicitly non-copyable/movable
Adding a non-trivial destructor:
class CPUBackend final : public Backend {
public:
~CPUBackend();
CPUBackend();
HpcThreadPool::ptr_t getThreadPool();
void initThreadPool(int32_t num_threads);
int32_t taskIndex();
private:
int32_t task_index_ = -1;
HpcThreadPool::ptr_t thread_pool_ = nullptr;
};triggers the clang guideline about special member functions. Semantically, copying or moving a backend that owns a thread pool would be problematic anyway, so it’s clearer to make this explicit:
class CPUBackend final : public Backend {
public:
CPUBackend();
~CPUBackend();
CPUBackend(const CPUBackend&) = delete;
CPUBackend& operator=(const CPUBackend&) = delete;
CPUBackend(CPUBackend&&) = delete;
CPUBackend& operator=(CPUBackend&&) = delete;
HpcThreadPool::ptr_t getThreadPool();
void initThreadPool(int32_t num_threads);
int32_t taskIndex();
private:
int32_t task_index_ = -1;
HpcThreadPool::ptr_t thread_pool_ = nullptr;
};This both matches the intended ownership model and silences the static-analysis error.
Also applies to: 29-34
🧰 Tools
🪛 Clang (14.0.6)
[error] 12-12: class 'CPUBackend' defines a non-default destructor but does not define a copy constructor, a copy assignment operator, a move constructor or a move assignment operator
(cppcoreguidelines-special-member-functions,-warnings-as-errors)
🤖 Prompt for AI Agents
In mllm/backends/cpu/CPUBackend.hpp around lines 8-27 (and also apply same
change to lines 29-34), the class defines a non-trivial destructor but does not
declare special member functions, which allows accidental copy/move of an object
that owns a thread pool; explicitly mark the copy constructor, copy assignment,
move constructor and move assignment as deleted by adding declarations for
CPUBackend(const CPUBackend&) = delete; CPUBackend& operator=(const CPUBackend&)
= delete; CPUBackend(CPUBackend&&) = delete; CPUBackend& operator=(CPUBackend&&)
= delete; in the public section of the class (keeping the existing default
constructor and destructor) so the class is non-copyable and non-movable.
| #ifdef MLLM_KERNEL_USE_THREADS_VENDOR_MLLM | ||
| #include "mllm/engine/HpcThreadPool.hpp" | ||
|
|
||
| #define MLLM_AUTO_PARALLEL_BEGIN(__iter__, __num__) \ | ||
| { \ | ||
| auto host_bk = \ | ||
| std::static_pointer_cast<::mllm::cpu::CPUBackend>(::mllm::Context::instance().getBackend(::mllm::DeviceTypes::kCPU)); \ | ||
| ::mllm::HpcThreadPoolTask task; \ | ||
| task.start = 0; \ | ||
| task.end = __num__; \ | ||
| task.step = 1; \ | ||
| task.func = [&](int __iter__) { | ||
| #define MLLM_AUTO_PARALLEL_END() \ | ||
| } \ | ||
| ; \ | ||
| host_bk->getThreadPool()->push(std::move(task), host_bk->taskIndex()); \ | ||
| } | ||
|
|
||
| #define MLLM_AUTO_PARALLEL_FOR_BEGIN(__iter__, __start__, __end__, __step__) \ | ||
| { \ | ||
| auto host_bk = \ | ||
| std::static_pointer_cast<::mllm::cpu::CPUBackend>(::mllm::Context::instance().getBackend(::mllm::DeviceTypes::kCPU)); \ | ||
| ::mllm::HpcThreadPoolTask task; \ | ||
| task.start = __start__; \ | ||
| task.end = __end__; \ | ||
| task.step = __step__; \ | ||
| task.func = [&](int __iter__) { | ||
| #define MLLM_AUTO_PARALLEL_FOR_END() \ | ||
| } \ | ||
| ; \ | ||
| host_bk->getThreadPool()->push(std::move(task), host_bk->taskIndex()); \ | ||
| } | ||
|
|
||
| #define MLLM_AUTO_PARALLEL_FOR_BEGIN_NT(__iter__, __start__, __end__, __step__, __num_threads__) \ | ||
| { \ | ||
| auto host_bk = \ | ||
| std::static_pointer_cast<::mllm::cpu::CPUBackend>(::mllm::Context::instance().getBackend(::mllm::DeviceTypes::kCPU)); \ | ||
| ::mllm::HpcThreadPoolTask task; \ | ||
| task.start = __start__; \ | ||
| task.end = __end__; \ | ||
| task.step = __step__; \ | ||
| task.func = [&](int __iter__) { | ||
| #define MLLM_AUTO_PARALLEL_FOR_END_NT() \ | ||
| } \ | ||
| ; \ | ||
| host_bk->getThreadPool()->push(std::move(task), host_bk->taskIndex()); \ | ||
| } | ||
|
|
||
| #define MLLM_SET_NUM_THREADS(num_threads) \ | ||
| do { (void)(num_threads); } while (0) | ||
| #endif |
There was a problem hiding this comment.
🧩 Analysis chain
HPC vendor macros: clarify preconditions and avoid fragile captures / ignored parameters
The MLLM vendor macros correctly wrap a HpcThreadPoolTask and push it to the CPU backend’s pool, but there are a few risks:
host_bkand its thread pool are assumed to be non‑null and fully initialised (Context::instance().getBackend(kCPU),host_bk->getThreadPool()). If context initialisation orinitThreadPoolare skipped in some flows, these macros will crash on first use.- The lambda uses
[&]capture, which needlessly captureshost_bkand any other locals by reference. It’s safe today because the lambda body doesn’t usehost_bk, but it’s fragile if future changes reference captured locals after the macro’s scope ends. MLLM_AUTO_PARALLEL_FOR_BEGIN_NTignores the__num_threads__argument entirely, which is a behavioural change vs. the OpenMP implementation and may surprise callers who expect per‑call control over thread count.
Suggested adjustments:
- auto host_bk = \
- std::static_pointer_cast<::mllm::cpu::CPUBackend>(::mllm::Context::instance().getBackend(::mllm::DeviceTypes::kCPU)); \
+ auto host_bk = \
+ std::static_pointer_cast<::mllm::cpu::CPUBackend>(::mllm::Context::instance().getBackend(::mllm::DeviceTypes::kCPU)); \
+ MLLM_ASSERT(host_bk && host_bk->getThreadPool()); /* or equivalent runtime check */ \
@@
- task.func = [&](int __iter__) {
+ task.func = [&, host_bk](int __iter__) {
@@
-#define MLLM_AUTO_PARALLEL_FOR_BEGIN_NT(__iter__, __start__, __end__, __step__, __num_threads__) \
+#define MLLM_AUTO_PARALLEL_FOR_BEGIN_NT(__iter__, __start__, __end__, __step__, __num_threads__) \
{ \
@@
- task.step = __step__; \
+ task.step = __step__; \
+ (void)__num_threads__; /* currently unused; consider threading it into pool config if needed */ \(Replace MLLM_ASSERT with whatever assertion/logging facility you use.)
At minimum, please document that __num_threads__ is ignored for the MLLM vendor path and ensure the context/backend initialisation order guarantees host_bk->getThreadPool() is always valid before these macros are used.
</invoke>
---
🏁 Script executed:
```shell
#!/bin/bash
# Search for usage of the three HPC vendor macros in the codebase
echo "=== Searching for MLLM_AUTO_PARALLEL_BEGIN usage ==="
rg -nP 'MLLM_AUTO_PARALLEL_BEGIN\s*\(' --type=hpp --type=cpp
echo -e "\n=== Searching for MLLM_AUTO_PARALLEL_FOR_BEGIN usage ==="
rg -nP 'MLLM_AUTO_PARALLEL_FOR_BEGIN\s*\(' --type=hpp --type=cpp
echo -e "\n=== Searching for MLLM_AUTO_PARALLEL_FOR_BEGIN_NT usage ==="
rg -nP 'MLLM_AUTO_PARALLEL_FOR_BEGIN_NT\s*\(' --type=hpp --type=cpp
echo -e "\n=== Check for Context initialization and getBackend patterns ==="
rg -nC3 'getBackend' --type=hpp --type=cpp | head -50
echo -e "\n=== Search for CPUBackend and getThreadPool ==="
rg -nC2 'getThreadPool' --type=hpp --type=cpp | head -50
Length of output: 426
🏁 Script executed:
#!/bin/bash
# Search for OpenMP macro definitions to compare __num_threads__ handling
echo "=== Search for OpenMP parallel pragmas and macro definitions ==="
rg -nP 'MLLM_AUTO_PARALLEL' --type=hpp --type=cpp -C2 | grep -A5 -B5 'OpenMP\|omp_set_num_threads\|num_threads'
echo -e "\n=== Look for HpcThreadPool definition and thread count handling ==="
fd -e hpp -e cpp | xargs grep -l 'HpcThreadPool' | head -5
echo -e "\n=== Search for assertion/error checking patterns in similar code ==="
rg -nP 'MLLM_ASSERT|assert\s*\(' --type=hpp --type=cpp | head -30Length of output: 408
🏁 Script executed:
#!/bin/bash
# Find HpcThreadPool definition
fd -type f \( -name '*HpcThreadPool*' -o -name '*ThreadPool*' \) | head -10
# Search for where thread count is handled in thread pool
rg -nC5 'num_threads' --type=hpp --type=cpp | grep -A5 -B5 'HpcThreadPool'Length of output: 198
🌐 Web query:
Does C++ lambda capture `[&]` affect the safety of captured variables after scope ends?
💡 Result:
Short answer: Yes — capturing with [&] does NOT extend the lifetime of the referenced objects. If a lambda that captured by reference is invoked after the original objects go out of scope, you get a dangling reference / undefined behavior. [1][2]
What to do instead (brief):
- Return/store lambdas that need to outlive the scope: capture by value ([=] or explicit captures), or use init-capture (e.g. x = std::move(x)). [1][3]
- For shared ownership use smart pointers (std::shared_ptr) captured by value. [2]
References:
- cppreference on lambda captures and lifetimes.
- SEI CERT C++ guideline: lambda object must not outlive its reference-captured objects.
- Practical discussion / examples (StackOverflow and tutorials).
🏁 Script executed:
#!/bin/bash
# Search for macro usage without problematic --type flag
echo "=== MLLM_AUTO_PARALLEL macros usage ==="
rg -n 'MLLM_AUTO_PARALLEL_BEGIN\(' mllm/
echo -e "\n=== MLLM_AUTO_PARALLEL_FOR_BEGIN usage ==="
rg -n 'MLLM_AUTO_PARALLEL_FOR_BEGIN\(' mllm/
echo -e "\n=== MLLM_AUTO_PARALLEL_FOR_BEGIN_NT usage ==="
rg -n 'MLLM_AUTO_PARALLEL_FOR_BEGIN_NT\(' mllm/
echo -e "\n=== Check entire Parallel.hpp structure ==="
cat mllm/core/Parallel.hpp | head -150Length of output: 13414
🏁 Script executed:
#!/bin/bash
# Examine HpcThreadPool implementation
echo "=== HpcThreadPool.hpp ==="
cat mllm/engine/HpcThreadPool.hpp
echo -e "\n=== HpcThreadPool.cpp (first 100 lines) ==="
head -100 mllm/engine/HpcThreadPool.cppLength of output: 3995
🏁 Script executed:
#!/bin/bash
# Check CPUBackend for getThreadPool and initialization
echo "=== CPUBackend.hpp snippet around getThreadPool ==="
rg -n -C5 'getThreadPool' mllm/backends/cpu/CPUBackend.hpp
echo -e "\n=== CPUBackend initialization and Context ==="
rg -n -C3 'initThreadPool\|ThreadPool\(' mllm/backends/cpu/CPUBackend.cpp | head -60
echo -e "\n=== Check Context implementation ==="
rg -n -C3 'Context::instance\|getBackend' mllm/ | head -80Length of output: 323
🏁 Script executed:
#!/bin/bash
# Search for OpenMP #else branch in Parallel.hpp to compare __num_threads__ handling
echo "=== Full Parallel.hpp to see OpenMP version ==="
cat mllm/core/Parallel.hpp
echo -e "\n=== Search for omp_set_num_threads or num_threads in OpenMP section ==="
rg -n 'omp_set_num_threads\|pragma omp\|__num_threads__' mllm/core/Parallel.hppLength of output: 10376
🏁 Script executed:
#!/bin/bash
# Check if task lambda is executed synchronously or asynchronously
echo "=== HpcThreadPool push method and task execution ==="
rg -n -A10 'void push\|ThreadPoolTask' mllm/engine/HpcThreadPool.hppLength of output: 52
Fix null checks, explicit lambda captures, and num_threads handling in HPC vendor macros
The three concerns identified are all valid:
-
Null pointer risk: The macros lack null checks on
Context::instance().getBackend()andgetThreadPool(). If context initialization is incomplete or skipped, the dereferences will crash. -
Fragile lambda capture: The
[&]capture in the lambda doesn't extend object lifetimes. If the thread pool executes tasks asynchronously after the macro's scope ends,host_bkwill be a dangling reference. Even if currently safe due to synchronous execution, this is fragile for future changes. -
__num_threads__ignored: The OpenMP path respects__num_threads__vianum_threads(__num_threads__), but the MLLM vendor path accepts but completely ignores it. Callers expecting per-call thread count control will see silent no-ops.
Required fixes:
- Add explicit null checks (assertion or runtime validation) after
getBackend()andgetThreadPool(). - Change lambda capture from
[&]to[&, host_bk]to explicitly capture by reference only what's needed and prevent accidental future dependencies on dangling locals. - Either implement
__num_threads__support in the thread pool or document prominently that it is ignored on the MLLM vendor path to prevent caller confusion.
🤖 Prompt for AI Agents
In mllm/core/Parallel.hpp lines 82-132, the vendor-threadpool macros lack null
checks on Context::instance().getBackend() and getThreadPool(), use a fragile
[&] lambda capture that can dangle, and ignore the __num_threads__ parameter;
fix by (1) adding runtime null checks (or assertions) immediately after
retrieving the backend and its thread pool and bail/throw with a clear message
if null, (2) change the lambda capture to capture host_bk by value (e.g.,
capture host_bk) so the backend pointer lifetime is preserved for async
execution, and (3) propagate the __num_threads__ argument into the thread-pool
API (either by setting a field on the task or passing it into
getThreadPool()->push) or, if the thread pool cannot support per-task thread
counts, explicitly document/emit a logged warning that __num_threads__ is
ignored on this vendor path so callers are not silently misled.
| #if defined(__gnu__linux__) | ||
| #include <pthread.h> | ||
| #include <sched.h> | ||
| #endif | ||
|
|
||
| #include <cstdint> | ||
|
|
||
| namespace mllm { | ||
|
|
||
| #if defined(__gnu_linux__) | ||
| static cpu_set_t getNumaAffinity(void) { | ||
| cpu_set_t cpuset; | ||
| pthread_t thread; | ||
| thread = pthread_self(); | ||
| CPU_ZERO(&cpuset); | ||
| pthread_getaffinity_np(thread, sizeof(cpu_set_t), &cpuset); | ||
| return cpuset; | ||
| } | ||
| #else | ||
| static uint32_t getNumaAffinity() { | ||
| return 0; // no NUMA support | ||
| } | ||
| #endif |
There was a problem hiding this comment.
Fix Linux platform macro mismatch and clean up NUMA helper
- Lines 6-9 use
__gnu__linux__while lines 15-23 use__gnu_linux__. On GNU/Linux only__gnu_linux__(and__linux__) are standard; the current mismatch can skip thepthread.h/sched.hincludes while still compiling thegetNumaAffinitybody, leading to undefined types and compilation failures on Linux. - Also, the
voidparameter list and uninitialized-variable warnings oncpuset/threadare easy to address.
Consider something like:
-#if defined(__gnu__linux__)
-#include <pthread.h>
-#include <sched.h>
-#endif
+ #if defined(__gnu_linux__) || defined(__linux__)
+ #include <pthread.h>
+ #include <sched.h>
+ #endif
@@
-#if defined(__gnu_linux__)
-static cpu_set_t getNumaAffinity(void) {
- cpu_set_t cpuset;
- pthread_t thread;
- thread = pthread_self();
+#if defined(__gnu_linux__) || defined(__linux__)
+static cpu_set_t getNumaAffinity() {
+ cpu_set_t cpuset{};
+ pthread_t thread = pthread_self();
CPU_ZERO(&cpuset);
pthread_getaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
return cpuset;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #if defined(__gnu__linux__) | |
| #include <pthread.h> | |
| #include <sched.h> | |
| #endif | |
| #include <cstdint> | |
| namespace mllm { | |
| #if defined(__gnu_linux__) | |
| static cpu_set_t getNumaAffinity(void) { | |
| cpu_set_t cpuset; | |
| pthread_t thread; | |
| thread = pthread_self(); | |
| CPU_ZERO(&cpuset); | |
| pthread_getaffinity_np(thread, sizeof(cpu_set_t), &cpuset); | |
| return cpuset; | |
| } | |
| #else | |
| static uint32_t getNumaAffinity() { | |
| return 0; // no NUMA support | |
| } | |
| #endif | |
| #if defined(__gnu_linux__) || defined(__linux__) | |
| #include <pthread.h> | |
| #include <sched.h> | |
| #endif | |
| #include <cstdint> | |
| namespace mllm { | |
| #if defined(__gnu_linux__) || defined(__linux__) | |
| static cpu_set_t getNumaAffinity() { | |
| cpu_set_t cpuset{}; | |
| pthread_t thread = pthread_self(); | |
| CPU_ZERO(&cpuset); | |
| pthread_getaffinity_np(thread, sizeof(cpu_set_t), &cpuset); | |
| return cpuset; | |
| } | |
| #else | |
| static uint32_t getNumaAffinity() { | |
| return 0; // no NUMA support | |
| } | |
| #endif |
🧰 Tools
🪛 Clang (14.0.6)
[error] 16-16: redundant void argument list in function definition
(modernize-redundant-void-arg,-warnings-as-errors)
[error] 17-17: variable 'cpuset' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
[error] 18-18: variable 'thread' is not initialized
(cppcoreguidelines-init-variables,-warnings-as-errors)
| void HpcThreadPool::splitTask(HpcThreadPoolTask&& task, int task_slot_idx) { | ||
| // There are no task slots, use this main thread to compute! | ||
| if (available_task_slots_ == 0) { | ||
| for (int i = task.start; i < task.end; i += task.step) { task.func(i); } | ||
| return; | ||
| } | ||
| int tiles_num = (task.end - task.start) / task.step; | ||
|
|
||
| int _cnt = 0; | ||
| std::vector<int> true_idx(tiles_num); | ||
| for (int i = task.start; i < task.end; i += task.step) { true_idx[_cnt++] = i; } | ||
|
|
||
| // When tiles_num > thread_cnt. We need to nested for loops. Which means static attach works to one thread. | ||
| // | ||
| // NOTE: Static dispatch may have performance issues in bit.LITTLE CPU Arch. | ||
| // | ||
| // NOTE: The dispatch logic is below: | ||
| // e.g.: threads is 4, tiles_name is 12. | ||
| // 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3 | ||
| if (tiles_num > thread_cnt_) { | ||
| tasks_[task_slot_idx].first = { | ||
| .start = 0, | ||
| .end = thread_cnt_, | ||
| .step = 1, | ||
| .func = | ||
| [tiles_num, &task, &true_idx, this](int thread_idx) { | ||
| for (int v = thread_idx; v < tiles_num; v += thread_cnt_) { task.func(true_idx[v]); } | ||
| }, | ||
| }; | ||
| tiles_num = thread_cnt_; | ||
| } else { | ||
| tasks_[task_slot_idx].first = { | ||
| .start = 0, | ||
| .end = tiles_num, | ||
| .step = 1, | ||
| .func = [tiles_num, &task, &true_idx, this](int thread_idx) { task.func(true_idx[thread_idx]); }, | ||
| }; | ||
| } | ||
| { | ||
| for (int i = 1; i < tiles_num; ++i) { *tasks_[task_slot_idx].second[i] = true; } | ||
| } | ||
|
|
||
| // Explain why we start a 0 idx function. | ||
| // | ||
| // FIXME: The main thread is also used for compute or NOT ? | ||
| tasks_[task_slot_idx].first.func(true_idx[0]); | ||
|
|
||
| // Wait for all threads to complete | ||
| bool complete = true; | ||
| do { | ||
| complete = true; | ||
| for (int i = 1; i < tiles_num; ++i) { | ||
| if (*tasks_[task_slot_idx].second[i]) { | ||
| complete = false; | ||
| break; | ||
| } | ||
| } | ||
| std::this_thread::yield(); | ||
| } while (!complete); | ||
| } |
There was a problem hiding this comment.
Incorrect main‑thread dispatch index in splitTask can cause OOB access / wrong work partitioning
In splitTask, the main thread calls:
tasks_[task_slot_idx].first.func(true_idx[0]);but func expects a thread index, not a tile value:
- In the
tiles_num > thread_cnt_branch the lambda interprets its argument asthread_idxand iterates tilesv = thread_idx; v < tiles_num; v += thread_cnt_. Passingtrue_idx[0](which istask.start) instead of0breaks this mapping whentask.start != 0. - In the
tiles_num <= thread_cnt_branch, the lambda usestrue_idx[thread_idx]. Iftask.start > 0, thentrue_idx[0]may be larger than or equal totiles_num, causing an out‑of‑bounds access ontrue_idx.
The main thread should use logical thread index 0, not the first tile value:
- // FIXME: The main thread is also used for compute or NOT ?
- tasks_[task_slot_idx].first.func(true_idx[0]);
+ // Main thread executes the work for logical thread index 0.
+ tasks_[task_slot_idx].first.func(0);Also note that this path busy‑waits with std::this_thread::yield() on all worker flags; if this becomes a hotspot, you may want to introduce a condition-variable or barrier-based completion mechanism instead of pure spinning.
🤖 Prompt for AI Agents
mllm/engine/HpcThreadPool.cpp around lines 100-159: the main thread currently
calls tasks_[task_slot_idx].first.func(true_idx[0]) which passes a tile value
instead of a logical thread index, causing incorrect partitioning and possible
out‑of‑bounds access; change that call to tasks_[task_slot_idx].first.func(0) so
the main thread uses thread index 0; additionally, ensure any comments/reference
to why index 0 is used are updated and consider replacing the spin-wait
(std::this_thread::yield()) with a condition-variable or barrier later if
spinning becomes a hotspot.
| #define MLLM_HPC_THREAD_POOL_TASK_LIMITS 2 | ||
|
|
||
| namespace mllm { | ||
|
|
||
| struct HpcThreadPoolTask { | ||
| std::function<void(int)> func; | ||
| int start; | ||
| int end; | ||
| int step = 1; | ||
|
|
||
| // Things that will be modified by the thread pool | ||
| bool __state_remap = false; | ||
| }; |
There was a problem hiding this comment.
Initialisation and naming in HpcThreadPoolTask
HpcThreadPoolTaskleavesstartandenduninitialised on default construction. The new macros always assign them before use, but any other caller that default‑constructs the task and forgets to set both may hit undefined behaviour.- The field name
__state_remapuses a leading double underscore, which is reserved for the implementation in C++.
Consider giving the struct a safe default constructor / in‑class initialisers and renaming the flag:
-struct HpcThreadPoolTask {
- std::function<void(int)> func;
- int start;
- int end;
- int step = 1;
-
- // Things that will be modified by the thread pool
- bool __state_remap = false;
-};
+struct HpcThreadPoolTask {
+ std::function<void(int)> func{};
+ int start = 0;
+ int end = 0;
+ int step = 1;
+
+ // Things that will be modified by the thread pool
+ bool state_remap = false;
+};📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #define MLLM_HPC_THREAD_POOL_TASK_LIMITS 2 | |
| namespace mllm { | |
| struct HpcThreadPoolTask { | |
| std::function<void(int)> func; | |
| int start; | |
| int end; | |
| int step = 1; | |
| // Things that will be modified by the thread pool | |
| bool __state_remap = false; | |
| }; | |
| #define MLLM_HPC_THREAD_POOL_TASK_LIMITS 2 | |
| namespace mllm { | |
| struct HpcThreadPoolTask { | |
| std::function<void(int)> func{}; | |
| int start = 0; | |
| int end = 0; | |
| int step = 1; | |
| // Things that will be modified by the thread pool | |
| bool state_remap = false; | |
| }; |
🧰 Tools
🪛 Clang (14.0.6)
[error] 16-16: macro 'MLLM_HPC_THREAD_POOL_TASK_LIMITS' used to declare a constant; consider using a 'constexpr' constant
(cppcoreguidelines-macro-usage,-warnings-as-errors)
[error] 20-20: constructor does not initialize these fields: func, start, end
(cppcoreguidelines-pro-type-member-init,-warnings-as-errors)
[error] 27-27: declaration uses identifier '__state_remap', which is a reserved identifier
(bugprone-reserved-identifier,-warnings-as-errors)
🤖 Prompt for AI Agents
In mllm/engine/HpcThreadPool.hpp around lines 16 to 28, HpcThreadPoolTask leaves
start and end uninitialised and uses a reserved identifier __state_remap; add
safe defaults for all members (e.g., std::function<void(int)> func = nullptr;
int start = 0; int end = 0; int step = 1;) or provide an explicit default
constructor that initializes them, and rename the flag to a non-reserved
identifier such as state_remap or is_state_remap (update all uses accordingly)
to avoid UB and reserved-name violations.
…and initialize task parameters
Summary by CodeRabbit
New Features
Chores