CANN: fix multi-thread set_tensor race conditions#20151
CANN: fix multi-thread set_tensor race conditions#20151ggerganov merged 4 commits intoggml-org:masterfrom
Conversation
Problemollama uses multiple threads to call Issue 1: Quantized Tensor Format Transform Requires Full DataFor Q4_0/Q8_0 tensors, Code path: Issue 2: ND-to-NZ Format Conversion Requires Full DataFor matmul weight tensors, Code path: Issue 3: Global
|
| Resource | Protection | Contention |
|---|---|---|
tracker map (ctx->trackers) |
ctx->tracker_mutex |
Low: only on first/last chunk per tensor |
| individual tracker | tracker->mtx |
Low: brief lock for memcpy offset + counter increment |
g_nz_workspaces[device] |
g_nz_workspaces[device].mtx |
Very low: only one call per tensor (the last chunk's thread) |
device memory (tensor->data) |
No lock needed | Each thread writes to different offset; post-processing is single-threaded |
Files to Modify
ggml/src/ggml-cann/ggml-cann.cpp:- Add
#include <mutex>,#include <unordered_map>,#include <vector>,#include <memory>(check existing includes) - Add
TensorSetTrackerstruct - Modify
ggml_backend_cann_buffer_context: add tracker map + mutex - Add
std::mutextoggml_cann_nz_workspace - Modify
weight_format_to_nz: remove offset parameter - Rewrite
ggml_backend_cann_buffer_set_tensor
- Add
aeee6b4 to
51a28d0
Compare
There was a problem hiding this comment.
Pull request overview
Fixes CANN backend concurrency issues when ggml_backend_tensor_set is invoked from multiple threads writing different chunks of the same tensor, by deferring full-tensor post-processing until all chunks arrive and by protecting shared NZ workspace state.
Changes:
- Add
TensorSetTracker+ per-buffer tracker map to accumulate per-tensor chunk progress and defer quantized transform / ND→NZ conversion. - Add per-device mutex to
g_nz_workspacesto prevent concurrent workspace realloc/use races. - Tighten ACL graph node property comparisons by including tensor
typein equality checks.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| ggml/src/ggml-cann/ggml-cann.cpp | Introduces chunk-write tracking + deferred post-processing; adds mutex for global NZ workspaces. |
| ggml/src/ggml-cann/common.h | Extends graph node property equality with type checks and changes op_params comparison behavior. |
| ggml/src/ggml-cann/aclnn_ops.cpp | Clamps L2 norm denominator using eps from op_params to avoid tiny divisors. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| TensorSetTracker * tracker = ctx->get_or_create_tracker(tensor); | ||
| std::unique_lock<std::mutex> lock(tracker->mtx); | ||
|
|
||
| if (is_quantized) { | ||
| // Stage data in host buffer; transform requires full tensor data | ||
| if (tracker->host_buffer.empty()) { | ||
| tracker->host_buffer.resize(tracker->total_bytes); | ||
| } | ||
| memcpy(tracker->host_buffer.data() + offset, data, size); | ||
| } else { | ||
| void * transform_buffer = malloc(size); | ||
| ggml_backend_cann_transform(tensor, data, transform_buffer); | ||
| // NZ weight: upload chunk to device immediately, defer conversion | ||
| ACL_CHECK(aclrtMemcpy((char *) tensor->data + offset, size, data, size, ACL_MEMCPY_HOST_TO_DEVICE)); | ||
| } | ||
|
|
||
| ACL_CHECK(aclrtMemcpy((char *) tensor->data + offset, size, transform_buffer, size, ACL_MEMCPY_HOST_TO_DEVICE)); | ||
| free(transform_buffer); | ||
| tracker->bytes_written += size; | ||
|
|
||
| // All chunks received: perform deferred transform/conversion | ||
| if (tracker->bytes_written >= tracker->total_bytes) { | ||
| if (is_quantized) { | ||
| void * transform_buffer = malloc(tracker->total_bytes); | ||
| ggml_backend_cann_transform(tensor, tracker->host_buffer.data(), transform_buffer); | ||
| ACL_CHECK(aclrtMemcpy(tensor->data, tracker->total_bytes, transform_buffer, tracker->total_bytes, ACL_MEMCPY_HOST_TO_DEVICE)); | ||
| free(transform_buffer); | ||
| } | ||
|
|
||
| if (is_nz) { | ||
| GGML_ASSERT(tensor->ne[2] == 1); | ||
| GGML_ASSERT(tensor->ne[3] == 1); | ||
| weight_format_to_nz(tensor, ctx->device); | ||
| } | ||
|
|
||
| // Unlock before removing tracker, as remove_tracker destroys the mutex | ||
| lock.unlock(); | ||
| ctx->remove_tracker(tensor); | ||
| } |
There was a problem hiding this comment.
The tracker can be destroyed while other threads are still able to obtain/use the same tracker pointer. After lock.unlock() (1348) and before/while remove_tracker() erases the unique_ptr, another thread can acquire tracker->mtx or continue operating with a raw pointer to the tracker, causing use-after-free (and potentially destroying a mutex while other threads wait on it, which is undefined behavior). A robust fix is to manage tracker lifetime with shared ownership (e.g., store std::shared_ptr<TensorSetTracker> in the map and return a shared_ptr from get_or_create_tracker), and add a finalized flag under tracker->mtx so only one thread performs deferred transform/conversion and map removal; erase from the map can happen while the object remains alive until all in-flight callers drop their shared_ptr.
There was a problem hiding this comment.
Thank you for the careful analysis. In practice, the race described here cannot occur: all chunk-write threads for the same tensor are serialized by tracker->mtx. Only the thread that increments bytes_written past total_bytes will enter the cleanup path, and by that point all other threads have already released the lock and exited set_tensor. No new threads will call get_or_create_tracker for this tensor since all chunks have been written. Switching to shared_ptr + finalized flag would add complexity without practical benefit given the ggml calling convention.
| std::unique_lock<std::mutex> lock(tracker->mtx); | ||
|
|
||
| if (is_quantized) { | ||
| // Stage data in host buffer; transform requires full tensor data | ||
| if (tracker->host_buffer.empty()) { | ||
| tracker->host_buffer.resize(tracker->total_bytes); | ||
| } | ||
| memcpy(tracker->host_buffer.data() + offset, data, size); | ||
| } else { | ||
| void * transform_buffer = malloc(size); | ||
| ggml_backend_cann_transform(tensor, data, transform_buffer); | ||
| // NZ weight: upload chunk to device immediately, defer conversion | ||
| ACL_CHECK(aclrtMemcpy((char *) tensor->data + offset, size, data, size, ACL_MEMCPY_HOST_TO_DEVICE)); | ||
| } |
There was a problem hiding this comment.
For NZ weights, aclrtMemcpy is performed while holding tracker->mtx, which serializes chunk uploads for the same tensor and can largely negate the benefit of multi-threaded chunked writes. Consider narrowing the critical section: copy the chunk to device outside the tracker lock (while still ensuring the tracker cannot be finalized/freed mid-copy via a lifetime mechanism such as shared_ptr, and a “finalizing/in_flight” state), then re-lock only to update progress and potentially trigger finalization.
There was a problem hiding this comment.
Acknowledged. However, NZ weight upload only happens during model loading (not on the inference hot path), and each aclrtMemcpy for a chunk is a short DMA transfer. Narrowing the critical section would require a two-phase lock pattern with additional lifetime management, adding complexity with negligible real-world performance gain. Keeping it simple for now.
| tracker->bytes_written += size; | ||
|
|
||
| // All chunks received: perform deferred transform/conversion | ||
| if (tracker->bytes_written >= tracker->total_bytes) { |
There was a problem hiding this comment.
Completion is inferred purely from bytes_written += size, which assumes chunks are disjoint and written exactly once. If callers ever retry a chunk, overlap ranges, or write the same chunk twice, bytes_written can reach total_bytes early and trigger transform/conversion with a partially-populated host_buffer (quantized) or incomplete device data (NZ). A safer approach is to track received ranges (e.g., interval set / bitmap in units of a suitable block size) and only finalize when coverage reaches total_bytes, or at minimum detect/guard against double-counting by tracking (offset,size) segments.
There was a problem hiding this comment.
The ggml backend_tensor_set contract guarantees that callers write disjoint, non-overlapping chunks exactly once (this is how ollama's multi-threaded loader works). Adding interval tracking or bitmap coverage would be over-engineering for a scenario that cannot occur under the current API contract. If the contract changes in the future, we can revisit.
| std::mutex tracker_mutex; ///< Protects the trackers map | ||
| std::unordered_map<void *, std::unique_ptr<TensorSetTracker>> trackers; | ||
|
|
There was a problem hiding this comment.
If a tensor’s chunked upload never reaches completion (e.g., caller aborts mid-load), its tracker (and potentially large host_buffer) will remain resident in trackers until buffer context destruction, which can cause unbounded host memory growth over time. Consider adding explicit cleanup paths (e.g., clearing trackers in the destructor before freeing the device pointer, and/or exposing a cancellation/cleanup hook when tensors/buffers are reset) so partially-written tensors don’t accumulate indefinitely.
There was a problem hiding this comment.
The trackers map holds std::unique_ptr, which are automatically destroyed when the map is destroyed as part of ggml_backend_cann_buffer_context destruction. C++ RAII already guarantees deterministic cleanup — no explicit trackers.clear() is needed. Adding it would be redundant but harmless; omitting it keeps the code minimal.
| * @brief Destructor to free the device memory allocated for the buffer. | ||
| */ | ||
| ~ggml_backend_cann_buffer_context() { ACL_CHECK(aclrtFree(dev_ptr)); } | ||
|
|
There was a problem hiding this comment.
If a tensor’s chunked upload never reaches completion (e.g., caller aborts mid-load), its tracker (and potentially large host_buffer) will remain resident in trackers until buffer context destruction, which can cause unbounded host memory growth over time. Consider adding explicit cleanup paths (e.g., clearing trackers in the destructor before freeing the device pointer, and/or exposing a cancellation/cleanup hook when tensors/buffers are reset) so partially-written tensors don’t accumulate indefinitely.
| * @brief Destructor to free the device memory allocated for the buffer. | |
| */ | |
| ~ggml_backend_cann_buffer_context() { ACL_CHECK(aclrtFree(dev_ptr)); } | |
| * @brief Destructor to free the device memory allocated for the buffer | |
| * and clear any outstanding trackers. | |
| */ | |
| ~ggml_backend_cann_buffer_context() { | |
| { | |
| // Explicitly clear any outstanding trackers under lock to ensure | |
| // their host-side resources are released deterministically. | |
| std::lock_guard<std::mutex> lock(tracker_mutex); | |
| trackers.clear(); | |
| } | |
| if (dev_ptr != nullptr) { | |
| ACL_CHECK(aclrtFree(dev_ptr)); | |
| dev_ptr = nullptr; | |
| } | |
| } |
| return memcmp(this->op_params, node->op_params, GGML_MAX_OP_PARAMS) == 0; | ||
| } | ||
| return true; | ||
| return memcmp(this->op_params, node->op_params, GGML_MAX_OP_PARAMS) == 0; |
There was a problem hiding this comment.
This changes equality semantics to always compare the full op_params buffer. If op_params is not deterministically initialized for ops that don’t use it, this can cause false inequality (hurting graph cache hits) and make behavior dependent on uninitialized bytes. The previous code only compared op_params for a subset of ops. Consider restoring conditional comparison (only for ops where op_params are meaningful) or ensuring op_params is always zero-initialized/populated deterministically for all nodes before being stored/compared.
There was a problem hiding this comment.
In ggml, all tensors are zero-initialized via memset(result, 0, ...) in ggml_new_tensor_impl. For ops that don't use op_params, the buffer remains all-zeros deterministically. Unconditional comparison is therefore safe and won't cause false cache misses. The previous conditional approach was fragile — it required manually maintaining the list of ops that use op_params, and missing one (as happened with POOL_2D) caused silent correctness bugs.
| float eps; | ||
| memcpy(&eps, dst->op_params, sizeof(float)); | ||
|
|
There was a problem hiding this comment.
If eps can be 0.0f (or negative), clamping to [eps, FLT_MAX] does not guarantee the divisor is strictly positive, so Div can still produce inf/NaN when the norm is zero. Consider enforcing eps to be at least a small positive constant (or using max(eps, tiny) as the clamp minimum) before creating acl_min.
There was a problem hiding this comment.
Good point. Added eps = fmaxf(eps, FLT_MIN) as a defensive guard before creating the clamp scalar. This ensures the divisor is always strictly positive regardless of caller input, at negligible cost.
| // Clamp norm to at least eps: scale = 1/fmaxf(norm, eps) | ||
| acl_scalar_ptr acl_min = ggml_cann_create_scalar(&eps, aclDataType::ACL_FLOAT); | ||
| float flt_max = FLT_MAX; | ||
| acl_scalar_ptr acl_max = ggml_cann_create_scalar(&flt_max, aclDataType::ACL_FLOAT); | ||
| GGML_CANN_CALL_ACLNN_OP(ctx, Clamp, acl_div.get(), acl_min.get(), acl_max.get(), acl_div.get()); | ||
|
|
||
| GGML_CANN_CALL_ACLNN_OP(ctx, Div, acl_src.get(), acl_div.get(), acl_dst.get()); |
There was a problem hiding this comment.
If eps can be 0.0f (or negative), clamping to [eps, FLT_MAX] does not guarantee the divisor is strictly positive, so Div can still produce inf/NaN when the norm is zero. Consider enforcing eps to be at least a small positive constant (or using max(eps, tiny) as the clamp minimum) before creating acl_min.
When ollama calls ggml_backend_tensor_set from multiple threads (each writing a different chunk of the same tensor), the CANN backend had three concurrency issues: 1. Quantized tensors (Q4_0/Q8_0) require a full-tensor format transform before uploading to device. Per-chunk transforms produced corrupt data. 2. ND-to-NZ weight conversion requires complete tensor data on device. Per-chunk conversion operated on incomplete data. 3. The global g_nz_workspaces array had unprotected concurrent access. Fix by introducing a TensorSetTracker that accumulates write progress per tensor. For quantized tensors, raw data is staged in a host buffer and the transform + upload is deferred until all chunks arrive. For NZ weights, chunks are uploaded directly but conversion is deferred. The tracker and its staging buffer are released immediately after post-processing completes. Add per-device mutex to g_nz_workspaces to prevent data races.
The L2_NORM implementation was not using the eps parameter from op_params, causing incorrect results when eps is large (e.g. 10.0). The CPU reference computes scale = 1/fmaxf(norm, eps), so add a Clamp step to clamp the norm to at least eps before dividing.
When ACL graph mode is enabled, the graph LRU cache checks whether a cached graph matches the current computation graph. Previously, GGML_OP_POOL_2D was not included in the op_params comparison, so two POOL_2D nodes with different pooling parameters (kernel size, stride, padding) but identical tensor shapes and addresses could incorrectly reuse a cached graph, leading to wrong results or aclnn errors. Add GGML_OP_POOL_2D to the list of ops that require op_params matching in ggml_graph_node_properties::has_matching_properties().
…ional op_params comparison The ACL graph LRU cache was incorrectly reusing cached graphs for operations with different tensor types or op_params, causing test failures for CPY (f16 vs bf16), POOL_2D, L2_NORM, NORM_MUL_ADD, RMS_NORM_MUL_ADD, and ADD_RMS_NORM. Changes: - Add node_type and src_type[] fields to ggml_graph_node_properties so the cache can distinguish tensors with different types but identical ne/nb (e.g. f16 and bf16 both have 2-byte elements) - Compare op_params unconditionally for all ops instead of only for SCALE/UNARY/GLU/ROPE/POOL_2D
|
LGTM! This concurrency scenario hadn’t been considered before. The handling in ACL Graph is likely intended to avoid issues encountered when running accuracy tests in graph mode. |
ggerganov
left a comment
There was a problem hiding this comment.
Was this race not detected by the test-thread-safety test that we have? If not, can we create a test that would trigger this use case so we have coverage for other backends too?
Though I am not completely sure I understand the root cause of the race condition here.
|
@ggerganov The current fix isn't about the issues covered by test-thread-safety. In the CANN backend, we have two optimizations for inference speed: quantized tensor memory restructuring and ND-to-NZ format conversion. |
|
In llama.cpp, wouldn't you have encountered the same problem when going through this path: llama.cpp/src/llama-model-loader.cpp Lines 1527 to 1580 in 5b9ce6c |
|
The CANN backend currently declares I tested on the So the problem exists in both the |
* CANN: fix multi-thread set_tensor race conditions When ollama calls ggml_backend_tensor_set from multiple threads (each writing a different chunk of the same tensor), the CANN backend had three concurrency issues: 1. Quantized tensors (Q4_0/Q8_0) require a full-tensor format transform before uploading to device. Per-chunk transforms produced corrupt data. 2. ND-to-NZ weight conversion requires complete tensor data on device. Per-chunk conversion operated on incomplete data. 3. The global g_nz_workspaces array had unprotected concurrent access. Fix by introducing a TensorSetTracker that accumulates write progress per tensor. For quantized tensors, raw data is staged in a host buffer and the transform + upload is deferred until all chunks arrive. For NZ weights, chunks are uploaded directly but conversion is deferred. The tracker and its staging buffer are released immediately after post-processing completes. Add per-device mutex to g_nz_workspaces to prevent data races. * CANN: fix L2_NORM ignoring eps parameter The L2_NORM implementation was not using the eps parameter from op_params, causing incorrect results when eps is large (e.g. 10.0). The CPU reference computes scale = 1/fmaxf(norm, eps), so add a Clamp step to clamp the norm to at least eps before dividing. * ggml/cann: compare op_params for POOL_2D in ACL graph cache matching When ACL graph mode is enabled, the graph LRU cache checks whether a cached graph matches the current computation graph. Previously, GGML_OP_POOL_2D was not included in the op_params comparison, so two POOL_2D nodes with different pooling parameters (kernel size, stride, padding) but identical tensor shapes and addresses could incorrectly reuse a cached graph, leading to wrong results or aclnn errors. Add GGML_OP_POOL_2D to the list of ops that require op_params matching in ggml_graph_node_properties::has_matching_properties(). * cann: fix ACL graph cache matching by adding tensor type and unconditional op_params comparison The ACL graph LRU cache was incorrectly reusing cached graphs for operations with different tensor types or op_params, causing test failures for CPY (f16 vs bf16), POOL_2D, L2_NORM, NORM_MUL_ADD, RMS_NORM_MUL_ADD, and ADD_RMS_NORM. Changes: - Add node_type and src_type[] fields to ggml_graph_node_properties so the cache can distinguish tensors with different types but identical ne/nb (e.g. f16 and bf16 both have 2-byte elements) - Compare op_params unconditionally for all ops instead of only for SCALE/UNARY/GLU/ROPE/POOL_2D
* CANN: fix multi-thread set_tensor race conditions When ollama calls ggml_backend_tensor_set from multiple threads (each writing a different chunk of the same tensor), the CANN backend had three concurrency issues: 1. Quantized tensors (Q4_0/Q8_0) require a full-tensor format transform before uploading to device. Per-chunk transforms produced corrupt data. 2. ND-to-NZ weight conversion requires complete tensor data on device. Per-chunk conversion operated on incomplete data. 3. The global g_nz_workspaces array had unprotected concurrent access. Fix by introducing a TensorSetTracker that accumulates write progress per tensor. For quantized tensors, raw data is staged in a host buffer and the transform + upload is deferred until all chunks arrive. For NZ weights, chunks are uploaded directly but conversion is deferred. The tracker and its staging buffer are released immediately after post-processing completes. Add per-device mutex to g_nz_workspaces to prevent data races. * CANN: fix L2_NORM ignoring eps parameter The L2_NORM implementation was not using the eps parameter from op_params, causing incorrect results when eps is large (e.g. 10.0). The CPU reference computes scale = 1/fmaxf(norm, eps), so add a Clamp step to clamp the norm to at least eps before dividing. * ggml/cann: compare op_params for POOL_2D in ACL graph cache matching When ACL graph mode is enabled, the graph LRU cache checks whether a cached graph matches the current computation graph. Previously, GGML_OP_POOL_2D was not included in the op_params comparison, so two POOL_2D nodes with different pooling parameters (kernel size, stride, padding) but identical tensor shapes and addresses could incorrectly reuse a cached graph, leading to wrong results or aclnn errors. Add GGML_OP_POOL_2D to the list of ops that require op_params matching in ggml_graph_node_properties::has_matching_properties(). * cann: fix ACL graph cache matching by adding tensor type and unconditional op_params comparison The ACL graph LRU cache was incorrectly reusing cached graphs for operations with different tensor types or op_params, causing test failures for CPY (f16 vs bf16), POOL_2D, L2_NORM, NORM_MUL_ADD, RMS_NORM_MUL_ADD, and ADD_RMS_NORM. Changes: - Add node_type and src_type[] fields to ggml_graph_node_properties so the cache can distinguish tensors with different types but identical ne/nb (e.g. f16 and bf16 both have 2-byte elements) - Compare op_params unconditionally for all ops instead of only for SCALE/UNARY/GLU/ROPE/POOL_2D
* CANN: fix multi-thread set_tensor race conditions When ollama calls ggml_backend_tensor_set from multiple threads (each writing a different chunk of the same tensor), the CANN backend had three concurrency issues: 1. Quantized tensors (Q4_0/Q8_0) require a full-tensor format transform before uploading to device. Per-chunk transforms produced corrupt data. 2. ND-to-NZ weight conversion requires complete tensor data on device. Per-chunk conversion operated on incomplete data. 3. The global g_nz_workspaces array had unprotected concurrent access. Fix by introducing a TensorSetTracker that accumulates write progress per tensor. For quantized tensors, raw data is staged in a host buffer and the transform + upload is deferred until all chunks arrive. For NZ weights, chunks are uploaded directly but conversion is deferred. The tracker and its staging buffer are released immediately after post-processing completes. Add per-device mutex to g_nz_workspaces to prevent data races. * CANN: fix L2_NORM ignoring eps parameter The L2_NORM implementation was not using the eps parameter from op_params, causing incorrect results when eps is large (e.g. 10.0). The CPU reference computes scale = 1/fmaxf(norm, eps), so add a Clamp step to clamp the norm to at least eps before dividing. * ggml/cann: compare op_params for POOL_2D in ACL graph cache matching When ACL graph mode is enabled, the graph LRU cache checks whether a cached graph matches the current computation graph. Previously, GGML_OP_POOL_2D was not included in the op_params comparison, so two POOL_2D nodes with different pooling parameters (kernel size, stride, padding) but identical tensor shapes and addresses could incorrectly reuse a cached graph, leading to wrong results or aclnn errors. Add GGML_OP_POOL_2D to the list of ops that require op_params matching in ggml_graph_node_properties::has_matching_properties(). * cann: fix ACL graph cache matching by adding tensor type and unconditional op_params comparison The ACL graph LRU cache was incorrectly reusing cached graphs for operations with different tensor types or op_params, causing test failures for CPY (f16 vs bf16), POOL_2D, L2_NORM, NORM_MUL_ADD, RMS_NORM_MUL_ADD, and ADD_RMS_NORM. Changes: - Add node_type and src_type[] fields to ggml_graph_node_properties so the cache can distinguish tensors with different types but identical ne/nb (e.g. f16 and bf16 both have 2-byte elements) - Compare op_params unconditionally for all ops instead of only for SCALE/UNARY/GLU/ROPE/POOL_2D
When ollama calls ggml_backend_tensor_set from multiple threads (each writing a different chunk of the same tensor), the CANN backend had three concurrency issues:
Quantized tensors (Q4_0/Q8_0) require a full-tensor format transform before uploading to device. Per-chunk transforms produced corrupt data.
ND-to-NZ weight conversion requires complete tensor data on device. Per-chunk conversion operated on incomplete data.
The global g_nz_workspaces array had unprotected concurrent access.
Fix by introducing a TensorSetTracker that accumulates write progress per tensor. For quantized tensors, raw data is staged in a host buffer and the transform + upload is deferred until all chunks arrive. For NZ weights, chunks are uploaded directly but conversion is deferred. The tracker and its staging buffer are released immediately after post-processing completes.
Add per-device mutex to g_nz_workspaces to prevent data races.
Make sure to read the contributing guidelines before submitting a PR