CANN: fix multi-thread set_tensor race conditions by hipudding · Pull Request #20151 · ggml-org/llama.cpp

hipudding · 2026-03-06T02:07:30Z

When ollama calls ggml_backend_tensor_set from multiple threads (each writing a different chunk of the same tensor), the CANN backend had three concurrency issues:

Quantized tensors (Q4_0/Q8_0) require a full-tensor format transform before uploading to device. Per-chunk transforms produced corrupt data.
ND-to-NZ weight conversion requires complete tensor data on device. Per-chunk conversion operated on incomplete data.
The global g_nz_workspaces array had unprotected concurrent access.

Fix by introducing a TensorSetTracker that accumulates write progress per tensor. For quantized tensors, raw data is staged in a host buffer and the transform + upload is deferred until all chunks arrive. For NZ weights, chunks are uploaded directly but conversion is deferred. The tracker and its staging buffer are released immediately after post-processing completes.

Add per-device mutex to g_nz_workspaces to prevent data races.

Make sure to read the contributing guidelines before submitting a PR

hipudding · 2026-03-06T02:12:41Z

Problem

ollama uses multiple threads to call ggml_backend_tensor_set, each thread handles a different chunk (offset/size) of the same tensor. The current CANN backend implementation has three concurrency issues:

Issue 1: Quantized Tensor Format Transform Requires Full Data

For Q4_0/Q8_0 tensors, ggml_backend_cann_transform reorganizes the entire tensor's memory layout — separating quant values and scale factors into contiguous regions. The transform uses ggml_nelements(tensor) to compute global offsets, meaning it cannot operate on partial chunks. When called per-chunk, the output is corrupted.

Code path: ggml-cann.cpp:1242-1247 (the else branch in set_tensor)

Issue 2: ND-to-NZ Format Conversion Requires Full Data

For matmul weight tensors, weight_format_to_nz calls aclnnTransMatmulWeight which converts the entire tensor from ND to NZ format. When called per-chunk (before all data is uploaded), the conversion operates on incomplete data.

Code path: ggml-cann.cpp:1237-1241

Issue 3: Global `g_nz_workspaces` Concurrent Access

g_nz_workspaces[device] is a global per-device workspace used in weight_format_to_nz. Multiple threads calling this concurrently can race on realloc() and get().

Code path: ggml-cann.cpp:1177, 1193-1206

Solution Design

Core Idea

Use a per-tensor tracker to accumulate write progress. Defer post-processing (quantized transform + device upload, or NZ conversion) until all chunks of a tensor have been written.

Data Structures

1. TensorSetTracker

Tracks how much data has been written for each tensor. Stored in the buffer context.

struct TensorSetTracker {
    std::mutex           mtx;            // protects this tracker
    size_t               bytes_written;  // accumulated bytes written so far
    size_t               total_bytes;    // ggml_nbytes(tensor), target to reach
    std::vector<uint8_t> host_buffer;    // staging buffer for quantized tensors only

    TensorSetTracker(size_t total, bool need_staging)
        : bytes_written(0), total_bytes(total) {
        if (need_staging) {
            host_buffer.resize(total);
        }
    }
};

2. Modifications to `ggml_backend_cann_buffer_context`

Add a map of active trackers and a mutex to protect the map itself.

struct ggml_backend_cann_buffer_context {
    int32_t device;
    void *  dev_ptr = nullptr;

    std::mutex                                                     tracker_mutex;
    std::unordered_map<ggml_tensor*, std::shared_ptr<TensorSetTracker>> trackers;

    // ... existing constructors/destructor
};

3. NZ Workspace Mutex

Add a std::mutex to ggml_cann_nz_workspace to protect per-device concurrent access.

struct ggml_cann_nz_workspace {
    std::mutex mtx;          // new: protects ptr/allocated
    void * ptr = nullptr;
    size_t allocated = 0;
    // ... existing methods, caller must hold mtx
};

Modified `ggml_backend_cann_buffer_set_tensor` Logic

function set_tensor(buffer, tensor, data, offset, size):
    ctx = buffer->context
    set_device(ctx->device)

    needs_transform = need_transform(tensor->type)
    needs_nz = weight_to_nz && is_matmul_weight(tensor)

    if !needs_transform && !needs_nz:
        // Case 1: Plain tensor, no post-processing needed
        // Direct memcpy is safe per-chunk, no tracker needed
        aclrtMemcpy(tensor->data + offset, size, data, size, HOST_TO_DEVICE)
        return

    // Case 2 & 3: Need post-processing, use tracker
    tracker = get_or_create_tracker(ctx, tensor, needs_transform)

    lock(tracker->mtx)

    if needs_transform:
        // Stage data in host buffer at the correct offset
        memcpy(tracker->host_buffer.data() + offset, data, size)
    else:
        // NZ case: upload chunk to device immediately (safe, different offsets)
        aclrtMemcpy(tensor->data + offset, size, data, size, HOST_TO_DEVICE)

    tracker->bytes_written += size
    all_done = (tracker->bytes_written >= tracker->total_bytes)

    unlock(tracker->mtx)

    if all_done:
        if needs_transform:
            // All data staged, now transform entire tensor and upload
            transform_buffer = malloc(total_bytes)
            ggml_backend_cann_transform(tensor, tracker->host_buffer.data(), transform_buffer)
            aclrtMemcpy(tensor->data, total_bytes, transform_buffer, total_bytes, HOST_TO_DEVICE)
            free(transform_buffer)

        if needs_nz:
            // All data on device, now convert entire tensor to NZ
            lock(g_nz_workspaces[device].mtx)
            weight_format_to_nz(tensor, 0, device)
            unlock(g_nz_workspaces[device].mtx)

        // Cleanup tracker
        lock(ctx->tracker_mutex)
        ctx->trackers.erase(tensor)
        unlock(ctx->tracker_mutex)

Helper: get_or_create_tracker

function get_or_create_tracker(ctx, tensor, needs_staging):
    lock(ctx->tracker_mutex)
    if tensor not in ctx->trackers:
        total = ggml_nbytes(tensor)
        ctx->trackers[tensor] = make_shared<TensorSetTracker>(total, needs_staging)
    tracker = ctx->trackers[tensor]
    unlock(ctx->tracker_mutex)
    return tracker

Modified `weight_format_to_nz`

Remove the offset parameter since it's always called on the complete tensor now.

// Before: static void weight_format_to_nz(ggml_tensor * tensor, size_t offset, int device)
// After:  static void weight_format_to_nz(ggml_tensor * tensor, int device)
//         Always converts from offset 0 (full tensor)

Single-Thread Compatibility

When called single-threaded (one call with offset=0, size=ggml_nbytes), the tracker is created, bytes_written reaches total_bytes in the same call, post-processing executes immediately, and the tracker is cleaned up. No behavior change.

Thread Safety Summary

Resource	Protection	Contention
tracker map (`ctx->trackers`)	`ctx->tracker_mutex`	Low: only on first/last chunk per tensor
individual tracker	`tracker->mtx`	Low: brief lock for memcpy offset + counter increment
`g_nz_workspaces[device]`	`g_nz_workspaces[device].mtx`	Very low: only one call per tensor (the last chunk's thread)
device memory (`tensor->data`)	No lock needed	Each thread writes to different offset; post-processing is single-threaded

Files to Modify

ggml/src/ggml-cann/ggml-cann.cpp:
- Add #include <mutex>, #include <unordered_map>, #include <vector>, #include <memory> (check existing includes)
- Add TensorSetTracker struct
- Modify ggml_backend_cann_buffer_context: add tracker map + mutex
- Add std::mutex to ggml_cann_nz_workspace
- Modify weight_format_to_nz: remove offset parameter
- Rewrite ggml_backend_cann_buffer_set_tensor

Copilot

Pull request overview

Fixes CANN backend concurrency issues when ggml_backend_tensor_set is invoked from multiple threads writing different chunks of the same tensor, by deferring full-tensor post-processing until all chunks arrive and by protecting shared NZ workspace state.

Changes:

Add TensorSetTracker + per-buffer tracker map to accumulate per-tensor chunk progress and defer quantized transform / ND→NZ conversion.
Add per-device mutex to g_nz_workspaces to prevent concurrent workspace realloc/use races.
Tighten ACL graph node property comparisons by including tensor type in equality checks.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File	Description
ggml/src/ggml-cann/ggml-cann.cpp	Introduces chunk-write tracking + deferred post-processing; adds mutex for global NZ workspaces.
ggml/src/ggml-cann/common.h	Extends graph node property equality with type checks and changes `op_params` comparison behavior.
ggml/src/ggml-cann/aclnn_ops.cpp	Clamps L2 norm denominator using `eps` from `op_params` to avoid tiny divisors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-26T13:36:32Z

+    TensorSetTracker * tracker = ctx->get_or_create_tracker(tensor);
+    std::unique_lock<std::mutex> lock(tracker->mtx);
+
+    if (is_quantized) {
+        // Stage data in host buffer; transform requires full tensor data
+        if (tracker->host_buffer.empty()) {
+            tracker->host_buffer.resize(tracker->total_bytes);
+        }
+        memcpy(tracker->host_buffer.data() + offset, data, size);
    } else {
-        void * transform_buffer = malloc(size);
-        ggml_backend_cann_transform(tensor, data, transform_buffer);
+        // NZ weight: upload chunk to device immediately, defer conversion
+        ACL_CHECK(aclrtMemcpy((char *) tensor->data + offset, size, data, size, ACL_MEMCPY_HOST_TO_DEVICE));
+    }

-        ACL_CHECK(aclrtMemcpy((char *) tensor->data + offset, size, transform_buffer, size, ACL_MEMCPY_HOST_TO_DEVICE));
-        free(transform_buffer);
+    tracker->bytes_written += size;
+
+    // All chunks received: perform deferred transform/conversion
+    if (tracker->bytes_written >= tracker->total_bytes) {
+        if (is_quantized) {
+            void * transform_buffer = malloc(tracker->total_bytes);
+            ggml_backend_cann_transform(tensor, tracker->host_buffer.data(), transform_buffer);
+            ACL_CHECK(aclrtMemcpy(tensor->data, tracker->total_bytes, transform_buffer, tracker->total_bytes, ACL_MEMCPY_HOST_TO_DEVICE));
+            free(transform_buffer);
+        }
+
+        if (is_nz) {
+            GGML_ASSERT(tensor->ne[2] == 1);
+            GGML_ASSERT(tensor->ne[3] == 1);
+            weight_format_to_nz(tensor, ctx->device);
+        }
+
+        // Unlock before removing tracker, as remove_tracker destroys the mutex
+        lock.unlock();
+        ctx->remove_tracker(tensor);
    }


The tracker can be destroyed while other threads are still able to obtain/use the same tracker pointer. After lock.unlock() (1348) and before/while remove_tracker() erases the unique_ptr, another thread can acquire tracker->mtx or continue operating with a raw pointer to the tracker, causing use-after-free (and potentially destroying a mutex while other threads wait on it, which is undefined behavior). A robust fix is to manage tracker lifetime with shared ownership (e.g., store std::shared_ptr<TensorSetTracker> in the map and return a shared_ptr from get_or_create_tracker), and add a finalized flag under tracker->mtx so only one thread performs deferred transform/conversion and map removal; erase from the map can happen while the object remains alive until all in-flight callers drop their shared_ptr.

Thank you for the careful analysis. In practice, the race described here cannot occur: all chunk-write threads for the same tensor are serialized by tracker->mtx. Only the thread that increments bytes_written past total_bytes will enter the cleanup path, and by that point all other threads have already released the lock and exited set_tensor. No new threads will call get_or_create_tracker for this tensor since all chunks have been written. Switching to shared_ptr + finalized flag would add complexity without practical benefit given the ggml calling convention.

Copilot · 2026-03-26T13:36:33Z

+    std::unique_lock<std::mutex> lock(tracker->mtx);
+
+    if (is_quantized) {
+        // Stage data in host buffer; transform requires full tensor data
+        if (tracker->host_buffer.empty()) {
+            tracker->host_buffer.resize(tracker->total_bytes);
+        }
+        memcpy(tracker->host_buffer.data() + offset, data, size);
    } else {
-        void * transform_buffer = malloc(size);
-        ggml_backend_cann_transform(tensor, data, transform_buffer);
+        // NZ weight: upload chunk to device immediately, defer conversion
+        ACL_CHECK(aclrtMemcpy((char *) tensor->data + offset, size, data, size, ACL_MEMCPY_HOST_TO_DEVICE));
+    }


For NZ weights, aclrtMemcpy is performed while holding tracker->mtx, which serializes chunk uploads for the same tensor and can largely negate the benefit of multi-threaded chunked writes. Consider narrowing the critical section: copy the chunk to device outside the tracker lock (while still ensuring the tracker cannot be finalized/freed mid-copy via a lifetime mechanism such as shared_ptr, and a “finalizing/in_flight” state), then re-lock only to update progress and potentially trigger finalization.

Acknowledged. However, NZ weight upload only happens during model loading (not on the inference hot path), and each aclrtMemcpy for a chunk is a short DMA transfer. Narrowing the critical section would require a two-phase lock pattern with additional lifetime management, adding complexity with negligible real-world performance gain. Keeping it simple for now.

Copilot · 2026-03-26T13:36:33Z

+    tracker->bytes_written += size;
+
+    // All chunks received: perform deferred transform/conversion
+    if (tracker->bytes_written >= tracker->total_bytes) {


Completion is inferred purely from bytes_written += size, which assumes chunks are disjoint and written exactly once. If callers ever retry a chunk, overlap ranges, or write the same chunk twice, bytes_written can reach total_bytes early and trigger transform/conversion with a partially-populated host_buffer (quantized) or incomplete device data (NZ). A safer approach is to track received ranges (e.g., interval set / bitmap in units of a suitable block size) and only finalize when coverage reaches total_bytes, or at minimum detect/guard against double-counting by tracking (offset,size) segments.

The ggml backend_tensor_set contract guarantees that callers write disjoint, non-overlapping chunks exactly once (this is how ollama's multi-threaded loader works). Adding interval tracking or bitmap coverage would be over-engineering for a scenario that cannot occur under the current API contract. If the contract changes in the future, we can revisit.

Copilot · 2026-03-26T13:36:33Z

+    std::mutex tracker_mutex;   ///< Protects the trackers map
+    std::unordered_map<void *, std::unique_ptr<TensorSetTracker>> trackers;
+


If a tensor’s chunked upload never reaches completion (e.g., caller aborts mid-load), its tracker (and potentially large host_buffer) will remain resident in trackers until buffer context destruction, which can cause unbounded host memory growth over time. Consider adding explicit cleanup paths (e.g., clearing trackers in the destructor before freeing the device pointer, and/or exposing a cancellation/cleanup hook when tensors/buffers are reset) so partially-written tensors don’t accumulate indefinitely.

The trackers map holds std::unique_ptr, which are automatically destroyed when the map is destroyed as part of ggml_backend_cann_buffer_context destruction. C++ RAII already guarantees deterministic cleanup — no explicit trackers.clear() is needed. Adding it would be redundant but harmless; omitting it keeps the code minimal.

Copilot · 2026-03-26T13:36:33Z

     * @brief Destructor to free the device memory allocated for the buffer.
     */
    ~ggml_backend_cann_buffer_context() { ACL_CHECK(aclrtFree(dev_ptr)); }
+


If a tensor’s chunked upload never reaches completion (e.g., caller aborts mid-load), its tracker (and potentially large host_buffer) will remain resident in trackers until buffer context destruction, which can cause unbounded host memory growth over time. Consider adding explicit cleanup paths (e.g., clearing trackers in the destructor before freeing the device pointer, and/or exposing a cancellation/cleanup hook when tensors/buffers are reset) so partially-written tensors don’t accumulate indefinitely.

Suggested change

* @brief Destructor to free the device memory allocated for the buffer.

*/

~ggml_backend_cann_buffer_context() { ACL_CHECK(aclrtFree(dev_ptr)); }

* @brief Destructor to free the device memory allocated for the buffer

* and clear any outstanding trackers.

*/

~ggml_backend_cann_buffer_context() {

{

// Explicitly clear any outstanding trackers under lock to ensure

// their host-side resources are released deterministically.

std::lock_guard<std::mutex> lock(tracker_mutex);

trackers.clear();

}

if (dev_ptr != nullptr) {

ACL_CHECK(aclrtFree(dev_ptr));

dev_ptr = nullptr;

}

}

Same above.

Copilot · 2026-03-26T13:36:34Z

-            return memcmp(this->op_params, node->op_params, GGML_MAX_OP_PARAMS) == 0;
-        }
-        return true;
+        return memcmp(this->op_params, node->op_params, GGML_MAX_OP_PARAMS) == 0;


This changes equality semantics to always compare the full op_params buffer. If op_params is not deterministically initialized for ops that don’t use it, this can cause false inequality (hurting graph cache hits) and make behavior dependent on uninitialized bytes. The previous code only compared op_params for a subset of ops. Consider restoring conditional comparison (only for ops where op_params are meaningful) or ensuring op_params is always zero-initialized/populated deterministically for all nodes before being stored/compared.

In ggml, all tensors are zero-initialized via memset(result, 0, ...) in ggml_new_tensor_impl. For ops that don't use op_params, the buffer remains all-zeros deterministically. Unconditional comparison is therefore safe and won't cause false cache misses. The previous conditional approach was fragile — it required manually maintaining the list of ops that use op_params, and missing one (as happened with POOL_2D) caused silent correctness bugs.

Copilot · 2026-03-26T13:36:34Z

+    float eps;
+    memcpy(&eps, dst->op_params, sizeof(float));
+


If eps can be 0.0f (or negative), clamping to [eps, FLT_MAX] does not guarantee the divisor is strictly positive, so Div can still produce inf/NaN when the norm is zero. Consider enforcing eps to be at least a small positive constant (or using max(eps, tiny) as the clamp minimum) before creating acl_min.

Good point. Added eps = fmaxf(eps, FLT_MIN) as a defensive guard before creating the clamp scalar. This ensures the divisor is always strictly positive regardless of caller input, at negligible cost.

Copilot · 2026-03-26T13:36:34Z

+    // Clamp norm to at least eps: scale = 1/fmaxf(norm, eps)
+    acl_scalar_ptr acl_min = ggml_cann_create_scalar(&eps, aclDataType::ACL_FLOAT);
+    float          flt_max = FLT_MAX;
+    acl_scalar_ptr acl_max = ggml_cann_create_scalar(&flt_max, aclDataType::ACL_FLOAT);
+    GGML_CANN_CALL_ACLNN_OP(ctx, Clamp, acl_div.get(), acl_min.get(), acl_max.get(), acl_div.get());
+
    GGML_CANN_CALL_ACLNN_OP(ctx, Div, acl_src.get(), acl_div.get(), acl_dst.get());


If eps can be 0.0f (or negative), clamping to [eps, FLT_MAX] does not guarantee the divisor is strictly positive, so Div can still produce inf/NaN when the norm is zero. Consider enforcing eps to be at least a small positive constant (or using max(eps, tiny) as the clamp minimum) before creating acl_min.

When ollama calls ggml_backend_tensor_set from multiple threads (each writing a different chunk of the same tensor), the CANN backend had three concurrency issues: 1. Quantized tensors (Q4_0/Q8_0) require a full-tensor format transform before uploading to device. Per-chunk transforms produced corrupt data. 2. ND-to-NZ weight conversion requires complete tensor data on device. Per-chunk conversion operated on incomplete data. 3. The global g_nz_workspaces array had unprotected concurrent access. Fix by introducing a TensorSetTracker that accumulates write progress per tensor. For quantized tensors, raw data is staged in a host buffer and the transform + upload is deferred until all chunks arrive. For NZ weights, chunks are uploaded directly but conversion is deferred. The tracker and its staging buffer are released immediately after post-processing completes. Add per-device mutex to g_nz_workspaces to prevent data races.

The L2_NORM implementation was not using the eps parameter from op_params, causing incorrect results when eps is large (e.g. 10.0). The CPU reference computes scale = 1/fmaxf(norm, eps), so add a Clamp step to clamp the norm to at least eps before dividing.

When ACL graph mode is enabled, the graph LRU cache checks whether a cached graph matches the current computation graph. Previously, GGML_OP_POOL_2D was not included in the op_params comparison, so two POOL_2D nodes with different pooling parameters (kernel size, stride, padding) but identical tensor shapes and addresses could incorrectly reuse a cached graph, leading to wrong results or aclnn errors. Add GGML_OP_POOL_2D to the list of ops that require op_params matching in ggml_graph_node_properties::has_matching_properties().

…ional op_params comparison The ACL graph LRU cache was incorrectly reusing cached graphs for operations with different tensor types or op_params, causing test failures for CPY (f16 vs bf16), POOL_2D, L2_NORM, NORM_MUL_ADD, RMS_NORM_MUL_ADD, and ADD_RMS_NORM. Changes: - Add node_type and src_type[] fields to ggml_graph_node_properties so the cache can distinguish tensors with different types but identical ne/nb (e.g. f16 and bf16 both have 2-byte elements) - Compare op_params unconditionally for all ops instead of only for SCALE/UNARY/GLU/ROPE/POOL_2D

noemotiovon · 2026-03-31T08:11:05Z

LGTM! This concurrency scenario hadn’t been considered before. The handling in ACL Graph is likely intended to avoid issues encountered when running accuracy tests in graph mode.

ggerganov

Was this race not detected by the test-thread-safety test that we have? If not, can we create a test that would trigger this use case so we have coverage for other backends too?

Though I am not completely sure I understand the root cause of the race condition here.

hipudding · 2026-03-31T11:55:50Z

@ggerganov The current fix isn't about the issues covered by test-thread-safety. In the CANN backend, we have two optimizations for inference speed: quantized tensor memory restructuring and ND-to-NZ format conversion.
These conversions require the full tensor content to work. llama.cpp handles this fine because it doesn't split tensors during loading. However, ollama loads tensors via multiple threads, with each thread handling only a part of the data. This causes the set_tensor calls to fail during conversion because the data is incomplete.
We’ve added a tracker to keep track of the loading progress. The conversion and device upload now only happen once the tensor is completely loaded. This problem is strictly limited to quantization and ND/NZ conversions on CANN and does not affect other backends.

ggerganov · 2026-03-31T12:18:52Z

In llama.cpp, wouldn't you have encountered the same problem when going through this path:

llama.cpp/src/llama-model-loader.cpp

Lines 1527 to 1580 in 5b9ce6c

    
           // If upload_backend is valid load the tensor in chunks to pinned memory and upload the buffers asynchronously to the GPU. 
        
           if (upload_backend) { 
        
               size_t offset = weight->offs; 
        
               alignment = file->read_alignment(); 
        
               size_t aligned_offset = offset & ~(alignment - 1); 
        
               size_t offset_from_alignment = offset - aligned_offset; 
        
               file->seek(aligned_offset, SEEK_SET); 
        
               // Calculate aligned read boundaries 
        
               size_t read_start = aligned_offset; 
        
               size_t read_end = (offset + n_size + alignment - 1) & ~(alignment - 1); 
        
               size_t bytes_read = 0; 
        
               size_t data_read = 0;  // Actual tensor data copied (excluding padding) 
        
               while (bytes_read < read_end - read_start) { 
        
                   size_t read_size = std::min<size_t>(buffer_size, read_end - read_start - bytes_read); 
        
                   // Align the destination pointer within the pinned buffer 
        
                   uintptr_t ptr_dest_aligned = (reinterpret_cast<uintptr_t>(host_ptrs[buffer_idx]) + alignment - 1) & ~(alignment - 1); 
        
                   // Wait for previous upload to complete before reusing buffer 
        
                   ggml_backend_event_synchronize(events[buffer_idx]); 
        
                   // Read aligned chunk from file 
        
                   file->read_raw_unsafe(reinterpret_cast<void *>(ptr_dest_aligned), read_size); 
        
                   // Calculate actual data portion (excluding alignment padding) 
        
                   uintptr_t ptr_data = ptr_dest_aligned; 
        
                   size_t data_to_copy = read_size; 
        
                   // Skip alignment padding at start of first chunk 
        
                   if (bytes_read == 0) { 
        
                       ptr_data += offset_from_alignment; 
        
                       data_to_copy -= offset_from_alignment; 
        
                   } 
        
                   // Trim alignment padding at end of last chunk 
        
                   if (aligned_offset + bytes_read + read_size > offset + n_size) { 
        
                       data_to_copy -= (read_end - (offset + n_size)); 
        
                   } 
        
                   // Async upload actual data to GPU 
        
                   ggml_backend_tensor_set_async(upload_backend, cur, 
        
                                                 reinterpret_cast<void *>(ptr_data), data_read, data_to_copy); 
        
                   ggml_backend_event_record(events[buffer_idx], upload_backend); 
        
                   data_read += data_to_copy; 
        
                   bytes_read += read_size; 
        
                   ++buffer_idx; 
        
                   buffer_idx %= n_buffers; 
        
               } 
        
           } else {

noemotiovon · 2026-03-31T12:52:00Z

The CANN backend currently declares async = false in ggml_backend_cann_device_get_props, so the upload_backend chunked upload path in llama-model-loader.cpp is never reached — upload_backend is always nullptr for CANN.

I tested on the master branch by setting async = true and using --no-mmap, which does enter the chunked upload path. The same issue occurs — the ND-to-NZ conversion is triggered on incomplete data.

So the problem exists in both the ollama multi-threaded loading path and the llama.cpp async upload path, but the latter is currently gated by async = false.

* CANN: fix multi-thread set_tensor race conditions When ollama calls ggml_backend_tensor_set from multiple threads (each writing a different chunk of the same tensor), the CANN backend had three concurrency issues: 1. Quantized tensors (Q4_0/Q8_0) require a full-tensor format transform before uploading to device. Per-chunk transforms produced corrupt data. 2. ND-to-NZ weight conversion requires complete tensor data on device. Per-chunk conversion operated on incomplete data. 3. The global g_nz_workspaces array had unprotected concurrent access. Fix by introducing a TensorSetTracker that accumulates write progress per tensor. For quantized tensors, raw data is staged in a host buffer and the transform + upload is deferred until all chunks arrive. For NZ weights, chunks are uploaded directly but conversion is deferred. The tracker and its staging buffer are released immediately after post-processing completes. Add per-device mutex to g_nz_workspaces to prevent data races. * CANN: fix L2_NORM ignoring eps parameter The L2_NORM implementation was not using the eps parameter from op_params, causing incorrect results when eps is large (e.g. 10.0). The CPU reference computes scale = 1/fmaxf(norm, eps), so add a Clamp step to clamp the norm to at least eps before dividing. * ggml/cann: compare op_params for POOL_2D in ACL graph cache matching When ACL graph mode is enabled, the graph LRU cache checks whether a cached graph matches the current computation graph. Previously, GGML_OP_POOL_2D was not included in the op_params comparison, so two POOL_2D nodes with different pooling parameters (kernel size, stride, padding) but identical tensor shapes and addresses could incorrectly reuse a cached graph, leading to wrong results or aclnn errors. Add GGML_OP_POOL_2D to the list of ops that require op_params matching in ggml_graph_node_properties::has_matching_properties(). * cann: fix ACL graph cache matching by adding tensor type and unconditional op_params comparison The ACL graph LRU cache was incorrectly reusing cached graphs for operations with different tensor types or op_params, causing test failures for CPY (f16 vs bf16), POOL_2D, L2_NORM, NORM_MUL_ADD, RMS_NORM_MUL_ADD, and ADD_RMS_NORM. Changes: - Add node_type and src_type[] fields to ggml_graph_node_properties so the cache can distinguish tensors with different types but identical ne/nb (e.g. f16 and bf16 both have 2-byte elements) - Compare op_params unconditionally for all ops instead of only for SCALE/UNARY/GLU/ROPE/POOL_2D

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning Ascend NPU issues specific to Ascend NPUs labels Mar 6, 2026

hipudding force-pushed the ocp branch from b9305e7 to ce2a4eb Compare March 7, 2026 06:29

hipudding force-pushed the ocp branch 2 times, most recently from aeee6b4 to 51a28d0 Compare March 26, 2026 13:23

hipudding marked this pull request as ready for review March 26, 2026 13:27

hipudding requested a review from a team as a code owner March 26, 2026 13:27

Copilot AI review requested due to automatic review settings March 26, 2026 13:27

Copilot AI reviewed Mar 26, 2026

View reviewed changes

Copilot started reviewing on behalf of hipudding March 26, 2026 13:58 View session

loci-dev mentioned this pull request Mar 27, 2026

UPSTREAM PR #20151: CANN: fix multi-thread set_tensor race conditions auroralabs-loci/llama.cpp#1302

Open

hipudding force-pushed the ocp branch from 6b340ed to c9dcafa Compare March 27, 2026 03:53

hipudding marked this pull request as draft March 28, 2026 05:35

hipudding added 4 commits March 28, 2026 05:36

hipudding force-pushed the ocp branch from f0e9055 to 2d0bb17 Compare March 28, 2026 05:36

hipudding marked this pull request as ready for review March 31, 2026 01:25

noemotiovon approved these changes Mar 31, 2026

View reviewed changes

ggerganov reviewed Mar 31, 2026

View reviewed changes

hipudding requested a review from ggerganov March 31, 2026 11:59

ggerganov merged commit 632219a into ggml-org:master Mar 31, 2026
85 of 86 checks passed

		std::mutex tracker_mutex; ///< Protects the trackers map
		std::unordered_map<void *, std::unique_ptr<TensorSetTracker>> trackers;

Conversation

hipudding commented Mar 6, 2026

Uh oh!

hipudding commented Mar 6, 2026

Problem

Issue 1: Quantized Tensor Format Transform Requires Full Data

Issue 2: ND-to-NZ Format Conversion Requires Full Data

Issue 3: Global g_nz_workspaces Concurrent Access

Solution Design

Core Idea

Data Structures

1. TensorSetTracker

2. Modifications to ggml_backend_cann_buffer_context

3. NZ Workspace Mutex

Modified ggml_backend_cann_buffer_set_tensor Logic

Helper: get_or_create_tracker

Modified weight_format_to_nz

Single-Thread Compatibility

Thread Safety Summary

Files to Modify

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

hipudding Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

hipudding Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

hipudding Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

hipudding Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

hipudding Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

hipudding Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

hipudding Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

noemotiovon commented Mar 31, 2026

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

hipudding commented Mar 31, 2026

Uh oh!

ggerganov commented Mar 31, 2026

Uh oh!

Issue 3: Global `g_nz_workspaces` Concurrent Access

2. Modifications to `ggml_backend_cann_buffer_context`

Modified `ggml_backend_cann_buffer_set_tensor` Logic

Modified `weight_format_to_nz`