Skip to content

CANN: fix multi-thread set_tensor race conditions#20151

Merged
ggerganov merged 4 commits intoggml-org:masterfrom
hipudding:ocp
Mar 31, 2026
Merged

CANN: fix multi-thread set_tensor race conditions#20151
ggerganov merged 4 commits intoggml-org:masterfrom
hipudding:ocp

Conversation

@hipudding
Copy link
Copy Markdown
Contributor

When ollama calls ggml_backend_tensor_set from multiple threads (each writing a different chunk of the same tensor), the CANN backend had three concurrency issues:

  1. Quantized tensors (Q4_0/Q8_0) require a full-tensor format transform before uploading to device. Per-chunk transforms produced corrupt data.

  2. ND-to-NZ weight conversion requires complete tensor data on device. Per-chunk conversion operated on incomplete data.

  3. The global g_nz_workspaces array had unprotected concurrent access.

Fix by introducing a TensorSetTracker that accumulates write progress per tensor. For quantized tensors, raw data is staged in a host buffer and the transform + upload is deferred until all chunks arrive. For NZ weights, chunks are uploaded directly but conversion is deferred. The tracker and its staging buffer are released immediately after post-processing completes.

Add per-device mutex to g_nz_workspaces to prevent data races.

Make sure to read the contributing guidelines before submitting a PR

@hipudding
Copy link
Copy Markdown
Contributor Author

Problem

ollama uses multiple threads to call ggml_backend_tensor_set, each thread handles a different chunk (offset/size) of the same tensor. The current CANN backend implementation has three concurrency issues:

Issue 1: Quantized Tensor Format Transform Requires Full Data

For Q4_0/Q8_0 tensors, ggml_backend_cann_transform reorganizes the entire tensor's memory layout — separating quant values and scale factors into contiguous regions. The transform uses ggml_nelements(tensor) to compute global offsets, meaning it cannot operate on partial chunks. When called per-chunk, the output is corrupted.

Code path: ggml-cann.cpp:1242-1247 (the else branch in set_tensor)

Issue 2: ND-to-NZ Format Conversion Requires Full Data

For matmul weight tensors, weight_format_to_nz calls aclnnTransMatmulWeight which converts the entire tensor from ND to NZ format. When called per-chunk (before all data is uploaded), the conversion operates on incomplete data.

Code path: ggml-cann.cpp:1237-1241

Issue 3: Global g_nz_workspaces Concurrent Access

g_nz_workspaces[device] is a global per-device workspace used in weight_format_to_nz. Multiple threads calling this concurrently can race on realloc() and get().

Code path: ggml-cann.cpp:1177, 1193-1206

Solution Design

Core Idea

Use a per-tensor tracker to accumulate write progress. Defer post-processing (quantized transform + device upload, or NZ conversion) until all chunks of a tensor have been written.

Data Structures

1. TensorSetTracker

Tracks how much data has been written for each tensor. Stored in the buffer context.

struct TensorSetTracker {
    std::mutex           mtx;            // protects this tracker
    size_t               bytes_written;  // accumulated bytes written so far
    size_t               total_bytes;    // ggml_nbytes(tensor), target to reach
    std::vector<uint8_t> host_buffer;    // staging buffer for quantized tensors only

    TensorSetTracker(size_t total, bool need_staging)
        : bytes_written(0), total_bytes(total) {
        if (need_staging) {
            host_buffer.resize(total);
        }
    }
};

2. Modifications to ggml_backend_cann_buffer_context

Add a map of active trackers and a mutex to protect the map itself.

struct ggml_backend_cann_buffer_context {
    int32_t device;
    void *  dev_ptr = nullptr;

    std::mutex                                                     tracker_mutex;
    std::unordered_map<ggml_tensor*, std::shared_ptr<TensorSetTracker>> trackers;

    // ... existing constructors/destructor
};

3. NZ Workspace Mutex

Add a std::mutex to ggml_cann_nz_workspace to protect per-device concurrent access.

struct ggml_cann_nz_workspace {
    std::mutex mtx;          // new: protects ptr/allocated
    void * ptr = nullptr;
    size_t allocated = 0;
    // ... existing methods, caller must hold mtx
};

Modified ggml_backend_cann_buffer_set_tensor Logic

function set_tensor(buffer, tensor, data, offset, size):
    ctx = buffer->context
    set_device(ctx->device)

    needs_transform = need_transform(tensor->type)
    needs_nz = weight_to_nz && is_matmul_weight(tensor)

    if !needs_transform && !needs_nz:
        // Case 1: Plain tensor, no post-processing needed
        // Direct memcpy is safe per-chunk, no tracker needed
        aclrtMemcpy(tensor->data + offset, size, data, size, HOST_TO_DEVICE)
        return

    // Case 2 & 3: Need post-processing, use tracker
    tracker = get_or_create_tracker(ctx, tensor, needs_transform)

    lock(tracker->mtx)

    if needs_transform:
        // Stage data in host buffer at the correct offset
        memcpy(tracker->host_buffer.data() + offset, data, size)
    else:
        // NZ case: upload chunk to device immediately (safe, different offsets)
        aclrtMemcpy(tensor->data + offset, size, data, size, HOST_TO_DEVICE)

    tracker->bytes_written += size
    all_done = (tracker->bytes_written >= tracker->total_bytes)

    unlock(tracker->mtx)

    if all_done:
        if needs_transform:
            // All data staged, now transform entire tensor and upload
            transform_buffer = malloc(total_bytes)
            ggml_backend_cann_transform(tensor, tracker->host_buffer.data(), transform_buffer)
            aclrtMemcpy(tensor->data, total_bytes, transform_buffer, total_bytes, HOST_TO_DEVICE)
            free(transform_buffer)

        if needs_nz:
            // All data on device, now convert entire tensor to NZ
            lock(g_nz_workspaces[device].mtx)
            weight_format_to_nz(tensor, 0, device)
            unlock(g_nz_workspaces[device].mtx)

        // Cleanup tracker
        lock(ctx->tracker_mutex)
        ctx->trackers.erase(tensor)
        unlock(ctx->tracker_mutex)

Helper: get_or_create_tracker

function get_or_create_tracker(ctx, tensor, needs_staging):
    lock(ctx->tracker_mutex)
    if tensor not in ctx->trackers:
        total = ggml_nbytes(tensor)
        ctx->trackers[tensor] = make_shared<TensorSetTracker>(total, needs_staging)
    tracker = ctx->trackers[tensor]
    unlock(ctx->tracker_mutex)
    return tracker

Modified weight_format_to_nz

Remove the offset parameter since it's always called on the complete tensor now.

// Before: static void weight_format_to_nz(ggml_tensor * tensor, size_t offset, int device)
// After:  static void weight_format_to_nz(ggml_tensor * tensor, int device)
//         Always converts from offset 0 (full tensor)

Single-Thread Compatibility

When called single-threaded (one call with offset=0, size=ggml_nbytes), the tracker is created, bytes_written reaches total_bytes in the same call, post-processing executes immediately, and the tracker is cleaned up. No behavior change.

Thread Safety Summary

Resource Protection Contention
tracker map (ctx->trackers) ctx->tracker_mutex Low: only on first/last chunk per tensor
individual tracker tracker->mtx Low: brief lock for memcpy offset + counter increment
g_nz_workspaces[device] g_nz_workspaces[device].mtx Very low: only one call per tensor (the last chunk's thread)
device memory (tensor->data) No lock needed Each thread writes to different offset; post-processing is single-threaded

Files to Modify

  1. ggml/src/ggml-cann/ggml-cann.cpp:
    • Add #include <mutex>, #include <unordered_map>, #include <vector>, #include <memory> (check existing includes)
    • Add TensorSetTracker struct
    • Modify ggml_backend_cann_buffer_context: add tracker map + mutex
    • Add std::mutex to ggml_cann_nz_workspace
    • Modify weight_format_to_nz: remove offset parameter
    • Rewrite ggml_backend_cann_buffer_set_tensor

@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning Ascend NPU issues specific to Ascend NPUs labels Mar 6, 2026
@hipudding hipudding force-pushed the ocp branch 2 times, most recently from aeee6b4 to 51a28d0 Compare March 26, 2026 13:23
@hipudding hipudding marked this pull request as ready for review March 26, 2026 13:27
@hipudding hipudding requested a review from a team as a code owner March 26, 2026 13:27
Copilot AI review requested due to automatic review settings March 26, 2026 13:27
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes CANN backend concurrency issues when ggml_backend_tensor_set is invoked from multiple threads writing different chunks of the same tensor, by deferring full-tensor post-processing until all chunks arrive and by protecting shared NZ workspace state.

Changes:

  • Add TensorSetTracker + per-buffer tracker map to accumulate per-tensor chunk progress and defer quantized transform / ND→NZ conversion.
  • Add per-device mutex to g_nz_workspaces to prevent concurrent workspace realloc/use races.
  • Tighten ACL graph node property comparisons by including tensor type in equality checks.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File Description
ggml/src/ggml-cann/ggml-cann.cpp Introduces chunk-write tracking + deferred post-processing; adds mutex for global NZ workspaces.
ggml/src/ggml-cann/common.h Extends graph node property equality with type checks and changes op_params comparison behavior.
ggml/src/ggml-cann/aclnn_ops.cpp Clamps L2 norm denominator using eps from op_params to avoid tiny divisors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1316 to 1350
TensorSetTracker * tracker = ctx->get_or_create_tracker(tensor);
std::unique_lock<std::mutex> lock(tracker->mtx);

if (is_quantized) {
// Stage data in host buffer; transform requires full tensor data
if (tracker->host_buffer.empty()) {
tracker->host_buffer.resize(tracker->total_bytes);
}
memcpy(tracker->host_buffer.data() + offset, data, size);
} else {
void * transform_buffer = malloc(size);
ggml_backend_cann_transform(tensor, data, transform_buffer);
// NZ weight: upload chunk to device immediately, defer conversion
ACL_CHECK(aclrtMemcpy((char *) tensor->data + offset, size, data, size, ACL_MEMCPY_HOST_TO_DEVICE));
}

ACL_CHECK(aclrtMemcpy((char *) tensor->data + offset, size, transform_buffer, size, ACL_MEMCPY_HOST_TO_DEVICE));
free(transform_buffer);
tracker->bytes_written += size;

// All chunks received: perform deferred transform/conversion
if (tracker->bytes_written >= tracker->total_bytes) {
if (is_quantized) {
void * transform_buffer = malloc(tracker->total_bytes);
ggml_backend_cann_transform(tensor, tracker->host_buffer.data(), transform_buffer);
ACL_CHECK(aclrtMemcpy(tensor->data, tracker->total_bytes, transform_buffer, tracker->total_bytes, ACL_MEMCPY_HOST_TO_DEVICE));
free(transform_buffer);
}

if (is_nz) {
GGML_ASSERT(tensor->ne[2] == 1);
GGML_ASSERT(tensor->ne[3] == 1);
weight_format_to_nz(tensor, ctx->device);
}

// Unlock before removing tracker, as remove_tracker destroys the mutex
lock.unlock();
ctx->remove_tracker(tensor);
}
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tracker can be destroyed while other threads are still able to obtain/use the same tracker pointer. After lock.unlock() (1348) and before/while remove_tracker() erases the unique_ptr, another thread can acquire tracker->mtx or continue operating with a raw pointer to the tracker, causing use-after-free (and potentially destroying a mutex while other threads wait on it, which is undefined behavior). A robust fix is to manage tracker lifetime with shared ownership (e.g., store std::shared_ptr<TensorSetTracker> in the map and return a shared_ptr from get_or_create_tracker), and add a finalized flag under tracker->mtx so only one thread performs deferred transform/conversion and map removal; erase from the map can happen while the object remains alive until all in-flight callers drop their shared_ptr.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the careful analysis. In practice, the race described here cannot occur: all chunk-write threads for the same tensor are serialized by tracker->mtx. Only the thread that increments bytes_written past total_bytes will enter the cleanup path, and by that point all other threads have already released the lock and exited set_tensor. No new threads will call get_or_create_tracker for this tensor since all chunks have been written. Switching to shared_ptr + finalized flag would add complexity without practical benefit given the ggml calling convention.

Comment on lines +1317 to +1328
std::unique_lock<std::mutex> lock(tracker->mtx);

if (is_quantized) {
// Stage data in host buffer; transform requires full tensor data
if (tracker->host_buffer.empty()) {
tracker->host_buffer.resize(tracker->total_bytes);
}
memcpy(tracker->host_buffer.data() + offset, data, size);
} else {
void * transform_buffer = malloc(size);
ggml_backend_cann_transform(tensor, data, transform_buffer);
// NZ weight: upload chunk to device immediately, defer conversion
ACL_CHECK(aclrtMemcpy((char *) tensor->data + offset, size, data, size, ACL_MEMCPY_HOST_TO_DEVICE));
}
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For NZ weights, aclrtMemcpy is performed while holding tracker->mtx, which serializes chunk uploads for the same tensor and can largely negate the benefit of multi-threaded chunked writes. Consider narrowing the critical section: copy the chunk to device outside the tracker lock (while still ensuring the tracker cannot be finalized/freed mid-copy via a lifetime mechanism such as shared_ptr, and a “finalizing/in_flight” state), then re-lock only to update progress and potentially trigger finalization.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. However, NZ weight upload only happens during model loading (not on the inference hot path), and each aclrtMemcpy for a chunk is a short DMA transfer. Narrowing the critical section would require a two-phase lock pattern with additional lifetime management, adding complexity with negligible real-world performance gain. Keeping it simple for now.

Comment on lines +1330 to +1333
tracker->bytes_written += size;

// All chunks received: perform deferred transform/conversion
if (tracker->bytes_written >= tracker->total_bytes) {
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completion is inferred purely from bytes_written += size, which assumes chunks are disjoint and written exactly once. If callers ever retry a chunk, overlap ranges, or write the same chunk twice, bytes_written can reach total_bytes early and trigger transform/conversion with a partially-populated host_buffer (quantized) or incomplete device data (NZ). A safer approach is to track received ranges (e.g., interval set / bitmap in units of a suitable block size) and only finalize when coverage reaches total_bytes, or at minimum detect/guard against double-counting by tracking (offset,size) segments.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ggml backend_tensor_set contract guarantees that callers write disjoint, non-overlapping chunks exactly once (this is how ollama's multi-threaded loader works). Adding interval tracking or bitmap coverage would be over-engineering for a scenario that cannot occur under the current API contract. If the contract changes in the future, we can revisit.

Comment on lines +801 to +803
std::mutex tracker_mutex; ///< Protects the trackers map
std::unordered_map<void *, std::unique_ptr<TensorSetTracker>> trackers;

Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a tensor’s chunked upload never reaches completion (e.g., caller aborts mid-load), its tracker (and potentially large host_buffer) will remain resident in trackers until buffer context destruction, which can cause unbounded host memory growth over time. Consider adding explicit cleanup paths (e.g., clearing trackers in the destructor before freeing the device pointer, and/or exposing a cancellation/cleanup hook when tensors/buffers are reset) so partially-written tensors don’t accumulate indefinitely.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trackers map holds std::unique_ptr, which are automatically destroyed when the map is destroyed as part of ggml_backend_cann_buffer_context destruction. C++ RAII already guarantees deterministic cleanup — no explicit trackers.clear() is needed. Adding it would be redundant but harmless; omitting it keeps the code minimal.

Comment on lines 813 to +816
* @brief Destructor to free the device memory allocated for the buffer.
*/
~ggml_backend_cann_buffer_context() { ACL_CHECK(aclrtFree(dev_ptr)); }

Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a tensor’s chunked upload never reaches completion (e.g., caller aborts mid-load), its tracker (and potentially large host_buffer) will remain resident in trackers until buffer context destruction, which can cause unbounded host memory growth over time. Consider adding explicit cleanup paths (e.g., clearing trackers in the destructor before freeing the device pointer, and/or exposing a cancellation/cleanup hook when tensors/buffers are reset) so partially-written tensors don’t accumulate indefinitely.

Suggested change
* @brief Destructor to free the device memory allocated for the buffer.
*/
~ggml_backend_cann_buffer_context() { ACL_CHECK(aclrtFree(dev_ptr)); }
* @brief Destructor to free the device memory allocated for the buffer
* and clear any outstanding trackers.
*/
~ggml_backend_cann_buffer_context() {
{
// Explicitly clear any outstanding trackers under lock to ensure
// their host-side resources are released deterministically.
std::lock_guard<std::mutex> lock(tracker_mutex);
trackers.clear();
}
if (dev_ptr != nullptr) {
ACL_CHECK(aclrtFree(dev_ptr));
dev_ptr = nullptr;
}
}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same above.

return memcmp(this->op_params, node->op_params, GGML_MAX_OP_PARAMS) == 0;
}
return true;
return memcmp(this->op_params, node->op_params, GGML_MAX_OP_PARAMS) == 0;
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes equality semantics to always compare the full op_params buffer. If op_params is not deterministically initialized for ops that don’t use it, this can cause false inequality (hurting graph cache hits) and make behavior dependent on uninitialized bytes. The previous code only compared op_params for a subset of ops. Consider restoring conditional comparison (only for ops where op_params are meaningful) or ensuring op_params is always zero-initialized/populated deterministically for all nodes before being stored/compared.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ggml, all tensors are zero-initialized via memset(result, 0, ...) in ggml_new_tensor_impl. For ops that don't use op_params, the buffer remains all-zeros deterministically. Unconditional comparison is therefore safe and won't cause false cache misses. The previous conditional approach was fragile — it required manually maintaining the list of ops that use op_params, and missing one (as happened with POOL_2D) caused silent correctness bugs.

Comment on lines +437 to +439
float eps;
memcpy(&eps, dst->op_params, sizeof(float));

Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If eps can be 0.0f (or negative), clamping to [eps, FLT_MAX] does not guarantee the divisor is strictly positive, so Div can still produce inf/NaN when the norm is zero. Consider enforcing eps to be at least a small positive constant (or using max(eps, tiny) as the clamp minimum) before creating acl_min.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Added eps = fmaxf(eps, FLT_MIN) as a defensive guard before creating the clamp scalar. This ensures the divisor is always strictly positive regardless of caller input, at negligible cost.

Comment on lines +463 to 469
// Clamp norm to at least eps: scale = 1/fmaxf(norm, eps)
acl_scalar_ptr acl_min = ggml_cann_create_scalar(&eps, aclDataType::ACL_FLOAT);
float flt_max = FLT_MAX;
acl_scalar_ptr acl_max = ggml_cann_create_scalar(&flt_max, aclDataType::ACL_FLOAT);
GGML_CANN_CALL_ACLNN_OP(ctx, Clamp, acl_div.get(), acl_min.get(), acl_max.get(), acl_div.get());

GGML_CANN_CALL_ACLNN_OP(ctx, Div, acl_src.get(), acl_div.get(), acl_dst.get());
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If eps can be 0.0f (or negative), clamping to [eps, FLT_MAX] does not guarantee the divisor is strictly positive, so Div can still produce inf/NaN when the norm is zero. Consider enforcing eps to be at least a small positive constant (or using max(eps, tiny) as the clamp minimum) before creating acl_min.

Copilot uses AI. Check for mistakes.
When ollama calls ggml_backend_tensor_set from multiple threads (each
writing a different chunk of the same tensor), the CANN backend had
three concurrency issues:

1. Quantized tensors (Q4_0/Q8_0) require a full-tensor format transform
   before uploading to device. Per-chunk transforms produced corrupt data.

2. ND-to-NZ weight conversion requires complete tensor data on device.
   Per-chunk conversion operated on incomplete data.

3. The global g_nz_workspaces array had unprotected concurrent access.

Fix by introducing a TensorSetTracker that accumulates write progress
per tensor. For quantized tensors, raw data is staged in a host buffer
and the transform + upload is deferred until all chunks arrive. For NZ
weights, chunks are uploaded directly but conversion is deferred. The
tracker and its staging buffer are released immediately after
post-processing completes.

Add per-device mutex to g_nz_workspaces to prevent data races.
The L2_NORM implementation was not using the eps parameter from
op_params, causing incorrect results when eps is large (e.g. 10.0).
The CPU reference computes scale = 1/fmaxf(norm, eps), so add a
Clamp step to clamp the norm to at least eps before dividing.
When ACL graph mode is enabled, the graph LRU cache checks whether a
cached graph matches the current computation graph. Previously,
GGML_OP_POOL_2D was not included in the op_params comparison, so two
POOL_2D nodes with different pooling parameters (kernel size, stride,
padding) but identical tensor shapes and addresses could incorrectly
reuse a cached graph, leading to wrong results or aclnn errors.

Add GGML_OP_POOL_2D to the list of ops that require op_params matching
in ggml_graph_node_properties::has_matching_properties().
…ional op_params comparison

The ACL graph LRU cache was incorrectly reusing cached graphs for
operations with different tensor types or op_params, causing test
failures for CPY (f16 vs bf16), POOL_2D, L2_NORM, NORM_MUL_ADD,
RMS_NORM_MUL_ADD, and ADD_RMS_NORM.

Changes:
- Add node_type and src_type[] fields to ggml_graph_node_properties
  so the cache can distinguish tensors with different types but
  identical ne/nb (e.g. f16 and bf16 both have 2-byte elements)
- Compare op_params unconditionally for all ops instead of only for
  SCALE/UNARY/GLU/ROPE/POOL_2D
@hipudding hipudding marked this pull request as ready for review March 31, 2026 01:25
@noemotiovon
Copy link
Copy Markdown
Collaborator

LGTM! This concurrency scenario hadn’t been considered before. The handling in ACL Graph is likely intended to avoid issues encountered when running accuracy tests in graph mode.

Copy link
Copy Markdown
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this race not detected by the test-thread-safety test that we have? If not, can we create a test that would trigger this use case so we have coverage for other backends too?

Though I am not completely sure I understand the root cause of the race condition here.

@hipudding
Copy link
Copy Markdown
Contributor Author

@ggerganov The current fix isn't about the issues covered by test-thread-safety. In the CANN backend, we have two optimizations for inference speed: quantized tensor memory restructuring and ND-to-NZ format conversion.
These conversions require the full tensor content to work. llama.cpp handles this fine because it doesn't split tensors during loading. However, ollama loads tensors via multiple threads, with each thread handling only a part of the data. This causes the set_tensor calls to fail during conversion because the data is incomplete.
We’ve added a tracker to keep track of the loading progress. The conversion and device upload now only happen once the tensor is completely loaded. This problem is strictly limited to quantization and ND/NZ conversions on CANN and does not affect other backends.

@hipudding hipudding requested a review from ggerganov March 31, 2026 11:59
@ggerganov
Copy link
Copy Markdown
Member

In llama.cpp, wouldn't you have encountered the same problem when going through this path:

// If upload_backend is valid load the tensor in chunks to pinned memory and upload the buffers asynchronously to the GPU.
if (upload_backend) {
size_t offset = weight->offs;
alignment = file->read_alignment();
size_t aligned_offset = offset & ~(alignment - 1);
size_t offset_from_alignment = offset - aligned_offset;
file->seek(aligned_offset, SEEK_SET);
// Calculate aligned read boundaries
size_t read_start = aligned_offset;
size_t read_end = (offset + n_size + alignment - 1) & ~(alignment - 1);
size_t bytes_read = 0;
size_t data_read = 0; // Actual tensor data copied (excluding padding)
while (bytes_read < read_end - read_start) {
size_t read_size = std::min<size_t>(buffer_size, read_end - read_start - bytes_read);
// Align the destination pointer within the pinned buffer
uintptr_t ptr_dest_aligned = (reinterpret_cast<uintptr_t>(host_ptrs[buffer_idx]) + alignment - 1) & ~(alignment - 1);
// Wait for previous upload to complete before reusing buffer
ggml_backend_event_synchronize(events[buffer_idx]);
// Read aligned chunk from file
file->read_raw_unsafe(reinterpret_cast<void *>(ptr_dest_aligned), read_size);
// Calculate actual data portion (excluding alignment padding)
uintptr_t ptr_data = ptr_dest_aligned;
size_t data_to_copy = read_size;
// Skip alignment padding at start of first chunk
if (bytes_read == 0) {
ptr_data += offset_from_alignment;
data_to_copy -= offset_from_alignment;
}
// Trim alignment padding at end of last chunk
if (aligned_offset + bytes_read + read_size > offset + n_size) {
data_to_copy -= (read_end - (offset + n_size));
}
// Async upload actual data to GPU
ggml_backend_tensor_set_async(upload_backend, cur,
reinterpret_cast<void *>(ptr_data), data_read, data_to_copy);
ggml_backend_event_record(events[buffer_idx], upload_backend);
data_read += data_to_copy;
bytes_read += read_size;
++buffer_idx;
buffer_idx %= n_buffers;
}
} else {

@noemotiovon
Copy link
Copy Markdown
Collaborator

The CANN backend currently declares async = false in ggml_backend_cann_device_get_props, so the upload_backend chunked upload path in llama-model-loader.cpp is never reached — upload_backend is always nullptr for CANN.

I tested on the master branch by setting async = true and using --no-mmap, which does enter the chunked upload path. The same issue occurs — the ND-to-NZ conversion is triggered on incomplete data.

So the problem exists in both the ollama multi-threaded loading path and the llama.cpp async upload path, but the latter is currently gated by async = false.

@ggerganov ggerganov merged commit 632219a into ggml-org:master Mar 31, 2026
85 of 86 checks passed
slartibardfast pushed a commit to slartibardfast/llama.cpp that referenced this pull request Apr 12, 2026
* CANN: fix multi-thread set_tensor race conditions

When ollama calls ggml_backend_tensor_set from multiple threads (each
writing a different chunk of the same tensor), the CANN backend had
three concurrency issues:

1. Quantized tensors (Q4_0/Q8_0) require a full-tensor format transform
   before uploading to device. Per-chunk transforms produced corrupt data.

2. ND-to-NZ weight conversion requires complete tensor data on device.
   Per-chunk conversion operated on incomplete data.

3. The global g_nz_workspaces array had unprotected concurrent access.

Fix by introducing a TensorSetTracker that accumulates write progress
per tensor. For quantized tensors, raw data is staged in a host buffer
and the transform + upload is deferred until all chunks arrive. For NZ
weights, chunks are uploaded directly but conversion is deferred. The
tracker and its staging buffer are released immediately after
post-processing completes.

Add per-device mutex to g_nz_workspaces to prevent data races.

* CANN: fix L2_NORM ignoring eps parameter

The L2_NORM implementation was not using the eps parameter from
op_params, causing incorrect results when eps is large (e.g. 10.0).
The CPU reference computes scale = 1/fmaxf(norm, eps), so add a
Clamp step to clamp the norm to at least eps before dividing.

* ggml/cann: compare op_params for POOL_2D in ACL graph cache matching

When ACL graph mode is enabled, the graph LRU cache checks whether a
cached graph matches the current computation graph. Previously,
GGML_OP_POOL_2D was not included in the op_params comparison, so two
POOL_2D nodes with different pooling parameters (kernel size, stride,
padding) but identical tensor shapes and addresses could incorrectly
reuse a cached graph, leading to wrong results or aclnn errors.

Add GGML_OP_POOL_2D to the list of ops that require op_params matching
in ggml_graph_node_properties::has_matching_properties().

* cann: fix ACL graph cache matching by adding tensor type and unconditional op_params comparison

The ACL graph LRU cache was incorrectly reusing cached graphs for
operations with different tensor types or op_params, causing test
failures for CPY (f16 vs bf16), POOL_2D, L2_NORM, NORM_MUL_ADD,
RMS_NORM_MUL_ADD, and ADD_RMS_NORM.

Changes:
- Add node_type and src_type[] fields to ggml_graph_node_properties
  so the cache can distinguish tensors with different types but
  identical ne/nb (e.g. f16 and bf16 both have 2-byte elements)
- Compare op_params unconditionally for all ops instead of only for
  SCALE/UNARY/GLU/ROPE/POOL_2D
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request Apr 26, 2026
* CANN: fix multi-thread set_tensor race conditions

When ollama calls ggml_backend_tensor_set from multiple threads (each
writing a different chunk of the same tensor), the CANN backend had
three concurrency issues:

1. Quantized tensors (Q4_0/Q8_0) require a full-tensor format transform
   before uploading to device. Per-chunk transforms produced corrupt data.

2. ND-to-NZ weight conversion requires complete tensor data on device.
   Per-chunk conversion operated on incomplete data.

3. The global g_nz_workspaces array had unprotected concurrent access.

Fix by introducing a TensorSetTracker that accumulates write progress
per tensor. For quantized tensors, raw data is staged in a host buffer
and the transform + upload is deferred until all chunks arrive. For NZ
weights, chunks are uploaded directly but conversion is deferred. The
tracker and its staging buffer are released immediately after
post-processing completes.

Add per-device mutex to g_nz_workspaces to prevent data races.

* CANN: fix L2_NORM ignoring eps parameter

The L2_NORM implementation was not using the eps parameter from
op_params, causing incorrect results when eps is large (e.g. 10.0).
The CPU reference computes scale = 1/fmaxf(norm, eps), so add a
Clamp step to clamp the norm to at least eps before dividing.

* ggml/cann: compare op_params for POOL_2D in ACL graph cache matching

When ACL graph mode is enabled, the graph LRU cache checks whether a
cached graph matches the current computation graph. Previously,
GGML_OP_POOL_2D was not included in the op_params comparison, so two
POOL_2D nodes with different pooling parameters (kernel size, stride,
padding) but identical tensor shapes and addresses could incorrectly
reuse a cached graph, leading to wrong results or aclnn errors.

Add GGML_OP_POOL_2D to the list of ops that require op_params matching
in ggml_graph_node_properties::has_matching_properties().

* cann: fix ACL graph cache matching by adding tensor type and unconditional op_params comparison

The ACL graph LRU cache was incorrectly reusing cached graphs for
operations with different tensor types or op_params, causing test
failures for CPY (f16 vs bf16), POOL_2D, L2_NORM, NORM_MUL_ADD,
RMS_NORM_MUL_ADD, and ADD_RMS_NORM.

Changes:
- Add node_type and src_type[] fields to ggml_graph_node_properties
  so the cache can distinguish tensors with different types but
  identical ne/nb (e.g. f16 and bf16 both have 2-byte elements)
- Compare op_params unconditionally for all ops instead of only for
  SCALE/UNARY/GLU/ROPE/POOL_2D
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
* CANN: fix multi-thread set_tensor race conditions

When ollama calls ggml_backend_tensor_set from multiple threads (each
writing a different chunk of the same tensor), the CANN backend had
three concurrency issues:

1. Quantized tensors (Q4_0/Q8_0) require a full-tensor format transform
   before uploading to device. Per-chunk transforms produced corrupt data.

2. ND-to-NZ weight conversion requires complete tensor data on device.
   Per-chunk conversion operated on incomplete data.

3. The global g_nz_workspaces array had unprotected concurrent access.

Fix by introducing a TensorSetTracker that accumulates write progress
per tensor. For quantized tensors, raw data is staged in a host buffer
and the transform + upload is deferred until all chunks arrive. For NZ
weights, chunks are uploaded directly but conversion is deferred. The
tracker and its staging buffer are released immediately after
post-processing completes.

Add per-device mutex to g_nz_workspaces to prevent data races.

* CANN: fix L2_NORM ignoring eps parameter

The L2_NORM implementation was not using the eps parameter from
op_params, causing incorrect results when eps is large (e.g. 10.0).
The CPU reference computes scale = 1/fmaxf(norm, eps), so add a
Clamp step to clamp the norm to at least eps before dividing.

* ggml/cann: compare op_params for POOL_2D in ACL graph cache matching

When ACL graph mode is enabled, the graph LRU cache checks whether a
cached graph matches the current computation graph. Previously,
GGML_OP_POOL_2D was not included in the op_params comparison, so two
POOL_2D nodes with different pooling parameters (kernel size, stride,
padding) but identical tensor shapes and addresses could incorrectly
reuse a cached graph, leading to wrong results or aclnn errors.

Add GGML_OP_POOL_2D to the list of ops that require op_params matching
in ggml_graph_node_properties::has_matching_properties().

* cann: fix ACL graph cache matching by adding tensor type and unconditional op_params comparison

The ACL graph LRU cache was incorrectly reusing cached graphs for
operations with different tensor types or op_params, causing test
failures for CPY (f16 vs bf16), POOL_2D, L2_NORM, NORM_MUL_ADD,
RMS_NORM_MUL_ADD, and ADD_RMS_NORM.

Changes:
- Add node_type and src_type[] fields to ggml_graph_node_properties
  so the cache can distinguish tensors with different types but
  identical ne/nb (e.g. f16 and bf16 both have 2-byte elements)
- Compare op_params unconditionally for all ops instead of only for
  SCALE/UNARY/GLU/ROPE/POOL_2D
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ascend NPU issues specific to Ascend NPUs ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants