Skip to content

feat(Ascend): Support Qwen3 on Ascend Backend#671

Merged
chenghuaWang merged 8 commits intoUbiquitousLearning:mainfrom
lywbarca:ascend
Apr 28, 2026
Merged

feat(Ascend): Support Qwen3 on Ascend Backend#671
chenghuaWang merged 8 commits intoUbiquitousLearning:mainfrom
lywbarca:ascend

Conversation

@lywbarca
Copy link
Copy Markdown

@lywbarca lywbarca commented Apr 28, 2026

Please check Guidelines for Contributing.

Summary by CodeRabbit

  • New Features

    • Full Ascend backend with graph-accelerated Qwen inference, W8A8 activation quantization path, KV-cache support, and a new example runner for QA/generation.
    • Qwen Ascend model variant and tokenizer with sample config.
  • Documentation

    • New Ascend backend setup, usage, and Qwen Ascend guides added.
  • Tools / Observability

    • Larger default Ascend memory pool and runtime memory statistics/reporting.
  • Tests

    • Extensive Ascend kernel and graph tests covering ops, graphs, and W8A8 pipelines.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 28, 2026

📝 Walkthrough

Walkthrough

Builds a full Ascend backend: adds ATB graph support, many new Ascend eager and graph ops (including W8A8), KV-cache, memory manager instrumentation, Qwen Ascend model and example runner, extensive tests, and CMake/ABI detection for ATB integration. (50 words)

Changes

Cohort / File(s) Summary
Build & CMake
CMakeLists.txt, examples/CMakeLists.txt, examples/qwen_ascend/CMakeLists.txt, mllm/backends/ascend/CMakeLists.txt, mllm/backends/ascend/graph/CMakeLists.txt, tests/ascend/CMakeLists.txt
Adds ATB/Ascend detection, aligns C++ ABI for ATB, globs graph sources, conditionally builds qwen_ascend example, and wires ATB include/lib paths for tests.
Documentation
docs/ascend_backend/*
Large rewrite and new pages: core design, setup/usage, Qwen Ascend guide, graph/eager execution, KV/RoPE specs, profiling env vars, and limitations.
Graph Infra
mllm/backends/ascend/graph/...
Adds AscendGraphBuilder, AscendGraphExecutor, object library target, and graph-runner primitives with persistent workspace and optional profiling.
Graph Plugin Ops
mllm/backends/ascend/graph/*
Introduces plugin operations (AttentionWithKVCache, CausalMask*, Clamp, Round, LinearW8A8, DynamicLinearW8A8, etc.) with factories, workspace handling, and subgraph logic.
Backend Core & Registration
mllm/backends/ascend/AscendBackend.*, Register.cpp
Registers many ops (transpose, embedding, gather, RoPE, fill, copy, causal mask, graph markers), increases pool sizing, and marks weights as on-device.
Common Utilities & Kernels
mllm/backends/ascend/AscendCommon.*, ops/AscendCausalMaskKernel.*
Adds dtype mapping, CPU→Ascend FP16 conversion helper, and causal-mask kernel implementation.
Memory Manager & Pool
mllm/backends/ascend/memory/*, MemoryBlock.hpp
Enlarges default pool, adds best-fit allocation, telemetry counters/histograms, block metadata, and public printStats().
Eager Operators
mllm/backends/ascend/ops/*
Adds many eager ops (Copy, Fill, Gather, Embedding, RoPE, Transpose, CausalMask, KVCache, Linear dynamic W8A8 path, etc.), updates Linear load/dispatch for INT8/W8A8, and minor timing/logging changes.
Linear Quant & Artifacts
mllm/backends/ascend/ops/AscendLinearQuant.*
Prepares W8A8 artifacts (scale validation, trans-quant params, bias int32) and helpers for INT8 weight handling.
Qwen Ascend Model & Tokenizer
mllm/models/qwen_ascend/*
Adds QwenAscend config, tokenizer, RoPE cache, graph-op factories, graph-accelerated decoder (FP16/W8A8), model wiring, and example runner APIs.
Examples
examples/qwen_ascend/*
New qwen_ascend example executable, CMake target, and model config with CLI options for smoke tests and QA generation.
Tests
tests/ascend/*
Adds broad Ascend tests: graph builder, linear/softmax graphs, causal-mask graph, aclnn kernels, broadcast, embedding, RoPE, transpose, W8A8 end-to-end pipeline, and GoogleTest wiring.

Sequence Diagram(s)

sequenceDiagram
  participant CLI as Client (example/test)
  participant Builder as AscendGraphBuilder
  participant Exec as AscendGraphExecutor
  participant MM as AscendMemoryManager
  participant ATB as ATB/ACL Runtime (device)

  CLI->>Builder: beginGraph(inputs, outputs, ops...)
  Builder->>Builder: create operations, infer shapes
  CLI->>Builder: build()
  Builder-->>CLI: atb::Operation* graph_op
  CLI->>Exec: new AscendGraphExecutor(graph_op, context)
  Exec->>ATB: graph_op->Setup(VariantPack) -> required_workspace
  ATB-->>Exec: workspace_size
  Exec->>MM: ensure workspace allocation (alloc/resize)
  CLI->>Exec: execute(inputs)
  Exec->>ATB: graph_op->Execute(VariantPack, workspace)
  ATB-->>Exec: run on device (async)
  Exec->>ATB: sync global stream
  Exec-->>CLI: outputs
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

  • PR #621 — Overlapping Ascend operator implementations and backend registration changes.
  • PR #629 — Related operator updates (Concat/Slice) and backend registration overlap.
  • PR #564 — Foundational Ascend backend additions (allocator/memory/registration) that intersect with these changes.

Suggested reviewers

  • liang1232018
  • yirongjie
  • chenghuaWang
  • oreomaker

"🐰
I hopped through graphs and bytes today,
W8A8 and FP16 led the way,
KV caches hummed, kernels took flight,
Qwen on Ascend—what a sight! ✨"

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 19

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🟡 Minor comments (9)
mllm/backends/ascend/ops/AscendGatherOps.cpp-98-100 (1)

98-100: ⚠️ Potential issue | 🟡 Minor

Resource leak on partial tensor creation failure.

If createAclTensor succeeds for some tensors but fails for others (returns nullptr), the successful tensors are not destroyed before MLLM_ERROR_EXIT is called. This can leak ACL tensor resources.

🐛 Proposed fix
   if (acl_weight == nullptr || acl_indices == nullptr || acl_output == nullptr) {
+    if (acl_weight) aclDestroyTensor(acl_weight);
+    if (acl_indices) aclDestroyTensor(acl_indices);
+    if (acl_output) aclDestroyTensor(acl_output);
     MLLM_ERROR_EXIT(ExitCode::kAscendError, "AscendGatherOp: Failed to create aclTensor");
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/backends/ascend/ops/AscendGatherOps.cpp` around lines 98 - 100, The
current check that calls MLLM_ERROR_EXIT if any of acl_weight, acl_indices, or
acl_output is nullptr can leak already-created ACL tensors; before calling
MLLM_ERROR_EXIT, explicitly destroy each tensor that was successfully created
(e.g., call the ACL tensor destroy function on acl_weight, acl_indices,
acl_output if they are non-null) so no resources remain allocated, then call
MLLM_ERROR_EXIT; update the block around the createAclTensor calls where
variables acl_weight, acl_indices, acl_output are checked and ensure each
non-null pointer is cleaned up individually prior to exiting.
mllm/models/qwen_ascend/configuration_qwen_ascend.hpp-12-30 (1)

12-30: ⚠️ Potential issue | 🟡 Minor

Add docstrings for public API struct and constructor.

The QwenAscendConfig struct and its constructor are public APIs but lack documentation explaining their purpose, parameters, and behavior. This violates the coding guideline requiring clear docstrings for public APIs, classes, and functions.

Add a docstring similar to this pattern:

/**
 * `@brief` Configuration struct for Qwen Ascend model.
 *
 * Loads model configuration from a JSON file and provides
 * access to model hyperparameters with built-in defaults.
 */
struct QwenAscendConfig : protected ConfigFile {
  /**
   * `@brief` Loads configuration from a JSON file.
   * `@param` file_path Path to the JSON configuration file.
   * `@throws` std::runtime_error if the file cannot be opened or JSON parsing fails.
   */
  explicit QwenAscendConfig(const std::string& file_path) : ConfigFile(file_path) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/models/qwen_ascend/configuration_qwen_ascend.hpp` around lines 12 - 30,
Add documentation comments for the public struct QwenAscendConfig and its
constructor: place a brief docstring above the struct describing its role as the
configuration container for the Qwen Ascend model (loading JSON config and
exposing hyperparameters), and a docstring above the explicit
QwenAscendConfig(const std::string& file_path) constructor that documents the
file_path parameter and notes that it may throw std::runtime_error on file
open/JSON parse failure; follow the existing project docstring style and
reference that QwenAscendConfig inherits from ConfigFile.
tests/ascend/AscendLinearSoftmaxGraphTest.hpp-48-116 (1)

48-116: ⚠️ Potential issue | 🟡 Minor

Destroy child operations after builder.build().

The AscendGraphBuilder does not take ownership of child operations; it only destroys its internal builder state. Only AscendGraphExecutor takes ownership of the assembled graph_op. On the success path, linear_op and softmax_op are never destroyed and leak.

After graph_op = builder.build(), explicitly destroy both child operations:

atb::DestroyOperation(linear_op);
atb::DestroyOperation(softmax_op);

The executor will handle destruction of the final graph_op when it goes out of scope.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/ascend/AscendLinearSoftmaxGraphTest.hpp` around lines 48 - 116, The
builder.build() call does not take ownership of the child ops (linear_op,
softmax_op) so on the success path you must explicitly free them after graph_op
is created; locate the builder.build() invocation and, if graph_op != nullptr,
call atb::DestroyOperation(linear_op) and atb::DestroyOperation(softmax_op)
before proceeding to create AscendGraphExecutor/execute so the executor can take
ownership of only graph_op.
mllm/backends/ascend/graph/AscendGraphExecutor.hpp-26-49 (1)

26-49: ⚠️ Potential issue | 🟡 Minor

Move deleted special members to public.

According to the Google C++ Style Guide and Chromium's C++ guidelines, deleted special member functions must be declared in the public section, not private. This makes the non-copyability and non-moveability of the class explicitly clear as part of its public interface.

Suggested cleanup
 class AscendGraphExecutor {
  public:
   // Construct executor for a graph operation and take ownership of it.
   AscendGraphExecutor(atb::Operation* graph_op, atb::Context* context);
 
   // Release graph operation and workspace resources.
   ~AscendGraphExecutor();
 
   // Execute the graph with given inputs and outputs.
   void execute(const std::vector<Tensor>& inputs,
                std::vector<Tensor>& outputs);
 
   // Get current workspace size in bytes.
   uint64_t workspaceSize() const { return workspace_size_; }
 
   // Get the graph operation pointer.
   atb::Operation* graphOp() const { return graph_op_; }
 
+  AscendGraphExecutor(const AscendGraphExecutor&) = delete;
+  AscendGraphExecutor& operator=(const AscendGraphExecutor&) = delete;
+  AscendGraphExecutor(AscendGraphExecutor&&) = delete;
+  AscendGraphExecutor& operator=(AscendGraphExecutor&&) = delete;
 
  private:
   atb::Operation* graph_op_;
   atb::Context* context_;
   void* workspace_;
   uint64_t workspace_size_;
   int workspace_block_id_;
-
-  // Disable copy construction.
-  AscendGraphExecutor(const AscendGraphExecutor&) = delete;
-
-  // Disable copy assignment.
-  AscendGraphExecutor& operator=(const AscendGraphExecutor&) = delete;
-
-  // Disable move construction.
-  AscendGraphExecutor(AscendGraphExecutor&&) = delete;
-
-  // Disable move assignment.
-  AscendGraphExecutor& operator=(AscendGraphExecutor&&) = delete;
 };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/backends/ascend/graph/AscendGraphExecutor.hpp` around lines 26 - 49, The
deleted special member functions for AscendGraphExecutor
(AscendGraphExecutor(const AscendGraphExecutor&), AscendGraphExecutor&
operator=(const AscendGraphExecutor&),
AscendGraphExecutor(AscendGraphExecutor&&), AscendGraphExecutor&
operator=(AscendGraphExecutor&&)) should be declared in the public section, not
private; move those four deleted declarations from the private block into the
public block of the class so the non-copyable/non-movable intent is part of the
public interface alongside workspaceSize() and graphOp().
mllm/backends/ascend/graph/AscendGraphExecutor.cpp-27-30 (1)

27-30: ⚠️ Potential issue | 🟡 Minor

Empty environment variable is treated as enabled.

If the environment variable is set but empty (e.g., export MLLM_PROFILE_ASCEND_GRAPH=), value[0] is '\0', and '\0' != '0' evaluates to true, causing profiling to be enabled unintentionally.

🛠️ Proposed fix
 bool isEnvEnabled(const char* name) {
   const char* value = std::getenv(name);
-  return value != nullptr && value[0] != '0';
+  return value != nullptr && value[0] != '\0' && value[0] != '0';
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/backends/ascend/graph/AscendGraphExecutor.cpp` around lines 27 - 30, The
isEnvEnabled function currently treats an empty environment variable as enabled
because it only checks value != nullptr and value[0] != '0'; update isEnvEnabled
to treat empty strings as disabled by ensuring the value is non-null AND not
empty (value[0] != '\0') and also not equal to the character '0' (value[0] !=
'0'), e.g., in isEnvEnabled(const char* name) check value != nullptr && value[0]
!= '\0' && value[0] != '0' so an exported-but-empty variable will be considered
disabled.
mllm/backends/ascend/graph/AscendDynamicLinearW8A8PluginOperation.cpp-673-675 (1)

673-675: ⚠️ Potential issue | 🟡 Minor

Incorrect error code for workspace validation.

ERROR_INVALID_TENSOR_NUM is misleading when the actual issue is a null workspace with non-zero required size. Consider using a more appropriate error code if available, or at minimum add a comment.

🛠️ Proposed fix
   if (workspace == nullptr && workspace_size > 0) {
-    return atb::ERROR_INVALID_TENSOR_NUM;
+    return atb::ERROR_INTERNAL_ERROR;  // Workspace required but not provided
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/backends/ascend/graph/AscendDynamicLinearW8A8PluginOperation.cpp` around
lines 673 - 675, The null workspace check incorrectly returns
atb::ERROR_INVALID_TENSOR_NUM; update the return to a more accurate error code
(e.g., atb::ERROR_INVALID_WORKSPACE or atb::ERROR_INVALID_MEMORY_ALLOCATION if
your codebase defines one) for the condition where workspace == nullptr &&
workspace_size > 0, and if neither meaningful code exists add a short comment
explaining that this path represents a missing/insufficient workspace and
consider adding a new error code (reference symbols: workspace, workspace_size,
atb::ERROR_INVALID_TENSOR_NUM, and the surrounding check in
AscendDynamicLinearW8A8PluginOperation.cpp).
mllm/models/qwen_ascend/tokenization_qwen_ascend.hpp-78-83 (1)

78-83: ⚠️ Potential issue | 🟡 Minor

Potential data loss in detokenize when casting wchar_t to unsigned char.

Line 81 casts wchar_t to unsigned char, which truncates values > 255. While the byte-to-unicode mapping should produce valid byte values, if bytes_2_unicode_dict_inverse_ lookup fails (key not found), the default wint_t value could produce garbage.

Consider adding validation:

🛡️ Defensive lookup
   std::wstring detokenize(int64_t pos_idx) override {
     auto str = _detokenize(pos_idx);
     std::string utf_8_str;
     for (wchar_t c : str) {
+      auto it = bytes_2_unicode_dict_inverse_.find(c);
+      if (it != bytes_2_unicode_dict_inverse_.end()) {
+        utf_8_str.push_back(static_cast<unsigned char>(it->second));
+      } else {
+        // Handle unknown character - could log warning or use replacement char
+        utf_8_str.push_back('?');
+      }
-      utf_8_str.push_back((unsigned char)(bytes_2_unicode_dict_inverse_[c]));
     }
     return {mllm::preprocessor::utf8string2WideString(utf_8_str)};
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/models/qwen_ascend/tokenization_qwen_ascend.hpp` around lines 78 - 83,
detokenize currently casts wchar_t results from _detokenize through
bytes_2_unicode_dict_inverse_ using (unsigned char) which can truncate >255 or
propagate garbage if lookup fails; update detokenize (and related _detokenize
usage) to validate lookups and bounds: for each wchar_t c, check that
bytes_2_unicode_dict_inverse_.contains(c) (or find() != end()), ensure the
returned value is within 0..255 before casting (use uint8_t or
static_cast<uint8_t>), provide a clear fallback (e.g. replacement byte like '?'
or skip) when lookup fails, and then append that validated byte to utf_8_str
before converting with utf8string2WideString.
mllm/models/qwen_ascend/tokenization_qwen_ascend.hpp-106-121 (1)

106-121: ⚠️ Potential issue | 🟡 Minor

Template replacement lacks error handling for missing placeholder.

Line 108 uses find("{{{prompt}}}") but doesn't check for std::string::npos. If the placeholder is missing (e.g., template is modified), replace at npos position has undefined behavior.

🛡️ Safer template replacement
 ARGenerationOutputPast convertMessage(const QwenAscendMessage& message) {
   auto applied_string = QwenAscendMessage::message_template;
+  constexpr const char* placeholder = "{{{prompt}}}";
-  size_t pos = applied_string.find("{{{prompt}}}");
-  applied_string.replace(pos, 12, message.prompt);
+  size_t pos = applied_string.find(placeholder);
+  if (pos != std::string::npos) {
+    applied_string.replace(pos, std::strlen(placeholder), message.prompt);
+  }
   // ... rest of method
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/models/qwen_ascend/tokenization_qwen_ascend.hpp` around lines 106 - 121,
The convertMessage function uses QwenAscendMessage::message_template and calls
applied_string.find("{{{prompt}}}") then applied_string.replace(...) without
checking for std::string::npos; add a guard after the find to detect npos, log
or throw a clear error (or fallback to appending/inserting the prompt) and only
call replace when pos != std::string::npos to avoid undefined behavior; update
references in convertMessage to handle the missing placeholder case safely
before tokenization and retain existing variable names (applied_string, pos,
message.prompt).
mllm/backends/ascend/graph/AscendAttentionWithKVCachePluginOperation.cpp-520-530 (1)

520-530: ⚠️ Potential issue | 🟡 Minor

Workspace size validation could be more precise.

The workspace plan at Lines 521-526 recalculates key_history_bytes and value_history_bytes, then computes subgraph_workspace_bytes as the remainder. If workspace_size is exactly the sum of history bytes with no remainder, subgraph_workspace_bytes becomes 0, which may cause issues if the subgraph requires workspace.

Consider adding a check that workspace_size >= plan.totalBytes() where totalBytes includes the expected subgraph workspace:

🔧 Suggested validation
+  // Verify workspace is at least as large as what Setup computed.
+  const uint64_t expected_subgraph_ws = (seq_q > 1) ? prefill_subgraph_workspace_bytes_ : decode_subgraph_workspace_bytes_;
+  if (workspace_size < plan.key_history_bytes + plan.value_history_bytes + expected_subgraph_ws) {
+    return atb::ERROR_INVALID_TENSOR_NUM;  // Or a more specific error
+  }
   plan.subgraph_workspace_bytes = workspace_size >= (plan.key_history_bytes + plan.value_history_bytes)
       ? workspace_size - (plan.key_history_bytes + plan.value_history_bytes)
       : 0;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/backends/ascend/graph/AscendAttentionWithKVCachePluginOperation.cpp`
around lines 520 - 530, The current logic only ensures workspace_size is
non-negative and subtracts history bytes to set plan.subgraph_workspace_bytes,
but doesn't validate that workspace_size actually satisfies the full required
workspace (history + any required subgraph workspace), so add an explicit
validation: compute the required_total_bytes (e.g., via a new
WorkspacePlan::totalBytes() or by summing plan.key_history_bytes +
plan.value_history_bytes + required_subgraph_min_bytes) and check workspace_size
>= required_total_bytes before proceeding; if not, return the appropriate error
(e.g., atb::ERROR_INVALID_TENSOR_NUM) so callers cannot pass a workspace that is
exactly the history bytes but insufficient for the subgraph. Ensure you
reference and update WorkspacePlan, plan.key_history_bytes,
plan.value_history_bytes, and plan.subgraph_workspace_bytes in the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@CMakeLists.txt`:
- Around line 52-60: The global
add_compile_definitions(_GLIBCXX_USE_CXX11_ABI=0) should be removed and replaced
by setting a cache flag when ATB cxx_abi_0 is detected (e.g.,
set(MLLM_ASCEND_USE_CXX11_ABI_0 TRUE CACHE BOOL "Ascend ABI0")) in this
CMakeLists, then remove the global definition here and instead read that cache
variable in mllm/backends/ascend/CMakeLists.txt and
mllm/backends/ascend/graph/CMakeLists.txt and apply
target_compile_definitions(_GLIBCXX_USE_CXX11_ABI=0 PRIVATE) only to the Ascend
targets MllmAscendBackend and MllmAscendGraph so other targets
(CPU/CUDA/OpenCL/QNN) are unaffected.

In `@examples/qwen_ascend/main.cpp`:
- Around line 146-178: The current stats loop reads logits_ptr[0..vocab_size-1]
which uses the first sequence position; instead extract the final timestep
logits (shape [B,S,vocab_size] -> use the last S-1 position) before computing
min/max/sum/argmax. Replace the direct logits.ptr access with the utility that
returns the last-position slice (e.g. call getLastLogits() used in ARGeneration
or index logits with logits[{kAll, logits.shape()[1]-1, kAll}]) and then use
that pointer when handling both kFloat16 (mllm::mllm_fp16_t) and float branches;
keep using cfg.vocab_size, logits.ptr, kFloat16, and argmax_idx as before.

In `@mllm/backends/ascend/graph/AscendCausalMaskTensorPluginOperation.cpp`:
- Around line 209-216: The async memcpy using aclrtMemcpyAsync writes from the
class member host_mask_buffer_ which may be reused by a subsequent Execute(); to
avoid the race, wait for the copy to finish before returning: after calling
aclrtMemcpyAsync (and checking its return), synchronize the stream returned by
getGlobalAtbStream() (e.g., via aclrtSynchronizeStream or equivalent) before
returning the error code; alternatively, implement a pinned-host buffer pool or
double-buffering for host_mask_buffer_ and ensure the buffer used for the
in-flight transfer is not reused until the async transfer completes.

In `@mllm/backends/ascend/graph/AscendLinearW8A8PluginOperation.cpp`:
- Around line 566-585: The current flow only calls syncGlobalAtbStream() when
profile_enabled is true, which can leave aclnnCast() unsynchronized before the
subsequent linear operation uses x_int8_t; change the logic around the
aclnnCast() result (ret) so that if ret == ACL_SUCCESS you always call
syncGlobalAtbStream(), and only compute cast_exec_ms =
elapsedMs(cast_exec_start, Clock::now()) when profile_enabled is true; update
the block that currently checks (profile_enabled && ret == ACL_SUCCESS) to
instead check ret == ACL_SUCCESS first (calling syncGlobalAtbStream()), then
conditionally set cast_exec_ms if profile_enabled. Ensure this change is applied
next to the aclnnCast call and variables cast_exec_start, cast_exec_ms, ret,
profile_enabled, and x_int8_t.

In `@mllm/backends/ascend/memory/AscendMemoryPool.cpp`:
- Around line 104-115: The allocation check currently uses remain_size_ >
align_size which prevents exact-fit allocations; update the condition to allow
remain_size_ >= align_size so requests that exactly consume the remaining pool
succeed. In the allocation branch around generateBlocksId(), cur_mem_ptr_, and
used_blocks_ where MemoryBlock is constructed and total_new_alloc_count_ is
incremented (and MLLM_INFO logged), ensure the alignment adjustment and
subsequent updates to remain_size_ and cur_mem_ptr_ remain correct when
remain_size_ == align_size so an exact-fit allocation is recorded into
used_blocks_ rather than treated as OOM.
- Around line 119-129: allocateBlock currently logs allocation failure but
leaves block_id untouched and returns implicitly; initialize block_id to a clear
failure sentinel (e.g., INVALID_BLOCK_ID) at the start of allocateBlock(),
change the function signature to return a success/failure status (bool or an
error code) instead of relying on side-effects, and on the failure path
explicitly set block_id to the sentinel before returning false; update callers
of allocateBlock() to check the returned status and avoid treating a stale
block_id as valid. Ensure references to block_id, allocateBlock(), free_blocks_,
and used_blocks_ are updated consistently and keep the existing MLLM_ERROR log
for diagnostics.

In `@mllm/backends/ascend/ops/AscendCausalMaskKernel.cpp`:
- Around line 77-103: The current implementation in AscendCausalMaskKernel.cpp
copies the entire mask device→host, modifies it with
applyStandardCausalMask/applySlidingWindowCausalMask, then copies back, which is
unacceptable for latency; replace this round-trip with a device-side
implementation: remove the host_data vector and both aclrtMemcpy calls and
instead implement and call a device kernel (e.g., launchAscendCausalMaskKernel
or applyDeviceCausalMaskKernel) that reads from input.deviceData and writes the
masked result directly to output.deviceData on the Ascend device (support both
standard and sliding_window paths by dispatching the kernel variant or passing
window_size), using the Ascend SDK kernel/RT APIs for device memory and launches
so no full-device copy to host occurs; keep error handling consistent (return
atb::ERROR_RT_FAIL on kernel launch failures) and only use host memcopies for
tiny metadata if absolutely needed.
- Around line 108-143: executeAscendCausalMaskKernel must validate that the last
dim D is >= S and that window_size > 0 for sliding-window mode to avoid
out-of-bounds writes in applyStandardCausalMask/executeTypedCausalMask; before
the S == 1 short-circuit, read D (e.g. input.desc.shape.dims[3]) and if D < S
return an error (e.g. atb::ERROR_INVALID_ARGUMENT or a suitable existing error
code), and if sliding_window is true validate window_size > 0 and return an
error for non-positive window_size; update callers/paths to ensure these checks
happen in executeAscendCausalMaskKernel before calling executeTypedCausalMask or
applyStandardCausalMask.

In `@mllm/backends/ascend/ops/AscendCopyOp.cpp`:
- Around line 17-32: AscendCopyOp::forward currently reads inputs[1] without
validating arity and passes src.bytes() as the aclrtMemcpy destMax; fix it by
first asserting or validating inputs.size() >= 2 (and that outputs is
intentionally unused or document/clear it) before accessing inputs[1], then call
aclrtMemcpy with the destination allocation size (use dst.bytes() for the
destMax argument) instead of src.bytes(); also ensure the function's contract
clearly states this is an in-place 2-input op (or populate outputs if it should
produce outputs).

In `@mllm/backends/ascend/ops/AscendLinearDynamicW8A8.cpp`:
- Around line 137-201: The code currently divides x by scale_x_npu (computed
from max_abs) without guarding against zeros, causing 0/0 NaNs; before calling
runAtbElewise(..., ELEWISE_REALDIV, x, scale_x_npu, x_scaled) clamp scale_x_npu
to a small positive epsilon (or replace zero entries) so no element is <
epsilon. Implement this by inserting an ATB elementwise/clamp/op that takes
max_abs/scale_x_npu and applies max(value, eps) (or a conditional replace) into
a new tensor (e.g., scale_x_clamped) and then use scale_x_clamped in
runAtbElewise; refer to the symbols max_abs, scale_x_npu, runAtbElewise,
ELEWISE_REALDIV and reuse the same mem_mgr/ATB op pattern used to compute
scale_x_npu.

In `@mllm/backends/ascend/ops/AscendLinearQuant.cpp`:
- Around line 55-90: prepareLinearW8A8Artifacts currently only validates
scale_x; add pre-checks to fail fast when out_channels <= 0 and when any element
of scale_w is non-positive before building static_deq/dynamic_deq. Specifically:
in prepareLinearW8A8Artifacts, after converting scale_w_cpu and scale_x_cpu but
before computing sw/static_deq/dynamic_deq, validate that out_channels > 0 and
iterate over the float pointer sw = artifacts.scale_w_cpu.ptr<float>() to ensure
every sw[n] > 0; if either check fails call MLLM_ERROR_EXIT with a clear message
including layer_name and the offending value(s) so the W8A8 path cannot be
silently poisoned (refer to artifacts.scale_w_cpu, artifacts.scale_x, sw,
static_deq, dynamic_deq).

In `@mllm/backends/ascend/ops/AscendRoPEOp.cpp`:
- Around line 67-71: Check that the input tensor has rank 4 before indexing
x.shape(): add an explicit validation using x.shape().size() (or equivalent rank
accessor) at the start of the routine in AscendRoPEOp (or the function where
batch_size/seq_len/num_heads/head_dim are extracted); if the size is not 4,
return an error or throw an exception (with a clear message) instead of
proceeding, and only then assign batch_size = x.shape()[0], seq_len =
x.shape()[1], num_heads = x.shape()[2], head_dim = x.shape()[3].
- Around line 150-165: The aclrtMemcpy calls that populate cos_expanded_ptr and
sin_expanded_ptr inside the need_expand block (and the host-to-device copy that
writes position_ids) must have their return values captured and validated with
MLLM_ACL_CHECK to avoid proceeding with uninitialized device buffers; update the
two aclrtMemcpy calls that copy cos_in.ptr<void>() and sin_in.ptr<void>() to
store the aclError_t result and pass it into MLLM_ACL_CHECK (same for the
aclrtMemcpy that transfers position_ids), so failures trigger the existing error
handling instead of silently running the kernel with bad memory.

In `@mllm/backends/ascend/Register.cpp`:
- Around line 19-20: The hard-coded 8GB pool_size in Register.cpp causes startup
failures or resource starvation; make the pool size configurable and safe: read
a configurable value (environment variable like ASCEND_MEMORY_POOL_SIZE or a
CMake-provided macro) and parse it, validate it (e.g., ensure >= a minimum and
<= available system memory), and fall back to a conservative default if unset or
invalid; then pass that computed size to
ascend::getAscendMemoryManager().createMemoryPool(pool_size). Also document the
new env/CMake option and ensure the code handles parsing errors gracefully.

In `@mllm/models/qwen_ascend/qwen_ascend_graph_ops.hpp`:
- Around line 167-174: The cache key for attentionScaleTensor is missing
head_dim: update attentionScaleTensor to also check cached_head_dim_ (add
private int32_t cached_head_dim_{-1};) and only reuse attention_scale_tensor_
when cached_head_dim_ == head_dim in addition to dtype/device; when you recreate
the tensor (in the block that assigns attention_scale_tensor_) set
cached_head_dim_ = head_dim so the cache is invalidated correctly on head size
changes.
- Around line 24-205: Add concise API doc comments for each public symbol: the
inline factory functions createLinearGraphOp, createRmsNormGraphOp,
createRoPEGraphOp, createTransposeGraphOp, createSoftmaxGraphOp,
createAddGraphOp, createMulGraphOp, createSiLUGraphOp (describe purpose,
parameters like has_bias/epsilon/axis/rank/dim0/dim1, return value and that they
MLLM_ERROR_EXIT on ATB failure), the helpers isQwenAscendDecoderGraphEnabled and
getQwenAscendDecoderGraphSetupBucketSize (describe env var semantics and return
meanings), and the QwenAscendDecoderGraphRunner class plus its public methods
configure, hasExecutor, setupBucketSize, setExecutor, attentionScaleTensor,
currentSeqLenTensor, and execute (describe intended usage, parameter
types/ownership, return semantics, side effects like device allocation or
memcpy, threading assumptions, and error/exit behavior). Keep each comment brief
and follow the project docstyle for one-line summary + param/return/error notes.
- Around line 69-75: In createTransposeGraphOp, guard accesses into param.perm
before using dim0/dim1: validate that dim0 and dim1 are within [0, rank-1] and
non-negative (or otherwise handle invalid inputs) and return a safe failure
(e.g., nullptr) or throw an explanatory error if they are out of range; locate
the function createTransposeGraphOp and the vector param.perm to add an early
check (if (dim0 < 0 || dim1 < 0 || dim0 >= rank || dim1 >= rank) { /* handle */
}) before calling std::swap, so you avoid out-of-bounds indexing.

In `@mllm/models/qwen_ascend/qwen_ascend_rope.hpp`:
- Around line 39-58: The issue is a dtype mismatch between
makeLocalRoPEPositionIds (which creates position_ids as kInt32) and
makeRotaryPosEmbedding (which casts position_ids.ptr to int64_t*); fix by making
the types consistent: either change makeLocalRoPEPositionIds to allocate kInt64
for position_ids, or change makeRotaryPosEmbedding to use int32_t* (and treat
pos as int32_t when computing freqs). Update the pointer type and any related
uses of pos in makeRotaryPosEmbedding to match the chosen integer dtype so
indexing and multiplications are correct.

In `@tests/ascend/AscendLinearKernelTest.hpp`:
- Around line 457-458: The allocateBlock call is narrowing ws_size (uint64_t) to
uint32_t which can wrap for >4GiB; update the allocator API or its overload used
by getAscendMemoryManager().allocateBlock to accept uint64_t (or change its
signature) so you pass the full ws_size, and adjust
getAscendMemoryManager().getBlockPtr(ws_bid, ws) callers accordingly; if you
cannot change the allocator signature, add a validated/capped conversion that
checks ws_size <= 0xFFFFFFFFULL (or caps to 4GiB and logs/errors) before calling
allocateBlock. Apply the same fix for the other occurrences where ws_size is
passed (the calls near lines referenced in the review: the other
allocateBlock/getBlockPtr pairs).

---

Minor comments:
In `@mllm/backends/ascend/graph/AscendAttentionWithKVCachePluginOperation.cpp`:
- Around line 520-530: The current logic only ensures workspace_size is
non-negative and subtracts history bytes to set plan.subgraph_workspace_bytes,
but doesn't validate that workspace_size actually satisfies the full required
workspace (history + any required subgraph workspace), so add an explicit
validation: compute the required_total_bytes (e.g., via a new
WorkspacePlan::totalBytes() or by summing plan.key_history_bytes +
plan.value_history_bytes + required_subgraph_min_bytes) and check workspace_size
>= required_total_bytes before proceeding; if not, return the appropriate error
(e.g., atb::ERROR_INVALID_TENSOR_NUM) so callers cannot pass a workspace that is
exactly the history bytes but insufficient for the subgraph. Ensure you
reference and update WorkspacePlan, plan.key_history_bytes,
plan.value_history_bytes, and plan.subgraph_workspace_bytes in the change.

In `@mllm/backends/ascend/graph/AscendDynamicLinearW8A8PluginOperation.cpp`:
- Around line 673-675: The null workspace check incorrectly returns
atb::ERROR_INVALID_TENSOR_NUM; update the return to a more accurate error code
(e.g., atb::ERROR_INVALID_WORKSPACE or atb::ERROR_INVALID_MEMORY_ALLOCATION if
your codebase defines one) for the condition where workspace == nullptr &&
workspace_size > 0, and if neither meaningful code exists add a short comment
explaining that this path represents a missing/insufficient workspace and
consider adding a new error code (reference symbols: workspace, workspace_size,
atb::ERROR_INVALID_TENSOR_NUM, and the surrounding check in
AscendDynamicLinearW8A8PluginOperation.cpp).

In `@mllm/backends/ascend/graph/AscendGraphExecutor.cpp`:
- Around line 27-30: The isEnvEnabled function currently treats an empty
environment variable as enabled because it only checks value != nullptr and
value[0] != '0'; update isEnvEnabled to treat empty strings as disabled by
ensuring the value is non-null AND not empty (value[0] != '\0') and also not
equal to the character '0' (value[0] != '0'), e.g., in isEnvEnabled(const char*
name) check value != nullptr && value[0] != '\0' && value[0] != '0' so an
exported-but-empty variable will be considered disabled.

In `@mllm/backends/ascend/graph/AscendGraphExecutor.hpp`:
- Around line 26-49: The deleted special member functions for
AscendGraphExecutor (AscendGraphExecutor(const AscendGraphExecutor&),
AscendGraphExecutor& operator=(const AscendGraphExecutor&),
AscendGraphExecutor(AscendGraphExecutor&&), AscendGraphExecutor&
operator=(AscendGraphExecutor&&)) should be declared in the public section, not
private; move those four deleted declarations from the private block into the
public block of the class so the non-copyable/non-movable intent is part of the
public interface alongside workspaceSize() and graphOp().

In `@mllm/backends/ascend/ops/AscendGatherOps.cpp`:
- Around line 98-100: The current check that calls MLLM_ERROR_EXIT if any of
acl_weight, acl_indices, or acl_output is nullptr can leak already-created ACL
tensors; before calling MLLM_ERROR_EXIT, explicitly destroy each tensor that was
successfully created (e.g., call the ACL tensor destroy function on acl_weight,
acl_indices, acl_output if they are non-null) so no resources remain allocated,
then call MLLM_ERROR_EXIT; update the block around the createAclTensor calls
where variables acl_weight, acl_indices, acl_output are checked and ensure each
non-null pointer is cleaned up individually prior to exiting.

In `@mllm/models/qwen_ascend/configuration_qwen_ascend.hpp`:
- Around line 12-30: Add documentation comments for the public struct
QwenAscendConfig and its constructor: place a brief docstring above the struct
describing its role as the configuration container for the Qwen Ascend model
(loading JSON config and exposing hyperparameters), and a docstring above the
explicit QwenAscendConfig(const std::string& file_path) constructor that
documents the file_path parameter and notes that it may throw std::runtime_error
on file open/JSON parse failure; follow the existing project docstring style and
reference that QwenAscendConfig inherits from ConfigFile.

In `@mllm/models/qwen_ascend/tokenization_qwen_ascend.hpp`:
- Around line 78-83: detokenize currently casts wchar_t results from _detokenize
through bytes_2_unicode_dict_inverse_ using (unsigned char) which can truncate
>255 or propagate garbage if lookup fails; update detokenize (and related
_detokenize usage) to validate lookups and bounds: for each wchar_t c, check
that bytes_2_unicode_dict_inverse_.contains(c) (or find() != end()), ensure the
returned value is within 0..255 before casting (use uint8_t or
static_cast<uint8_t>), provide a clear fallback (e.g. replacement byte like '?'
or skip) when lookup fails, and then append that validated byte to utf_8_str
before converting with utf8string2WideString.
- Around line 106-121: The convertMessage function uses
QwenAscendMessage::message_template and calls
applied_string.find("{{{prompt}}}") then applied_string.replace(...) without
checking for std::string::npos; add a guard after the find to detect npos, log
or throw a clear error (or fallback to appending/inserting the prompt) and only
call replace when pos != std::string::npos to avoid undefined behavior; update
references in convertMessage to handle the missing placeholder case safely
before tokenization and retain existing variable names (applied_string, pos,
message.prompt).

In `@tests/ascend/AscendLinearSoftmaxGraphTest.hpp`:
- Around line 48-116: The builder.build() call does not take ownership of the
child ops (linear_op, softmax_op) so on the success path you must explicitly
free them after graph_op is created; locate the builder.build() invocation and,
if graph_op != nullptr, call atb::DestroyOperation(linear_op) and
atb::DestroyOperation(softmax_op) before proceeding to create
AscendGraphExecutor/execute so the executor can take ownership of only graph_op.

---

Nitpick comments:
In `@mllm/backends/ascend/AscendBackend.cpp`:
- Around line 46-50: Normalize spacing between template arguments in the
regOpFactory instantiation: ensure a single space follows every comma between
factory type names (e.g., between AscendAddOpFactory and AscendSubOpFactory, and
consistently for AscendConcatOpFactory, AscendSliceOpFactory, etc.) so the
template argument list for regOpFactory<AscendGraphBeginOpFactory,
AscendGraphEndOpFactory, AscendAddOpFactory, AscendSubOpFactory,
AscendMulOpFactory, AscendX2XOpFactory, AscendSiLUOpFactory,
AscendLinearOpFactory, AscendRMSNormOpFactory, AscendViewOpFactory,
AscendMatMulOpFactory, AscendSoftmaxOpFactory, AscendConcatOpFactory,
AscendSliceOpFactory, AscendTransposeOpFactory, AscendEmbeddingOpFactory,
AscendGatherOpFactory, AscendRoPEOpFactory, AscendFillOpFactory,
AscendCopyOpFactory, AscendCausalMaskOpFactory> uses consistent single-space
separation for readability.

In `@mllm/backends/ascend/AscendCommon.cpp`:
- Around line 331-335: The FP16 path is doing an unnecessary per-element
conversion via float; instead copy raw FP16 bytes directly: when
cpu_tensor.dtype() == kFloat16, obtain src via cpu_tensor.ptr<mllm_fp16_t>() and
memcpy into fp16_data using numel * sizeof(mllm_fp16_t) (ensure fp16_data and
src have the same underlying type and alignment). Replace the loop that uses
half_float::half(static_cast<float>(src[i])) with a single memcpy and keep
numel, mllm_fp16_t, cpu_tensor.ptr<mllm_fp16_t>(), kFloat16 and fp16_data
references to locate the change.

In `@mllm/backends/ascend/graph/AscendAttentionWithKVCachePluginOperation.cpp`:
- Around line 70-80: readCurrentSeqLen does a blocking aclrtMemcpy
DEVICE_TO_HOST each Setup/Execute which forces sync on the device; change the
implementation to avoid per-call synchronous copies by caching the sequence
length on the host (read once in the first Setup into a member like
cached_seq_len and return that on subsequent calls) or switch to an asynchronous
copy (aclrtMemcpy with a stream and a pinned host buffer) combined with
appropriate stream synchronization where the value is needed. Locate
readCurrentSeqLen and callers in Setup/Execute to replace the direct device
copy: either load seq_len_tensor into a host-side cached_seq_len after
validating dtype and dataSize, or use async memcpy into a pinned buffer and
enforce stream/sync only at the boundary that actually needs the value. Ensure
existing dtype/deviceData checks (seq_len_tensor.desc.dtype, dataSize,
deviceData) remain and fall back to the synchronous path on errors.

In `@mllm/backends/ascend/graph/AscendAttentionWithKVCachePluginOperation.hpp`:
- Around line 13-66: Add a concise Doxygen-style doc comment above the
AscendAttentionWithKVCachePluginOperation class declaration that summarizes the
class purpose, enumerates the expected tensor inputs and outputs (names, shapes,
and roles used by InferShape/GetInputNum/GetOutputNum), documents the meaning
and effect of setup_bucket_size_ (how it alters bucketing or workspace sizing),
and explains the distinction between prefill and decode subgraphs
(prefill_subgraph_op_ vs decode_subgraph_op_, when each is used in Setup/Execute
and their different workspace counts
prefill_subgraph_workspace_bytes_/decode_subgraph_workspace_bytes_); keep it
brief (2–6 lines) and reference key methods like Setup, Execute,
bucketedTotalSeq, and buildAttentionSubgraph so callers and maintainers can
quickly understand behavior.

In `@mllm/backends/ascend/graph/AscendCausalMaskPluginOperation.hpp`:
- Around line 17-18: The AscendCausalMaskPluginOperation constructor can be
invoked with a single bool and should be marked explicit to avoid accidental
implicit conversions; update the declaration of
AscendCausalMaskPluginOperation(bool sliding_window = false, int32_t window_size
= 0) to be explicit (i.e., explicit AscendCausalMaskPluginOperation(...)) so the
constructor prevents implicit conversions while preserving the default
parameters.

In `@mllm/backends/ascend/graph/AscendCausalMaskTensorPluginOperation.cpp`:
- Around line 181-203: Replace the magic mask literals passed into
fillMaskBuffer with named constants defined at file/namespace scope (e.g.,
kFP16MaskValue and kFP32MaskValue) and use those constants where
input.desc.dtype is compared to ACL_FLOAT16 and ACL_FLOAT; update the FP16 call
to convert the kFP16MaskValue into half_float::half when calling fillMaskBuffer
for host_mask_buffer_, keeping the conditional branches and symbol names
(input.desc.dtype, ACL_FLOAT16, ACL_FLOAT, fillMaskBuffer, host_mask_buffer_)
intact.

In `@mllm/backends/ascend/graph/AscendCausalMaskTensorPluginOperation.hpp`:
- Around line 16-17: The public constructor for
AscendCausalMaskTensorPluginOperation is currently non-explicit and allows
unintended implicit conversions; mark the constructor
AscendCausalMaskTensorPluginOperation(bool sliding_window = false, int32_t
window_size = 0) as explicit to prevent implicit conversions (i.e., change the
declaration of the AscendCausalMaskTensorPluginOperation constructor to be
explicit).

In `@mllm/backends/ascend/graph/AscendClampPluginOperation.cpp`:
- Around line 42-73: Extract the duplicated helpers into a shared utility (e.g.,
add mllm/backends/ascend/graph/AscendPluginCommon.hpp and corresponding .cpp)
and replace the local definitions in AscendClampPluginOperation.cpp and
AscendRoundPluginOperation.cpp with includes and calls to the shared functions;
move sameTensorDesc and createAclTensor (consider renaming to
createAclTensorFromAtb) into the new namespace mllm::ascend, expose their
signatures in the header, implement them in the .cpp, and update both plugin
files to `#include` "AscendPluginCommon.hpp" and call sameTensorDesc and
createAclTensorFromAtb instead of their local copies.

In `@mllm/backends/ascend/graph/AscendDynamicLinearW8A8PluginOperation.hpp`:
- Around line 34-92: The class AscendDynamicLinearW8A8PluginOperation owns
multiple raw pointers (muls_op_, div_op_, linear_op_, mul_dequant_op_,
abs_cache_, maxdim_cache_, round_cache_, clamp_cache_, cast_cache_) so implicit
copying would lead to double-free; fix by declaring the copy constructor and
copy assignment operator deleted (and optionally delete the move constructor and
move assignment if you want to forbid moves too) on
AscendDynamicLinearW8A8PluginOperation (add e.g.
AscendDynamicLinearW8A8PluginOperation(const
AscendDynamicLinearW8A8PluginOperation&) = delete;
AscendDynamicLinearW8A8PluginOperation& operator=(const
AscendDynamicLinearW8A8PluginOperation&) = delete;) so copies are disabled—place
these declarations in the class definition alongside the existing
constructor/destructor and before private members to mirror the pattern used in
AscendLinearW8A8PluginOperation.

In `@mllm/backends/ascend/graph/AscendGraphBuilder.cpp`:
- Around line 40-48: Extract the duplicated std::vector→atb::SVector conversion
into a small helper (e.g., toAtbSVector) in an anonymous namespace and replace
the manual loops in both beginGraph and addOperation with calls to that helper;
ensure the helper signature matches usage (atb::SVector<std::string>
toAtbSVector(const std::vector<std::string>&)) and update the calls that
currently build atb_input_names and atb_output_names so they simply assign from
the helper.

In `@mllm/backends/ascend/graph/AscendGraphBuilder.hpp`:
- Around line 51-55: Move the deleted copy constructor and copy assignment into
the public section of the AscendGraphBuilder class: relocate the lines
"AscendGraphBuilder(const AscendGraphBuilder&) = delete;" and
"AscendGraphBuilder& operator=(const AscendGraphBuilder&) = delete;" from their
current location into the public: area of the class definition so the deleted
special member functions are public (still deleted) to produce clearer
compiler/static-analysis diagnostics when copy operations are attempted.
- Around line 41-42: The accessor graphName() should be marked [[nodiscard]] to
prevent accidental ignoring of its return value; update the method
declaration/definition for graphName() (which returns const std::string&
current_graph_name_) by adding the [[nodiscard]] attribute before the return
type (i.e., [[nodiscard]] const std::string& graphName() const { ... }) so the
compiler warns when its result is discarded.

In `@mllm/backends/ascend/graph/AscendLinearW8A8PluginOperation.cpp`:
- Around line 56-63: The getEnvInt function uses platform-dependent long; change
it to use a fixed-width type: include <cstdint>, replace the local parsed
variable type with int64_t and parse via std::strtoll (or equivalent that
returns a 64-bit value), keep the same error checks (end == value, parsed <= 0)
and return static_cast<int>(parsed) as before; update references to the variable
in getEnvInt to use the new int64_t parsed.

In `@mllm/backends/ascend/graph/AscendLinearW8A8PluginOperation.hpp`:
- Around line 22-71: AscendLinearW8A8PluginOperation owns raw pointers
(quant_subgraph_op_, linear_op_, cast_cache_state_) so disable implicit copying
to avoid double-free: declare the copy constructor and copy assignment operator
as deleted (e.g. AscendLinearW8A8PluginOperation(const
AscendLinearW8A8PluginOperation&) = delete; AscendLinearW8A8PluginOperation&
operator=(const AscendLinearW8A8PluginOperation&) = delete;) inside the class
definition; optionally explicitly default or delete move operations as desired
to make ownership semantics clear.

In `@mllm/backends/ascend/graph/CMakeLists.txt`:
- Around line 7-9: Replace the file(GLOB MLLM_ASCEND_GRAPH_FILES
${CMAKE_CURRENT_SOURCE_DIR}/*.cpp) pattern with an explicit list of source
files: find the CMake symbol MLLM_ASCEND_GRAPH_FILES and enumerate each .cpp
file currently in the directory instead of using the glob, or alternatively add
a clear comment in the CMakeLists.txt near the file(GLOB ...) line explaining
that CMake must be re-run when new .cpp files are added; update
add_library/add_executable calls that use MLLM_ASCEND_GRAPH_FILES to reference
the explicit list (or the documented glob) so builds are deterministic.
- Around line 26-28: The current fragile ABI detection uses string matching on
"$ENV{ATB_HOME_PATH}" — replace it with an explicit CMake option and a robust
fallback check: add a new option (e.g., ATB_USE_CXX11_ABI or
ATB_CXX11_ABI=ON/OFF) and respect that when setting
target_compile_definitions(MllmAscendGraph PUBLIC _GLIBCXX_USE_CXX11_ABI=0); if
the option is unset, check for a clear marker file inside the ATB install (e.g.,
$ENV{ATB_HOME_PATH}/cxx_abi_0 or another agreed file) before applying the flag
so detection does not rely on directory name matching. Ensure the new option is
documented in the CMakeLists and used in the conditional that currently
references "$ENV{ATB_HOME_PATH}".

In `@mllm/backends/ascend/memory/MemoryBlock.hpp`:
- Around line 11-18: MemoryBlock has inconsistent member initialization: add
default initializers for block_id_ and block_size_ to match the others so a
default-constructed MemoryBlock won't contain uninitialized values; update the
struct fields block_id_ and block_size_ (in struct MemoryBlock) to have sensible
defaults (e.g., 0 for block_id_ and 0 for block_size_) so all members
(block_id_, block_size_, requested_size_, address_, alloc_seq_, reuse_count_)
are consistently initialized.

In `@mllm/backends/ascend/ops/AscendCausalMaskOp.hpp`:
- Around line 12-18: Add a brief doc comment above the public class
AscendCausalMaskOp explaining its purpose (an Ascend backend implementation of
aops::CausalMaskOp), the constructor parameter (const aops::CausalMaskOpOptions&
options) and what the overridden methods do (setup prepares outputs given
inputs; forward applies the causal mask during computation). Mention expected
input/output tensor shapes or types at a high level and any error/edge-case
behavior for setup/forward so maintainers know usage; place the comment
immediately above the AscendCausalMaskOp declaration.

In `@mllm/backends/ascend/ops/AscendConcatOp.cpp`:
- Around line 94-97: The block in AscendConcatOp::forward contains an orphaned
brace left by the commented-out ASCEND_TIME_SCOPE; fix this by restoring
consistent block structure: either re-enable the timing scope macro or remove
the extra braces and collapse to a single statement calling op->Execute(vp,
reinterpret_cast<uint8_t*>(workspace), workspaceSize, atb_ctx) and assigning to
st. Ensure similar edits are applied consistently across other Ascend ops that
used the same pattern so the scope/bracing around op->Execute and the st
assignment matches (e.g., in other files that used ASCEND_TIME_SCOPE).

In `@mllm/backends/ascend/ops/AscendCopyOp.hpp`:
- Around line 11-23: Add concise API comments above the AscendCopyOp and
AscendCopyOpFactory declarations: document AscendCopyOp's purpose (a
backend-specific implementation of aops::CopyOp for Ascend devices), describe
the constructor AscendCopyOp(const aops::CopyOpOptions& options) and what
options control, explain the semantics and ownership for forward(const
std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) (expected
input/output shapes, whether tensors are moved/copied, thread/device guarantees,
and error conditions/exception semantics), and document AscendCopyOpFactory's
role (creates shared_ptr<BaseOp> via createOpImpl and that it returns ownership
in a std::shared_ptr). Keep each comment short (one- to three-sentence summary)
and place them immediately above the class declarations so callers can
understand purpose, parameters, returns, and ownership contract.

In `@mllm/backends/ascend/ops/AscendEmbeddingOp.cpp`:
- Around line 99-169: The forward method in AscendEmbeddingOp duplicates cleanup
logic for acl tensors and workspace; wrap ACL tensor handles returned by
createAclTensor (e.g., acl_weight, acl_indices, acl_output) in an
RAII/scope-guard object (or use a small CleanupTensors struct) that calls
aclDestroyTensor for each non-null pointer on scope exit, and similarly
encapsulate workspace allocation/free (workspace_block_id via
getAscendMemoryManager().allocateBlock / freeBlock) in a RAII guard so all
early-return error paths and the success path share a single automatic cleanup
instead of manual duplicated frees in AscendEmbeddingOp::forward and the error
branches around aclnnEmbeddingGetWorkspaceSize and aclnnEmbedding.

In `@mllm/backends/ascend/ops/AscendFillOp.cpp`:
- Around line 24-30: In AscendFillOp::forward validate the random range before
branching on dtype: check the input start and end values (the params used for
kRandom) and assert or return an error if start > end so the code fails fast
instead of letting invalid ranges flow into the integer/float random branches;
update the pre-dtype-split validation in the forward method (and the other
duplicate checks referenced around the other occurrences) to perform the
start<=end check and use the existing MLLM_RT_ASSERT_* or error path to reject
the invalid range.

In `@mllm/backends/ascend/ops/AscendKVCacheOp.cpp`:
- Around line 117-160: In repeatInterleaveForGQA the extracted variable B is
unused; either remove the line "int B = shape[0];" to avoid confusion or, if
batch handling will be implemented later, add a short comment above it (e.g.,
"// batch dimension B kept for future per-batch handling") to make intent
explicit; update any related comments to reflect that the current loop operates
over heads only and does not iterate per-batch.
- Around line 50-62: In AscendKVCache::updateKVCache, explicitly convert the
tensor shape value to the same integer type used by current_seq_cnt_ and
max_cache_length_ to avoid narrowing mismatches: replace auto inputs_seq_len =
k.shape()[2] with an explicit int32_t (e.g., int32_t inputs_seq_len =
static_cast<int32_t>(k.shape()[2])); and add a defensive check to handle
out-of-range shape() values before casting (return or error via MLLM_ERROR_EXIT
if shape()[2] > std::numeric_limits<int32_t>::max()) so subsequent arithmetic
with current_seq_cnt_ and max_cache_length_ is safe and well-typed.

In `@mllm/backends/ascend/ops/AscendLinearQuant.hpp`:
- Around line 12-24: The header lacks a contract describing
AscendLinearW8A8Artifacts and prepareLinearW8A8Artifacts; add a concise doc
comment above the struct and function that states which struct members are CPU
tensors (scale_w_cpu, scale_x_cpu) vs Ascend/NPU tensors (deq_scale_npu,
deq_scale_w_npu, bias_int32_npu), the expected dtypes and shapes for each (e.g.,
scale_*: float32 1D per-channel of length out_channels, deq/bias: int32/float32
with NPU layout), that scale_x is a scalar float, ownership/lifetime
expectations (caller-usable tensors are already device-allocated), and for
prepareLinearW8A8Artifacts document parameters (layer_name, out_channels,
scale_w_raw, scale_x_raw), expected input dtypes/shapes and what the returned
AscendLinearW8A8Artifacts contains on success.

In `@mllm/backends/ascend/ops/AscendMatMulOp.cpp`:
- Around line 109-123: The two independent static counters matmul_dbg_count and
matmul_ws_count cause desynchronized debug prints; consolidate them into a
single shared counter used by both metadata and workspace debug blocks inside
shouldDebugMatMulStats() so both kinds of messages are gated together (increment
the single counter once per call and use it to limit both the tensor-metadata
prints and the workspace-size prints to 12). Locate the debug sections that
reference matmul_dbg_count and matmul_ws_count in AscendMatMulOp.cpp and replace
them with one unified counter (e.g., matmul_debug_count) ensuring the increment
happens exactly once per MatMul invocation.
- Around line 29-32: shouldDebugMatMulStats currently calls std::getenv() on
every call; cache the result to avoid repeated syscalls by computing the boolean
once and returning that cached value. Replace the body of
shouldDebugMatMulStats() with a static initialized value (e.g., static const
bool kDebug = []{ auto* d = std::getenv("MLLM_DEBUG_MATMUL_STATS"); return d !=
nullptr && d[0]=='1'; }(); ) or use std::call_once to initialize a cached flag,
and then return that flag; keep the function signature and name unchanged.

In `@mllm/backends/ascend/ops/AscendRMSNormOp.hpp`:
- Line 16: Add a one- to two-sentence contract comment above the public override
method load(const ParameterFile::ptr_t& ploader) in AscendRMSNormOp.hpp that
explains the method's purpose (initialize the operator from persistent
parameters), describes the ploader parameter type and expectations (which
keys/fields are required versus optional), specifies how missing parameters are
handled (defaults used or error thrown), notes any side effects or conversions
performed (type/shape conversions, memory layout changes, or validation), and
lists possible error conditions or exceptions that callers should expect; refer
to the symbol load(...) and the surrounding AscendRMSNormOp class so reviewers
can locate and validate the behavior.

In `@mllm/backends/ascend/ops/AscendRoPEOp.hpp`:
- Around line 12-25: Add a concise doc comment block above the public types to
document the RoPE op contract: describe the purpose of AscendRoPEOp and
AscendRoPEOpFactory, the constructor AscendRoPEOp(const aops::RoPEOpOptions&),
and the expected behavior and side-effects of setup(...) and forward(...)
(input/output tensor shapes/format expectations, when setup is called, whether
forward mutates outputs or throws on shape mismatch, and any error conditions).
Include brief notes on ownership/lifetime of returned shared_ptr from
AscendRoPEOpFactory::createOpImpl and any thread-safety expectations.

In `@mllm/backends/ascend/ops/AscendSiLUOp.cpp`:
- Around line 96-99: The extra braces around the op->Execute call in
AscendSiLUOp::forward were left when ASCEND_TIME_SCOPE was commented out; either
remove the now-unnecessary scope braces or add a short explanatory comment.
Locate the block containing ASCEND_TIME_SCOPE (commented) and op->Execute(vp,
reinterpret_cast<uint8_t*>(workspace), workspaceSize, atb_ctx) and either delete
the enclosing { } so the call is not wrapped by an empty scope, or keep the
braces but add a one-line comment (e.g., "// timing disabled due to ...")
explaining why the timing scope is omitted.

In `@mllm/backends/ascend/ops/AscendTransposeOp.hpp`:
- Around line 12-26: Add concise header comments documenting the public API: for
class AscendTransposeOp describe its purpose (an Ascend backend implementation
of aops::TransposeOp), the contract for its constructor (expects
aops::TransposeOpOptions), and behavior/expectations of its methods setup,
reshape, and forward (inputs/outputs shapes, ownership/aliasing, and error
conditions). For AscendTransposeOpFactory document that it produces
AscendTransposeOp instances for OpTypes::kTranspose, note createOpImpl's input
(options) and returned type (std::shared_ptr<BaseOp>), and any thread-safety or
lifetime guarantees; keep comments brief and follow existing project docstring
style.

In `@mllm/backends/ascend/ops/AscendX2XOp.cpp`:
- Around line 43-44: Replace the change-history comment "Removed
syncGlobalAtbStream() for better pipeline performance" with a behavior-focused
note describing the expected synchronization contract for Host-to-Device (H2D)
copies: state that this code path intentionally returns early (the return; after
the comment) because H2D transfers here are assumed to be synchronized through
per-stream/event coordination (or another explicit mechanism) rather than via a
global ATB stream, and document what callers must guarantee (e.g., that the
transfer stream is properly synchronized with the compute stream using events or
that upstream code flushes/queues required barriers). Mention the removed
function name syncGlobalAtbStream() only to clarify what is not performed and
reference the H2D copy behavior/invariant so future readers know the runtime
expectation.

In `@mllm/models/qwen_ascend/configuration_qwen_ascend.hpp`:
- Line 9: QwenAscendConfig currently inherits from ConfigFile using protected
inheritance; change this to either public inheritance or composition depending
on intended API: if QwenAscendConfig should expose ConfigFile's interface,
replace "protected ConfigFile" with "public ConfigFile" in the struct
declaration; if ConfigFile is an implementation detail, remove the base class
inheritance and add a private member like "ConfigFile config_," then update
QwenAscendConfig constructors and any uses of base members to access them via
config_ (or forward only the needed methods). Ensure all references to inherited
members in QwenAscendConfig are updated to the chosen approach (constructor
initializers, method calls, and member accesses).

In `@mllm/models/qwen_ascend/modeling_qwen_ascend.hpp`:
- Around line 117-122: The parameter inputs[3] is intentionally unused in the
forward method of modeling_qwen_ascend::forward; add a concise inline comment
next to the (void)inputs[3] line explaining what inputs[3] represents (e.g.,
attention mask / token type / auxiliary tensor) and why it is not needed in this
attention implementation (for example, handled elsewhere or redundant because
llm_embedding_sin/llm_embedding_cos and past_kv_cache are used instead). Mention
the related symbols llm_embedding_sin, llm_embedding_cos and past_kv_cache to
clarify the rationale for ignoring inputs[3] so future maintainers understand
the decision.
- Around line 276-294: The current logic in modeling_qwen_ascend.hpp creates a
new single-element position_ids Tensor on every decode step (when seq_len == 1),
causing repeated allocations; change this to reuse a cached Tensor member (e.g.,
add a class member like cached_single_pos_ids): on first use allocate the cached
Tensor with shape {batch_size,1}, on subsequent decode steps update its single
value(s) instead of allocating, and use that cached Tensor wherever position_ids
is set for seq_len == 1; ensure the cache is lazily allocated (check for nil)
and updated via the same offsettedPtr/ptr logic so existing code paths
(position_ids, seq_len, batch_size) can remain unchanged.

In `@mllm/models/qwen_ascend/qwen_ascend_decoder_graph.hpp`:
- Around line 253-279: Extract the repeated 4D reshape lambda used in the three
builder.reshape calls ("q_linear", "k_linear", "v_linear") into a helper factory
(e.g., make4DReshapeFunc) that accepts the variable number-of-heads and head-dim
and returns a lambda matching (const atb::Dims&, atb::Dims&). Replace the inline
lambdas so q uses self_attn_.num_attention_heads_ and k/v use
self_attn_.num_key_value_heads_ with self_attn_.head_dim_ passed to the helper;
keep the builder.reshape invocations and the reshape id strings ("q_linear_4d",
"k_linear_4d", "v_linear_4d") unchanged. Ensure the helper captures values by
value so it is safe if moved to a different scope.

In `@mllm/models/qwen_ascend/qwen_ascend_rope.hpp`:
- Around line 147-170: The current nested loops over batch and seq in the
function that uses cached_sin_emb_, cached_cos_emb_, pos_ptr, sin_result and
cos_result issue many small aclrtMemcpy calls (ACL_MEMCPY_DEVICE_TO_DEVICE),
which is inefficient; replace the per-position memcpys with batched device
copies by computing contiguous source/target ranges and issuing a single larger
aclrtMemcpy (or aclrtMemcpyAsync) per batch/sequence span, or implement a
device-side gather kernel to copy multiple non-contiguous positions into
sin_result and cos_result in one kernel launch; update the loops around the
aclrtMemcpy calls to accumulate offsets and sizes and call aclrtMemcpy once per
larger block (or call the new gather kernel) and keep MLLM_ACL_CHECK(ret) after
those batched ops.

In `@mllm/models/qwen_ascend/tokenization_qwen_ascend.hpp`:
- Around line 95-104: In convert2Ids, replace the C-style casts of ids.size()
with C++ casts: change occurrences like (int32_t)ids.size() to
static_cast<int32_t>(ids.size()) in the Tensor::empty call and any other usages
inside the function (e.g., where the shape is computed or elements are assigned)
to improve type safety and clarity; the target symbol to edit is the method
convert2Ids in the qwen_ascend tokenizer (look for Tensor::empty({1,
(int32_t)ids.size()}, ...) and similar casts).

In `@tests/ascend/AscendAclnnKernelTest.hpp`:
- Around line 54-66: The duplicated makeAclTensor lambda used in
AclnnCastFloat16ToInt8Test, AclnnAbsFloat16Test and AclnnMaxDimFloat16Test
should be extracted into a single helper (either a private static method on the
test fixture or a free function in an anonymous namespace) with the same
signature (accepting const Tensor& and aclDataType, returning aclTensor*);
replace each inline lambda with calls to that helper (remove the duplicated
lambdas), ensure the helper name (e.g., makeAclTensor) is visible to all tests
and that its ownership/cleanup expectations remain unchanged, and update any
call sites to match the helper’s parameters and return type.

In `@tests/ascend/AscendEmbeddingKernelTest.hpp`:
- Around line 23-86: The test uses purely random indices and can miss edge
cases; update EmbeddingFloat16Test so at least one deterministic case forces
token IDs 0, vocab_size-1 and repeated tokens: either (A) augment the input
test_cases with a specific tuple (e.g., small batch/seq/vocab/hidden) used to
build a bespoke indices_cpu layout that places 0, vocab_size-1 and repeated IDs
at known positions, or (B) after the random fill for indices_cpu (in the block
that uses std::mt19937 and dist) add an explicit override for particular
positions (e.g., indices_cpu.ptr<int32_t>()[0]=0;
indices_cpu.ptr<int32_t>()[1]=vocab_size-1; fill some other positions with the
same token to create repeats) so that the gather call (nn::functional::gather)
is exercised with these edge IDs; keep the rest of the test flow (weight_cpu,
ref_cpu, weight_ascend, indices_ascend, y_ascend comparisons) unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

Comment thread CMakeLists.txt
Comment on lines +52 to +60
# ATB libraries with cxx_abi_0 require using the old C++ ABI globally
if(MLLM_BUILD_ASCEND_BACKEND)
if(DEFINED ENV{ATB_HOME_PATH})
if("$ENV{ATB_HOME_PATH}" MATCHES "cxx_abi_0")
message(STATUS "Detected ATB cxx_abi_0, setting _GLIBCXX_USE_CXX11_ABI=0 globally for ABI compatibility")
add_compile_definitions(_GLIBCXX_USE_CXX11_ABI=0)
endif()
endif()
endif()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Global ABI definitions =="
rg -n --iglob 'CMakeLists.txt' 'add_compile_definitions\([^)]*_GLIBCXX_USE_CXX11_ABI=0'

echo
echo "== Target-scoped ABI definitions =="
rg -n --iglob 'CMakeLists.txt' 'target_compile_definitions\([^)]*_GLIBCXX_USE_CXX11_ABI'

echo
echo "== Ascend/ATB-linked targets =="
rg -n --iglob 'CMakeLists.txt' 'target_link_libraries\(|MllmAscendBackend|ascendcl|atb'

Repository: UbiquitousLearning/mllm

Length of output: 12749


Scope ABI switch to Ascend targets, not globally.

Line 57 applies _GLIBCXX_USE_CXX11_ABI=0 to the entire build. This affects non-Ascend targets (CPU, CUDA, OpenCL, QNN backends) and can introduce ABI mismatches with external dependencies built with ABI=1. The definition already exists target-scoped in mllm/backends/ascend/graph/CMakeLists.txt:27, but the global definition overrides the entire build. Keep detection logic here, but propagate a flag via cache variable and apply the definition only to Ascend-linked targets (MllmAscendBackend, MllmAscendGraph).

Suggested refactor in this file
-      message(STATUS "Detected ATB cxx_abi_0, setting _GLIBCXX_USE_CXX11_ABI=0 globally for ABI compatibility")
-      add_compile_definitions(_GLIBCXX_USE_CXX11_ABI=0)
+      message(STATUS "Detected ATB cxx_abi_0; enabling ABI0 for Ascend targets")
+      set(MLLM_ASCEND_USE_GLIBCXX_ABI0 ON CACHE BOOL "Use _GLIBCXX_USE_CXX11_ABI=0 for Ascend targets")

Then in mllm/backends/ascend/CMakeLists.txt, apply conditionally to MllmAscendBackend.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CMakeLists.txt` around lines 52 - 60, The global
add_compile_definitions(_GLIBCXX_USE_CXX11_ABI=0) should be removed and replaced
by setting a cache flag when ATB cxx_abi_0 is detected (e.g.,
set(MLLM_ASCEND_USE_CXX11_ABI_0 TRUE CACHE BOOL "Ascend ABI0")) in this
CMakeLists, then remove the global definition here and instead read that cache
variable in mllm/backends/ascend/CMakeLists.txt and
mllm/backends/ascend/graph/CMakeLists.txt and apply
target_compile_definitions(_GLIBCXX_USE_CXX11_ABI=0 PRIVATE) only to the Ascend
targets MllmAscendBackend and MllmAscendGraph so other targets
(CPU/CUDA/OpenCL/QNN) are unaffected.

Comment on lines +146 to +178
// Print some output statistics
int vocab_size = cfg.vocab_size;
float min_val = 0, max_val = 0, sum = 0;
int64_t argmax_idx = 0;

// Handle FP16 data type
if (logits.dtype() == mllm::kFloat16) {
auto logits_ptr = logits.ptr<mllm::mllm_fp16_t>();
min_val = static_cast<float>(logits_ptr[0]);
max_val = min_val;
for (int i = 0; i < vocab_size; ++i) {
float val = static_cast<float>(logits_ptr[i]);
sum += val;
if (val < min_val) min_val = val;
if (val > max_val) {
max_val = val;
argmax_idx = i;
}
}
} else {
auto logits_ptr = logits.ptr<float>();
min_val = logits_ptr[0];
max_val = min_val;
for (int i = 0; i < vocab_size; ++i) {
float val = logits_ptr[i];
sum += val;
if (val < min_val) min_val = val;
if (val > max_val) {
max_val = val;
argmax_idx = i;
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check how logits are returned from the model forward pass
ast-grep --pattern $'forward($$$) {
  $$$
  return $_;
}'
rg -n -A5 'def forward|logits' --type cpp -g '*qwen_ascend*'

Repository: UbiquitousLearning/mllm

Length of output: 402


🏁 Script executed:

# Read the relevant section of the file to see full context
cat -n examples/qwen_ascend/main.cpp | sed -n '130,185p'

Repository: UbiquitousLearning/mllm

Length of output: 2586


🏁 Script executed:

# Look for the forward pass and input tensor creation
cat -n examples/qwen_ascend/main.cpp | sed -n '100,145p'

Repository: UbiquitousLearning/mllm

Length of output: 2380


🏁 Script executed:

# Check test_seq_len default value and any slicing operations
cat -n examples/qwen_ascend/main.cpp | sed -n '50,75p'

Repository: UbiquitousLearning/mllm

Length of output: 1360


🏁 Script executed:

# Search for model.forward() implementation or any slicing before statistics
rg -n 'forward\(' examples/qwen_ascend/main.cpp | head -20

Repository: UbiquitousLearning/mllm

Length of output: 117


🏁 Script executed:

# Check if there's any reshaping or slicing of logits after line 144 (where it's moved to CPU)
# and before the statistics calculation starts at line 146
cat -n examples/qwen_ascend/main.cpp | sed -n '144,180p'

Repository: UbiquitousLearning/mllm

Length of output: 1494


🏁 Script executed:

# Check if there's any mention of how the model handles sequence positions
# or if there's slicing happening in the model itself
rg -n 'view|reshape|slice|gather|squeeze' examples/qwen_ascend/main.cpp | head -20

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Check how other examples or tests handle multi-dimensional tensor access
rg -n 'logits\[' --type cpp -g '*.cpp' | head -30

Repository: UbiquitousLearning/mllm

Length of output: 122


🏁 Script executed:

# Look for ARGeneration to see how it extracts logits
cat -n mllm/models/ARGeneration.cpp | sed -n '470,490p'

Repository: UbiquitousLearning/mllm

Length of output: 757


🏁 Script executed:

# Check if the example uses getLastLogits anywhere
rg -n 'getLastLogits' examples/qwen_ascend/main.cpp

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

# Check what the forward() function signature is and what it returns
rg -n 'forward\(' mllm/models/ARGeneration.h | head -10

Repository: UbiquitousLearning/mllm

Length of output: 134


🏁 Script executed:

# Find the ARGeneration header
find . -name "*.h" -o -name "*.hpp" | xargs grep -l "class ARGeneration" | head -5

Repository: UbiquitousLearning/mllm

Length of output: 98


🏁 Script executed:

# Check the forward method in ARGeneration
rg -n 'forward' mllm/models/ARGeneration.hpp | head -20

Repository: UbiquitousLearning/mllm

Length of output: 183


🏁 Script executed:

# Check what test_seq_len default value is and if there's documentation on expected behavior
rg -n 'test_seq_len|seq_len' examples/qwen_ascend/main.cpp | head -20

Repository: UbiquitousLearning/mllm

Length of output: 620


Incorrect indexing of logits tensor; computes statistics from position 0 instead of position S-1.

The logits tensor has shape [B, S, vocab_size] (as shown in lines 134-135), but the statistics loop accesses only logits_ptr[0..vocab_size-1], which corresponds to the first sequence position [B, 0, vocab_size]. With test input of shape [1, 8], the output is [1, 8, vocab_size], so the current code analyzes position 0 instead of the final position 7.

For autoregressive generation, the relevant logits are at the last sequence position. Use logits[{kAll, logits.shape()[1] - 1, kAll}] or the existing getLastLogits() utility (see ARGeneration.cpp:471) to extract the correct position before computing statistics.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/qwen_ascend/main.cpp` around lines 146 - 178, The current stats loop
reads logits_ptr[0..vocab_size-1] which uses the first sequence position;
instead extract the final timestep logits (shape [B,S,vocab_size] -> use the
last S-1 position) before computing min/max/sum/argmax. Replace the direct
logits.ptr access with the utility that returns the last-position slice (e.g.
call getLastLogits() used in ARGeneration or index logits with logits[{kAll,
logits.shape()[1]-1, kAll}]) and then use that pointer when handling both
kFloat16 (mllm::mllm_fp16_t) and float branches; keep using cfg.vocab_size,
logits.ptr, kFloat16, and argmax_idx as before.

Comment on lines +209 to +216
auto ret = aclrtMemcpyAsync(output.deviceData,
total_bytes,
host_mask_buffer_,
total_bytes,
ACL_MEMCPY_HOST_TO_DEVICE,
getGlobalAtbStream());
return ret == ACL_SUCCESS ? atb::NO_ERROR : atb::ERROR_RT_FAIL;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential race condition with async memcpy and reused host buffer.

aclrtMemcpyAsync is non-blocking, but host_mask_buffer_ is a class member that may be reused in a subsequent Execute() call before the async copy completes. This could corrupt the in-flight transfer.

Either synchronize after the async copy or use a pinned buffer pool with double-buffering.

🔧 Option 1: Synchronize before returning
   auto ret = aclrtMemcpyAsync(output.deviceData,
                               total_bytes,
                               host_mask_buffer_,
                               total_bytes,
                               ACL_MEMCPY_HOST_TO_DEVICE,
                               getGlobalAtbStream());
+  if (ret == ACL_SUCCESS) {
+    ret = aclrtSynchronizeStream(getGlobalAtbStream());
+  }
   return ret == ACL_SUCCESS ? atb::NO_ERROR : atb::ERROR_RT_FAIL;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/backends/ascend/graph/AscendCausalMaskTensorPluginOperation.cpp` around
lines 209 - 216, The async memcpy using aclrtMemcpyAsync writes from the class
member host_mask_buffer_ which may be reused by a subsequent Execute(); to avoid
the race, wait for the copy to finish before returning: after calling
aclrtMemcpyAsync (and checking its return), synchronize the stream returned by
getGlobalAtbStream() (e.g., via aclrtSynchronizeStream or equivalent) before
returning the error code; alternatively, implement a pinned-host buffer pool or
double-buffering for host_mask_buffer_ and ensure the buffer used for the
in-flight transfer is not reused until the async transfer completes.

Comment on lines +566 to +585
if (cast_cache == nullptr || cast_cache->executor == nullptr ||
!sameTensorDesc(cast_cache->src_desc, x_clamped_t.desc) ||
!sameTensorDesc(cast_cache->dst_desc, x_int8_t.desc)) {
return atb::ERROR_INTERNAL_ERROR;
}
auto cast_update_st = updateCastTensorAddrs(cast_cache, x_clamped_t, x_int8_t);
if (profile_enabled) cast_query_ms = elapsedMs(cast_query_start, Clock::now());
const auto cast_exec_start = profile_enabled ? Clock::now() : Clock::time_point{};
aclError ret = (cast_update_st == atb::NO_ERROR)
? aclnnCast(cast_cache->workspace_size > 0 ? cast_ws_ptr : nullptr,
cast_cache->workspace_size,
cast_cache->executor,
getGlobalAtbStream())
: ACL_ERROR_INTERNAL_ERROR;
if (profile_enabled && ret == ACL_SUCCESS) {
syncGlobalAtbStream();
cast_exec_ms = elapsedMs(cast_exec_start, Clock::now());
}
if (ret != ACL_SUCCESS) return atb::ERROR_INTERNAL_ERROR;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for other aclnnCast usages to see if sync is typically required
rg -n 'aclnnCast\(' --type=cpp -A 5 -B 2

Repository: UbiquitousLearning/mllm

Length of output: 5920


🏁 Script executed:

#!/bin/bash
# Check what happens after line 585 in the reviewed file
sed -n '580,610p' mllm/backends/ascend/graph/AscendLinearW8A8PluginOperation.cpp

Repository: UbiquitousLearning/mllm

Length of output: 1470


🏁 Script executed:

#!/bin/bash
# Look for the Linear operation execution immediately following the cast
# to see if it implicitly synchronizes
rg -n 'aclnnLinear\(' --type=cpp -A 3 -B 2 | head -50

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

#!/bin/bash
# Check if there are any stream synchronization utilities or patterns
# that might provide implicit sync between operations
rg -n 'syncGlobalAtbStream\|aclrtSynchronizeStream' --type=cpp | head -20

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

#!/bin/bash
# Find the linear_op_ type and its Execute method
rg -n 'linear_op_' mllm/backends/ascend/graph/AscendLinearW8A8PluginOperation.cpp | head -20

Repository: UbiquitousLearning/mllm

Length of output: 562


🏁 Script executed:

#!/bin/bash
# Look at the complete picture - check test pattern again carefully
# to understand expected behavior
grep -A 10 -B 5 'aclnnCast.*stream' tests/ascend/AscendAclnnKernelTest.hpp

Repository: UbiquitousLearning/mllm

Length of output: 595


🏁 Script executed:

#!/bin/bash
# Check if there's any documentation or comments about stream behavior
rg -n 'stream.*async\|asynchronous\|synchronous' --type=cpp -i | head -20

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

#!/bin/bash
# Find createLinearW8A8Op to understand the linear operation type
rg -n 'createLinearW8A8Op' --type=cpp -A 15 | head -50

Repository: UbiquitousLearning/mllm

Length of output: 4895


🏁 Script executed:

#!/bin/bash
# Check the ATB Linear W8A8 operation implementation
fd -t f 'AtbLinearW8A8\|LinearW8A8' --type f | head -10

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

#!/bin/bash
# Look for stream behavior in ATB operations - check if Execute implicitly syncs
rg -n 'Execute.*stream\|stream.*Execute' mllm/backends/ascend --type=cpp -A 2 | head -40

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

#!/bin/bash
# Check if ATB operations documentation mentions stream behavior or synchronization
rg -n 'OperationInfra\|Execute.*atb\|stream.*ordering' --type=cpp -B 2 -A 2 | head -60

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

#!/bin/bash
# Look for other operations that might have similar patterns - synchronization after async ops
rg -n 'aclnnRound\|aclnnClamp' --type=cpp -A 4 -B 1 | head -60

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

#!/bin/bash
# Search for all sync calls to understand the pattern
rg -n 'syncGlobalAtbStream\|aclrtSynchronizeStream\|aclrtStreamSynchronize' --type=cpp | grep -v test | head -30

Repository: UbiquitousLearning/mllm

Length of output: 49


🏁 Script executed:

#!/bin/bash
# Look at the quant_subgraph_op execution which happens before cast
# to see if there's a sync pattern there
sed -n '540,575p' mllm/backends/ascend/graph/AscendLinearW8A8PluginOperation.cpp

Repository: UbiquitousLearning/mllm

Length of output: 2051


🏁 Script executed:

#!/bin/bash
# Check how AscendLinearDynamicW8A8.cpp handles the cast/linear sequence
# since it also has aclnnCast
sed -n '240,260p' mllm/backends/ascend/ops/AscendLinearDynamicW8A8.cpp

Repository: UbiquitousLearning/mllm

Length of output: 966


🏁 Script executed:

#!/bin/bash
# Final check: confirm cast is async by looking at ACL documentation patterns
# and check if there are any comments explaining why sync is skipped
sed -n '560,595p' mllm/backends/ascend/graph/AscendLinearW8A8PluginOperation.cpp

Repository: UbiquitousLearning/mllm

Length of output: 1918


Always synchronize after aclnnCast regardless of profiling state.

The stream synchronization at lines 580–583 should not be gated on profile_enabled. Test code consistently synchronizes immediately after aclnnCast() without profiling conditions. Since x_int8_t (cast output) is consumed directly by the linear operation at line 591, the cast must complete before the linear operation begins. When profiling is disabled, the stream may remain unsynchronized, causing the linear operation to read stale or uninitialized data.

Change to:

if (profile_enabled && ret == ACL_SUCCESS) {
  syncGlobalAtbStream();
  cast_exec_ms = elapsedMs(cast_exec_start, Clock::now());
}

to:

if (ret == ACL_SUCCESS) {
  syncGlobalAtbStream();
  if (profile_enabled) cast_exec_ms = elapsedMs(cast_exec_start, Clock::now());
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/backends/ascend/graph/AscendLinearW8A8PluginOperation.cpp` around lines
566 - 585, The current flow only calls syncGlobalAtbStream() when
profile_enabled is true, which can leave aclnnCast() unsynchronized before the
subsequent linear operation uses x_int8_t; change the logic around the
aclnnCast() result (ret) so that if ret == ACL_SUCCESS you always call
syncGlobalAtbStream(), and only compute cast_exec_ms =
elapsedMs(cast_exec_start, Clock::now()) when profile_enabled is true; update
the block that currently checks (profile_enabled && ret == ACL_SUCCESS) to
instead check ret == ACL_SUCCESS first (calling syncGlobalAtbStream()), then
conditionally set cast_exec_ms if profile_enabled. Ensure this change is applied
next to the aclnnCast call and variables cast_exec_start, cast_exec_ms, ret,
profile_enabled, and x_int8_t.

Comment on lines 104 to 115
if (remain_size_ > align_size) {
block_id = generateBlocksId();
uint64_t cur_mem_ptr_align = (reinterpret_cast<uint64_t>(cur_mem_ptr_) + 63) & ~63;
uint64_t cur_mem_ptr_align = (reinterpret_cast<uint64_t>(cur_mem_ptr_) + 63) & ~63;
remain_size_ -= (cur_mem_ptr_align - reinterpret_cast<uint64_t>(cur_mem_ptr_));
cur_mem_ptr_ = reinterpret_cast<void *>(cur_mem_ptr_align);

MemoryBlock block = {block_id, align_size, cur_mem_ptr_};
MemoryBlock block = {block_id, align_size, size, cur_mem_ptr_, total_alloc_count_, 0};
used_blocks_.insert({block_id, block});
remain_size_ -= align_size;
cur_mem_ptr_ = reinterpret_cast<uint8_t *>(cur_mem_ptr_) + align_size;
total_new_alloc_count_++;
MLLM_INFO("allocate block id {} for size {}", block_id, align_size);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Allow exact-fit allocations from the remaining pool.

Line 104 uses remain_size_ > align_size, which rejects requests that exactly consume the last available chunk of the pool. That turns a valid allocation into a false OOM.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/backends/ascend/memory/AscendMemoryPool.cpp` around lines 104 - 115, The
allocation check currently uses remain_size_ > align_size which prevents
exact-fit allocations; update the condition to allow remain_size_ >= align_size
so requests that exactly consume the remaining pool succeed. In the allocation
branch around generateBlocksId(), cur_mem_ptr_, and used_blocks_ where
MemoryBlock is constructed and total_new_alloc_count_ is incremented (and
MLLM_INFO logged), ensure the alignment adjustment and subsequent updates to
remain_size_ and cur_mem_ptr_ remain correct when remain_size_ == align_size so
an exact-fit allocation is recorded into used_blocks_ rather than treated as
OOM.

Comment on lines +24 to +205
inline atb::Operation* createLinearGraphOp(bool has_bias) {
atb::infer::LinearParam param;
param.transposeA = false;
param.transposeB = true;
param.hasBias = has_bias;
param.outDataType = ACL_DT_UNDEFINED;
param.enAccum = false;
param.matmulType = atb::infer::LinearParam::MATMUL_UNDEFINED;
param.quantMode = atb::infer::LinearParam::QUANT_UNDEFINED;

atb::Operation* op = nullptr;
auto st = atb::CreateOperation(param, &op);
if (st != atb::NO_ERROR || op == nullptr) {
MLLM_ERROR_EXIT(ExitCode::kAscendError, "ATB CreateOperation(Linear) failed, status={}", static_cast<int>(st));
}
return op;
}

inline atb::Operation* createRmsNormGraphOp(float epsilon) {
atb::infer::RmsNormParam param;
param.layerType = atb::infer::RmsNormParam::RmsNormType::RMS_NORM_NORM;
param.normParam.quantType = atb::infer::QuantType::QUANT_UNQUANT;
param.normParam.epsilon = epsilon;

atb::Operation* op = nullptr;
auto st = atb::CreateOperation(param, &op);
if (st != atb::NO_ERROR || op == nullptr) {
MLLM_ERROR_EXIT(ExitCode::kAscendError, "ATB CreateOperation(RMS_NORM) failed, status={}", static_cast<int>(st));
}
return op;
}

inline atb::Operation* createRoPEGraphOp() {
atb::infer::RopeParam param;
param.rotaryCoeff = 2;
param.cosFormat = 0;

atb::Operation* op = nullptr;
auto st = atb::CreateOperation(param, &op);
if (st != atb::NO_ERROR || op == nullptr) {
MLLM_ERROR_EXIT(ExitCode::kAscendError, "ATB CreateOperation(RoPE) failed, status={}", static_cast<int>(st));
}
return op;
}

inline atb::Operation* createTransposeGraphOp(int rank, int dim0, int dim1) {
atb::infer::TransposeParam param;
for (int i = 0; i < rank; ++i) {
param.perm.push_back(i);
}
std::swap(param.perm[dim0], param.perm[dim1]);

atb::Operation* op = nullptr;
auto st = atb::CreateOperation(param, &op);
if (st != atb::NO_ERROR || op == nullptr) {
MLLM_ERROR_EXIT(ExitCode::kAscendError, "ATB CreateOperation(Transpose) failed, status={}", static_cast<int>(st));
}
return op;
}

inline atb::Operation* createSoftmaxGraphOp(int axis) {
atb::infer::SoftmaxParam param;
param.axes.push_back(axis);

atb::Operation* op = nullptr;
auto st = atb::CreateOperation(param, &op);
if (st != atb::NO_ERROR || op == nullptr) {
MLLM_ERROR_EXIT(ExitCode::kAscendError, "ATB CreateOperation(Softmax) failed, status={}", static_cast<int>(st));
}
return op;
}

inline atb::Operation* createAddGraphOp() {
atb::infer::ElewiseParam param;
param.elewiseType = atb::infer::ElewiseParam::ELEWISE_ADD;

atb::Operation* op = nullptr;
auto st = atb::CreateOperation(param, &op);
if (st != atb::NO_ERROR || op == nullptr) {
MLLM_ERROR_EXIT(ExitCode::kAscendError, "ATB CreateOperation(ELEWISE_ADD) failed, status={}", static_cast<int>(st));
}
return op;
}

inline atb::Operation* createMulGraphOp() {
atb::infer::ElewiseParam param;
param.elewiseType = atb::infer::ElewiseParam::ELEWISE_MUL;

atb::Operation* op = nullptr;
auto st = atb::CreateOperation(param, &op);
if (st != atb::NO_ERROR || op == nullptr) {
MLLM_ERROR_EXIT(ExitCode::kAscendError, "ATB CreateOperation(ELEWISE_MUL) failed, status={}", static_cast<int>(st));
}
return op;
}

inline atb::Operation* createSiLUGraphOp() {
atb::infer::ActivationParam param;
param.activationType = atb::infer::ACTIVATION_SWISH;

atb::Operation* op = nullptr;
auto st = atb::CreateOperation(param, &op);
if (st != atb::NO_ERROR || op == nullptr) {
MLLM_ERROR_EXIT(ExitCode::kAscendError,
"ATB CreateOperation(ACTIVATION_SWISH) failed, status={}",
static_cast<int>(st));
}
return op;
}

inline bool isQwenAscendDecoderGraphEnabled() {
const char* env = std::getenv("MLLM_ASCEND_QWEN_DECODER_GRAPH");
return env == nullptr || env[0] != '0';
}

inline int32_t getQwenAscendDecoderGraphSetupBucketSize() {
const char* env = std::getenv("MLLM_ASCEND_QWEN_DECODER_GRAPH_SETUP_BUCKET");
if (env == nullptr || env[0] == '\0') return 0;
char* end = nullptr;
const long parsed = std::strtol(env, &end, 10);
if (end == env || parsed <= 0) return 0;
return static_cast<int32_t>(parsed);
}

class QwenAscendDecoderGraphRunner final {
public:
QwenAscendDecoderGraphRunner() = default;

void configure(int32_t max_cache_length) {
max_cache_length_ = max_cache_length;
setup_bucket_size_ = getQwenAscendDecoderGraphSetupBucketSize();
if (setup_bucket_size_ > max_cache_length_) {
setup_bucket_size_ = max_cache_length_;
}
}

[[nodiscard]] bool hasExecutor() const { return executor_ != nullptr; }
[[nodiscard]] int32_t setupBucketSize() const { return setup_bucket_size_; }

void setExecutor(std::unique_ptr<mllm::ascend::AscendGraphExecutor> executor) {
executor_ = std::move(executor);
}

Tensor attentionScaleTensor(int32_t head_dim, DataTypes dtype, DeviceTypes device) {
if (attention_scale_tensor_.isNil()
|| attention_scale_tensor_.dtype() != dtype
|| attention_scale_tensor_.device() != device) {
attention_scale_tensor_ =
(Tensor::ones({1, 1, 1, 1}, dtype, kCPU) * (1.f / sqrtf(head_dim))).to(device);
}
return attention_scale_tensor_;
}

Tensor currentSeqLenTensor(int32_t seq_len) {
if (current_seq_len_tensor_.isNil()) {
current_seq_len_tensor_ = Tensor::empty({1}, kInt32, kAscend).alloc();
}
int32_t host_seq_len = seq_len;
auto ret = aclrtMemcpy(current_seq_len_tensor_.ptr<void>(),
sizeof(int32_t),
&host_seq_len,
sizeof(int32_t),
ACL_MEMCPY_HOST_TO_DEVICE);
if (ret != ACL_SUCCESS) {
MLLM_ACL_CHECK(ret);
}
return current_seq_len_tensor_;
}

void execute(const std::vector<Tensor>& inputs,
std::vector<Tensor>& outputs) {
MLLM_RT_ASSERT(executor_ != nullptr);
executor_->execute(inputs, outputs);
}

private:
std::unique_ptr<mllm::ascend::AscendGraphExecutor> executor_;
Tensor attention_scale_tensor_;
Tensor current_seq_len_tensor_;
int32_t max_cache_length_{0};
int32_t setup_bucket_size_{0};
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add API docs for public helpers and runner methods.

This header introduces multiple public inline factory functions and a public runner class API without clear parameter/return/error documentation.

As per coding guidelines, "Ensure public APIs, classes, and functions have clear docstrings or comments explaining purpose, parameters, returns, and errors."

🧰 Tools
🪛 Clang (14.0.6)

[error] 143-143: consider replacing 'long' with 'int64'

(google-runtime-int,-warnings-as-errors)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/models/qwen_ascend/qwen_ascend_graph_ops.hpp` around lines 24 - 205, Add
concise API doc comments for each public symbol: the inline factory functions
createLinearGraphOp, createRmsNormGraphOp, createRoPEGraphOp,
createTransposeGraphOp, createSoftmaxGraphOp, createAddGraphOp,
createMulGraphOp, createSiLUGraphOp (describe purpose, parameters like
has_bias/epsilon/axis/rank/dim0/dim1, return value and that they MLLM_ERROR_EXIT
on ATB failure), the helpers isQwenAscendDecoderGraphEnabled and
getQwenAscendDecoderGraphSetupBucketSize (describe env var semantics and return
meanings), and the QwenAscendDecoderGraphRunner class plus its public methods
configure, hasExecutor, setupBucketSize, setExecutor, attentionScaleTensor,
currentSeqLenTensor, and execute (describe intended usage, parameter
types/ownership, return semantics, side effects like device allocation or
memcpy, threading assumptions, and error/exit behavior). Keep each comment brief
and follow the project docstyle for one-line summary + param/return/error notes.

Comment on lines +69 to +75
inline atb::Operation* createTransposeGraphOp(int rank, int dim0, int dim1) {
atb::infer::TransposeParam param;
for (int i = 0; i < rank; ++i) {
param.perm.push_back(i);
}
std::swap(param.perm[dim0], param.perm[dim1]);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Guard transpose dims before indexing perm.

dim0/dim1 are used directly as vector indices; invalid or negative values can trigger out-of-bounds access.

[suggested fix]

Proposed patch
 inline atb::Operation* createTransposeGraphOp(int rank, int dim0, int dim1) {
+  if (rank <= 0) {
+    MLLM_ERROR_EXIT(ExitCode::kParamError, "Transpose rank must be > 0, got {}", rank);
+  }
+  auto normalize_dim = [rank](int d) { return d < 0 ? d + rank : d; };
+  dim0 = normalize_dim(dim0);
+  dim1 = normalize_dim(dim1);
+  if (dim0 < 0 || dim0 >= rank || dim1 < 0 || dim1 >= rank) {
+    MLLM_ERROR_EXIT(ExitCode::kParamError,
+                    "Transpose dims out of range: rank={}, dim0={}, dim1={}",
+                    rank,
+                    dim0,
+                    dim1);
+  }
+
   atb::infer::TransposeParam param;
   for (int i = 0; i < rank; ++i) {
     param.perm.push_back(i);
   }
   std::swap(param.perm[dim0], param.perm[dim1]);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
inline atb::Operation* createTransposeGraphOp(int rank, int dim0, int dim1) {
atb::infer::TransposeParam param;
for (int i = 0; i < rank; ++i) {
param.perm.push_back(i);
}
std::swap(param.perm[dim0], param.perm[dim1]);
inline atb::Operation* createTransposeGraphOp(int rank, int dim0, int dim1) {
if (rank <= 0) {
MLLM_ERROR_EXIT(ExitCode::kParamError, "Transpose rank must be > 0, got {}", rank);
}
auto normalize_dim = [rank](int d) { return d < 0 ? d + rank : d; };
dim0 = normalize_dim(dim0);
dim1 = normalize_dim(dim1);
if (dim0 < 0 || dim0 >= rank || dim1 < 0 || dim1 >= rank) {
MLLM_ERROR_EXIT(ExitCode::kParamError,
"Transpose dims out of range: rank={}, dim0={}, dim1={}",
rank,
dim0,
dim1);
}
atb::infer::TransposeParam param;
for (int i = 0; i < rank; ++i) {
param.perm.push_back(i);
}
std::swap(param.perm[dim0], param.perm[dim1]);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/models/qwen_ascend/qwen_ascend_graph_ops.hpp` around lines 69 - 75, In
createTransposeGraphOp, guard accesses into param.perm before using dim0/dim1:
validate that dim0 and dim1 are within [0, rank-1] and non-negative (or
otherwise handle invalid inputs) and return a safe failure (e.g., nullptr) or
throw an explanatory error if they are out of range; locate the function
createTransposeGraphOp and the vector param.perm to add an early check (if (dim0
< 0 || dim1 < 0 || dim0 >= rank || dim1 >= rank) { /* handle */ }) before
calling std::swap, so you avoid out-of-bounds indexing.

Comment on lines +167 to +174
Tensor attentionScaleTensor(int32_t head_dim, DataTypes dtype, DeviceTypes device) {
if (attention_scale_tensor_.isNil()
|| attention_scale_tensor_.dtype() != dtype
|| attention_scale_tensor_.device() != device) {
attention_scale_tensor_ =
(Tensor::ones({1, 1, 1, 1}, dtype, kCPU) * (1.f / sqrtf(head_dim))).to(device);
}
return attention_scale_tensor_;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

attentionScaleTensor cache key is incomplete.

The cached tensor ignores head_dim; if head_dim changes while dtype/device stay the same, the scale value becomes stale and wrong.

Proposed patch
   Tensor attentionScaleTensor(int32_t head_dim, DataTypes dtype, DeviceTypes device) {
+    MLLM_RT_ASSERT(head_dim > 0);
     if (attention_scale_tensor_.isNil()
         || attention_scale_tensor_.dtype() != dtype
-        || attention_scale_tensor_.device() != device) {
+        || attention_scale_tensor_.device() != device
+        || cached_head_dim_ != head_dim) {
       attention_scale_tensor_ =
           (Tensor::ones({1, 1, 1, 1}, dtype, kCPU) * (1.f / sqrtf(head_dim))).to(device);
+      cached_head_dim_ = head_dim;
     }
     return attention_scale_tensor_;
   }

Also add this field in the private section:

int32_t cached_head_dim_{-1};
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/models/qwen_ascend/qwen_ascend_graph_ops.hpp` around lines 167 - 174,
The cache key for attentionScaleTensor is missing head_dim: update
attentionScaleTensor to also check cached_head_dim_ (add private int32_t
cached_head_dim_{-1};) and only reuse attention_scale_tensor_ when
cached_head_dim_ == head_dim in addition to dtype/device; when you recreate the
tensor (in the block that assigns attention_scale_tensor_) set cached_head_dim_
= head_dim so the cache is invalidated correctly on head size changes.

Comment on lines +39 to +58
inline auto makeRotaryPosEmbedding(Tensor& position_ids, const Tensor& inv_freq,
float attention_scaling = 1.0f) -> std::pair<Tensor, Tensor> {
auto batch_size = position_ids.shape()[0];
auto seq_len = position_ids.shape()[1];
auto inv_freq_len = inv_freq.shape()[0];
auto dim = inv_freq_len * 2;

auto freqs = Tensor::empty({batch_size, seq_len, inv_freq_len}, kFloat32, kCPU).alloc();
auto freqs_ptr = freqs.ptr<float>();
auto position_ids_ptr = position_ids.ptr<int64_t>();
auto inv_freq_ptr = inv_freq.ptr<float>();

for (int b = 0; b < batch_size; ++b) {
for (int s = 0; s < seq_len; ++s) {
auto pos = position_ids_ptr[b * seq_len + s];
for (int d = 0; d < inv_freq_len; ++d) {
freqs_ptr[b * seq_len * inv_freq_len + s * inv_freq_len + d] = static_cast<float>(pos) * inv_freq_ptr[d];
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Type mismatch: position_ids is cast to int64_t* but created as kInt32.

In makeLocalRoPEPositionIds (line 20), the tensor is created with kInt32, but in makeRotaryPosEmbedding (line 48), the pointer is cast to int64_t*. This will cause incorrect values and potential buffer overrun.

🐛 Fix: Use consistent types

Either change makeLocalRoPEPositionIds to use kInt64:

 inline auto makeLocalRoPEPositionIds(int batch_size, int seq_len) -> Tensor {
-  auto rope_pos_ids = Tensor::empty({batch_size, seq_len}, kInt32, kCPU).alloc();
-  auto* ptr = rope_pos_ids.ptr<int32_t>();
+  auto rope_pos_ids = Tensor::empty({batch_size, seq_len}, kInt64, kCPU).alloc();
+  auto* ptr = rope_pos_ids.ptr<int64_t>();
   for (int b = 0; b < batch_size; ++b) {

Or change makeRotaryPosEmbedding to use int32_t*:

-  auto position_ids_ptr = position_ids.ptr<int64_t>();
+  auto position_ids_ptr = position_ids.ptr<int32_t>();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/models/qwen_ascend/qwen_ascend_rope.hpp` around lines 39 - 58, The issue
is a dtype mismatch between makeLocalRoPEPositionIds (which creates position_ids
as kInt32) and makeRotaryPosEmbedding (which casts position_ids.ptr to
int64_t*); fix by making the types consistent: either change
makeLocalRoPEPositionIds to allocate kInt64 for position_ids, or change
makeRotaryPosEmbedding to use int32_t* (and treat pos as int32_t when computing
freqs). Update the pointer type and any related uses of pos in
makeRotaryPosEmbedding to match the chosen integer dtype so indexing and
multiplications are correct.

Comment on lines +457 to +458
getAscendMemoryManager().allocateBlock(static_cast<uint32_t>(ws_size), ws_bid);
getAscendMemoryManager().getBlockPtr(ws_bid, ws);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n -C2 'allocateBlock\s*\(' .
files="$(fd -a 'AscendMemoryPool\.(hpp|cpp)$' .; fd -a 'AscendMemoryManager\.(hpp|cpp)$' .)"
rg -n -C2 'uint32_t\s+size|size_t\s+size|allocateBlock\s*\(' $files

Repository: UbiquitousLearning/mllm

Length of output: 21842


🏁 Script executed:

# Find where ws_size is declared/assigned in the test file
sed -n '450,600p' ./tests/ascend/AscendLinearKernelTest.hpp | head -100

Repository: UbiquitousLearning/mllm

Length of output: 4205


Fix workspace size parameter type mismatch in allocateBlock calls.

ws_size is declared as uint64_t and populated by ACL/ATB APIs (aclnnRoundGetWorkspaceSize, aclnnClampGetWorkspaceSize, etc.), but narrowed to uint32_t when calling allocateBlock(). Any workspace exceeding 4 GiB will wrap and under-allocate, causing memory corruption or opaque failures.

Update the allocator to accept uint64_t or ensure workspace sizes are validated and capped to 4 GiB before allocation.

Also applies to: 484-485, 514-515, 543-544, 593-594

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/ascend/AscendLinearKernelTest.hpp` around lines 457 - 458, The
allocateBlock call is narrowing ws_size (uint64_t) to uint32_t which can wrap
for >4GiB; update the allocator API or its overload used by
getAscendMemoryManager().allocateBlock to accept uint64_t (or change its
signature) so you pass the full ws_size, and adjust
getAscendMemoryManager().getBlockPtr(ws_bid, ws) callers accordingly; if you
cannot change the allocator signature, add a validated/capped conversion that
checks ws_size <= 0xFFFFFFFFULL (or caps to 4GiB and logs/errors) before calling
allocateBlock. Apply the same fix for the other occurrences where ws_size is
passed (the calls near lines referenced in the review: the other
allocateBlock/getBlockPtr pairs).

Copy link
Copy Markdown
Collaborator

@chenghuaWang chenghuaWang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chenghuaWang chenghuaWang changed the title Support Qwen3 on Ascend Backend feat(Ascend): Support Qwen3 on Ascend Backend Apr 28, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
mllm/backends/ascend/ops/AscendCausalMaskKernel.cpp (1)

77-103: ⚠️ Potential issue | 🟠 Major

The host round-trip is still the hot-path bottleneck.

This still copies the full tensor device→host→device on every call, so the latency issue previously flagged remains. For graph execution, this will dominate the kernel cost and erase the Ascend-side benefit.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/backends/ascend/ops/AscendCausalMaskKernel.cpp` around lines 77 - 103,
The code currently round-trips the entire tensor via aclrtMemcpy to host
(host_data) and back, causing a hot-path bottleneck; replace this with a
device-side implementation: remove the host_data vector and both aclrtMemcpy
calls in the block and instead launch an Ascend device kernel (e.g., implement
and call launchDeviceCausalMaskKernel or
deviceApplyStandardCausalMask/deviceApplySlidingWindowCausalMask) that operates
directly on input.deviceData (or writes to output.deviceData) on the device;
choose the kernel variant based on sliding_window, use aclrtLaunchKernel (or the
appropriate ACL launch API) with the current stream, pass
B,H,S,D,window_size,mask_val, and check return codes for errors (replacing
current aclrtMemcpy error checks).
🧹 Nitpick comments (2)
README.md (1)

135-135: Consider making device-status evidence traceable.

The static build-passing badge on Line 135 can drift over time. Recommend linking the badge or row to the latest CI run/job for OrangePi AI Pro(310B), so readers can verify current status quickly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` at line 135, Update the static badge for the "OrangePi AI
Pro(310B)" row so it links to the latest CI job instead of a fixed image: locate
the table row containing "OrangePi AI Pro(310B)" and the "build-passing" badge
and replace the standalone badge image with a linked badge that points to your
CI pipeline/job URL (or use your CI provider's dynamic badge URL), or wrap the
existing badge in a markdown link to the most recent run; ensure the README.md
row shows the badge as a link so readers can click through to the live CI
status.
mllm/backends/ascend/ops/AscendCausalMaskKernel.cpp (1)

108-111: Document the exported kernel entrypoint.

This public backend API has no comment explaining the tensor layout, preconditions, or error returns. As per coding guidelines, public APIs, classes, and functions must have clear docstrings or comments explaining purpose, parameters, returns, and errors.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/backends/ascend/ops/AscendCausalMaskKernel.cpp` around lines 108 - 111,
Add a clear doc comment above the exported kernel entrypoint
executeAscendCausalMaskKernel describing its purpose and public contract:
explain the expected tensor layouts for input and output (shapes, dtype, memory
layout/format), semantics of sliding_window and window_size, required
preconditions (e.g., matching batch/seq dims, non-null tensors, valid
window_size > 0), what the function returns on success (atb::Status::OK) and the
kinds of error Status values it may produce (invalid-arg, shape-mismatch,
runtime-failure) and when; include any concurrency/memory-alignment assumptions
and mention whether the function mutates input or only writes output. Ensure the
comment uses the same function name executeAscendCausalMaskKernel so reviewers
can locate it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@mllm/backends/ascend/ops/AscendCausalMaskKernel.cpp`:
- Around line 53-59: The mask currently computes copy_start_idx from s so cached
prefix columns (when D > S) are never masked; fix by computing the absolute
query position query_pos = D - S + s (or equivalent) and use that to derive left
and right bounds: copy_start_idx = std::max<int64_t>(0, query_pos - window_size
+ 1) and set masked columns for d < copy_start_idx and d > query_pos (replace
uses of s in the right-hand loop). Update the loops that write into
row_ptr[row_offset + d] to use query_pos instead of s so the sliding-window mask
correctly applies for decode/prefill shapes (symbols to edit: copy_start_idx, s,
D, S, window_size, row_ptr, row_offset).

In `@README.md`:
- Around line 76-79: The table header "Hexagon NPU INT8" is inconsistent with
the cell value "W4A16-SM8650"; update the header to a relaxed label like
"Hexagon NPU" (or ensure all column entries are strictly INT8) so the column
accurately reflects the listed formats; locate and change the header text
"Hexagon NPU INT8" in the table and keep the row entries such as "W4A16-SM8650"
and "✔️ w4a8" unchanged.

---

Duplicate comments:
In `@mllm/backends/ascend/ops/AscendCausalMaskKernel.cpp`:
- Around line 77-103: The code currently round-trips the entire tensor via
aclrtMemcpy to host (host_data) and back, causing a hot-path bottleneck; replace
this with a device-side implementation: remove the host_data vector and both
aclrtMemcpy calls in the block and instead launch an Ascend device kernel (e.g.,
implement and call launchDeviceCausalMaskKernel or
deviceApplyStandardCausalMask/deviceApplySlidingWindowCausalMask) that operates
directly on input.deviceData (or writes to output.deviceData) on the device;
choose the kernel variant based on sliding_window, use aclrtLaunchKernel (or the
appropriate ACL launch API) with the current stream, pass
B,H,S,D,window_size,mask_val, and check return codes for errors (replacing
current aclrtMemcpy error checks).

---

Nitpick comments:
In `@mllm/backends/ascend/ops/AscendCausalMaskKernel.cpp`:
- Around line 108-111: Add a clear doc comment above the exported kernel
entrypoint executeAscendCausalMaskKernel describing its purpose and public
contract: explain the expected tensor layouts for input and output (shapes,
dtype, memory layout/format), semantics of sliding_window and window_size,
required preconditions (e.g., matching batch/seq dims, non-null tensors, valid
window_size > 0), what the function returns on success (atb::Status::OK) and the
kinds of error Status values it may produce (invalid-arg, shape-mismatch,
runtime-failure) and when; include any concurrency/memory-alignment assumptions
and mention whether the function mutates input or only writes output. Ensure the
comment uses the same function name executeAscendCausalMaskKernel so reviewers
can locate it.

In `@README.md`:
- Line 135: Update the static badge for the "OrangePi AI Pro(310B)" row so it
links to the latest CI job instead of a fixed image: locate the table row
containing "OrangePi AI Pro(310B)" and the "build-passing" badge and replace the
standalone badge image with a linked badge that points to your CI pipeline/job
URL (or use your CI provider's dynamic badge URL), or wrap the existing badge in
a markdown link to the most recent run; ensure the README.md row shows the badge
as a link so readers can click through to the live CI status.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 78f647ee-6517-44e0-8e0c-db5bf6c66f26

📥 Commits

Reviewing files that changed from the base of the PR and between 7304d56 and 16d3539.

📒 Files selected for processing (2)
  • README.md
  • mllm/backends/ascend/ops/AscendCausalMaskKernel.cpp

Comment on lines +53 to +59
const int64_t copy_start_idx = std::max<int64_t>(0, s - window_size + 1);
for (int64_t d = 0; d < copy_start_idx; ++d) {
row_ptr[row_offset + d] = mask_val;
}
for (int64_t d = s + 1; d < D; ++d) {
row_ptr[row_offset + d] = mask_val;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix the sliding-window mask to use the absolute query position.

copy_start_idx is derived from s only, so when D > S the cached prefix is never masked and the window is effectively ignored for decode/prefill shapes. Compute the left and right bounds from D - S + s, matching the standard causal path.

🔧 Proposed fix
-        const int64_t copy_start_idx = std::max<int64_t>(0, s - window_size + 1);
+        const int64_t query_pos = D - S + s;
+        const int64_t copy_start_idx = std::max<int64_t>(0, query_pos - window_size + 1);
         for (int64_t d = 0; d < copy_start_idx; ++d) {
           row_ptr[row_offset + d] = mask_val;
         }
-        for (int64_t d = s + 1; d < D; ++d) {
+        for (int64_t d = query_pos + 1; d < D; ++d) {
           row_ptr[row_offset + d] = mask_val;
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm/backends/ascend/ops/AscendCausalMaskKernel.cpp` around lines 53 - 59,
The mask currently computes copy_start_idx from s so cached prefix columns (when
D > S) are never masked; fix by computing the absolute query position query_pos
= D - S + s (or equivalent) and use that to derive left and right bounds:
copy_start_idx = std::max<int64_t>(0, query_pos - window_size + 1) and set
masked columns for d < copy_start_idx and d > query_pos (replace uses of s in
the right-hand loop). Update the loops that write into row_ptr[row_offset + d]
to use query_pos instead of s so the sliding-window mask correctly applies for
decode/prefill shapes (symbols to edit: copy_start_idx, s, D, S, window_size,
row_ptr, row_offset).

Comment thread README.md
Comment on lines +76 to +79
| Model(v2) | CPU | Hexagon NPU <br> INT8 | Ascend NPU |
|-----------------------------------------------------------------------------|------|-----------------------|------------|
| [Qwen3-0.6B](https://github.com/QwenLM/Qwen3) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen3-0.6B-w4a32kai) | | ✔️ W8A8 |
| [Qwen3-1.7B](https://github.com/QwenLM/Qwen3) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen3-1.7B-w4a8-i8mm-kai) | [W4A16-SM8650](https://modelscope.cn/models/mllmTeam/Qwen3-1.7B-Qnn-AOT-SM8650/) | |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Header precision label is inconsistent with row content.

Line 76 says Hexagon NPU INT8, but Line 79 lists W4A16-SM8650 in that same column. Please either relax the column header (e.g., Hexagon NPU) or keep entries strictly INT8 to avoid misleading compatibility info.

Proposed doc fix
-| Model(v2)                                                                   | CPU  | Hexagon NPU <br> INT8 | Ascend NPU |
+| Model(v2)                                                                   | CPU  | Hexagon NPU | Ascend NPU |
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| Model(v2) | CPU | Hexagon NPU <br> INT8 | Ascend NPU |
|-----------------------------------------------------------------------------|------|-----------------------|------------|
| [Qwen3-0.6B](https://github.com/QwenLM/Qwen3) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen3-0.6B-w4a32kai) | | ✔️ W8A8 |
| [Qwen3-1.7B](https://github.com/QwenLM/Qwen3) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen3-1.7B-w4a8-i8mm-kai) | [W4A16-SM8650](https://modelscope.cn/models/mllmTeam/Qwen3-1.7B-Qnn-AOT-SM8650/) | |
| Model(v2) | CPU | Hexagon NPU | Ascend NPU |
|-----------------------------------------------------------------------------|------|-----------------------|------------|
| [Qwen3-0.6B](https://github.com/QwenLM/Qwen3) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen3-0.6B-w4a32kai) | | ✔️ W8A8 |
| [Qwen3-1.7B](https://github.com/QwenLM/Qwen3) | [✔️ w4a8](https://www.modelscope.cn/models/mllmTeam/Qwen3-1.7B-w4a8-i8mm-kai) | [W4A16-SM8650](https://modelscope.cn/models/mllmTeam/Qwen3-1.7B-Qnn-AOT-SM8650/) | |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 76 - 79, The table header "Hexagon NPU INT8" is
inconsistent with the cell value "W4A16-SM8650"; update the header to a relaxed
label like "Hexagon NPU" (or ensure all column entries are strictly INT8) so the
column accurately reflects the listed formats; locate and change the header text
"Hexagon NPU INT8" in the table and keep the row entries such as "W4A16-SM8650"
and "✔️ w4a8" unchanged.

@chenghuaWang chenghuaWang merged commit 4942dd1 into UbiquitousLearning:main Apr 28, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants