feat(qwen3): Add configuration files and enhance Qwen3 model with layer indexing and quantization improvements. by chenghuaWang · Pull Request #611 · UbiquitousLearning/mllm

chenghuaWang · 2026-01-23T07:41:16Z

Summary by CodeRabbit

New Features
- Added Qwen3 4B model variant with complete configuration and quantization setup
- Introduced quantization parameter validation utilities for enhanced model optimization
Chores
- Updated hardware target machine configuration for compatibility
- Refined quantization workflow with improved weight synchronization handling

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…er indexing and quantization improvements. Introduce new JSON configurations for 1.7B and 4B models, and update model architecture to support layer-specific operations and weight management.

…e graph structure, including quantization specifications and layer operations. This implementation enhances model performance and supports advanced quantization techniques.

…en3Text class to streamline model input handling.

coderabbitai · 2026-01-23T07:41:48Z

📝 Walkthrough

Walkthrough

The PR introduces Qwen3 4B model configurations for QNN AOT compilation and enhances the quantization pipeline with per-layer QDQ conditional logic, embedding weight synchronization for tied embeddings, and scale/zero-point recomputation with concat observer validation utilities.

Changes

Cohort / File(s)	Summary
Qwen3 Configuration Files `examples/qwen3_qnn_aot/config_4B.json`	New model configuration for Qwen3 4B with architecture parameters (hidden_size=2560, num_hidden_layers=36, intermediate_size=9728, num_attention_heads=32) and QNN_LPBQ_w4a16o16_G32 linear implementation.
QNN AOT Compilation Configs `examples/qwen3_qnn_aot/qnn_aot_cfg_1.7B.json`, `examples/qwen3_qnn_aot/qnn_aot_cfg_4B.json`	Updated 1.7B target machine to V79/SM8750; introduced new 4B AOT config with quantization recipe (36 layers, per-tensor kv_cache w8a8, lm_head per-channel w8a16, linear LPBQ w4a16 block_size 32).
Model Header Implementation `examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp`	Added per-layer indexing (layer_idx_) to Qwen3Decoder and Qwen3Attention; made input QDQ conditional on layer_idx > 0; changed decode_blocks_ to ModuleListWithIdx; removed text embedding pre-quantization step.
Python Model Utilities `pymllm/backends/qualcomm/transformers/qwen3/modeling_qwen3.py`	Added self.config attribute assignment and new copy_lm_head_weight_from_embed_tokens() method to synchronize LM head weights when tie_word_embeddings is enabled.
Quantization Runner `pymllm/backends/qualcomm/transformers/qwen3/runner.py`	Added recompute_scale_zp() utility for post-calibration FakeQuantize parameter refresh; added validate_concat_observer_fn() for input observer consistency auditing; added freeze_qwen3_embed_tokens_weight() for embedding weight freezing; extended Qwen3Quantizer with corresponding public methods and conditional LM head weight copying.
Training Pipeline `pymllm/backends/qualcomm/transformers/qwen3/train.py`	Added recompute_scale_zp() and validate_concat_observer() calls post-calibration; removed deployment-time embedding weight modification block and related 1.7B model guard.

Sequence Diagram(s)

sequenceDiagram
    participant Trainer
    participant Model as Qwen3Model
    participant Calibrator as Calibration
    participant Quantizer as Qwen3Quantizer
    participant FQ as FakeQuantize
    participant Observer as ConcatObserver
    participant Converter as AOT Converter

    Trainer->>Quantizer: Initialize (tie_word_embeddings)
    Quantizer->>Model: copy_lm_head_weight_from_embed_tokens()
    Quantizer->>Model: freeze_qwen3_embed_tokens_weight()
    
    Trainer->>Calibrator: Run calibration on Model
    Calibrator->>Model: Forward pass (collect activation ranges)
    
    Trainer->>Quantizer: Enable fake quantization
    
    Trainer->>Quantizer: recompute_scale_zp()
    Quantizer->>FQ: Refresh scale/zero_point per layer
    FQ-->>Quantizer: Updated parameters
    
    Trainer->>Quantizer: validate_concat_observer()
    Quantizer->>Observer: Audit input observer consistency
    Observer-->>Quantizer: Per-observer metrics & mismatches
    
    Trainer->>Converter: Convert to QNN AOT
    Converter->>Model: Process per-layer quantized tensors
    Converter-->>Trainer: Compiled artifact

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

fix: Qualcomm QNN AOT Pass #579 — Modifies examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp for per-layer QDQ/RoPE conditional logic in the same decoder/attention components.
feat(qnn): Enhance QNNBackend initialization with improved logging and error handling; update default log level to verbose. Add QEmbedding class for quantized embedding operations in PyTorch. Introduce build tasks for Android and x86 QNN AOT SDKs. #609 — Implements per-layer QDQ and QEmbedding weight handling with similar embedding synchronization patterns.
feat(Qnn AOT): AOT and AOT Runtime. Qwen3 AOT Mode. #567 — Introduces QNN_LPBQ_w4a16o16_G32 linear implementation specification used in the new 4B config file.

Suggested reviewers

liang1232018
oreomaker
yirongjie

Poem

🐰 Hop, hop, the configs align,
Per-layer QDQ now fine and divine,
Embeddings tied, observers validated true,
Four billion parameters compiled for you! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The pull request description is entirely missing; no content was provided by the author beyond the template.	Add a detailed description explaining the purpose of the changes, the new configuration files, layer indexing implementation, quantization improvements, and any relevant context for reviewers.
Docstring Coverage	⚠️ Warning	Docstring coverage is 15.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main changes: configuration files for Qwen3 4B model and enhancements including layer indexing and quantization improvements across multiple files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp (2)
288-289: layer_idx_ is uninitialized in the constructor.

layer_idx_ is declared but never initialized within Qwen3Attention's constructor. It's set externally at line 347 in Qwen3Text. If the member is accessed before external initialization (e.g., during debugging or if the initialization order changes), this could lead to undefined behavior.

Consider initializing it in the constructor or using a default value.
🔧 Suggested fix
 public:
   Qwen3Attention() = default;

-  Qwen3Attention(const std::string& name, const Qwen3Config& cfg) : nn::Module(name) {
+  Qwen3Attention(const std::string& name, const Qwen3Config& cfg, int layer_idx = -1) : nn::Module(name), layer_idx_(layer_idx) {
     hidden_size_ = cfg.hidden_size;
486-486: Storing a reference to cfg may lead to dangling reference.

const Qwen3Config& cfg is stored as a member, but if the original Qwen3Config object passed to the constructor is destroyed or goes out of scope, this reference becomes dangling, leading to undefined behavior when accessed in trace().

Consider storing by value or using std::shared_ptr if the config is expensive to copy.
🔧 Suggested fix
-  const Qwen3Config& cfg;
+  Qwen3Config cfg;

🤖 Fix all issues with AI agents

In `@pymllm/backends/qualcomm/transformers/qwen3/runner.py`:
- Around line 19-82: The debug print in recompute_scale_zp incorrectly
references module.scale; update the log to reference the FakeQuantize buffer by
using module.fake_quant.scale instead (i.e., change the f-string in the loop
over module.fake_quant.named_parameters() to print {module.fake_quant.scale});
ensure the print remains inside the loop that checks for value is
module.fake_quant.scale so it logs the correct tensor from FakeQuantize.
- Around line 100-138: The code assumes per-tensor quantization by calling
.item() on scale and zp from observer.calculate_qparams() (used in the
ConcatObserver logging and mismatch messages); make this defensive: when
collecting and printing scales_zps and when formatting mismatch messages in the
loop, detect if scale.numel() == 1 and use .item(), otherwise convert to a list
(scale.tolist(), zp.tolist()) and format accordingly (e.g., show full list or
summarized stats). Also ensure comparisons still work for multi-element qparams
by keeping torch.allclose(ref_scale, scale, ...) and torch.equal(ref_zp, zp)
(they already support multi-element tensors), and add a short comment near
input_observers / ConcatObserver indicating the code supports both per-tensor
and per-channel qparams.

🧹 Nitpick comments (3)

pymllm/backends/qualcomm/transformers/qwen3/runner.py (1)
52-57: Consider using the logging module instead of print for error handling.

The broad except Exception is acceptable here given the variety of observer implementations, but using logging.warning or logging.debug instead of print(e) would provide better control over verbosity and avoid cluttering stdout in production.
Proposed improvement
+import logging
+
+logger = logging.getLogger(__name__)
+
 # In recompute_scale_zp function:
             try:
                 scale, zero_point = observer.calculate_qparams()
             except Exception as e:
                 # Some special Observers (e.g., FixedQParams) may not support recomputation or behave differently, safely skip
-                print(e)
+                logger.debug("Skipping observer recomputation: %s", e)
                 return
examples/qwen3_qnn_aot/modeling_qwen_qnn_aot.hpp (2)
20-30: Redundant initialization of scale_name and zp_name.

Lines 21-22 initialize scale_name and zp_name, but the if-else block at lines 24-30 always overwrites these values. Lines 28-29 in the else branch are identical to lines 21-22.
♻️ Suggested simplification
 Tensor QDQ(nn::Module* m, Tensor in, const std::string& qdq_name_in_pytorch) {
-  std::string scale_name = m->getModuleName() + "." + qdq_name_in_pytorch + ".fake_quant.scale";
-  std::string zp_name = m->getModuleName() + "." + qdq_name_in_pytorch + ".fake_quant.zero_point";
-
-  if (m->getModuleName().empty()) {
-    scale_name = qdq_name_in_pytorch + ".fake_quant.scale";
-    zp_name = qdq_name_in_pytorch + ".fake_quant.zero_point";
-  } else {
-    scale_name = m->getModuleName() + "." + qdq_name_in_pytorch + ".fake_quant.scale";
-    zp_name = m->getModuleName() + "." + qdq_name_in_pytorch + ".fake_quant.zero_point";
-  }
+  const auto& modName = m->getModuleName();
+  std::string prefix = modName.empty() ? "" : modName + ".";
+  std::string scale_name = prefix + qdq_name_in_pytorch + ".fake_quant.scale";
+  std::string zp_name = prefix + qdq_name_in_pytorch + ".fake_quant.zero_point";
301-307: layer_idx is not propagated to self_attn_.

The constructor receives layer_idx and stores it in layer_idx_, but doesn't pass it to self_attn_ during registration. Instead, self_attn_.layer_idx_ is set externally in Qwen3Text at line 347. This creates a fragmented initialization pattern where the decoder knows its layer index but relies on an external caller to set the same index on its child attention module.

Consider propagating the index during construction for encapsulation, or document this dependency clearly.
♻️ Suggested fix to propagate layer_idx

If Qwen3Attention constructor is updated to accept layer_idx:
   Qwen3Decoder(const std::string& name, const Qwen3Config& cfg, int layer_idx) : nn::Module(name) {
     layer_idx_ = layer_idx;
-    self_attn_ = reg<Qwen3Attention>("self_attn", cfg);
+    self_attn_ = reg<Qwen3Attention>("self_attn", cfg, layer_idx);
     mlp_ = reg<Qwen3MLP>("mlp", cfg);
Then remove line 347 in Qwen3Text:
     decode_blocks_ = reg<nn::ModuleListWithIdx<Qwen3Decoder>>("layers", cfg.num_hidden_layers, cfg);
-    for (auto [idx, b] : enumerate(decode_blocks_.list())) { b.self_attn_.layer_idx_ = idx; }
     norm_ = reg<nn::RMSNorm>("norm", cfg.rms_norm_eps);

coderabbitai · 2026-01-23T07:46:03Z

+def recompute_scale_zp(module):
+    """
+    Callback function: Used to forcefully refresh scale and zero_point of all FakeQuantize modules after calibration.
+
+    Problem solved:
+        When using ConcatObserver, min/max may be updated during forward pass,
+        but at the end of forward, the scale/zp stored in FakeQuantize's internal buffer are still computed from old min/max.
+        This function forces a calculate_qparams call to sync the latest parameters to the buffer.
+
+    Usage:
+        model.apply(recompute_scale_zp)
+    """
+
+    # We mainly focus on FakeQuantize modules since they store the scale/zero_point buffers
+    # Note: model.apply recursively traverses all submodules, so self.fake_quant inside ActivationQDQ will also be visited
+    if isinstance(module, ActivationQDQ):
+        observer = module.fake_quant.activation_post_process
+
+        # 2. Check if observer is valid and contains statistics
+        # We only care about MinMaxObserver or MovingAverageMinMaxObserver that have min_val/max_val
+        if hasattr(observer, "min_val") and hasattr(observer, "max_val"):
+            # 3. Check if data is initialized
+            # If min_val is still the initial inf, this layer hasn't processed data, skip to avoid errors
+            if observer.min_val.numel() == 0 or observer.max_val.numel() == 0:
+                return
+            if (
+                torch.isinf(observer.min_val).any()
+                or torch.isinf(observer.max_val).any()
+            ):
+                return
+
+            # 4. Recompute Scale and Zero Point
+            # calculate_qparams reads the current min_val/max_val from observer (may have been modified by ConcatObserver)
+            try:
+                scale, zero_point = observer.calculate_qparams()
+            except Exception as e:
+                # Some special Observers (e.g., FixedQParams) may not support recomputation or behave differently, safely skip
+                print(e)
+                return
+
+            # 5. Force overwrite the computed results to FakeQuantize's Buffer
+            # Use copy_ to keep reference unchanged, ensuring the new values are used during export
+            if (
+                hasattr(module.fake_quant, "scale")
+                and module.fake_quant.scale is not None
+            ):
+                # Ensure dimension match (handle per-channel vs per-tensor)
+                if module.fake_quant.scale.shape != scale.shape:
+                    module.fake_quant.scale.resize_(scale.shape)
+                module.fake_quant.scale.copy_(scale)
+                # Try to get the registered name of module scale from _parameters or _buffers
+                for key, value in module.fake_quant.named_parameters():
+                    if value is module.fake_quant.scale:
+                        print(f"{module._get_name()}.{key}: {module.scale}")
+                        break
+
+            if (
+                hasattr(module.fake_quant, "zero_point")
+                and module.fake_quant.zero_point is not None
+            ):
+                if module.fake_quant.zero_point.shape != zero_point.shape:
+                    module.fake_quant.zero_point.resize_(zero_point.shape)
+                module.fake_quant.zero_point.copy_(zero_point)
+


⚠️ Potential issue | 🟡 Minor

Bug: Incorrect attribute reference in debug print statement.

Line 72 prints module.scale but based on the surrounding code, it should be module.fake_quant.scale.

Proposed fix

for key, value in module.fake_quant.named_parameters(): if value is module.fake_quant.scale: - print(f"{module._get_name()}.{key}: {module.scale}") + print(f"{module._get_name()}.{key}: {module.fake_quant.scale}") break

🧰 Tools

🪛 Ruff (0.14.13)

54-54: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

In `@pymllm/backends/qualcomm/transformers/qwen3/runner.py` around lines 19 - 82, The debug print in recompute_scale_zp incorrectly references module.scale; update the log to reference the FakeQuantize buffer by using module.fake_quant.scale instead (i.e., change the f-string in the loop over module.fake_quant.named_parameters() to print {module.fake_quant.scale}); ensure the print remains inside the loop that checks for value is module.fake_quant.scale so it logs the correct tensor from FakeQuantize.

coderabbitai · 2026-01-23T07:46:03Z

+    # Collect scale and zero_point from all observers
+    scales_zps = []
+    for i, observer in enumerate(input_observers):
+        try:
+            scale, zp = observer.calculate_qparams()
+            scales_zps.append(f"[{i}] s={scale.item():.8f} zp={zp.item()}")
+        except Exception:
+            scales_zps.append(f"[{i}] failed")
+
+    # Print one line: scale and zp of all inputs for each concat observer
+    print(f"ConcatObserver [{name}]: {' | '.join(scales_zps)}")
+
+    # Original validation logic
+    if len(input_observers) <= 1:
+        return
+
+    # Get scale and zero_point from the first observer as reference
+    first_observer = input_observers[0]
+    try:
+        ref_scale, ref_zp = first_observer.calculate_qparams()
+    except Exception:
+        return
+
+    # Check if all other observers have the same scale and zero_point
+    for i, observer in enumerate(input_observers[1:], start=1):
+        try:
+            scale, zp = observer.calculate_qparams()
+        except Exception:
+            results.append(f"Failed to calculate qparams for observer[{i}]")
+            continue
+
+        scale_match = torch.allclose(ref_scale, scale, rtol=1e-5, atol=1e-8)
+        zp_match = torch.equal(ref_zp, zp)
+
+        if not scale_match or not zp_match:
+            results.append(
+                f"observer[{i}] mismatch: ref_scale={ref_scale.item():.8f}, "
+                f"scale={scale.item():.8f}, ref_zp={ref_zp.item()}, zp={zp.item()}"
+            )


⚠️ Potential issue | 🟡 Minor

Potential issue: .item() calls assume per-tensor quantization.

Lines 105 and 136-137 call .item() on scale and zp tensors. This will fail with a ValueError if the observer uses per-channel quantization (where scale/zp are multi-element tensors).

Given the ConcatObserver configuration in modeling_qwen3.py uses per_tensor_affine, this may be safe, but it's worth adding defensive handling or a comment clarifying the assumption.

Proposed defensive fix

for i, observer in enumerate(input_observers): try: scale, zp = observer.calculate_qparams() - scales_zps.append(f"[{i}] s={scale.item():.8f} zp={zp.item()}") + if scale.numel() == 1: + scales_zps.append(f"[{i}] s={scale.item():.8f} zp={zp.item()}") + else: + scales_zps.append(f"[{i}] s={scale.tolist()} zp={zp.tolist()}") except Exception: scales_zps.append(f"[{i}] failed")

🧰 Tools

🪛 Ruff (0.14.13)

106-106: Do not catch blind exception: Exception

(BLE001)

120-120: Do not catch blind exception: Exception

(BLE001)

127-127: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

In `@pymllm/backends/qualcomm/transformers/qwen3/runner.py` around lines 100 - 138, The code assumes per-tensor quantization by calling .item() on scale and zp from observer.calculate_qparams() (used in the ConcatObserver logging and mismatch messages); make this defensive: when collecting and printing scales_zps and when formatting mismatch messages in the loop, detect if scale.numel() == 1 and use .item(), otherwise convert to a list (scale.tolist(), zp.tolist()) and format accordingly (e.g., show full list or summarized stats). Also ensure comparisons still work for multi-element qparams by keeping torch.allclose(ref_scale, scale, ...) and torch.equal(ref_zp, zp) (they already support multi-element tensors), and add a short comment near input_observers / ConcatObserver indicating the code supports both per-tensor and per-channel qparams.

UbiquitousLearning

LGTM

chenghuaWang and others added 4 commits January 23, 2026 07:22

feat(qwen3): Add configuration files and enhance Qwen3 model with lay…

91fd7ec

…er indexing and quantization improvements. Introduce new JSON configurations for 1.7B and 4B models, and update model architecture to support layer-specific operations and weight management.

feat(qwen3): Introduce new AOT model file for Qwen3 with comprehensiv…

30c1a95

…e graph structure, including quantization specifications and layer operations. This implementation enhances model performance and supports advanced quantization techniques.

refactor(qwen3): Remove quantization step from input processing in Qw…

4ad8b8e

…en3Text class to streamline model input handling.

Merge branch 'UbiquitousLearning:main' into wch-main

b0bdec0

coderabbitai Bot reviewed Jan 23, 2026

View reviewed changes

UbiquitousLearning approved these changes Jan 23, 2026

View reviewed changes

chenghuaWang merged commit 705f161 into UbiquitousLearning:main Jan 23, 2026
1 check passed

This was referenced Jan 30, 2026

feat(core): Introduce kBool data type for Qnn ElewiseEqual Op #618

Merged

feat: implement qwen3 probing service and tests #626

Merged

coderabbitai Bot mentioned this pull request Mar 27, 2026

fix: qnn-aot: correct block size for qwen-4b and clean code #661

Merged

coderabbitai Bot mentioned this pull request Apr 28, 2026

feat(pymllm): support Qwen3 Jetson BF16, W4A16, and W8A8 serving #670

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(qwen3): Add configuration files and enhance Qwen3 model with layer indexing and quantization improvements.#611

feat(qwen3): Add configuration files and enhance Qwen3 model with layer indexing and quantization improvements.#611
chenghuaWang merged 4 commits intoUbiquitousLearning:mainfrom
chenghuaWang:wch-main

chenghuaWang commented Jan 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jan 23, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jan 23, 2026

Uh oh!

coderabbitai Bot Jan 23, 2026

Uh oh!

UbiquitousLearning left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chenghuaWang commented Jan 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

UbiquitousLearning left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chenghuaWang commented Jan 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jan 23, 2026 •

edited

Loading