Refactor Qwen3.5 MoE quantization to use _QuantFunctionalMixin by cjluo-nv · Pull Request #1170 · NVIDIA/Model-Optimizer

cjluo-nv · 2026-04-02T18:18:17Z

Summary

Refactors _QuantQwen35MoeExperts from QuantModule with a custom forward to _QuantFunctionalMixin, keeping the original HF forward unmodified (single fused F.linear + chunk instead of two separate matmuls per expert)
Adds per-expert quantizer ModuleLists with expert index recovery via storage offset, preserving per-expert calibration granularity
Adds _export_qwen35_experts in moe_utils.py to split fused 3D params into per-expert named tensors at export time, reusing _export_quantized_weight for all quantization formats
Moves Qwen3_5MoeSparseMoeBlock to the fused gate_up_proj/down_proj expert linear names group in layer_utils.py

Test plan

Run MoE quantization unit tests: python -m pytest tests/unit/torch/quantization/plugins/test_sparse_moe.py -x
Run export tests: python -m pytest tests/gpu/torch/export/ -x
Verify exported checkpoint naming matches experts.{E}.gate_proj.weight convention
Verify no regression on Qwen3 MoE (non-3.5) quantization

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Corrected Qwen3.5 MoE block expert detection logic.
New Features
- Added quantized export support for Qwen3.5 Mixture of Experts models with per-expert quantization buffers.
Improvements
- Optimized MoE expert quantization using functional interception for improved efficiency.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

github-actions · 2026-04-02T18:22:33Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1170/
Built to branch `gh-pages` at 2026-04-02 18:28 UTC. Preview will be ready when the GitHub Pages deployment is complete.

coderabbitai · 2026-04-02T18:23:14Z

📝 Walkthrough

Walkthrough

This change implements Qwen3.5 MoE expert quantization and export by updating expert classification, adding specialized expert splitting and quantization logic, integrating new export paths in the unified export pipeline, and replacing the decomposition-based quantization approach with a functional wrapper that intercepts linear operations.

Changes

Cohort / File(s)	Summary
Expert Linear Name Detection `modelopt/torch/export/layer_utils.py`	Moved `Qwen3_5MoeSparseMoeBlock` from Qwen-style unfused mapping to the fused `gate_up_proj`/`down_proj` mapping, aligning its classification with `GptOssMoE`-type modules.
Expert Quantization & Splitting `modelopt/torch/export/moe_utils.py`	Added `_export_qwen35_experts` function to decompose fused Qwen3.5 MoE weights into per-expert submodules, export quantized weights with per-expert scales, apply fallback quantization logic for uncalibrated weights, and clean up original fused parameters.
Unified Export Integration `modelopt/torch/export/unified_export_hf.py`	Integrated Qwen3.5 MoE expert export into `_process_quantized_modules` and `_export_transformers_checkpoint` to call the new export function and skip redundant per-expert processing.
Functional Quantization Wrapper `modelopt/torch/quantization/plugins/huggingface.py`	Replaced per-expert decomposition with `_QuantQwen35MoeExperts` functional wrapper that intercepts `torch.nn.functional.linear` calls, extracts expert indices from fused weights, and applies per-expert quantization without materializing intermediate submodules.

Sequence Diagram

sequenceDiagram
    participant Exporter as Unified Exporter
    participant MoeModule as Qwen3.5 MoE Module
    participant SplitLogic as Expert Splitting Logic
    participant QuantLogic as Quantization Logic
    participant Storage as Module Storage

    Exporter->>MoeModule: Identify QuantQwen3_5MoeExperts
    Exporter->>SplitLogic: Call _export_qwen35_experts()
    
    SplitLogic->>MoeModule: Access fused gate_up_proj & down_proj
    SplitLogic->>SplitLogic: Decompose fused weights per expert
    
    loop For each expert slice
        SplitLogic->>QuantLogic: Export quantized weight & scales
        QuantLogic->>QuantLogic: Apply per-channel amax fallback
        QuantLogic->>QuantLogic: Compute amax if uncalibrated
    end
    
    SplitLogic->>Storage: Register per-expert submodules
    SplitLogic->>Storage: Remove fused parameters
    SplitLogic->>Exporter: Return with per-expert structure

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: refactoring Qwen3.5 MoE quantization implementation to use _QuantFunctionalMixin instead of custom forward logic.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Security Anti-Patterns	✅ Passed	The pull request complies with all security coding practices outlined in SECURITY.md. No unsafe patterns detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chenjiel/refactor_qwen35

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai

🧹 Nitpick comments (2)

modelopt/torch/export/moe_utils.py (1)
105-117: Amax slicing logic is correct but inconsistent with line 130.

The proportional slicing for per-channel amax is mathematically correct. However, line 117 sets w_quantizer._amax (the internal attribute), while line 130 sets w_quantizer.amax (the property). Consider using the property setter consistently for proper validation:
-               w_quantizer._amax = amax[slice_start:slice_end].contiguous()
+               w_quantizer.amax = amax[slice_start:slice_end].contiguous()
This ensures any property-level validation in TensorQuantizer.amax.setter is applied uniformly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/moe_utils.py` around lines 105 - 117, The per-channel
amax slice currently assigns directly to the internal attribute
w_quantizer._amax (in the block that checks hasattr(w_quantizer, "_amax")),
which bypasses any validation in the TensorQuantizer.amax property; change this
to assign via the property (e.g., set w_quantizer.amax =
sliced_amax.contiguous()) instead of writing to _amax so the
TensorQuantizer.amax.setter runs consistently with the later code that uses
w_quantizer.amax.
modelopt/torch/quantization/plugins/huggingface.py (1)
805-828: Consider thread-safety implications of the toggle mechanism.

The toggle state (_down_proj_linear, _current_expert_idx) is instance-level mutable state accessed during F.linear interception. If the same module instance is used concurrently (e.g., in data-parallel training without proper synchronization), the toggle could become inconsistent across threads.

This is likely fine for typical inference/calibration workloads (single-threaded forward), but worth noting for future maintainers if concurrent usage becomes a requirement.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 805 - 828,
The toggle state used in functionals_to_replace via the nested _quantized_linear
(specifically instance fields _down_proj_linear and _current_expert_idx) is
mutable and not thread-safe; replace the instance-level toggle with a
thread-local or per-call state to avoid race conditions when F.linear is
intercepted concurrently. Concretely, change _quantized_linear to use a
threading.local() or local context object (created outside or on the stack)
keyed per-thread/call to store the down-proj boolean and current expert index
(instead of _down_proj_linear/_current_expert_idx), or protect access with a
lightweight Lock around reads/writes; update uses of
_get_expert_idx_from_gate_up, gate_up_proj_input_quantizers,
gate_up_proj_weight_quantizers, down_proj_input_quantizers and
down_proj_weight_quantizers to read/write the thread-local or locked state so
concurrent forwards don’t clobber each other.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@modelopt/torch/export/moe_utils.py`:
- Around line 105-117: The per-channel amax slice currently assigns directly to
the internal attribute w_quantizer._amax (in the block that checks
hasattr(w_quantizer, "_amax")), which bypasses any validation in the
TensorQuantizer.amax property; change this to assign via the property (e.g., set
w_quantizer.amax = sliced_amax.contiguous()) instead of writing to _amax so the
TensorQuantizer.amax.setter runs consistently with the later code that uses
w_quantizer.amax.

In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 805-828: The toggle state used in functionals_to_replace via the
nested _quantized_linear (specifically instance fields _down_proj_linear and
_current_expert_idx) is mutable and not thread-safe; replace the instance-level
toggle with a thread-local or per-call state to avoid race conditions when
F.linear is intercepted concurrently. Concretely, change _quantized_linear to
use a threading.local() or local context object (created outside or on the
stack) keyed per-thread/call to store the down-proj boolean and current expert
index (instead of _down_proj_linear/_current_expert_idx), or protect access with
a lightweight Lock around reads/writes; update uses of
_get_expert_idx_from_gate_up, gate_up_proj_input_quantizers,
gate_up_proj_weight_quantizers, down_proj_input_quantizers and
down_proj_weight_quantizers to read/write the thread-local or locked state so
concurrent forwards don’t clobber each other.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1a839b2a-9e97-4d74-a751-5dd420978867

📥 Commits

Reviewing files that changed from the base of the PR and between 665cc63 and 59d10d9.

📒 Files selected for processing (4)

modelopt/torch/export/layer_utils.py
modelopt/torch/export/moe_utils.py
modelopt/torch/export/unified_export_hf.py
modelopt/torch/quantization/plugins/huggingface.py

codecov · 2026-04-02T18:36:21Z

Codecov Report

❌ Patch coverage is 11.70213% with 83 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.74%. Comparing base (00c002f) to head (59d10d9).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/torch/export/moe_utils.py	3.70%	52 Missing ⚠️
modelopt/torch/quantization/plugins/huggingface.py	18.18%	27 Missing ⚠️
modelopt/torch/export/unified_export_hf.py	33.33%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1170      +/-   ##
==========================================
+ Coverage   74.27%   75.74%   +1.47%     
==========================================
  Files         349      349              
  Lines       39846    39886      +40     
==========================================
+ Hits        29594    30212     +618     
+ Misses      10252     9674     -578

Flag	Coverage Δ
examples	`43.87% <11.70%> (+4.81%)`	⬆️
gpu	`57.03% <9.57%> (-0.23%)`	⬇️
unit	`54.48% <8.51%> (-0.06%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Edwardf0t1

Review

Key Concerns

1. Fragile toggle-based state machine (`huggingface.py`)

The _quantized_linear closure uses a boolean toggle (_down_proj_linear) to distinguish gate_up vs down_proj calls:

self._down_proj_linear = not self._down_proj_linear

This assumes HF's forward calls F.linear exactly twice per expert in strict alternation. The comment acknowledges this, but:

If HF changes the forward to add a third linear call (e.g., a shared expert gate), this silently misassigns quantizers.
If an exception occurs mid-forward inside super().forward(), the toggle is reset at the next forward() call (good), but during gradient checkpointing or re-entrant autograd the toggle could get out of sync.

Consider validating with the weight shape or storage offset for both calls instead of only the first, so mismatches raise early rather than silently corrupting quantization.

2. Expert index recovery via storage offset (`_get_expert_idx_from_gate_up`)

return (weight.storage_offset() - base_offset) // stride

This is clever but brittle — it breaks if:

The weight is .contiguous()-copied (the docstring acknowledges this)
FSDP2, tensor parallel, or other distributed wrappers reshard/redistribute the parameter
torch.compile materializes a new tensor

There's no runtime assertion that the computed index is in [0, num_experts). Adding a bounds check would catch silent corruption:

idx = (weight.storage_offset() - base_offset) // stride
assert 0 <= idx < self.num_experts, f"Invalid expert idx {idx}"
return idx

3. `break` in export loop (`unified_export_hf.py:726`)

elif "QuantQwen3_5MoeExperts" in type(sub_module.experts).__name__:
    break  # exits the inner `for linear_name` loop; type check prevents re-entry

Using break to skip the inner loop is non-obvious and the comment says "type check prevents re-entry" but the prevention is really just that this branch is hit on the first iteration. If the iteration order of linear_name changes or a new name is added before the Qwen3.5 one, this could silently skip processing. A continue at the outer loop or restructuring to avoid the break would be clearer.

4. Export: `copy.deepcopy` on quantizer (`moe_utils.py:106`)

w_quantizer = copy.deepcopy(w_quantizer_src) if is_gate_up else w_quantizer_src

deepcopy on a TensorQuantizer can be expensive and may not correctly copy all internal state (e.g., registered hooks, CUDA state). Since only _amax needs to be sliced independently, consider cloning just the amax tensor instead of the entire quantizer.

5. Amax slicing math (`moe_utils.py:109-118`)

The proportional slicing logic:

slice_start = fused_start * amax_dim0 // fused_total
slice_end = (fused_start + weight_slice.shape[0]) * amax_dim0 // fused_total

This integer division assumes fused_total is always divisible by amax_dim0 (validated above), but the slice indices depend on fused_start also being aligned. For gate_proj (fused_start=0) this is fine, but for up_proj (fused_start=expert_dim), if expert_dim * amax_dim0 % fused_total != 0, the slicing would be wrong without error. Consider adding a check that slice_start * fused_total == fused_start * amax_dim0.

6. Minor: `intermediate_size` vs `intermediate_dim` (`moe_utils.py:48-53`)

The dual-attribute check is good for cross-version compatibility, but the quantization plugin (huggingface.py) doesn't seem to have the same fallback — it references self.intermediate_dim directly. These should be consistent.

Positive Aspects

The core idea is sound — _QuantFunctionalMixin avoids rewriting the HF forward, which reduces maintenance burden when upstream HF code changes.
Per-expert ModuleList quantizers preserve calibration granularity.
Moving Qwen3_5MoeSparseMoeBlock to the fused expert names group in layer_utils.py is a clean fix.
The export function is well-structured with clear separation of concerns.

Summary

The main risk is the toggle-based state machine for distinguishing linear calls — it's an implicit contract with HF's forward that has no runtime validation. Adding defensive assertions (expert index bounds, linear call count per expert) would significantly improve robustness. The storage-offset trick is clever but should also have a bounds check. The rest of the changes are clean.

meenchen

1. Dependencies — Open Questions

[QUESTION] vLLM export
huggingface.py:1491 (_QuantStep3p5MoeLinear docstring) explicitly notes vLLM requires stacked 3D scaling factors and that the add_module() per-expert approach (used here in _export_qwen35_experts) is not accepted by vLLM. Is vLLM export for Qwen3.5 intentionally out of scope, or does this need a separate path?

[QUESTION] FSDP2 compatibility
Has this been tested under FSDP2? Sharded parameters may have different storage layouts, breaking _get_expert_idx_from_gate_up during calibration (before export, where fsdp2_aware_weight_update protects things).

2. Design — Robustness

[SUGGESTION] Storage-offset expert index recovery — add bounds check
_get_expert_idx_from_gate_up relies on gate_up_proj[idx] always returning a contiguous-storage view. If the invariant breaks (distributed wrappers, .contiguous() copy), the index silently goes wrong. Add a bounds assertion:

idx = (weight.storage_offset() - base_offset) // stride
assert 0 <= idx < self.num_experts, f"Recovered expert idx {idx} out of range"

[SUGGESTION] Toggle state machine tightly coupled to HF's forward
_down_proj_linear assumes F.linear is called exactly twice per expert in strict gate_up→down alternation. If a future HF release adds a third linear call, the toggle silently misaligns. Consider verifying weight shape (gate_up vs down dimensions) instead of a blind toggle as a defensive measure.

3. Issues — Code Quality

[SUGGESTION] break in _export_transformers_checkpoint — restructure control flow
The new branch at unified_export_hf.py:723 uses break inside the for linear_name loop but never uses linear_name. Cleaner to check the type before the loop:

if "QuantQwen3_5MoeExperts" in type(sub_module.experts).__name__:
    continue  # amax + export handled by _export_qwen35_experts
for linear_name in expert_linear_names:
    ...

[SUGGESTION] copy.deepcopy on TensorQuantizer may carry unwanted state
In _export_qwen35_experts, deepcopy(w_quantizer_src) clones the entire quantizer including calibrators and hooks — only _amax slicing is needed. A lighter-weight approach (new quantizer + copy only the sliced _amax) would be more explicit and safer.

[QUESTION] Removal of __len__/__iter__/__getitem__
The old implementation made the experts module iterable. The isinstance(sub_module.experts, collections.abc.Iterable) fallback in _export_transformers_checkpoint:733 would no longer match — the new type-check short-circuits first, but is there any other downstream code that iterates over sub_module.experts?

cjluo-nv added 2 commits April 2, 2026 18:06

Refactor Qwen3.5 MOE support for HF 5.0

9d7c8c0

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

Reviewer1

d3f23c5

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

cjluo-nv requested review from a team as code owners April 2, 2026 18:18

cjluo-nv requested review from meenchen and sychen52 April 2, 2026 18:18

Reviewer 2

59d10d9

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai bot reviewed Apr 2, 2026

View reviewed changes

Edwardf0t1 reviewed Apr 7, 2026

View reviewed changes

Edwardf0t1 mentioned this pull request Apr 7, 2026

Generic Fused MoE Quantization + Export for transformers 5.0+ #1187

Open

3 tasks

meenchen reviewed Apr 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Qwen3.5 MoE quantization to use _QuantFunctionalMixin#1170

Refactor Qwen3.5 MoE quantization to use _QuantFunctionalMixin#1170
cjluo-nv wants to merge 3 commits intomainfrom
chenjiel/refactor_qwen35

cjluo-nv commented Apr 2, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Apr 2, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-04-02 18:28 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

coderabbitai bot commented Apr 2, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

codecov bot commented Apr 2, 2026 •

edited

Loading

Uh oh!

Edwardf0t1 left a comment

Uh oh!

meenchen left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cjluo-nv commented Apr 2, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

github-actions bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-04-02 18:28 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

coderabbitai bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Edwardf0t1 left a comment

Choose a reason for hiding this comment

Review

Key Concerns

1. Fragile toggle-based state machine (huggingface.py)

2. Expert index recovery via storage offset (_get_expert_idx_from_gate_up)

3. break in export loop (unified_export_hf.py:726)

4. Export: copy.deepcopy on quantizer (moe_utils.py:106)

5. Amax slicing math (moe_utils.py:109-118)

6. Minor: intermediate_size vs intermediate_dim (moe_utils.py:48-53)

Positive Aspects

Summary

Uh oh!

meenchen left a comment

Choose a reason for hiding this comment

1. Dependencies — Open Questions

2. Design — Robustness

3. Issues — Code Quality

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cjluo-nv commented Apr 2, 2026 •

edited by coderabbitai bot

Loading

github-actions bot commented Apr 2, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-04-02 18:28 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

coderabbitai bot commented Apr 2, 2026 •

edited

Loading

codecov bot commented Apr 2, 2026 •

edited

Loading

1. Fragile toggle-based state machine (`huggingface.py`)

2. Expert index recovery via storage offset (`_get_expert_idx_from_gate_up`)

3. `break` in export loop (`unified_export_hf.py:726`)

4. Export: `copy.deepcopy` on quantizer (`moe_utils.py:106`)

5. Amax slicing math (`moe_utils.py:109-118`)

6. Minor: `intermediate_size` vs `intermediate_dim` (`moe_utils.py:48-53`)