Feature: Optional MoE Routing Disable during Quantization #2235

avtc · 2025-12-04T10:07:32Z

@Qubitium please review.
It looks like LLM has issues in Antigravity when it tried to understand what needed to be changed from git diff and tried to write feature from scratch, so had to babysit it and get out of limits every now and then, also there are no many tests were added.

The model definitions with gate/up/down, w3/w1/w2 expert weights were updated, but several are out of scheme, I have left them untouched:

dbrx_converted
ernie4_5_moe (it has "#": ("gate_proj:0", "upe_proj:0", "down_proj:1"), - upe_proj maybe typo, idk), and it has custom forward
gpt_oss
phi3

I have made test run on Qwen3-30B-A3B - the logs shows all samples are passed to each expert weight, and the LLM answers normally after quantization.

Also added a flag wait_for_layer_completion to minimize VRAM usage.

…rent layer index

…when forward_hook_last is True

…n forward_to_all_experts

…ve part of experts, remove optimization for shared experts as it could be not optimal when StopForward raised

…ed. update _masked_hook_wrapper to check hooks_paused

…, except of: dbrx_converted ernie4_5_moe gpt_oss phi3

Qubitium

@avtc Thanks for working on this. I will work with you to get this merged for v5.5 feature target release.

Some quick notes. I have only casually scanned the code and not indepth yet.

location of moe flag in model defs

            "mlp:moe": {
                "": ("gate_proj:0", "up_proj:0", "down_proj:1"),
                "experts": {
                    "#": ("gate_proj:0", "up_proj:0", "down_proj:1"),
                },
                "shared_experts": ("gate_proj:0", "up_proj:0", "down_proj:1"),
            },

Since mlp layer contains more than just moe, we should flag the actual block that is holding the moe modules. Maybe have two moe flags moe and moe-share to make them distinctive? I see string name parsing for finding shared-moe which can be eliminated.

            "mlp": {
                "": ("gate_proj:0", "up_proj:0", "down_proj:1"),
                "experts:moe": {
                    "#": ("gate_proj:0", "up_proj:0", "down_proj:1"),
                },
                "shared_experts:moe-share": ("gate_proj:0", "up_proj:0", "down_proj:1"),
            },

Qubitium · 2025-12-04T17:09:59Z

gptqmodel/quantization/config.py

+    pass_whole_dataset_to_each_expert: bool = field(
+        default=False,
+        metadata={"help": "Forward entire calibration dataset to all MoE experts (not just routed experts)"}
+    )


Rename pass_whole_dataset_to_each_expert to moe_routing_disable . moe_ prefix allows future grouping of moe_xxx options. The new name describe exactly what it does in code instead of what it achieves as a goal. You want to have the goal of pass_whole_dataset_to_each_expert but the action in code is actually moe_routing_disable.

How about moe_force_all_experts or moe_bypass_router, as disable sound ambiguous?

Qubitium · 2025-12-04T17:13:51Z

gptqmodel/quantization/config.py

+    wait_for_layer_completion: bool = field(
+        default=False,
+        metadata={"help": "Wait for all layer finalization tasks (packing, writing) to complete before proceeding to next layer"}
+    )


Rename wait_for_layer_completion to submodule_finalize_wait for the same reasons as pass_whole_dataset_to_each_expert rename.

I am using this flag not only on layer completion, but also after subset forward and after subset quantize - to defrag the VRAM before new VRAM stressing logic.

Are you OK with stage_finalize_wait or memory_cleanup_on_stage_end? or suggest you variant

Please check if names are OK:

# Control whether to wait for layer finalization (packing, writing) before proceeding to next layer # Default False preserves current behavior (async finalization in background while next layer starts) vram_opt_memory_cleanup_on_stage_end: bool = field( default=False, metadata={"help": "Also wait for all layer finalization tasks (packing, writing) to complete before proceeding to next layer"} ) # Control whether to exclude device 0 from forward pass and quantization vram_opt_exclude_device_0_from_compute: bool = field( default=False, metadata={"help": "Exclude device 0 from forward pass and quantization to reserve memory for model weights, input/output tokens"} ) # MoE quantization: forward whole calibration dataset to each expert instead of only routed data # This ensures all experts receive sufficient calibration samples but increases quantization time moe_bypass_router: bool = field( default=False, metadata={"help": "Forward entire calibration dataset to all MoE experts (not just routed experts)"} ) # Works faster than data parallel with some configurations force_subset_forward_serial: bool = field( default=False, metadata={"help": "Force serial forward pass for subsets instead of data parallel"} )

avtc · 2025-12-04T17:47:22Z

@Qubitium regarding Maybe have two moe flags moe and moe-share to make them distinctive?, good idea, I will check

avtc · 2025-12-05T10:37:07Z

I have found issue of moe-lifecycle flow does not work properly with module replicas, working on in

avtc · 2025-12-06T12:17:33Z

@Qubitium regarding Maybe have two moe flags moe and moe-share to make them distinctive? I have started to refactor this, and appear that :moe flag should be better set on module that does routing and contains experts/shared_experts inside - this is for easier finding the module and override the forward method. I initially though about adding flags to experts and shared_experts, but after checking all model definitions - they all have names experts, shared_experts or shared_expert. So for simplicity of definitions and logic left them without flags.

We cannot override forward for "experts" as this is a module list in some models (for example Minimax-M2), and forward is not called on the list, but in "mlp" module with routing logic.

…g vram on stage end, to force forward serial, clear intermediate tensors, and fix for AttributeError when PyTorch expects prev_idx attribute

avtc · 2025-12-07T10:10:40Z

@Qubitium I have fixed all found bugs, and now testing new feature - batched experts processing (become possible when moe_bypass_router is true). Would you like to see it in another PR (after this PR will be merged) or add to this one?

P.S. I have played with vram_strategies - adding new ones to investigate how to bypass the exclusive strategy hard vram requirement - having hessian accumulator on each GPU for each weight - for GLM-4.5-Air it is ~17Gb per layer - tested having single accumulator for expert weight on only one GPU with lowest VRAM usage while forwarding inputs in data parallel - but it worked slow. Then I have realize if we have less experts in subset-batch - we can fit hessian accumulator for all experts in batch. For example having half of 128 experts in batch means: 4096*4096 * 2(gate/up) * 4(32bits) * 64 = 8.5Gb - that fits and we able to use data parallel or even do forward pass on single GPU.

P.P.S. I have checked quantization with force_single_forward=True vs False on default vram_strategy (exclusive), with experts batch size = 64 (able to force forward on single 3090), with 2320 samples (769K tokens).
The time to quantize a single layer with MoE:

force_single_forward=True = 15.5 minutes
force_single_forward=False = 31.15 minutes (replicate step take very long time on 8 GPUs, will check vs 4GPUs and vs deepcopy). I thought maybe it's possible to replicate at layer start, and then modify replicas when needed for speedup?

Qubitium · 2025-12-10T09:42:45Z

@avtc Sorry for the delay. I will work though the PR in the next few days. 5.6.0 has been rleased so this PR will go to the next 5.8.0 (even number are non-dev pypi release).

avtc added 30 commits December 1, 2025 12:26

phase 1 and 2

06e7d2b

phase 3

d0915cc

fix for the case when :moe module does not have expert modules on cur…

7e59bb3

…rent layer index

log activations collected, raise StopForward in named_module forward …

15c74bf

…when forward_hook_last is True

optimization to prevent shared experts double activation

5eff9af

fix circular dependency

6b7f190

debug hooks not used

791192d

integrate to _run_forward_batches_parallel

c9e2c96

remove redundant overrides

9b3b783

debug subset content

b17bd95

cache subset_modules, refactor and fix confusion with names/modules i…

48c00a7

…n forward_to_all_experts

debug

a7032dc

refining forward_to_all_experts

8cbf894

refactor get_experts_module_name

6d05f70

fix import

77a8bfd

optimization for _extract_moe_block_prefix, fix in case subset can ha…

00ce12d

…ve part of experts, remove optimization for shared experts as it could be not optimal when StopForward raised

deduplicate moe_forward_wrapper

ce2a4a8

update _masked_pre_hook_wrapper to recent variant. revisit hooks_paus…

10d8861

…ed. update _masked_hook_wrapper to check hooks_paused

fix signature

169fa63

fix param missing

a76af2a

remove comments and unused variables

d4bc837

fix indentation, and move after torch_sync

21b0d5e

trying to fix for data-parallel

1520b87

fix indentation

c6697c3

config option wait_for_layer_completion

f6f4e1a

clean moe_contexts in finally

d175149

enable moe_lifecycle_hooks for qwen3_moe

6a1f90d

trying to fix VRAM leak

61cf213

fix typo

19e6b32

update model definitions to support pass_whole_dataset_to_each_expert…

745944b

…, except of: dbrx_converted ernie4_5_moe gpt_oss phi3

avtc added 3 commits December 4, 2025 10:27

add warnings

c1a6707

remove redundant log

fcd174f

clear cache after forward pass

e27abcb

Qubitium self-assigned this Dec 4, 2025

Qubitium reviewed Dec 4, 2025

View reviewed changes

Qubitium changed the title ~~Feature: forward whole dataset to each expert~~ Feature: Optional MoE Routing Disable during Quantization Dec 4, 2025

avtc added 3 commits December 6, 2025 15:53

handle replicated modules, ability to save vram on device 0, to defra…

38a5f78

…g vram on stage end, to force forward serial, clear intermediate tensors, and fix for AttributeError when PyTorch expects prev_idx attribute

missing save change

6a5405c

skip iterating over all experts during replay forward

81b409c

turn off auto-gc when vram_opt_memory_cleanup_on_stage_end is set

27914f0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature: Optional MoE Routing Disable during Quantization #2235

Feature: Optional MoE Routing Disable during Quantization #2235

Uh oh!

avtc commented Dec 4, 2025

Uh oh!

Qubitium left a comment •

edited

Loading

Uh oh!

Qubitium Dec 4, 2025 •

edited

Loading

Uh oh!

avtc Dec 6, 2025

Uh oh!

Qubitium Dec 4, 2025 •

edited

Loading

Uh oh!

avtc Dec 6, 2025 •

edited

Loading

Uh oh!

avtc Dec 6, 2025

Uh oh!

avtc commented Dec 4, 2025

Uh oh!

avtc commented Dec 5, 2025

Uh oh!

avtc commented Dec 6, 2025

Uh oh!

avtc commented Dec 7, 2025 •

edited

Loading

Uh oh!

Qubitium commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Feature: Optional MoE Routing Disable during Quantization #2235

Are you sure you want to change the base?

Feature: Optional MoE Routing Disable during Quantization #2235

Uh oh!

Conversation

avtc commented Dec 4, 2025

Uh oh!

Qubitium left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Qubitium Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avtc Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

Qubitium Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avtc Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avtc Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

avtc commented Dec 4, 2025

Uh oh!

avtc commented Dec 5, 2025

Uh oh!

avtc commented Dec 6, 2025

Uh oh!

avtc commented Dec 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Qubitium commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Qubitium left a comment •

edited

Loading

Qubitium Dec 4, 2025 •

edited

Loading

Qubitium Dec 4, 2025 •

edited

Loading

avtc Dec 6, 2025 •

edited

Loading

avtc commented Dec 7, 2025 •

edited

Loading