-
Notifications
You must be signed in to change notification settings - Fork 137
Feature: Optional MoE Routing Disable during Quantization #2235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feature: Optional MoE Routing Disable during Quantization #2235
Conversation
…when forward_hook_last is True
…n forward_to_all_experts
…ve part of experts, remove optimization for shared experts as it could be not optimal when StopForward raised
…ed. update _masked_hook_wrapper to check hooks_paused
…, except of: dbrx_converted ernie4_5_moe gpt_oss phi3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@avtc Thanks for working on this. I will work with you to get this merged for v5.5 feature target release.
Some quick notes. I have only casually scanned the code and not indepth yet.
location of moe flag in model defs
"mlp:moe": {
"": ("gate_proj:0", "up_proj:0", "down_proj:1"),
"experts": {
"#": ("gate_proj:0", "up_proj:0", "down_proj:1"),
},
"shared_experts": ("gate_proj:0", "up_proj:0", "down_proj:1"),
},Since mlp layer contains more than just moe, we should flag the actual block that is holding the moe modules. Maybe have two moe flags moe and moe-share to make them distinctive? I see string name parsing for finding shared-moe which can be eliminated.
"mlp": {
"": ("gate_proj:0", "up_proj:0", "down_proj:1"),
"experts:moe": {
"#": ("gate_proj:0", "up_proj:0", "down_proj:1"),
},
"shared_experts:moe-share": ("gate_proj:0", "up_proj:0", "down_proj:1"),
},
gptqmodel/quantization/config.py
Outdated
| pass_whole_dataset_to_each_expert: bool = field( | ||
| default=False, | ||
| metadata={"help": "Forward entire calibration dataset to all MoE experts (not just routed experts)"} | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename pass_whole_dataset_to_each_expert to moe_routing_disable . moe_ prefix allows future grouping of moe_xxx options. The new name describe exactly what it does in code instead of what it achieves as a goal. You want to have the goal of pass_whole_dataset_to_each_expert but the action in code is actually moe_routing_disable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about moe_force_all_experts or moe_bypass_router, as disable sound ambiguous?
gptqmodel/quantization/config.py
Outdated
| wait_for_layer_completion: bool = field( | ||
| default=False, | ||
| metadata={"help": "Wait for all layer finalization tasks (packing, writing) to complete before proceeding to next layer"} | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename wait_for_layer_completion to submodule_finalize_wait for the same reasons as pass_whole_dataset_to_each_expert rename.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am using this flag not only on layer completion, but also after subset forward and after subset quantize - to defrag the VRAM before new VRAM stressing logic.
Are you OK with stage_finalize_wait or memory_cleanup_on_stage_end? or suggest you variant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check if names are OK:
# Control whether to wait for layer finalization (packing, writing) before proceeding to next layer
# Default False preserves current behavior (async finalization in background while next layer starts)
vram_opt_memory_cleanup_on_stage_end: bool = field(
default=False,
metadata={"help": "Also wait for all layer finalization tasks (packing, writing) to complete before proceeding to next layer"}
)
# Control whether to exclude device 0 from forward pass and quantization
vram_opt_exclude_device_0_from_compute: bool = field(
default=False,
metadata={"help": "Exclude device 0 from forward pass and quantization to reserve memory for model weights, input/output tokens"}
)
# MoE quantization: forward whole calibration dataset to each expert instead of only routed data
# This ensures all experts receive sufficient calibration samples but increases quantization time
moe_bypass_router: bool = field(
default=False,
metadata={"help": "Forward entire calibration dataset to all MoE experts (not just routed experts)"}
)
# Works faster than data parallel with some configurations
force_subset_forward_serial: bool = field(
default=False,
metadata={"help": "Force serial forward pass for subsets instead of data parallel"}
)
|
@Qubitium regarding |
|
I have found issue of moe-lifecycle flow does not work properly with module replicas, working on in |
|
@Qubitium regarding We cannot override forward for "experts" as this is a module list in some models (for example Minimax-M2), and forward is not called on the list, but in "mlp" module with routing logic. |
…g vram on stage end, to force forward serial, clear intermediate tensors, and fix for AttributeError when PyTorch expects prev_idx attribute
|
@Qubitium I have fixed all found bugs, and now testing new feature - batched experts processing (become possible when moe_bypass_router is true). Would you like to see it in another PR (after this PR will be merged) or add to this one? P.S. I have played with vram_strategies - adding new ones to investigate how to bypass the exclusive strategy hard vram requirement - having hessian accumulator on each GPU for each weight - for GLM-4.5-Air it is ~17Gb per layer - tested having single accumulator for expert weight on only one GPU with lowest VRAM usage while forwarding inputs in data parallel - but it worked slow. Then I have realize if we have less experts in subset-batch - we can fit hessian accumulator for all experts in batch. For example having half of 128 experts in batch means: 4096*4096 * 2(gate/up) * 4(32bits) * 64 = 8.5Gb - that fits and we able to use data parallel or even do forward pass on single GPU. P.P.S. I have checked quantization with
|
|
@avtc Sorry for the delay. I will work though the PR in the next few days. 5.6.0 has been rleased so this PR will go to the next 5.8.0 (even number are non-dev pypi release). |
@Qubitium please review.
It looks like LLM has issues in Antigravity when it tried to understand what needed to be changed from git diff and tried to write feature from scratch, so had to babysit it and get out of limits every now and then, also there are no many tests were added.
The model definitions with gate/up/down, w3/w1/w2 expert weights were updated, but several are out of scheme, I have left them untouched:
"#": ("gate_proj:0", "upe_proj:0", "down_proj:1"),- upe_proj maybe typo, idk), and it has custom forwardI have made test run on Qwen3-30B-A3B - the logs shows all samples are passed to each expert weight, and the LLM answers normally after quantization.
Also added a flag
wait_for_layer_completionto minimize VRAM usage.