Skip to content

Conversation

@avtc
Copy link
Contributor

@avtc avtc commented Dec 4, 2025

@Qubitium please review.
It looks like LLM has issues in Antigravity when it tried to understand what needed to be changed from git diff and tried to write feature from scratch, so had to babysit it and get out of limits every now and then, also there are no many tests were added.

The model definitions with gate/up/down, w3/w1/w2 expert weights were updated, but several are out of scheme, I have left them untouched:

  • dbrx_converted
  • ernie4_5_moe (it has "#": ("gate_proj:0", "upe_proj:0", "down_proj:1"), - upe_proj maybe typo, idk), and it has custom forward
  • gpt_oss
  • phi3

I have made test run on Qwen3-30B-A3B - the logs shows all samples are passed to each expert weight, and the LLM answers normally after quantization.

Also added a flag wait_for_layer_completion to minimize VRAM usage.

avtc added 30 commits December 1, 2025 12:26
…ve part of experts, remove optimization for shared experts as it could be not optimal when StopForward raised
…ed. update _masked_hook_wrapper to check hooks_paused
…, except of:

dbrx_converted
ernie4_5_moe
gpt_oss
phi3
@Qubitium Qubitium self-assigned this Dec 4, 2025
Copy link
Collaborator

@Qubitium Qubitium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@avtc Thanks for working on this. I will work with you to get this merged for v5.5 feature target release.

Some quick notes. I have only casually scanned the code and not indepth yet.

location of moe flag in model defs

            "mlp:moe": {
                "": ("gate_proj:0", "up_proj:0", "down_proj:1"),
                "experts": {
                    "#": ("gate_proj:0", "up_proj:0", "down_proj:1"),
                },
                "shared_experts": ("gate_proj:0", "up_proj:0", "down_proj:1"),
            },

Since mlp layer contains more than just moe, we should flag the actual block that is holding the moe modules. Maybe have two moe flags moe and moe-share to make them distinctive? I see string name parsing for finding shared-moe which can be eliminated.

            "mlp": {
                "": ("gate_proj:0", "up_proj:0", "down_proj:1"),
                "experts:moe": {
                    "#": ("gate_proj:0", "up_proj:0", "down_proj:1"),
                },
                "shared_experts:moe-share": ("gate_proj:0", "up_proj:0", "down_proj:1"),
            },

Comment on lines 267 to 270
pass_whole_dataset_to_each_expert: bool = field(
default=False,
metadata={"help": "Forward entire calibration dataset to all MoE experts (not just routed experts)"}
)
Copy link
Collaborator

@Qubitium Qubitium Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename pass_whole_dataset_to_each_expert to moe_routing_disable . moe_ prefix allows future grouping of moe_xxx options. The new name describe exactly what it does in code instead of what it achieves as a goal. You want to have the goal of pass_whole_dataset_to_each_expert but the action in code is actually moe_routing_disable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about moe_force_all_experts or moe_bypass_router, as disable sound ambiguous?

Comment on lines 274 to 277
wait_for_layer_completion: bool = field(
default=False,
metadata={"help": "Wait for all layer finalization tasks (packing, writing) to complete before proceeding to next layer"}
)
Copy link
Collaborator

@Qubitium Qubitium Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename wait_for_layer_completion to submodule_finalize_wait for the same reasons as pass_whole_dataset_to_each_expert rename.

Copy link
Contributor Author

@avtc avtc Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using this flag not only on layer completion, but also after subset forward and after subset quantize - to defrag the VRAM before new VRAM stressing logic.

Are you OK with stage_finalize_wait or memory_cleanup_on_stage_end? or suggest you variant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check if names are OK:

    # Control whether to wait for layer finalization (packing, writing) before proceeding to next layer
    # Default False preserves current behavior (async finalization in background while next layer starts)
    vram_opt_memory_cleanup_on_stage_end: bool = field(
        default=False,
        metadata={"help": "Also wait for all layer finalization tasks (packing, writing) to complete before proceeding to next layer"}
    )

    # Control whether to exclude device 0 from forward pass and quantization
    vram_opt_exclude_device_0_from_compute: bool = field(
        default=False,
        metadata={"help": "Exclude device 0 from forward pass and quantization to reserve memory for model weights, input/output tokens"}
    )

    # MoE quantization: forward whole calibration dataset to each expert instead of only routed data
    # This ensures all experts receive sufficient calibration samples but increases quantization time
    moe_bypass_router: bool = field(
        default=False,
        metadata={"help": "Forward entire calibration dataset to all MoE experts (not just routed experts)"}
    )

    # Works faster than data parallel with some configurations 
    force_subset_forward_serial: bool = field(
        default=False,
        metadata={"help": "Force serial forward pass for subsets instead of data parallel"}
    )

@avtc
Copy link
Contributor Author

avtc commented Dec 4, 2025

@Qubitium regarding Maybe have two moe flags moe and moe-share to make them distinctive?, good idea, I will check

@Qubitium Qubitium changed the title Feature: forward whole dataset to each expert Feature: Optional MoE Routing Disable during Quantization Dec 4, 2025
@avtc
Copy link
Contributor Author

avtc commented Dec 5, 2025

I have found issue of moe-lifecycle flow does not work properly with module replicas, working on in

@avtc
Copy link
Contributor Author

avtc commented Dec 6, 2025

@Qubitium regarding Maybe have two moe flags moe and moe-share to make them distinctive? I have started to refactor this, and appear that :moe flag should be better set on module that does routing and contains experts/shared_experts inside - this is for easier finding the module and override the forward method. I initially though about adding flags to experts and shared_experts, but after checking all model definitions - they all have names experts, shared_experts or shared_expert. So for simplicity of definitions and logic left them without flags.

We cannot override forward for "experts" as this is a module list in some models (for example Minimax-M2), and forward is not called on the list, but in "mlp" module with routing logic.

avtc added 3 commits December 6, 2025 15:53
…g vram on stage end, to force forward serial, clear intermediate tensors, and fix for AttributeError when PyTorch expects prev_idx attribute
@avtc
Copy link
Contributor Author

avtc commented Dec 7, 2025

@Qubitium I have fixed all found bugs, and now testing new feature - batched experts processing (become possible when moe_bypass_router is true). Would you like to see it in another PR (after this PR will be merged) or add to this one?

P.S. I have played with vram_strategies - adding new ones to investigate how to bypass the exclusive strategy hard vram requirement - having hessian accumulator on each GPU for each weight - for GLM-4.5-Air it is ~17Gb per layer - tested having single accumulator for expert weight on only one GPU with lowest VRAM usage while forwarding inputs in data parallel - but it worked slow. Then I have realize if we have less experts in subset-batch - we can fit hessian accumulator for all experts in batch. For example having half of 128 experts in batch means: 4096*4096 * 2(gate/up) * 4(32bits) * 64 = 8.5Gb - that fits and we able to use data parallel or even do forward pass on single GPU.

P.P.S. I have checked quantization with force_single_forward=True vs False on default vram_strategy (exclusive), with experts batch size = 64 (able to force forward on single 3090), with 2320 samples (769K tokens).
The time to quantize a single layer with MoE:

  • force_single_forward=True = 15.5 minutes
  • force_single_forward=False = 31.15 minutes (replicate step take very long time on 8 GPUs, will check vs 4GPUs and vs deepcopy). I thought maybe it's possible to replicate at layer start, and then modify replicas when needed for speedup?

@Qubitium
Copy link
Collaborator

@avtc Sorry for the delay. I will work though the PR in the next few days. 5.6.0 has been rleased so this PR will go to the next 5.8.0 (even number are non-dev pypi release).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants