-
Notifications
You must be signed in to change notification settings - Fork 4k
[Feature] MXFP4 model loading #6299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can we extend this to handle fp8/bf16/fp16 activations (with only weight be mxfp4)? |
|
@fxmarty-amd rebase? |
|
|
||
| fuse_qkv_a_proj = hasattr(config, "q_lora_rank") and config.q_lora_rank is not None | ||
| if fuse_qkv_a_proj: | ||
| self.packed_modules_mapping["fused_qkv_a_proj_with_mqa"] = ["q_a_proj", "kv_a_proj_with_mqa"] | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these additions needed solely by quark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed for https://github.com/fxmarty-amd/sglang/blob/c0e1e1eeb12bf69ca28953d1c0960f370e7ee645/python/sglang/srt/layers/quantization/quark/quark.py#L64-L66
and
sglang/python/sglang/srt/layers/quantization/compressed_tensors/compressed_tensors.py
Lines 124 to 126 in 844e2f2
| if should_ignore_layer( | |
| prefix, ignore=self.ignore, fused_mapping=self.packed_modules_mapping | |
| ): |
|
|
||
| quark_scale_names = { | ||
| ".q_proj.output_scale": ".attn.q_scale", | ||
| ".k_proj.output_scale": ".attn.k_scale", | ||
| ".v_proj.output_scale": ".attn.v_scale", | ||
| "self_attn.prob_output_scale": ".attn.prob_scale" | ||
| } | ||
| for quark_scale_name, sglang_scale_name in quark_scale_names.items(): | ||
| if name.endswith(quark_scale_name): | ||
| return name.replace(quark_scale_name, sglang_scale_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this how is done in vllm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, sglang has a different logic.
vllm uses https://github.com/vllm-project/vllm/blob/d637b960994119907b41c82d79f5a71c96dd419b/vllm/model_executor/layers/quantization/quark/quark.py#L344, which does not exist for other quantization schemes in sglang.
These are some good ideas, however I'd suggest focus the scope of this PR within just mxfp4 weight and dynamic mxfp4 activations. Folks can fill in these two spots with mxfp4 kernel call to make it work e2e.
|
|
@zhaochenyang20 I merged main, let me know what you think! |
| def supports_custom_op() -> bool: | ||
| return hasattr(torch.library, "custom_op") | ||
|
|
||
| def supports_mx() -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change this to mxfp_supported()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why? there are supports_custom_op, support_triton above
|
@fxmarty-amd let's rebase |
|
Hi @HaiShaw, conflicts should have been solved, and tests from https://github.com/fxmarty-amd/sglang/blob/mxfp4/test/srt/models/test_quark_models.py run fine for me locally. |
Quark quantizer should run just fine on Nvidia GPUs, although our CI runs on AMD Instinct GPUs. |
|
Hi @HaiShaw @zhyncs @zhaochenyang20 hope you are doing well. I solved conflicts again and made sure the added tests pass successfully as well. Let me know if you need anything from me to get this PR merged. Edit: actually, TestR1MXFP4Accuracy is not passing now - checking. |
|
gentle ping @zhyncs @zhaochenyang20 The added tests pass for me. I disabled |
|
@HaiShaw please kindly take a look. |
|
Closing as #8255 will land instead. |
This PR allows to load MXFP4 models in sglang, using dynamic per-group OCP MXFP4 quantization for activations for linear layers.
The supported models are quantized using AMD Quark.
For now, the execution of GEMM is simulated on fp16/bf16, but in the future mxfp4 kernels will be added.
I get sensibly similar eval results vs the recent integration in vllm, on wikitext for a llama 2 70B mxfp4 model:
sglang (this PR):
vllm:
To do: