Skip to content

Conversation

@fxmarty-amd
Copy link

@fxmarty-amd fxmarty-amd commented May 14, 2025

This PR allows to load MXFP4 models in sglang, using dynamic per-group OCP MXFP4 quantization for activations for linear layers.

The supported models are quantized using AMD Quark.

For now, the execution of GEMM is simulated on fp16/bf16, but in the future mxfp4 kernels will be added.

I get sensibly similar eval results vs the recent integration in vllm, on wikitext for a llama 2 70B mxfp4 model:

sglang (this PR):

Tasks Version Filter n-shot Metric Value Stderr
wikitext 2 none 0 bits_per_byte 0.5560 ± N/A
none 0 byte_perplexity 1.4701 ± N/A
none 0 word_perplexity 7.8516 ± N/A

vllm:

Tasks Version Filter n-shot Metric Value Stderr
wikitext 2 none 0 bits_per_byte 0.5565 ± N/A
none 0 byte_perplexity 1.4707 ± N/A
none 0 word_perplexity 7.8674 ± N/A

To do:

  • Documentation
  • Test
  • Support MOE models (likely in an other PR)
  • Support mxfp4 * mxfp4 GEMM kernel (likely in an other PR)
  • Test eval parity for Deepseek R1, Llama 4 and llama 405B with vllm

@HaiShaw HaiShaw self-assigned this May 15, 2025
@HaiShaw
Copy link
Collaborator

HaiShaw commented May 15, 2025

Can we extend this to handle fp8/bf16/fp16 activations (with only weight be mxfp4)?

@zhaochenyang20
Copy link
Collaborator

@fxmarty-amd rebase?

Comment on lines 1467 to 1471

fuse_qkv_a_proj = hasattr(config, "q_lora_rank") and config.q_lora_rank is not None
if fuse_qkv_a_proj:
self.packed_modules_mapping["fused_qkv_a_proj_with_mqa"] = ["q_a_proj", "kv_a_proj_with_mqa"]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these additions needed solely by quark?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 732 to 741

quark_scale_names = {
".q_proj.output_scale": ".attn.q_scale",
".k_proj.output_scale": ".attn.k_scale",
".v_proj.output_scale": ".attn.v_scale",
"self_attn.prob_output_scale": ".attn.prob_scale"
}
for quark_scale_name, sglang_scale_name in quark_scale_names.items():
if name.endswith(quark_scale_name):
return name.replace(quark_scale_name, sglang_scale_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this how is done in vllm?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, sglang has a different logic.

vllm uses https://github.com/vllm-project/vllm/blob/d637b960994119907b41c82d79f5a71c96dd419b/vllm/model_executor/layers/quantization/quark/quark.py#L344, which does not exist for other quantization schemes in sglang.

@BowenBao
Copy link
Contributor

Can we extend this to handle fp8/bf16/fp16 activations (with only weight be mxfp4)?

These are some good ideas, however I'd suggest focus the scope of this PR within just mxfp4 weight and dynamic mxfp4 activations. Folks can fill in these two spots with mxfp4 kernel call to make it work e2e.

@fxmarty-amd fxmarty-amd requested a review from BBuf as a code owner May 19, 2025 10:34
@fxmarty-amd
Copy link
Author

@zhaochenyang20 I merged main, let me know what you think!

def supports_custom_op() -> bool:
return hasattr(torch.library, "custom_op")

def supports_mx() -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this to mxfp_supported()

Copy link
Author

@fxmarty-amd fxmarty-amd Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? there are supports_custom_op, support_triton above

@HaiShaw
Copy link
Collaborator

HaiShaw commented Jul 14, 2025

@fxmarty-amd let's rebase

@fxmarty-amd fxmarty-amd requested a review from HaiShaw July 15, 2025 15:46
@fxmarty-amd
Copy link
Author

Hi @HaiShaw, conflicts should have been solved, and tests from https://github.com/fxmarty-amd/sglang/blob/mxfp4/test/srt/models/test_quark_models.py run fine for me locally.

@fxmarty-amd
Copy link
Author

@xutizhou

Hi @fxmarty-amd, thank you very much for your outstanding work! I was wondering, is Quark also compatible with NVIDIA GPUs?

Quark quantizer should run just fine on Nvidia GPUs, although our CI runs on AMD Instinct GPUs.

@fxmarty-amd
Copy link
Author

fxmarty-amd commented Jul 23, 2025

Hi @HaiShaw @zhyncs @zhaochenyang20 hope you are doing well.

I solved conflicts again and made sure the added tests pass successfully as well.

Let me know if you need anything from me to get this PR merged.

Edit: actually, TestR1MXFP4Accuracy is not passing now - checking.

@fxmarty-amd
Copy link
Author

gentle ping @zhyncs @zhaochenyang20

The added tests pass for me.

I disabled --attention-backend aiter for now as there are ongoing issues unrelated to this PR investigated by @kkHuang-amd .

@BowenBao
Copy link
Contributor

@HaiShaw please kindly take a look.

@fxmarty-amd
Copy link
Author

Closing as #8255 will land instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants