[Feature] MXFP4 model loading #6299

fxmarty-amd · 2025-05-14T15:43:52Z

This PR allows to load MXFP4 models in sglang, using dynamic per-group OCP MXFP4 quantization for activations for linear layers.

The supported models are quantized using AMD Quark.

For now, the execution of GEMM is simulated on fp16/bf16, but in the future mxfp4 kernels will be added.

I get sensibly similar eval results vs the recent integration in vllm, on wikitext for a llama 2 70B mxfp4 model:

sglang (this PR):

Tasks	Version	Filter	Metric		Value		Stderr
wikitext	2	none	bits_per_byte	↓	0.5560	±	N/A
		none	byte_perplexity	↓	1.4701	±	N/A
		none	word_perplexity	↓	7.8516	±	N/A

vllm:

Tasks	Version	Filter	Metric		Value		Stderr
wikitext	2	none	bits_per_byte	↓	0.5565	±	N/A
		none	byte_perplexity	↓	1.4707	±	N/A
		none	word_perplexity	↓	7.8674	±	N/A

To do:

Documentation
Test
Support MOE models (likely in an other PR)
Support mxfp4 * mxfp4 GEMM kernel (likely in an other PR)
Test eval parity for Deepseek R1, Llama 4 and llama 405B with vllm

HaiShaw · 2025-05-15T00:41:16Z

Can we extend this to handle fp8/bf16/fp16 activations (with only weight be mxfp4)?

zhaochenyang20 · 2025-05-15T15:51:13Z

@fxmarty-amd rebase？

BowenBao · 2025-05-15T18:05:55Z

python/sglang/srt/models/deepseek_v2.py

+
+        fuse_qkv_a_proj = hasattr(config, "q_lora_rank") and config.q_lora_rank is not None
+        if fuse_qkv_a_proj:
+            self.packed_modules_mapping["fused_qkv_a_proj_with_mqa"] = ["q_a_proj", "kv_a_proj_with_mqa"]
+


are these additions needed solely by quark?

This is needed for https://github.com/fxmarty-amd/sglang/blob/c0e1e1eeb12bf69ca28953d1c0960f370e7ee645/python/sglang/srt/layers/quantization/quark/quark.py#L64-L66

and

sglang/python/sglang/srt/layers/quantization/compressed_tensors/compressed_tensors.py

Lines 124 to 126 in 844e2f2

if should_ignore_layer(

prefix, ignore=self.ignore, fused_mapping=self.packed_modules_mapping

):

to properly work.

BowenBao · 2025-05-15T18:06:18Z

python/sglang/srt/model_loader/weight_utils.py

+
+    quark_scale_names = {
+        ".q_proj.output_scale": ".attn.q_scale",
+        ".k_proj.output_scale": ".attn.k_scale",
+        ".v_proj.output_scale": ".attn.v_scale",
+        "self_attn.prob_output_scale": ".attn.prob_scale"
+    }
+    for quark_scale_name, sglang_scale_name in quark_scale_names.items():
+        if name.endswith(quark_scale_name):
+            return name.replace(quark_scale_name, sglang_scale_name)


is this how is done in vllm?

No, sglang has a different logic.

vllm uses https://github.com/vllm-project/vllm/blob/d637b960994119907b41c82d79f5a71c96dd419b/vllm/model_executor/layers/quantization/quark/quark.py#L344, which does not exist for other quantization schemes in sglang.

BowenBao · 2025-05-16T16:55:22Z

Can we extend this to handle fp8/bf16/fp16 activations (with only weight be mxfp4)?

These are some good ideas, however I'd suggest focus the scope of this PR within just mxfp4 weight and dynamic mxfp4 activations. Folks can fill in these two spots with mxfp4 kernel call to make it work e2e.

sglang/python/sglang/srt/layers/quantization/quark/schemes/quark_w4a4_mxfp4.py

Line 106 in d21830c

raise NotImplementedError()

sglang/python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py

Line 815 in d21830c

raise NotImplementedError()

fxmarty-amd · 2025-05-19T12:55:03Z

@zhaochenyang20 I merged main, let me know what you think!

HaiShaw · 2025-07-07T04:26:28Z

python/sglang/srt/utils.py

 def supports_custom_op() -> bool:
    return hasattr(torch.library, "custom_op")

+def supports_mx() -> bool:


change this to mxfp_supported()

why? there are supports_custom_op, support_triton above

HaiShaw · 2025-07-14T06:52:35Z

@fxmarty-amd let's rebase

fxmarty-amd · 2025-07-15T16:10:22Z

Hi @HaiShaw, conflicts should have been solved, and tests from https://github.com/fxmarty-amd/sglang/blob/mxfp4/test/srt/models/test_quark_models.py run fine for me locally.

fxmarty-amd · 2025-07-15T16:11:18Z

@xutizhou

Hi @fxmarty-amd, thank you very much for your outstanding work! I was wondering, is Quark also compatible with NVIDIA GPUs?

Quark quantizer should run just fine on Nvidia GPUs, although our CI runs on AMD Instinct GPUs.

…s being moved

fxmarty-amd · 2025-07-23T13:03:49Z

Hi @HaiShaw @zhyncs @zhaochenyang20 hope you are doing well.

I solved conflicts again and made sure the added tests pass successfully as well.

Let me know if you need anything from me to get this PR merged.

Edit: actually, TestR1MXFP4Accuracy is not passing now - checking.

fxmarty-amd · 2025-07-25T10:07:33Z

gentle ping @zhyncs @zhaochenyang20

The added tests pass for me.

I disabled --attention-backend aiter for now as there are ongoing issues unrelated to this PR investigated by @kkHuang-amd .

BowenBao · 2025-07-25T14:56:52Z

@HaiShaw please kindly take a look.

fxmarty-amd · 2025-07-28T08:48:33Z

Closing as #8255 will land instead.

fxmarty-amd added 2 commits May 14, 2025 08:24

wip

5480456

working version

74d0350

fxmarty-amd requested review from ByronHsu, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock, merrymercy and zhyncs as code owners May 14, 2025 15:43

cleanup

bdc23fe

HaiShaw self-assigned this May 15, 2025

HaiShaw added the review-wip label May 15, 2025

support moe

f908531

fxmarty-amd requested a review from xiezhq-hermann as a code owner May 15, 2025 09:12

moe

ff2ef02

fxmarty-amd requested a review from zhaochenyang20 as a code owner May 15, 2025 13:09

add tests

d21830c

BowenBao reviewed May 15, 2025

View reviewed changes

llama4 support

c0e1e1e

fxmarty-amd requested a review from BBuf as a code owner May 19, 2025 10:34

fxmarty-amd added 4 commits May 19, 2025 12:44

Merge branch 'main' into mxfp4

b854e67

move supports_mx to sglang.srt.utils

55d5e28

typo

c63316e

tests pass

05c9193

HaiShaw requested changes Jul 7, 2025

View reviewed changes

zhyncs added the high priority label Jul 10, 2025

fxmarty-amd added 6 commits July 15, 2025 11:52

Merge branch 'main' into mxfp4

ad7733e

add docstrings, add quark to pyproject.toml as suggested

b6d60f2

remove SGLANG_QUARK_EMU_MEM_OPT

532697f

post-merge cleanup

8561e48

remove duplicate test

cdaf18f

cleanup tests

f9c455b

fxmarty-amd requested a review from HaiShaw July 15, 2025 15:46

fxmarty-amd added 5 commits July 16, 2025 10:44

Merge branch 'main' into mxfp4

366e5f0

Merge branch 'main' into mxfp4

5276fca

precise doc

c340aee

fix mxfp4 integration after select_expert refactoring and base classe…

0e3565c

…s being moved

linting

2897dad

fxmarty-amd added 2 commits July 23, 2025 13:12

update doc

0daf048

better comment

f5b0cfd

This was referenced Jul 23, 2025

fmoe_codegen_asm ROCm/aiter#690

Merged

Update gfx942 FA fwd kernel ROCm/aiter#648

Merged

fxmarty-amd added 4 commits July 24, 2025 10:32

Merge branch 'main' into mxfp4

60bcdd8

disable aiter backend for now

4be3d8f

Merge branch 'main' into mxfp4

dba6bbd

linting

d374622

fxmarty-amd closed this Jul 28, 2025

	if should_ignore_layer(
	prefix, ignore=self.ignore, fused_mapping=self.packed_modules_mapping
	):

[Feature] MXFP4 model loading #6299

[Feature] MXFP4 model loading #6299

Uh oh!

Conversation

fxmarty-amd commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HaiShaw commented May 15, 2025

Uh oh!

zhaochenyang20 commented May 15, 2025

Uh oh!

BowenBao May 15, 2025

Choose a reason for hiding this comment

Uh oh!

fxmarty-amd May 19, 2025

Choose a reason for hiding this comment

Uh oh!

BowenBao May 15, 2025

Choose a reason for hiding this comment

Uh oh!

fxmarty-amd May 19, 2025

Choose a reason for hiding this comment

Uh oh!

BowenBao commented May 16, 2025

Uh oh!

fxmarty-amd commented May 19, 2025

Uh oh!

HaiShaw Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

fxmarty-amd Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HaiShaw commented Jul 14, 2025

Uh oh!

fxmarty-amd commented Jul 15, 2025

Uh oh!

fxmarty-amd commented Jul 15, 2025

Uh oh!

fxmarty-amd commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fxmarty-amd commented Jul 25, 2025

Uh oh!

BowenBao commented Jul 25, 2025

Uh oh!

fxmarty-amd commented Jul 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

fxmarty-amd commented May 14, 2025 •

edited

Loading

fxmarty-amd Jul 15, 2025 •

edited

Loading

fxmarty-amd commented Jul 23, 2025 •

edited

Loading