make NVFP4Tensor handle per-expert outer scale by vkuzo · Pull Request #4315 · pytorch/ao

vkuzo · 2026-04-22T19:59:58Z

Summary:

NVFP4 MoE kernels usually do per-expert outer scale.
This PR extends NVFP4Tensor to support this.

Test Plan:

pytest test/prototype/mx_formats/test_nvfp4_tensor.py -s

[ghstack-poisoned]

vkuzo · 2026-04-22T19:59:59Z

Stack from ghstack (oldest at bottom):

pytorch-bot · 2026-04-22T20:00:02Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4315

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Rolling out OSDC (ARC) runners on pull & trunk workflows in PyTorch main

❌ 1 New Failure

As of commit be9dc1b with merge base 0c8f44b ():

NEW FAILURE - The following job has failed:

Run 1xH100 Tests / test (H100, linux.aws.h100, --pre torch torchvision torchaudio mslk --index-url https://download.... / linux-job (gh)
RuntimeError: Command docker exec -t 5dd0e52871e98926ec9131b5dde934af7872e675cac6bc31722239d831223573 /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jerryzh168 · 2026-04-22T21:21:59Z


    def get_hp_scales(self) -> torch.Tensor:
-        """Get the scales of the NVFP4Tensor in original dtype.
+        """Get the scales of the NVFP4Tensor in float32.


not for this PR: I just checked the usage of self._orig_dtype seems it's duplicated with self.dtype now

jerryzh168

would be good to add a comment for attribute per_tensor_scale for NVFP4Tensor I think, also probably try to restrict this to weight only since it doesn't apply to activations?

[ghstack-poisoned]

vkuzo · 2026-04-23T09:28:50Z

would be good to add a comment for attribute per_tensor_scale for NVFP4Tensor I think

sure, fixed!

also probably try to restrict this to weight only since it doesn't apply to activations?

it could apply to activations (per-token-group)

vkuzo added 20 commits April 20, 2026 20:52

Update

f46445f

[ghstack-poisoned]

Update

3c92c1a

[ghstack-poisoned]

Update

b513b61

[ghstack-poisoned]

Update

a669b9e

[ghstack-poisoned]

Update

53bd8d0

[ghstack-poisoned]

Update

4c86363

[ghstack-poisoned]

Update

3cc91ed

[ghstack-poisoned]

Update

9b7dc74

[ghstack-poisoned]

Update

d69b32a

[ghstack-poisoned]

Update

294c9cc

[ghstack-poisoned]

Update

65fae62

[ghstack-poisoned]

Update

5ee2ad2

[ghstack-poisoned]

Update

2adda75

[ghstack-poisoned]

Update

6463808

[ghstack-poisoned]

Update

d121bff

[ghstack-poisoned]

Update

80421c8

[ghstack-poisoned]

Update

d302888

[ghstack-poisoned]

Update

9631b76

[ghstack-poisoned]

Update

5fe6574

[ghstack-poisoned]

Update

5292f2f

[ghstack-poisoned]

vkuzo requested review from danielvegamyhre, drisspg and jerryzh168 as code owners April 22, 2026 19:59

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 22, 2026

This was referenced Apr 22, 2026

add gptq benchmark, and speed up by ~3x with compile #4310

Merged

gptq example: remove transformers version check #4313

Merged

emulated nvfp4 support torch._grouped_mm for inference #4314

Merged

vkuzo added the module: not user facing Use this tag if you don't want this PR to show up in release notes label Apr 22, 2026

vkuzo mentioned this pull request Apr 22, 2026

hook up real nvfp4 grouped_gemm #4316

Merged

jerryzh168 reviewed Apr 22, 2026

View reviewed changes

jerryzh168 approved these changes Apr 22, 2026

View reviewed changes

vkuzo added 6 commits April 23, 2026 09:23

Update

68dc794

[ghstack-poisoned]

Update

3ffc619

[ghstack-poisoned]

Update

2f0a3cf

[ghstack-poisoned]

Update

f668c26

[ghstack-poisoned]

Update

522de32

[ghstack-poisoned]

Update

31bcb11

[ghstack-poisoned]

vkuzo changed the base branch from gh/vkuzo/252/head to main April 23, 2026 09:25

Update

be9dc1b

[ghstack-poisoned]

This was referenced Apr 23, 2026

e2e example with HF model + MoE + torchao nvfp4 #4319

Merged

wire up nvfp4 bmm #4320

Merged

hook up hf + moe + nvfp4 script to lm_eval #4321

Merged

vkuzo merged commit bfa7a94 into main Apr 23, 2026
54 of 57 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make NVFP4Tensor handle per-expert outer scale#4315

make NVFP4Tensor handle per-expert outer scale#4315
vkuzo merged 27 commits intomainfrom
gh/vkuzo/253/head

vkuzo commented Apr 22, 2026

Uh oh!

vkuzo commented Apr 22, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

jerryzh168 Apr 22, 2026

Uh oh!

jerryzh168 left a comment

Uh oh!

vkuzo commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vkuzo commented Apr 22, 2026

Uh oh!

vkuzo commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4315

❗ 1 Active SEVs

❌ 1 New Failure

Uh oh!

jerryzh168 Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

jerryzh168 left a comment

Choose a reason for hiding this comment

Uh oh!

vkuzo commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vkuzo commented Apr 22, 2026 •

edited

Loading

pytorch-bot Bot commented Apr 22, 2026 •

edited

Loading