make NVFP4Tensor handle per-expert outer scale#4315
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4315
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 1 New FailureAs of commit be9dc1b with merge base 0c8f44b ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
|
||
| def get_hp_scales(self) -> torch.Tensor: | ||
| """Get the scales of the NVFP4Tensor in original dtype. | ||
| """Get the scales of the NVFP4Tensor in float32. |
There was a problem hiding this comment.
not for this PR: I just checked the usage of self._orig_dtype seems it's duplicated with self.dtype now
jerryzh168
left a comment
There was a problem hiding this comment.
would be good to add a comment for attribute per_tensor_scale for NVFP4Tensor I think, also probably try to restrict this to weight only since it doesn't apply to activations?
sure, fixed!
it could apply to activations (per-token-group) |
Summary:
NVFP4 MoE kernels usually do per-expert outer scale.
This PR extends
NVFP4Tensorto support this.Test Plan: