install debug expert token counters on nvfp4 moe test script by vkuzo · Pull Request #4322 · pytorch/ao

vkuzo · 2026-04-23T17:21:55Z

Summary:

Add expert counters on HF olmoe model to get token-expert counts.
This helps understand how much calibration is needed to get
a reasonable amount for tokens per expert for PTQ algorithms.

For this specific model, seems that calibrating on c4 will quickly hit
all the experts:

(pt_nightly) dev@gpu-dev-fef610c2:~/ao (20260420_gptq_nvfp4)$ time python scripts/prototype/test_nvfp4_moe.py --recipe=bf16 --calibrate_on_c4 --num_calibration_samples 1
...
=== Global expert utilization summary ===
experts with <= 0 tokens: 4/1024 (0.4%)
experts with <= 10 tokens: 60/1024 (5.9%)
experts with <= 20 tokens: 136/1024 (13.3%)
experts with <= 30 tokens: 220/1024 (21.5%)
experts with <= 40 tokens: 322/1024 (31.4%)
experts with <= 50 tokens: 404/1024 (39.5%)
experts with <= 60 tokens: 507/1024 (49.5%)
experts with <= 70 tokens: 612/1024 (59.8%)
experts with <= 80 tokens: 706/1024 (68.9%)
experts with <= 90 tokens: 799/1024 (78.0%)
experts with <= 100 tokens: 878/1024 (85.7%)
experts with <= 110 tokens: 923/1024 (90.1%)
experts with <= 120 tokens: 958/1024 (93.6%)

// min number for at least one token per expert
(pt_nightly) dev@gpu-dev-fef610c2:~/ao (20260420_gptq_nvfp4)$ time python scripts/prototype/test_nvfp4_moe.py --recipe=bf16 --calibrate_on_c4 --num_calibration_samples 3
...
=== Global expert utilization summary ===
experts with <= 0 tokens: 0/1024 (0.0%)
experts with <= 10 tokens: 1/1024 (0.1%)
experts with <= 20 tokens: 9/1024 (0.9%)
experts with <= 30 tokens: 22/1024 (2.1%)
experts with <= 40 tokens: 32/1024 (3.1%)
experts with <= 50 tokens: 49/1024 (4.8%)
experts with <= 60 tokens: 60/1024 (5.9%)
experts with <= 70 tokens: 69/1024 (6.7%)
experts with <= 80 tokens: 81/1024 (7.9%)
experts with <= 90 tokens: 88/1024 (8.6%)
experts with <= 100 tokens: 115/1024 (11.2%)
experts with <= 110 tokens: 130/1024 (12.7%)
experts with <= 120 tokens: 149/1024 (14.6%)

// resonable default
(pt_nightly) dev@gpu-dev-fef610c2:~/ao (20260420_gptq_nvfp4)$ time python scripts/prototype/test_nvfp4_moe.py --recipe=bf16 --calibrate_on_c4 --num_calibration_samples 128
...
=== Global expert utilization summary ===
experts with <= 0 tokens: 0/1024 (0.0%)
experts with <= 10 tokens: 0/1024 (0.0%)
experts with <= 20 tokens: 0/1024 (0.0%)
experts with <= 30 tokens: 0/1024 (0.0%)
experts with <= 40 tokens: 0/1024 (0.0%)
experts with <= 50 tokens: 0/1024 (0.0%)
experts with <= 60 tokens: 0/1024 (0.0%)
experts with <= 70 tokens: 0/1024 (0.0%)
experts with <= 80 tokens: 0/1024 (0.0%)
experts with <= 90 tokens: 0/1024 (0.0%)
experts with <= 100 tokens: 0/1024 (0.0%)
experts with <= 110 tokens: 0/1024 (0.0%)
experts with <= 120 tokens: 0/1024 (0.0%)

90% clauded

Test Plan:

[ghstack-poisoned]

vkuzo · 2026-04-23T17:21:56Z

Stack from ghstack (oldest at bottom):

pytorch-bot · 2026-04-23T17:21:58Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4322

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Rolling out OSDC (ARC) runners on pull & trunk workflows in PyTorch main

❌ 1 New Failure

As of commit f8d1861 with merge base 67a78e5 ():

NEW FAILURE - The following job has failed:

Run 1xH100 Tests / test (H100, linux.aws.h100, --pre torch torchvision torchaudio mslk --index-url https://download.... / linux-job (gh)
RuntimeError: Command docker exec -t 703808054d5b07cf875f3603a7d8043aef041013322b7ddaca6e549a64dc917e /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

Summary: Add expert counters on HF olmoe model to get token-expert counts. This helps understand how much calibration is needed to get a reasonable amount for tokens per expert for PTQ algorithms. For this specific model, seems that calibrating on c4 will quickly hit all the experts: ``` (pt_nightly) dev@gpu-dev-fef610c2:~/ao (20260420_gptq_nvfp4)$ time python scripts/prototype/test_nvfp4_moe.py --recipe=bf16 --calibrate_on_c4 --num_calibration_samples 1 ... === Global expert utilization summary === experts with <= 0 tokens: 4/1024 (0.4%) experts with <= 10 tokens: 60/1024 (5.9%) experts with <= 20 tokens: 136/1024 (13.3%) experts with <= 30 tokens: 220/1024 (21.5%) experts with <= 40 tokens: 322/1024 (31.4%) experts with <= 50 tokens: 404/1024 (39.5%) experts with <= 60 tokens: 507/1024 (49.5%) experts with <= 70 tokens: 612/1024 (59.8%) experts with <= 80 tokens: 706/1024 (68.9%) experts with <= 90 tokens: 799/1024 (78.0%) experts with <= 100 tokens: 878/1024 (85.7%) experts with <= 110 tokens: 923/1024 (90.1%) experts with <= 120 tokens: 958/1024 (93.6%) // min number for at least one token per expert (pt_nightly) dev@gpu-dev-fef610c2:~/ao (20260420_gptq_nvfp4)$ time python scripts/prototype/test_nvfp4_moe.py --recipe=bf16 --calibrate_on_c4 --num_calibration_samples 3 ... === Global expert utilization summary === experts with <= 0 tokens: 0/1024 (0.0%) experts with <= 10 tokens: 1/1024 (0.1%) experts with <= 20 tokens: 9/1024 (0.9%) experts with <= 30 tokens: 22/1024 (2.1%) experts with <= 40 tokens: 32/1024 (3.1%) experts with <= 50 tokens: 49/1024 (4.8%) experts with <= 60 tokens: 60/1024 (5.9%) experts with <= 70 tokens: 69/1024 (6.7%) experts with <= 80 tokens: 81/1024 (7.9%) experts with <= 90 tokens: 88/1024 (8.6%) experts with <= 100 tokens: 115/1024 (11.2%) experts with <= 110 tokens: 130/1024 (12.7%) experts with <= 120 tokens: 149/1024 (14.6%) // resonable default (pt_nightly) dev@gpu-dev-fef610c2:~/ao (20260420_gptq_nvfp4)$ time python scripts/prototype/test_nvfp4_moe.py --recipe=bf16 --calibrate_on_c4 --num_calibration_samples 128 ... === Global expert utilization summary === experts with <= 0 tokens: 0/1024 (0.0%) experts with <= 10 tokens: 0/1024 (0.0%) experts with <= 20 tokens: 0/1024 (0.0%) experts with <= 30 tokens: 0/1024 (0.0%) experts with <= 40 tokens: 0/1024 (0.0%) experts with <= 50 tokens: 0/1024 (0.0%) experts with <= 60 tokens: 0/1024 (0.0%) experts with <= 70 tokens: 0/1024 (0.0%) experts with <= 80 tokens: 0/1024 (0.0%) experts with <= 90 tokens: 0/1024 (0.0%) experts with <= 100 tokens: 0/1024 (0.0%) experts with <= 110 tokens: 0/1024 (0.0%) experts with <= 120 tokens: 0/1024 (0.0%) ``` 90% clauded Test Plan: ghstack-source-id: 4860e17 ghstack-comment-id: 4306401628 Pull-Request: #4322

[ghstack-poisoned]

Summary: Add expert counters on HF olmoe model to get token-expert counts. This helps understand how much calibration is needed to get a reasonable amount for tokens per expert for PTQ algorithms. For this specific model, seems that calibrating on c4 will quickly hit all the experts: ``` (pt_nightly) dev@gpu-dev-fef610c2:~/ao (20260420_gptq_nvfp4)$ time python scripts/prototype/test_nvfp4_moe.py --recipe=bf16 --calibrate_on_c4 --num_calibration_samples 1 ... === Global expert utilization summary === experts with <= 0 tokens: 4/1024 (0.4%) experts with <= 10 tokens: 60/1024 (5.9%) experts with <= 20 tokens: 136/1024 (13.3%) experts with <= 30 tokens: 220/1024 (21.5%) experts with <= 40 tokens: 322/1024 (31.4%) experts with <= 50 tokens: 404/1024 (39.5%) experts with <= 60 tokens: 507/1024 (49.5%) experts with <= 70 tokens: 612/1024 (59.8%) experts with <= 80 tokens: 706/1024 (68.9%) experts with <= 90 tokens: 799/1024 (78.0%) experts with <= 100 tokens: 878/1024 (85.7%) experts with <= 110 tokens: 923/1024 (90.1%) experts with <= 120 tokens: 958/1024 (93.6%) // min number for at least one token per expert (pt_nightly) dev@gpu-dev-fef610c2:~/ao (20260420_gptq_nvfp4)$ time python scripts/prototype/test_nvfp4_moe.py --recipe=bf16 --calibrate_on_c4 --num_calibration_samples 3 ... === Global expert utilization summary === experts with <= 0 tokens: 0/1024 (0.0%) experts with <= 10 tokens: 1/1024 (0.1%) experts with <= 20 tokens: 9/1024 (0.9%) experts with <= 30 tokens: 22/1024 (2.1%) experts with <= 40 tokens: 32/1024 (3.1%) experts with <= 50 tokens: 49/1024 (4.8%) experts with <= 60 tokens: 60/1024 (5.9%) experts with <= 70 tokens: 69/1024 (6.7%) experts with <= 80 tokens: 81/1024 (7.9%) experts with <= 90 tokens: 88/1024 (8.6%) experts with <= 100 tokens: 115/1024 (11.2%) experts with <= 110 tokens: 130/1024 (12.7%) experts with <= 120 tokens: 149/1024 (14.6%) // resonable default (pt_nightly) dev@gpu-dev-fef610c2:~/ao (20260420_gptq_nvfp4)$ time python scripts/prototype/test_nvfp4_moe.py --recipe=bf16 --calibrate_on_c4 --num_calibration_samples 128 ... === Global expert utilization summary === experts with <= 0 tokens: 0/1024 (0.0%) experts with <= 10 tokens: 0/1024 (0.0%) experts with <= 20 tokens: 0/1024 (0.0%) experts with <= 30 tokens: 0/1024 (0.0%) experts with <= 40 tokens: 0/1024 (0.0%) experts with <= 50 tokens: 0/1024 (0.0%) experts with <= 60 tokens: 0/1024 (0.0%) experts with <= 70 tokens: 0/1024 (0.0%) experts with <= 80 tokens: 0/1024 (0.0%) experts with <= 90 tokens: 0/1024 (0.0%) experts with <= 100 tokens: 0/1024 (0.0%) experts with <= 110 tokens: 0/1024 (0.0%) experts with <= 120 tokens: 0/1024 (0.0%) ``` 90% clauded Test Plan: ghstack-source-id: 4860e17 ghstack-comment-id: 4306401628 Pull-Request: #4322

[ghstack-poisoned]

Summary: Add expert counters on HF olmoe model to get token-expert counts. This helps understand how much calibration is needed to get a reasonable amount for tokens per expert for PTQ algorithms. For this specific model, seems that calibrating on c4 will quickly hit all the experts: ``` (pt_nightly) dev@gpu-dev-fef610c2:~/ao (20260420_gptq_nvfp4)$ time python scripts/prototype/test_nvfp4_moe.py --recipe=bf16 --calibrate_on_c4 --num_calibration_samples 1 ... === Global expert utilization summary === experts with <= 0 tokens: 4/1024 (0.4%) experts with <= 10 tokens: 60/1024 (5.9%) experts with <= 20 tokens: 136/1024 (13.3%) experts with <= 30 tokens: 220/1024 (21.5%) experts with <= 40 tokens: 322/1024 (31.4%) experts with <= 50 tokens: 404/1024 (39.5%) experts with <= 60 tokens: 507/1024 (49.5%) experts with <= 70 tokens: 612/1024 (59.8%) experts with <= 80 tokens: 706/1024 (68.9%) experts with <= 90 tokens: 799/1024 (78.0%) experts with <= 100 tokens: 878/1024 (85.7%) experts with <= 110 tokens: 923/1024 (90.1%) experts with <= 120 tokens: 958/1024 (93.6%) // min number for at least one token per expert (pt_nightly) dev@gpu-dev-fef610c2:~/ao (20260420_gptq_nvfp4)$ time python scripts/prototype/test_nvfp4_moe.py --recipe=bf16 --calibrate_on_c4 --num_calibration_samples 3 ... === Global expert utilization summary === experts with <= 0 tokens: 0/1024 (0.0%) experts with <= 10 tokens: 1/1024 (0.1%) experts with <= 20 tokens: 9/1024 (0.9%) experts with <= 30 tokens: 22/1024 (2.1%) experts with <= 40 tokens: 32/1024 (3.1%) experts with <= 50 tokens: 49/1024 (4.8%) experts with <= 60 tokens: 60/1024 (5.9%) experts with <= 70 tokens: 69/1024 (6.7%) experts with <= 80 tokens: 81/1024 (7.9%) experts with <= 90 tokens: 88/1024 (8.6%) experts with <= 100 tokens: 115/1024 (11.2%) experts with <= 110 tokens: 130/1024 (12.7%) experts with <= 120 tokens: 149/1024 (14.6%) // resonable default (pt_nightly) dev@gpu-dev-fef610c2:~/ao (20260420_gptq_nvfp4)$ time python scripts/prototype/test_nvfp4_moe.py --recipe=bf16 --calibrate_on_c4 --num_calibration_samples 128 ... === Global expert utilization summary === experts with <= 0 tokens: 0/1024 (0.0%) experts with <= 10 tokens: 0/1024 (0.0%) experts with <= 20 tokens: 0/1024 (0.0%) experts with <= 30 tokens: 0/1024 (0.0%) experts with <= 40 tokens: 0/1024 (0.0%) experts with <= 50 tokens: 0/1024 (0.0%) experts with <= 60 tokens: 0/1024 (0.0%) experts with <= 70 tokens: 0/1024 (0.0%) experts with <= 80 tokens: 0/1024 (0.0%) experts with <= 90 tokens: 0/1024 (0.0%) experts with <= 100 tokens: 0/1024 (0.0%) experts with <= 110 tokens: 0/1024 (0.0%) experts with <= 120 tokens: 0/1024 (0.0%) ``` 90% clauded Test Plan: ghstack-source-id: 4860e17 ghstack-comment-id: 4306401628 Pull-Request: #4322

[ghstack-poisoned]

Summary: Add expert counters on HF olmoe model to get token-expert counts. This helps understand how much calibration is needed to get a reasonable amount for tokens per expert for PTQ algorithms. For this specific model, seems that calibrating on c4 will quickly hit all the experts: ``` (pt_nightly) dev@gpu-dev-fef610c2:~/ao (20260420_gptq_nvfp4)$ time python scripts/prototype/test_nvfp4_moe.py --recipe=bf16 --calibrate_on_c4 --num_calibration_samples 1 ... === Global expert utilization summary === experts with <= 0 tokens: 4/1024 (0.4%) experts with <= 10 tokens: 60/1024 (5.9%) experts with <= 20 tokens: 136/1024 (13.3%) experts with <= 30 tokens: 220/1024 (21.5%) experts with <= 40 tokens: 322/1024 (31.4%) experts with <= 50 tokens: 404/1024 (39.5%) experts with <= 60 tokens: 507/1024 (49.5%) experts with <= 70 tokens: 612/1024 (59.8%) experts with <= 80 tokens: 706/1024 (68.9%) experts with <= 90 tokens: 799/1024 (78.0%) experts with <= 100 tokens: 878/1024 (85.7%) experts with <= 110 tokens: 923/1024 (90.1%) experts with <= 120 tokens: 958/1024 (93.6%) // min number for at least one token per expert (pt_nightly) dev@gpu-dev-fef610c2:~/ao (20260420_gptq_nvfp4)$ time python scripts/prototype/test_nvfp4_moe.py --recipe=bf16 --calibrate_on_c4 --num_calibration_samples 3 ... === Global expert utilization summary === experts with <= 0 tokens: 0/1024 (0.0%) experts with <= 10 tokens: 1/1024 (0.1%) experts with <= 20 tokens: 9/1024 (0.9%) experts with <= 30 tokens: 22/1024 (2.1%) experts with <= 40 tokens: 32/1024 (3.1%) experts with <= 50 tokens: 49/1024 (4.8%) experts with <= 60 tokens: 60/1024 (5.9%) experts with <= 70 tokens: 69/1024 (6.7%) experts with <= 80 tokens: 81/1024 (7.9%) experts with <= 90 tokens: 88/1024 (8.6%) experts with <= 100 tokens: 115/1024 (11.2%) experts with <= 110 tokens: 130/1024 (12.7%) experts with <= 120 tokens: 149/1024 (14.6%) // resonable default (pt_nightly) dev@gpu-dev-fef610c2:~/ao (20260420_gptq_nvfp4)$ time python scripts/prototype/test_nvfp4_moe.py --recipe=bf16 --calibrate_on_c4 --num_calibration_samples 128 ... === Global expert utilization summary === experts with <= 0 tokens: 0/1024 (0.0%) experts with <= 10 tokens: 0/1024 (0.0%) experts with <= 20 tokens: 0/1024 (0.0%) experts with <= 30 tokens: 0/1024 (0.0%) experts with <= 40 tokens: 0/1024 (0.0%) experts with <= 50 tokens: 0/1024 (0.0%) experts with <= 60 tokens: 0/1024 (0.0%) experts with <= 70 tokens: 0/1024 (0.0%) experts with <= 80 tokens: 0/1024 (0.0%) experts with <= 90 tokens: 0/1024 (0.0%) experts with <= 100 tokens: 0/1024 (0.0%) experts with <= 110 tokens: 0/1024 (0.0%) experts with <= 120 tokens: 0/1024 (0.0%) ``` 90% clauded Test Plan: ghstack-source-id: 4860e17 ghstack-comment-id: 4306401628 Pull-Request: #4322

vkuzo added 30 commits April 20, 2026 20:52

Update

f46445f

[ghstack-poisoned]

Update

3c92c1a

[ghstack-poisoned]

Update

b513b61

[ghstack-poisoned]

Update

a669b9e

[ghstack-poisoned]

Update

53bd8d0

[ghstack-poisoned]

Update

4c86363

[ghstack-poisoned]

Update

3cc91ed

[ghstack-poisoned]

Update

9b7dc74

[ghstack-poisoned]

Update

d69b32a

[ghstack-poisoned]

Update

294c9cc

[ghstack-poisoned]

Update

65fae62

[ghstack-poisoned]

Update

5ee2ad2

[ghstack-poisoned]

Update

2adda75

[ghstack-poisoned]

Update

6463808

[ghstack-poisoned]

Update

d121bff

[ghstack-poisoned]

Update

80421c8

[ghstack-poisoned]

Update

d302888

[ghstack-poisoned]

Update

9631b76

[ghstack-poisoned]

Update

5fe6574

[ghstack-poisoned]

Update

5292f2f

[ghstack-poisoned]

Update

f679216

[ghstack-poisoned]

Update

68dc794

[ghstack-poisoned]

Update

3ffc619

[ghstack-poisoned]

Update

2f0a3cf

[ghstack-poisoned]

Update

fad1467

[ghstack-poisoned]

Update

f668c26

[ghstack-poisoned]

Update

522de32

[ghstack-poisoned]

Update

f635432

[ghstack-poisoned]

Update

31bcb11

[ghstack-poisoned]

Update

75542fa

[ghstack-poisoned]

vkuzo requested a review from jerryzh168 as a code owner April 23, 2026 17:21

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 23, 2026

This was referenced Apr 23, 2026

hook up real nvfp4 grouped_gemm #4316

Merged

e2e example with HF model + MoE + torchao nvfp4 #4319

Merged

wire up nvfp4 bmm #4320

Merged

hook up hf + moe + nvfp4 script to lm_eval #4321

Merged

vkuzo added the module: not user facing Use this tag if you don't want this PR to show up in release notes label Apr 23, 2026

vkuzo added 4 commits April 23, 2026 18:44

Update

7794548

[ghstack-poisoned]

Update

8f1f410

[ghstack-poisoned]

Update

b664d8a

[ghstack-poisoned]

Update

dc35c65

[ghstack-poisoned]

vkuzo added 3 commits April 23, 2026 18:45

Update

f23cf38

[ghstack-poisoned]

Update

6650b1d

[ghstack-poisoned]

Update

4d9b68f

[ghstack-poisoned]

vkuzo added 2 commits April 23, 2026 18:46

Update

8a21110

[ghstack-poisoned]

Update

4a456ec

[ghstack-poisoned]

Update

f8d1861

[ghstack-poisoned]

vkuzo changed the base branch from gh/vkuzo/257/head to main April 23, 2026 18:47

jerryzh168 approved these changes Apr 23, 2026

View reviewed changes

This was referenced Apr 23, 2026

nvfp4 gptq: prep for bmm and grouped_mm #4323

Merged

nvfp4 gptq for bmm #4327

Merged

extend GPTQ coverage to grouped_mm #4328

Merged

extend gptq example script with olmoe model #4329

Merged

vkuzo merged commit 899fea2 into main Apr 24, 2026
50 of 53 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

install debug expert token counters on nvfp4 moe test script#4322

install debug expert token counters on nvfp4 moe test script#4322
vkuzo merged 53 commits intomainfrom
gh/vkuzo/258/head

vkuzo commented Apr 23, 2026

Uh oh!

vkuzo commented Apr 23, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vkuzo commented Apr 23, 2026

Uh oh!

vkuzo commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4322

❗ 1 Active SEVs

❌ 1 New Failure

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vkuzo commented Apr 23, 2026 •

edited

Loading

pytorch-bot Bot commented Apr 23, 2026 •

edited

Loading