[TRITON]: Standardize GEMM weight shape to (N, K) and TN memory layout (by default) #597

willzhou-amd · 2025-07-01T15:11:44Z

Changes:

Fix .assume() bug that causes M=1 cases to fail in some kernels
Add test cases for M = 1 to all tests
Standardize weight shapes to (N, K)

…et validated

aiter/ops/triton/batched_gemm_a8w8.py

…and raises a cast error. Solution involves using tl.cast

willzhou-amd · 2025-07-08T16:55:46Z

There's a gnarly issue where Triton will implicitly promote small integers that it believes to be constants to tl.constant(). In some cases where the dimensions are small, typecasting arguments to your kernel will yield a cast error:

    batch_id = batch_id.to(tl.int64)
    stride_ab = stride_ab.to(tl.int64)
                ^
AttributeError("'constexpr' object has no attribute 'to'")

Replacing .to() with tl.cast() prevents this implicit promotion. I'm uncertain whether this happens on other versions of Triton, but this is reliably replicable on Triton 3.3.1.

tldr: If you're attempting to cast strides to int64 to prevent integer overflow, use tl.cast.

…t (by default) (#597) * Fix .assume() bug that causes M=1 cases to fail in some kernels * Add minimal test cases (M, N, K) = (1, 1, 1) * Fix bug where stride_.. becomes a tl.constant and raises a cast error * Add weight shape changes for a8w8, a8w8 blockscale, a16w16 * Add weight shape changes for the rest of the GEMMs. FP4 kernels not yet validated * Fix tensor ops for the afp4wfp4 GEMM * Fix tensor ops for the afp4wfp4 pre-quant GEMM * Add weight shape changes for gemm_afp4wfp4 kernels (atomic & standard) * Formatting changes * Fix bug where stride_.. becomes a tl.constant and raises a cast error * Add stride int64 casts back * Fix bug where stride_.. becomes implicitly promoted to a tl.constant and raises a cast error. Solution involves using tl.cast * Add cast debug comment * Temp change to use int64 strides

Fix .assume() bug that causes M=1 cases to fail in some kernels

6241c7e

willzhou-amd requested review from rahulbatra85 and vgokhale July 1, 2025 15:11

willzhou-amd self-assigned this Jul 1, 2025

willzhou-amd added 9 commits July 1, 2025 15:28

Add minimal test cases (M, N, K) = (1, 1, 1)

5acf72c

Fix bug where stride_.. becomes a tl.constant and raises a cast error

0ef6e24

Add weight shape changes for a8w8, a8w8 blockscale, a16w16

f94b036

Add weight shape changes for the rest of the GEMMs. FP4 kernels not y…

1b9b877

…et validated

Fix tensor ops for the afp4wfp4 GEMM

cea61f5

Fix tensor ops for the afp4wfp4 pre-quant GEMM

84a12c9

Add weight shape changes for gemm_afp4wfp4 kernels (atomic & standard)

09e502d

Formatting changes

7cfbef1

Merge branch 'main' into willz/weight-shape-debug

2e6ecee

willzhou-amd requested a review from scxiao July 2, 2025 16:49

rahulbatra85 changed the title ~~Fix .assume() bug that causes M=1 cases to fail in some kernels~~ [TRITON]: Fix .assume() bug that causes M=1 cases to fail in some kernels Jul 2, 2025

willzhou-amd changed the title ~~[TRITON]: Fix .assume() bug that causes M=1 cases to fail in some kernels~~ [TRITON]: Standardize weight shapes to (N, K) and TN memory layout (by default) Jul 2, 2025

willzhou-amd changed the title ~~[TRITON]: Standardize weight shapes to (N, K) and TN memory layout (by default)~~ [TRITON]: Standardize GEMM weight shape to (N, K) and TN memory layout (by default) Jul 2, 2025

willzhou-amd added 2 commits July 2, 2025 21:00

Fix bug where stride_.. becomes a tl.constant and raises a cast error

33bc89e

Merge branch 'main' into willz/weight-shape-debug

0bad83a

rahulbatra85 reviewed Jul 8, 2025

View reviewed changes

aiter/ops/triton/batched_gemm_a8w8.py Show resolved Hide resolved

willzhou-amd added 3 commits July 8, 2025 16:04

Merge branch 'main' into willz/weight-shape-debug

c3b955d

Add stride int64 casts back

c48adf6

Fix bug where stride_.. becomes implicitly promoted to a tl.constant …

51acebe

…and raises a cast error. Solution involves using tl.cast

Add cast debug comment

c4c52bc

rahulbatra85 self-requested a review July 8, 2025 18:52

rahulbatra85 previously approved these changes Jul 8, 2025

View reviewed changes

willzhou-amd added 3 commits July 8, 2025 14:49

Merge branch 'main' into willz/weight-shape-debug

9dd647f

Merge branch 'main' into willz/weight-shape-debug

21f0007

Merge branch 'main' into willz/weight-shape-debug

523d655

willzhou-amd added 2 commits July 9, 2025 11:33

Merge branch 'main' into willz/weight-shape-debug

05f704c

Temp change to use int64 strides

677aefc

willzhou-amd dismissed rahulbatra85’s stale review via 677aefc July 9, 2025 18:07

rahulbatra85 self-requested a review July 9, 2025 21:32

rahulbatra85 approved these changes Jul 9, 2025

View reviewed changes

rahulbatra85 merged commit e7570ed into main Jul 10, 2025
13 checks passed

rahulbatra85 deleted the willz/weight-shape-debug branch July 10, 2025 01:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TRITON]: Standardize GEMM weight shape to (N, K) and TN memory layout (by default) #597

[TRITON]: Standardize GEMM weight shape to (N, K) and TN memory layout (by default) #597

Uh oh!

willzhou-amd commented Jul 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

willzhou-amd commented Jul 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[TRITON]: Standardize GEMM weight shape to (N, K) and TN memory layout (by default) #597

[TRITON]: Standardize GEMM weight shape to (N, K) and TN memory layout (by default) #597

Uh oh!

Conversation

willzhou-amd commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

willzhou-amd commented Jul 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

willzhou-amd commented Jul 1, 2025 •

edited

Loading