Skip to content

Conversation

@caio96
Copy link
Contributor

@caio96 caio96 commented Jan 5, 2026

Motivation

BitfieldExtract code generation does not support extracting from a source that has two registers (e.g. 64-bit source) into a destination that has one register (e.g. 32-bit destination). This PR splits BitfieldExtract (BFE) expressions that cross dword boundaries into two separate BFE combined with an OR when such BFEs are created during SplitBitfieldCombine.

Technical Details

  • Calls splitBitfieldExtract when creating BFEs in SplitBitfieldCombine
  • Adds test and updates previous SplitBitfieldCombine tests

Test Result

  • Passing SplitBitfieldCombine tests

Submission Checklist

@caio96 caio96 requested a review from a team as a code owner January 5, 2026 20:30
@caio96 caio96 changed the title [rocRoller] Splits BitFieldExtract expressions that cross dword boundaries [rocRoller] Split BitFieldExtract expressions that cross dword boundaries Jan 5, 2026
@math-ci-jobs
Copy link

math-ci-jobs bot commented Jan 5, 2026

Generated Documentation

@math-ci-jobs
Copy link

math-ci-jobs bot commented Jan 5, 2026

CodeQL report

Results Summary

Full table of results
Tool Severity Code Location Line

Links

  • HTML
  • Sarif (for download and usage in conjunction with SARIF viewers)

@math-ci-jobs
Copy link

math-ci-jobs bot commented Jan 5, 2026

Performance Report for gfx12

Results

Details
@@            Significant (p-val <0.05) Performance Diffs            @@
====================================================================================================
-   0.22% | p=2.5347e-02 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'bf8', 'type_B': 'bf8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 16, wave_n: 16, wave_k: 16, wave_b: 1, workgroup_size_x: 64, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Cooperative, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'bf8', 'type_B': 'bf8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=16, wave_n=16, wave_k=16, wave_b=1, workgroup_size_x=64, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Cooperative', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, matchMemoryAccess=True, version='')
-   0.39% | p=2.5347e-02 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'bf8', 'type_B': 'fp8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 16, wave_n: 16, wave_k: 16, wave_b: 1, workgroup_size_x: 64, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'bf8', 'type_B': 'fp8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=16, wave_n=16, wave_k=16, wave_b=1, workgroup_size_x=64, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, matchMemoryAccess=True, version='')
+   0.10% | p=2.5347e-02 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'fp8', 'type_B': 'bf8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 16, wave_n: 16, wave_k: 16, wave_b: 1, workgroup_size_x: 64, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Cooperative, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'fp8', 'type_B': 'bf8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=16, wave_n=16, wave_k=16, wave_b=1, workgroup_size_x=64, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Cooperative', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, matchMemoryAccess=True, version='')
-   0.26% | p=2.5347e-02 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'fp8', 'type_B': 'bf8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 16, wave_n: 16, wave_k: 16, wave_b: 1, workgroup_size_x: 64, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'fp8', 'type_B': 'bf8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=16, wave_n=16, wave_k=16, wave_b=1, workgroup_size_x=64, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, matchMemoryAccess=True, version='')
+   0.14% | p=1.7451e-03 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 16, wave_n: 16, wave_k: 16, wave_b: 1, workgroup_size_x: 64, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=16, wave_n=16, wave_k=16, wave_b=1, workgroup_size_x=64, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, matchMemoryAccess=True, version='')
-   1.86% | p=2.5347e-02 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'half', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 16, wave_n: 16, wave_k: 16, wave_b: 1, workgroup_size_x: 64, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'half', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=16, wave_n=16, wave_k=16, wave_b=1, workgroup_size_x=64, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx1201', 'Xnack': False, 'Sramecc': False}, matchMemoryAccess=True, version='')
+   1.58% | p=3.2348e-04 
	| CodeGen(instCount: 40000, instructions: comments)| CodeGen() | CodeGen(instCount: 40000, instructions: comments)
+   1.59% | p=5.6994e-05 
	| CodeGen(instCount: 40000, instructions: complex_mi_with_coop)| CodeGen() | CodeGen(instCount: 40000, instructions: complex_mi_with_coop)
+   4.09% | p=5.6994e-05 
	| CodeGen(instCount: 40000, instructions: simple_mi)| CodeGen() | CodeGen(instCount: 40000, instructions: simple_mi)
Links

@math-ci-jobs
Copy link

math-ci-jobs bot commented Jan 5, 2026

Resource Report for gfx12

Results

Details

✔️ No Resource Usage Changes ✔️

Links

@math-ci-jobs
Copy link

math-ci-jobs bot commented Jan 5, 2026

Performance Report for gfx950-rocroller

Results

Details
@@            Significant (p-val <0.05) Performance Diffs            @@
====================================================================================================
-   0.39% | p=1.7451e-03 
	| 3. FloatsGEMM(M: 256, N: 256, K: 16384, alpha: 2, beta: 0.5, types: {'type_A': 'float', 'type_B': 'float', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 2, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=256, N=256, K=16384, alpha=2, beta=0.5, types={'type_A': 'float', 'type_B': 'float', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=2, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
-   0.49% | p=1.3121e-07 
	| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 256, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale: True, pretileScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=256, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale=True, pretileScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
-   0.09% | p=8.6805e-07 
	| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 256, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale: True, pretileScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=256, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale=True, pretileScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
-   0.66% | p=1.4484e-11 
	| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, pretileScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, pretileScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+   0.27% | p=5.7414e-04 
	| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, pretileScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, pretileScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
-   0.54% | p=3.6202e-11 
	| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 256, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale: True, pretileScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=256, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale=True, pretileScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
-   0.10% | p=1.3500e-08 
	| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 256, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale: True, pretileScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=256, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale=True, pretileScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+   0.10% | p=6.8974e-07 
	| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, pretileScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, pretileScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+   0.05% | p=9.4893e-03 
	| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, pretileScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, pretileScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+   0.10% | p=6.1027e-09 
	| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': True, 'scaleShuffleTileA': [32, 8, 2], 'scaleShuffleTileB': [32, 8, 2], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 256, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale: True, pretileScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': True, 'scaleShuffleTileA': [32, 8, 2], 'scaleShuffleTileB': [32, 8, 2], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=256, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 32, 'k': 8, 'n': 32, 'l': 8}, prefetchScale=True, pretileScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+   0.26% | p=6.3719e-03 
	| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': True, 'scaleShuffleTileA': [64, 4, 4], 'scaleShuffleTileB': [64, 4, 4], 'scalePreTileA': [128, 8], 'scalePreTileB': [8, 128]}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 256, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToLDS, loadScale_B: BufferToLDS, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 4, 'n': 64, 'l': 4}, prefetchScale: False, pretileScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': True, 'scaleShuffleTileA': [64, 4, 4], 'scaleShuffleTileB': [64, 4, 4], 'scalePreTileA': [128, 8], 'scalePreTileB': [8, 128]}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=256, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToLDS', loadScale_B='BufferToLDS', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 4, 'n': 64, 'l': 4}, prefetchScale=False, pretileScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+   0.06% | p=1.7202e-04 
	| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': True, 'scaleShuffleTileA': [64, 4, 4], 'scaleShuffleTileB': [64, 4, 4], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, pretileScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': True, 'scaleShuffleTileA': [64, 4, 4], 'scaleShuffleTileB': [64, 4, 4], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, pretileScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+   0.17% | p=5.7400e-16 
	| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'SingleScale', 'scaleType_A': 'E8M0', 'scale_B': 'SingleScale', 'scaleType_B': 'E8M0', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 0.00999999978, scaleValue_B: 0.00999999978, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeightedSimple, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'SingleScale', 'scaleType_A': 'E8M0', 'scale_B': 'SingleScale', 'scaleType_B': 'E8M0', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=0.00999999978, scaleValue_B=0.00999999978, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeightedSimple', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+   0.16% | p=1.4378e-05 
	| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+   0.12% | p=2.8685e-10 
	| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, pretileScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, pretileScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+   0.24% | p=8.3258e-03 
	| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, pretileScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, pretileScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
-   0.08% | p=1.3660e-06 
	| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': True, 'scaleShuffleTileA': [64, 4, 2], 'scaleShuffleTileB': [64, 4, 2], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, pretileScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': True, 'scaleShuffleTileA': [64, 4, 2], 'scaleShuffleTileB': [64, 4, 2], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, pretileScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+   0.40% | p=3.7823e-08 
	| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': True, 'scaleShuffleTileA': [64, 4, 4], 'scaleShuffleTileB': [64, 4, 4], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: False, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 4, prefetchLDSFactor: 1, prefetchMixMemOps: True, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: True, swizzleTileSize: {'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale: True, pretileScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': True, 'scaleShuffleTileA': [64, 4, 4], 'scaleShuffleTileB': [64, 4, 4], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=False, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=4, prefetchLDSFactor=1, prefetchMixMemOps=True, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=True, swizzleTileSize={'m': 64, 'k': 16, 'n': 64, 'l': 16}, prefetchScale=True, pretileScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
-   0.36% | p=5.4772e-07 
	| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0.5, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 1, prefetchMixMemOps: False, loadScale_A: BufferToLDSViaVGPR, loadScale_B: BufferToLDSViaVGPR, swizzleScale: False, swizzleTileSize: {'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale: False, pretileScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0.5, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=1, prefetchMixMemOps=False, loadScale_A='BufferToLDSViaVGPR', loadScale_B='BufferToLDSViaVGPR', swizzleScale=False, swizzleTileSize={'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale=False, pretileScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
-   0.10% | p=5.0572e-06 
	| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0.5, types: {'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: 0, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 256, mac_k: 128, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDS, load_B: BufferToLDS, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 1, prefetchMixMemOps: False, loadScale_A: BufferToLDSViaVGPR, loadScale_B: BufferToLDSViaVGPR, swizzleScale: False, swizzleTileSize: {'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale: False, pretileScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0.5, types={'type_A': 'fp4', 'type_B': 'fp4', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=0, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=256, mac_k=128, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDS', load_B='BufferToLDS', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=1, prefetchMixMemOps=False, loadScale_A='BufferToLDSViaVGPR', loadScale_B='BufferToLDSViaVGPR', swizzleScale=False, swizzleTileSize={'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale=False, pretileScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+   0.63% | p=2.2521e-117 
	| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0.5, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 2, prefetchMixMemOps: False, loadScale_A: BufferToLDSViaVGPR, loadScale_B: BufferToLDSViaVGPR, swizzleScale: False, swizzleTileSize: {'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale: False, pretileScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0.5, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=2, prefetchMixMemOps=False, loadScale_A='BufferToLDSViaVGPR', loadScale_B='BufferToLDSViaVGPR', swizzleScale=False, swizzleTileSize={'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale=False, pretileScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
-   0.89% | p=4.3914e-98 
	| 3. FloatsGEMM(M: 4096, N: 4096, K: 32768, alpha: 2, beta: 0.5, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 256, mac_n: 128, mac_k: 128, wave_m: 16, wave_n: 16, wave_k: 128, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 2, prefetchMixMemOps: False, loadScale_A: BufferToLDSViaVGPR, loadScale_B: BufferToLDSViaVGPR, swizzleScale: False, swizzleTileSize: {'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale: False, pretileScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=32768, alpha=2, beta=0.5, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'Separate', 'scaleType_A': 'E8M0', 'scale_B': 'Separate', 'scaleType_B': 'E8M0', 'scaleBlockSize': 32, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=256, mac_n=128, mac_k=128, wave_m=16, wave_n=16, wave_k=128, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=2, prefetchMixMemOps=False, loadScale_A='BufferToLDSViaVGPR', loadScale_B='BufferToLDSViaVGPR', swizzleScale=False, swizzleTileSize={'m': 64, 'k': 8, 'n': 64, 'l': 8}, prefetchScale=False, pretileScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+   0.82% | p=1.7451e-03 
	| 3. FloatsGEMM(M: 4096, N: 4096, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'bf8', 'type_B': 'bf8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 256, workgroup_size_y: 1, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=8192, alpha=2, beta=0.5, types={'type_A': 'bf8', 'type_B': 'bf8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=64, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=256, workgroup_size_y=1, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
-   1.24% | p=1.7451e-03 
	| 3. FloatsGEMM(M: 4096, N: 4096, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 256, workgroup_size_y: 1, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=8192, alpha=2, beta=0.5, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=64, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=256, workgroup_size_y=1, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+   1.61% | p=2.5347e-02 
	| 3. FloatsGEMM(M: 4096, N: 4096, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 64, wave_b: 1, workgroup_size_x: 256, workgroup_size_y: 1, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=4096, N=4096, K=8192, alpha=2, beta=0.5, types={'type_A': 'fp8', 'type_B': 'fp8', 'type_C': 'float', 'type_D': 'float', 'type_acc': 'float', 'trans_A': 'T', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=64, wave_m=32, wave_n=32, wave_k=64, wave_b=1, workgroup_size_x=256, workgroup_size_y=1, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
-   1.44% | p=5.6994e-05 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
-   0.67% | p=1.7451e-03 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: True, numWGs: 256, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=True, numWGs=256, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
-   1.70% | p=2.5347e-02 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
-   0.56% | p=2.5347e-02 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: True, numWGs: 256, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=True, numWGs=256, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
-   0.53% | p=2.5347e-02 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+   2.01% | p=5.6994e-05 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: True, numWGs: 256, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=True, numWGs=256, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
-   0.97% | p=2.5347e-02 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
-   1.08% | p=5.6994e-05 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
-   0.69% | p=2.5347e-02 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
-   1.29% | p=2.5347e-02 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 2, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=2, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
-   0.40% | p=1.7451e-03 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+   0.57% | p=2.5347e-02 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
-   0.37% | p=1.7451e-03 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+   0.35% | p=1.7451e-03 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: True, numWGs: 256, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=True, numWGs=256, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+   0.45% | p=1.7451e-03 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+   0.95% | p=1.7451e-03 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: True, numWGs: 256, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=True, numWGs=256, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
-   0.47% | p=2.5347e-02 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
-   0.62% | p=1.7451e-03 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 2, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: True, numWGs: 256, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=2, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=True, numWGs=256, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx950', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
Links

@math-ci-jobs
Copy link

math-ci-jobs bot commented Jan 5, 2026

Resource Report for gfx950-rocroller

Results

Details

✔️ No Resource Usage Changes ✔️

Links

@math-ci-jobs
Copy link

math-ci-jobs bot commented Jan 5, 2026

perfci run on commit 3f22da1

math-ci run

@math-ci-jobs
Copy link

math-ci-jobs bot commented Jan 5, 2026

Performance Report for gfx942

Results

Details
@@            Significant (p-val <0.05) Performance Diffs            @@
====================================================================================================
+   5.23% | p=1.7451e-03 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8192, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8192, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'N', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
-   0.26% | p=2.5347e-02 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 128, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=128, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+   1.98% | p=5.6994e-05 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
-   0.25% | p=2.5347e-02 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+   2.86% | p=2.5347e-02 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: True, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=True, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
-   2.22% | p=2.5347e-02 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+  12.91% | p=2.5347e-02 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 2, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=2, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+  10.59% | p=2.5347e-02 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+   1.17% | p=2.5347e-02 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 128, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=128, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+   1.69% | p=2.5347e-02 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 2, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=2, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+   1.31% | p=5.6994e-05 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 32, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: True, numWGs: 304, streamKTwoTile: True, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=32, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=True, numWGs=304, streamKTwoTile=True, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
-   1.86% | p=2.5347e-02 
	| 3. FloatsGEMM(M: 7680, N: 8448, K: 8448, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 64, mac_n: 64, mac_k: 64, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: False, prefetchInFlight: 0, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: True, numWGs: 304, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=7680, N=8448, K=8448, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=64, mac_n=64, mac_k=64, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=False, prefetchInFlight=0, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=True, numWGs=304, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+   2.37% | p=2.5347e-02 
	| 3. FloatsGEMM(M: 8448, N: 8448, K: 128, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Cooperative, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=8448, N=8448, K=128, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Cooperative', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
-   4.38% | p=1.7451e-03 
	| 3. FloatsGEMM(M: 8448, N: 8448, K: 128, alpha: 2, beta: 0.5, types: {'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A: 1, scaleValue_B: 1, initMode_A: DataInitMode(Bounded), initMode_B: DataInitMode(Bounded), initMode_C: DataInitMode(Bounded), workgroupMappingDim: -1, workgroupMappingValue: -1)| GEMM(mac_m: 128, mac_n: 256, mac_k: 16, wave_m: 32, wave_n: 32, wave_k: 8, wave_b: 1, workgroup_size_x: 128, workgroup_size_y: 2, workgroupRemapXCC: False, workgroupRemapXCCValue: -1, unroll_x: 0, unroll_y: 0, load_A: BufferToLDSViaVGPR, load_B: BufferToLDSViaVGPR, storeLDS_D: True, betaInFma: True, scheduler: Priority, schedulerCost: LinearWeighted, prefetch: True, prefetchInFlight: 2, prefetchLDSFactor: 0, prefetchMixMemOps: False, loadScale_A: BufferToVGPR, loadScale_B: BufferToVGPR, swizzleScale: False, swizzleTileSize: {'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale: False, pretileScale: False, streamK: False, numWGs: 0, streamKTwoTile: False, streamKTwoTileDPFirst: False, architecture: {'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess: True, version: ) | GEMMProblem(M=8448, N=8448, K=128, alpha=2, beta=0.5, types={'type_A': 'half', 'type_B': 'half', 'type_C': 'half', 'type_D': 'half', 'type_acc': 'float', 'trans_A': 'N', 'trans_B': 'T', 'scale_A': 'None', 'scaleType_A': 'None', 'scale_B': 'None', 'scaleType_B': 'None', 'scaleBlockSize': -1, 'scaleSkipPermlane': False, 'scaleShuffleTileA': [], 'scaleShuffleTileB': [], 'scalePreTileA': [], 'scalePreTileB': []}, scaleValue_A=1, scaleValue_B=1, initMode_A='DataInitMode(Bounded)', initMode_B='DataInitMode(Bounded)', initMode_C='DataInitMode(Bounded)', workgroupMappingDim=-1, workgroupMappingValue=-1)GEMMSolution(mac_m=128, mac_n=256, mac_k=16, wave_m=32, wave_n=32, wave_k=8, wave_b=1, workgroup_size_x=128, workgroup_size_y=2, workgroupRemapXCC=False, workgroupRemapXCCValue=-1, unroll_x=0, unroll_y=0, load_A='BufferToLDSViaVGPR', load_B='BufferToLDSViaVGPR', storeLDS_D=True, betaInFma=True, scheduler='Priority', schedulerCost='LinearWeighted', prefetch=True, prefetchInFlight=2, prefetchLDSFactor=0, prefetchMixMemOps=False, loadScale_A='BufferToVGPR', loadScale_B='BufferToVGPR', swizzleScale=False, swizzleTileSize={'m': 0, 'k': 0, 'n': 0, 'l': 0}, prefetchScale=False, pretileScale=False, streamK=False, numWGs=0, streamKTwoTile=False, streamKTwoTileDPFirst=False, architecture={'ArchString': 'gfx942', 'Xnack': False, 'Sramecc': True}, matchMemoryAccess=True, version='')
+   1.39% | p=5.6994e-05 
	| CodeGen(instCount: 40000, instructions: comments)| CodeGen() | CodeGen(instCount: 40000, instructions: comments)
Links

@math-ci-jobs
Copy link

math-ci-jobs bot commented Jan 5, 2026

Resource Report for gfx942

Results

Details

✔️ No Resource Usage Changes ✔️

Links

@math-ci-jobs
Copy link

math-ci-jobs bot commented Jan 5, 2026

Code Coverage Report for gfx942

Summary

Type Total Missed Master Missed Missed Change Coverage Master Coverage Coverage Change
Lines 61314 10820 10820 0 82.35% 82.35% 0%
Functions 5770 1111 1111 0 80.75% 80.74% .01%
Regions 39222 10641 10635 6 72.87% 72.87% 0%
Branches 21122 5809 5807 2 72.50% 72.50% 0%

Artifacts

Commit Hashes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants