Skip to content

[sm90][sparse-prefill] optimize phase1 producer scheduling#170

Open
huangzhilin-hzl wants to merge 1 commit intodeepseek-ai:mainfrom
huangzhilin-hzl:hzl/sm90-sparse-prefill-phase1-opt
Open

[sm90][sparse-prefill] optimize phase1 producer scheduling#170
huangzhilin-hzl wants to merge 1 commit intodeepseek-ai:mainfrom
huangzhilin-hzl:hzl/sm90-sparse-prefill-phase1-opt

Conversation

@huangzhilin-hzl
Copy link
Copy Markdown

Summary

This PR improves the H20 SM90 sparse prefill phase1 path by:

  • reordering producer copies to make V0R ready earlier
  • publishing the valid mask immediately after V0R
  • switching producer-side cp_async cache policy to evict_first
  • reducing warpgroup_reg_alloc from 216 to 200

Test Environment

  • GPU: H20

Correctness

  • current sparse prefill test script result: 617 / 617 passed

Performance

Repository-standard v32 benchmark (tests/test_flash_mla_sparse_prefill.py

group case baseline us current us speedup
v32 sq256_skv8192 1165.2 1138.4 +2.359%
v32 sq256_skv32768 1157.5 1139.8 +1.548%
v32 sq256_skv65536 1156.4 1135.7 +1.827%
v32 sq256_skv98304 1150.8 1137.2 +1.197%
v32 sq256_skv131072 1153.0 1137.7 +1.348%
v32 sq1024_skv8192 4472.3 4360.7 +2.558%
v32 sq1024_skv32768 4447.4 4345.1 +2.355%
v32 sq1024_skv65536 4421.6 4350.2 +1.643%
v32 sq1024_skv98304 4427.3 4354.8 +1.666%
v32 sq1024_skv131072 4438.8 4350.7 +2.025%
v32 sq2048_skv8192 8791.5 8566.8 +2.623%
v32 sq2048_skv32768 8712.7 8556.9 +1.821%
v32 sq2048_skv65536 8699.5 8631.3 +0.790%
v32 sq2048_skv98304 8700.2 8618.7 +0.946%
v32 sq2048_skv131072 8692.1 8605.2 +1.009%
v32 sq4096_skv8192 17502.9 17144.2 +2.092%
v32 sq4096_skv32768 17388.1 17136.7 +1.467%
v32 sq4096_skv65536 17354.2 17137.2 +1.266%
v32 sq4096_skv98304 17457.3 17127.5 +1.926%
v32 sq4096_skv131072 17472.9 17126.5 +2.022%
v32 sq8192_skv8192 35160.0 34168.7 +2.901%
v32 sq8192_skv32768 34926.3 34135.1 +2.318%
v32 sq8192_skv65536 34863.3 34134.0 +2.137%
v32 sq8192_skv98304 34845.0 34122.5 +2.117%
v32 sq8192_skv131072 34860.1 34133.0 +2.130%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant