Skip to content

perf(amdgpu): amdgpu perf + force_inline#15

Open
deepsek wants to merge 4 commits intoamd-integrationfrom
perf/deepsek/expose_perf_vars
Open

perf(amdgpu): amdgpu perf + force_inline#15
deepsek wants to merge 4 commits intoamd-integrationfrom
perf/deepsek/expose_perf_vars

Conversation

@deepsek
Copy link
Copy Markdown
Collaborator

@deepsek deepsek commented Apr 25, 2026

  • kernel launch improvements
  • force_inline support into the AST

@deepsek deepsek force-pushed the perf/deepsek/expose_perf_vars branch from 7c8ba7e to 789d90d Compare April 26, 2026 15:00
@deepsek deepsek force-pushed the perf/deepsek/expose_perf_vars branch from 789d90d to 95b5708 Compare April 28, 2026 09:27
@deepsek
Copy link
Copy Markdown
Collaborator Author

deepsek commented Apr 28, 2026

/run-ci

Copy link
Copy Markdown
Collaborator

@yaoliu13 yaoliu13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

@yaoliu13
Copy link
Copy Markdown
Collaborator

/run-ci

1 similar comment
@yaoliu13
Copy link
Copy Markdown
Collaborator

/run-ci

@lohiaj
Copy link
Copy Markdown

lohiaj commented Apr 29, 2026

Strong evidence to land. AMDGCN dumps on gfx942 from the Genesis hot kernels show the launcher kernels are thin (≤74 VGPR, 0 scratch) but each calls an outlined function_body via s_swappc_b64 that pays a fixed callee-save prologue/epilogue:

outlined function_body in VGPR AGPR callee-save scratch
func_solve_body_monolith_kernel_1 256 172 1012 B (252 dwords)
func_solve_init_kernel_11 248 32 284 B
func_solve_init_kernel_7 248 32 292 B
kernel_step_1_kernel_26 248 32 240 B

Prologue in each is ~60 contiguous scratch_store_dwords + ~32 v_accvgpr_write_b32s saving v40..v207 in 8-VGPR groups; epilogue mirrors. ~184 prologue/epilogue ops every call.

Confirmed via paired AMDGCN dumps that source-level changes don't clear this floor: e.g. a loop-fusion candidate that cut -690 asm lines from function_body left ΔVGPR/ΔAGPR/Δscratch = 0. Whatever shrinks the body still saves the same registers across the call boundary.

force_inline on the relevant @qd.funcs removes that boundary entirely. Happy to share the full dumps for any of the above kernels if useful

@yaoliu13 @deepsek FYI

@deepsek
Copy link
Copy Markdown
Collaborator Author

deepsek commented Apr 29, 2026

Good eye @lohiaj! That's the main reason I'm exposing this as a variable.. Thanks for validating the same too!
A fun little exercise would also be to see why i'm exposing it as a loop_config decorator instead of qd.func..
In any case, waiting for this to land finally so that the other side of the spectrum can have it's light of day!

@deepsek
Copy link
Copy Markdown
Collaborator Author

deepsek commented Apr 29, 2026

/run-ci

@deepsek deepsek force-pushed the perf/deepsek/expose_perf_vars branch from 9d2a6cb to caf2c1f Compare May 1, 2026 20:20
@deepsek
Copy link
Copy Markdown
Collaborator Author

deepsek commented May 1, 2026

/run-ci

@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 2, 2026

1370057 and 4968

@ROCm ROCm deleted a comment from mukh1l May 2, 2026
@ROCm ROCm deleted a comment from deepsek May 2, 2026
Copy link
Copy Markdown
Collaborator

@yaoliu13 yaoliu13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 2, 2026

Need one more approval

Copy link
Copy Markdown

@lohiaj lohiaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed and approved, force_inline removes the outlined callee save restore boundary i validated on the genesis hot kernels and the launcher hot path cleanup looks clean

@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 3, 2026

/run-ci

1 similar comment
@yaoliu13
Copy link
Copy Markdown
Collaborator

yaoliu13 commented May 3, 2026

/run-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants