Skip to content

perf(amdgpu): inline offloaded range-for kernel body#12

Closed
kevinjosephamd wants to merge 1 commit intoamd-integrationfrom
kejoseph/perf/inline-range-for-body
Closed

perf(amdgpu): inline offloaded range-for kernel body#12
kevinjosephamd wants to merge 1 commit intoamd-integrationfrom
kejoseph/perf/inline-range-for-body

Conversation

@kevinjosephamd
Copy link
Copy Markdown

Mark the per-iteration loop body emitted by create_offload_range_for as alwaysinline so the AMDGPU backend inlines it into gpu_parallel_range_for instead of leaving it as a separate function call. Measured throughput improvement on an internal end-to-end workload.

@yaoliu13
Copy link
Copy Markdown
Collaborator

/run-ci

Copy link
Copy Markdown
Collaborator

@jamesETsmith jamesETsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good pending CI runs after #3 goes in

Mark the per-iteration loop body emitted by create_offload_range_for as
alwaysinline so the AMDGPU backend inlines it into gpu_parallel_range_for
instead of leaving it as a separate function call. Measured throughput
improvement on an internal end-to-end workload.
@gpinkert gpinkert force-pushed the kejoseph/perf/inline-range-for-body branch from 4e5bc24 to 895dfc3 Compare April 25, 2026 05:50
@gpinkert
Copy link
Copy Markdown

/run-ci

1 similar comment
@yaoliu13
Copy link
Copy Markdown
Collaborator

/run-ci

@kevinjosephamd
Copy link
Copy Markdown
Author

Closing this PR out since we're unable to replicate the performance improvements originally observed.

@yaoliu13
Copy link
Copy Markdown
Collaborator

/run-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants