-
Notifications
You must be signed in to change notification settings - Fork 721
[XLA:GPU] Make load/store logic more general for vectorization #7980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| llvm::Value* row_index = nullptr; | ||
| if (!launch_config_.row_vectorized) { | ||
| array_indices.emplace_back(linear_index_base, shape_, b_); | ||
| llvm::Value* linear_index = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR.
Could you please add a comment why is the Add operation needed if the added value is 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated with my comment.
// The add operation is needed even if the offset is 0, since when the
// kernel is unrolled, the following GEP instruction shares the same pointer
// and sequential indices with others, allowing the default SLP pass to
// optimize them into vectorized load/store operations.
tdanyluk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Imported from GitHub PR openxla/xla#7980 This PR is originally submited as part of openxla/xla#7940. It aims to make load/store logic more general so that it can be optimized to vector load/store pattern. It changes IR from ```llvm %linear_index_plus_base = add nuw nsw i32 %linear_index_base, %loop.indvar %linear_index1 = add nuw nsw i32 %linear_index_plus_base, 1 %linear_index2 = add nuw nsw i32 %linear_index_plus_base, 2 %linear_index3 = add nuw nsw i32 %linear_index_plus_base, 3 %21 = getelementptr inbounds float, ptr %0, i32 %linear_index_plus_base %22 = load float, ptr %21, align 4, !invariant.load !4 %26 = getelementptr inbounds float, ptr %0, i32 %linear_index1 %27 = load float, ptr %26, align 4, !invariant.load !4 %31 = getelementptr inbounds float, ptr %0, i32 %linear_index2 %32 = load float, ptr %31, align 4, !invariant.load !4 %36 = getelementptr inbounds float, ptr %0, i32 %linear_index3 %37 = load float, ptr %36, align 4, !invariant.load !4 ``` to ```llvm %linear_index_plus_base = add nuw nsw i32 %linear_index_base, %loop.indvar %21 = getelementptr float, ptr %0, i32 %linear_index_plus_base %22 = getelementptr inbounds float, ptr %21, i32 0 %23 = load float, ptr %22, align 4, !invariant.load !4 %29 = getelementptr float, ptr %0, i32 %linear_index_plus_base %30 = getelementptr inbounds float, ptr %29, i32 1 %31 = load float, ptr %30, align 4, !invariant.load !4 %37 = getelementptr float, ptr %0, i32 %linear_index_plus_base %38 = getelementptr inbounds float, ptr %37, i32 2 %39 = load float, ptr %38, align 4, !invariant.load !4 %45 = getelementptr float, ptr %0, i32 %linear_index_plus_base %46 = getelementptr inbounds float, ptr %45, i32 3 %47 = load float, ptr %46, align 4, !invariant.load !4 ``` The former one does not always work for different backends since it needs additional pass to handle GEP pattern. There are only ~20 lines of core code changes, the others are all UTs changes. Copybara import of the project: -- df1286a48461dd6337c841d4a353b694ce60bf86 by Sheng, Yang <yang.sheng@intel.com>: Make vector load/store logic more general -- 66f0782ad62e90e01fd8de7b41d20d607f974844 by Sheng, Yang <yang.sheng@intel.com>: fix IR in UTs -- b944ee015c177d4180f9cba6b564cf72e1c80bbc by Sheng, Yang <yang.sheng@intel.com>: Add comments Merging this change closes #7980 PiperOrigin-RevId: 595953368
This PR is originally submited as part of #7940.
It aims to make load/store logic more general so that it can be optimized to vector load/store pattern. It changes IR from
to
The former one does not always work for different backends since it needs additional pass to handle GEP pattern.
There are only ~20 lines of core code changes, the others are all UTs changes.