[XLA:GPU] Make load/store logic more general for vectorization #7980

Solaryee · 2023-12-21T03:38:20Z

This PR is originally submited as part of #7940.
It aims to make load/store logic more general so that it can be optimized to vector load/store pattern. It changes IR from

%linear_index_plus_base = add nuw nsw i32 %linear_index_base, %loop.indvar
%linear_index1 = add nuw nsw i32 %linear_index_plus_base, 1
%linear_index2 = add nuw nsw i32 %linear_index_plus_base, 2
%linear_index3 = add nuw nsw i32 %linear_index_plus_base, 3

%21 = getelementptr inbounds float, ptr %0, i32 %linear_index_plus_base
%22 = load float, ptr %21, align 4, !invariant.load !4
%26 = getelementptr inbounds float, ptr %0, i32 %linear_index1
%27 = load float, ptr %26, align 4, !invariant.load !4
%31 = getelementptr inbounds float, ptr %0, i32 %linear_index2
%32 = load float, ptr %31, align 4, !invariant.load !4
%36 = getelementptr inbounds float, ptr %0, i32 %linear_index3
%37 = load float, ptr %36, align 4, !invariant.load !4

to

%linear_index_plus_base = add nuw nsw i32 %linear_index_base, %loop.indvar

%21 = getelementptr float, ptr %0, i32 %linear_index_plus_base
%22 = getelementptr inbounds float, ptr %21, i32 0
%23 = load float, ptr %22, align 4, !invariant.load !4
%29 = getelementptr float, ptr %0, i32 %linear_index_plus_base
%30 = getelementptr inbounds float, ptr %29, i32 1
%31 = load float, ptr %30, align 4, !invariant.load !4
%37 = getelementptr float, ptr %0, i32 %linear_index_plus_base
%38 = getelementptr inbounds float, ptr %37, i32 2
%39 = load float, ptr %38, align 4, !invariant.load !4
%45 = getelementptr float, ptr %0, i32 %linear_index_plus_base
%46 = getelementptr inbounds float, ptr %45, i32 3
%47 = load float, ptr %46, align 4, !invariant.load !4

The former one does not always work for different backends since it needs additional pass to handle GEP pattern.

There are only ~20 lines of core code changes, the others are all UTs changes.

tdanyluk · 2024-01-04T14:55:11Z

xla/service/gpu/parallel_loop_emitter.cc

  llvm::Value* row_index = nullptr;
  if (!launch_config_.row_vectorized) {
-    array_indices.emplace_back(linear_index_base, shape_, b_);
+    llvm::Value* linear_index =


Thank you for the PR.
Could you please add a comment why is the Add operation needed if the added value is 0?

Updated with my comment.

// The add operation is needed even if the offset is 0, since when the // kernel is unrolled, the following GEP instruction shares the same pointer // and sequential indices with others, allowing the default SLP pass to // optimize them into vectorized load/store operations.

tdanyluk

LGTM

Imported from GitHub PR openxla/xla#7980 This PR is originally submited as part of openxla/xla#7940. It aims to make load/store logic more general so that it can be optimized to vector load/store pattern. It changes IR from ```llvm %linear_index_plus_base = add nuw nsw i32 %linear_index_base, %loop.indvar %linear_index1 = add nuw nsw i32 %linear_index_plus_base, 1 %linear_index2 = add nuw nsw i32 %linear_index_plus_base, 2 %linear_index3 = add nuw nsw i32 %linear_index_plus_base, 3 %21 = getelementptr inbounds float, ptr %0, i32 %linear_index_plus_base %22 = load float, ptr %21, align 4, !invariant.load !4 %26 = getelementptr inbounds float, ptr %0, i32 %linear_index1 %27 = load float, ptr %26, align 4, !invariant.load !4 %31 = getelementptr inbounds float, ptr %0, i32 %linear_index2 %32 = load float, ptr %31, align 4, !invariant.load !4 %36 = getelementptr inbounds float, ptr %0, i32 %linear_index3 %37 = load float, ptr %36, align 4, !invariant.load !4 ``` to ```llvm %linear_index_plus_base = add nuw nsw i32 %linear_index_base, %loop.indvar %21 = getelementptr float, ptr %0, i32 %linear_index_plus_base %22 = getelementptr inbounds float, ptr %21, i32 0 %23 = load float, ptr %22, align 4, !invariant.load !4 %29 = getelementptr float, ptr %0, i32 %linear_index_plus_base %30 = getelementptr inbounds float, ptr %29, i32 1 %31 = load float, ptr %30, align 4, !invariant.load !4 %37 = getelementptr float, ptr %0, i32 %linear_index_plus_base %38 = getelementptr inbounds float, ptr %37, i32 2 %39 = load float, ptr %38, align 4, !invariant.load !4 %45 = getelementptr float, ptr %0, i32 %linear_index_plus_base %46 = getelementptr inbounds float, ptr %45, i32 3 %47 = load float, ptr %46, align 4, !invariant.load !4 ``` The former one does not always work for different backends since it needs additional pass to handle GEP pattern. There are only ~20 lines of core code changes, the others are all UTs changes. Copybara import of the project: -- df1286a48461dd6337c841d4a353b694ce60bf86 by Sheng, Yang <yang.sheng@intel.com>: Make vector load/store logic more general -- 66f0782ad62e90e01fd8de7b41d20d607f974844 by Sheng, Yang <yang.sheng@intel.com>: fix IR in UTs -- b944ee015c177d4180f9cba6b564cf72e1c80bbc by Sheng, Yang <yang.sheng@intel.com>: Add comments Merging this change closes #7980 PiperOrigin-RevId: 595953368

Make vector load/store logic more general

df1286a

github-actions bot added the kokoro:force-run Forces CI to rerun label Dec 21, 2023

github-actions bot assigned kamaljeeti and xla-rotation Dec 21, 2023

kokoro-team removed the kokoro:force-run Forces CI to rerun label Dec 21, 2023

kamaljeeti requested a review from sergeykozub December 21, 2023 05:28

fix IR in UTs

66f0782

github-actions bot added the kokoro:force-run Forces CI to rerun label Dec 21, 2023

kokoro-team removed the kokoro:force-run Forces CI to rerun label Dec 21, 2023

kamaljeeti requested a review from tdanyluk January 2, 2024 06:20

tdanyluk suggested changes Jan 4, 2024

View reviewed changes

Add comments

b944ee0

github-actions bot added the kokoro:force-run Forces CI to rerun label Jan 5, 2024

kokoro-team removed the kokoro:force-run Forces CI to rerun label Jan 5, 2024

tdanyluk approved these changes Jan 5, 2024

View reviewed changes

copybara-service bot closed this in 47b68d4 Jan 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[XLA:GPU] Make load/store logic more general for vectorization #7980

[XLA:GPU] Make load/store logic more general for vectorization #7980

Uh oh!

Solaryee commented Dec 21, 2023 •

edited

Loading

Uh oh!

tdanyluk Jan 4, 2024 •

edited

Loading

Uh oh!

Solaryee Jan 5, 2024

Uh oh!

tdanyluk left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[XLA:GPU] Make load/store logic more general for vectorization #7980

[XLA:GPU] Make load/store logic more general for vectorization #7980

Uh oh!

Conversation

Solaryee commented Dec 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdanyluk Jan 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Solaryee Jan 5, 2024

Choose a reason for hiding this comment

Uh oh!

tdanyluk left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Solaryee commented Dec 21, 2023 •

edited

Loading

tdanyluk Jan 4, 2024 •

edited

Loading