Skip to content

Conversation

@Solaryee
Copy link
Contributor

@Solaryee Solaryee commented Dec 20, 2023

This PR aims to add LLVM codegen changes for SPIRV backend. It contains SPIRV target and related intrinsic.

// for that dimensions. This helps LLVM generate vectorized codes
// in that cases.
llvm::Value* row_index = nullptr;
if (!launch_config_.row_vectorized) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand, these changes are just a NFC refactoring, and are not related to the real changes being done in this PR? If yes, maybe split them into their own PR?
I would say the new code is not necessarily more easy to read. I agree with the changes to swap if/else blocks and get rid of the negation, but I would prefer if the loop still starts from 1 and we handle the i = 0 case outside the loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a NFC refactoring, I can put it in another PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. This PR only contains SPIR target/intrinsic now.

Copy link
Contributor

@cheshire cheshire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to test those changes? How do we plan to trigger Intel codegen codepath?

Copy link
Member

@penpornk penpornk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!

Comment on lines +58 to +65
return {
llvm::Intrinsic::nvvm_read_ptx_sreg_tid_x,
llvm::Intrinsic::amdgcn_workitem_id_x,
[](llvm::IRBuilder<>* b_) -> llvm::CallInst* {
return EmitDeviceFunctionCall(
"_Z32__spirv_BuiltInLocalInvocationIdi", {b_->getInt32(0)},
{U32}, U64, {b_->getContext()}, b_);
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be refactored to a templated struct because we only need the code for one device backend at the same time. But I can look into that after this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, there's quite a bit of duplication across those EmitDeviceFunctionCall, maybe just extracting it to a function would be better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would make it cleaner too :)

@penpornk
Copy link
Member

Re: @cheshire:

Is it possible to test those changes? How do we plan to trigger Intel codegen codepath?

@ShengYang1 Do you already have a CI running? Are the results publicly available now? If so, could you please share the CI link here for the time being? (We can look into adding the CI link to our github check dashboard later.)

Copy link
Member

@penpornk penpornk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More feedback from our code check. (For these functions, we prefer the std version.)

@Zantares
Copy link

Zantares commented Dec 22, 2023

Re: @cheshire:

Is it possible to test those changes? How do we plan to trigger Intel codegen codepath?

@ShengYang1 Do you already have a CI running? Are the results publicly available now? If so, could you please share the CI link here for the time being? (We can look into adding the CI link to our github check dashboard later.)

According to the suggestion in RFC, we have initial CI plan here and welcome your suggestions:

  1. We plan to enable post-CI as AMD, but we also would like to support more capabilities (trigger GitHub action on demand) if OpenXLA is interested :)
  2. Current public Intel CI can't be used to track OpenXLA main branch because it based on Extension with fixed OpenXLA commit. Consider the rebase effort, we hope to fully enable the CI after merged Runtime PR (it's the largest and most important PR), then we can directly build Intel GPU target and check it without Extension. Furthermore, make pending code-gen PRs waiting till Runtime PR is submitted, and test them together is an alternative solution
  3. We will make sure these PR won't impact existing target (check by current OpenXLA pre-CI)

@github-actions github-actions bot added the kokoro:force-run Forces CI to rerun label Dec 22, 2023
@kokoro-team kokoro-team removed the kokoro:force-run Forces CI to rerun label Dec 22, 2023
@github-actions github-actions bot added the kokoro:force-run Forces CI to rerun label Dec 26, 2023
@kokoro-team kokoro-team removed the kokoro:force-run Forces CI to rerun label Dec 26, 2023
Copy link
Member

@penpornk penpornk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the changes!
@cheshire Are you okay with us going ahead and merging the PR now?

copybara-service bot pushed a commit that referenced this pull request Jan 5, 2024
Imported from GitHub PR #7940

This PR aims to add LLVM codegen changes for SPIRV backend. It contains SPIRV target and related intrinsic.
Copybara import of the project:

--
db35320 by Sheng, Yang <yang.sheng@intel.com>:

SPIR changes

Merging this change closes #7940

FUTURE_COPYBARA_INTEGRATE_REVIEW=#7940 from Intel-tensorflow:yang/spir db35320
PiperOrigin-RevId: 595934626
@cheshire
Copy link
Contributor

cheshire commented Jan 5, 2024

I think we should not merge this just yet; without tests and an integration plan the diff by itself makes little sense.

Is there a working branch where all this code works end-to-end? How many tests are passing now? What is the desired end state? What are the CI plans?

copybara-service bot pushed a commit that referenced this pull request Jan 5, 2024
Imported from GitHub PR #7980

This PR is originally submited as part of #7940.
It aims to make load/store logic more general so that it can be optimized to vector load/store pattern. It changes IR from
```llvm
%linear_index_plus_base = add nuw nsw i32 %linear_index_base, %loop.indvar
%linear_index1 = add nuw nsw i32 %linear_index_plus_base, 1
%linear_index2 = add nuw nsw i32 %linear_index_plus_base, 2
%linear_index3 = add nuw nsw i32 %linear_index_plus_base, 3

%21 = getelementptr inbounds float, ptr %0, i32 %linear_index_plus_base
%22 = load float, ptr %21, align 4, !invariant.load !4
%26 = getelementptr inbounds float, ptr %0, i32 %linear_index1
%27 = load float, ptr %26, align 4, !invariant.load !4
%31 = getelementptr inbounds float, ptr %0, i32 %linear_index2
%32 = load float, ptr %31, align 4, !invariant.load !4
%36 = getelementptr inbounds float, ptr %0, i32 %linear_index3
%37 = load float, ptr %36, align 4, !invariant.load !4
```
to
```llvm
%linear_index_plus_base = add nuw nsw i32 %linear_index_base, %loop.indvar

%21 = getelementptr float, ptr %0, i32 %linear_index_plus_base
%22 = getelementptr inbounds float, ptr %21, i32 0
%23 = load float, ptr %22, align 4, !invariant.load !4
%29 = getelementptr float, ptr %0, i32 %linear_index_plus_base
%30 = getelementptr inbounds float, ptr %29, i32 1
%31 = load float, ptr %30, align 4, !invariant.load !4
%37 = getelementptr float, ptr %0, i32 %linear_index_plus_base
%38 = getelementptr inbounds float, ptr %37, i32 2
%39 = load float, ptr %38, align 4, !invariant.load !4
%45 = getelementptr float, ptr %0, i32 %linear_index_plus_base
%46 = getelementptr inbounds float, ptr %45, i32 3
%47 = load float, ptr %46, align 4, !invariant.load !4
```
The former one does not always work for different backends since it needs additional pass to handle GEP pattern.

There are only ~20 lines of core code changes, the others are all UTs changes.
Copybara import of the project:

--
df1286a by Sheng, Yang <yang.sheng@intel.com>:

Make vector load/store logic more general

--
66f0782 by Sheng, Yang <yang.sheng@intel.com>:

fix IR in UTs

--
b944ee0 by Sheng, Yang <yang.sheng@intel.com>:

Add comments

Merging this change closes #7980

COPYBARA_INTEGRATE_REVIEW=#7980 from Intel-tensorflow:yang/vec b944ee0
PiperOrigin-RevId: 595953368
copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Jan 5, 2024
Imported from GitHub PR openxla/xla#7980

This PR is originally submited as part of openxla/xla#7940.
It aims to make load/store logic more general so that it can be optimized to vector load/store pattern. It changes IR from
```llvm
%linear_index_plus_base = add nuw nsw i32 %linear_index_base, %loop.indvar
%linear_index1 = add nuw nsw i32 %linear_index_plus_base, 1
%linear_index2 = add nuw nsw i32 %linear_index_plus_base, 2
%linear_index3 = add nuw nsw i32 %linear_index_plus_base, 3

%21 = getelementptr inbounds float, ptr %0, i32 %linear_index_plus_base
%22 = load float, ptr %21, align 4, !invariant.load !4
%26 = getelementptr inbounds float, ptr %0, i32 %linear_index1
%27 = load float, ptr %26, align 4, !invariant.load !4
%31 = getelementptr inbounds float, ptr %0, i32 %linear_index2
%32 = load float, ptr %31, align 4, !invariant.load !4
%36 = getelementptr inbounds float, ptr %0, i32 %linear_index3
%37 = load float, ptr %36, align 4, !invariant.load !4
```
to
```llvm
%linear_index_plus_base = add nuw nsw i32 %linear_index_base, %loop.indvar

%21 = getelementptr float, ptr %0, i32 %linear_index_plus_base
%22 = getelementptr inbounds float, ptr %21, i32 0
%23 = load float, ptr %22, align 4, !invariant.load !4
%29 = getelementptr float, ptr %0, i32 %linear_index_plus_base
%30 = getelementptr inbounds float, ptr %29, i32 1
%31 = load float, ptr %30, align 4, !invariant.load !4
%37 = getelementptr float, ptr %0, i32 %linear_index_plus_base
%38 = getelementptr inbounds float, ptr %37, i32 2
%39 = load float, ptr %38, align 4, !invariant.load !4
%45 = getelementptr float, ptr %0, i32 %linear_index_plus_base
%46 = getelementptr inbounds float, ptr %45, i32 3
%47 = load float, ptr %46, align 4, !invariant.load !4
```
The former one does not always work for different backends since it needs additional pass to handle GEP pattern.

There are only ~20 lines of core code changes, the others are all UTs changes.
Copybara import of the project:

--
df1286a48461dd6337c841d4a353b694ce60bf86 by Sheng, Yang <yang.sheng@intel.com>:

Make vector load/store logic more general

--
66f0782ad62e90e01fd8de7b41d20d607f974844 by Sheng, Yang <yang.sheng@intel.com>:

fix IR in UTs

--
b944ee015c177d4180f9cba6b564cf72e1c80bbc by Sheng, Yang <yang.sheng@intel.com>:

Add comments

Merging this change closes #7980

PiperOrigin-RevId: 595953368
@Zantares
Copy link

Zantares commented Jan 8, 2024

Hi @cheshire , hope below info can answer your question:

  1. The integration plan is listed in RFC https://github.com/openxla/community/blob/49541f1e8b0bdba6ca003797a5ae01f4a2f8cbb7/rfcs/20231102-intel-gpu.md#overview. The basic functionality can be enabled with PR1 & PR3, that's why we suggest testing it after these 2 PRs are merged.

    • SPIRV target (this PR)
    • OneDNN lib call. This part will refer the 3rd library call in CPU RFC [RFC] OpenXLA CPU Strategy community#96, so it will be deferred after the related CPU changes are ready
    • SYCL runtime (To be submitted)
  2. The CI plan is also roughly listed in RFC https://github.com/openxla/community/blob/49541f1e8b0bdba6ca003797a5ae01f4a2f8cbb7/rfcs/20231102-intel-gpu.md#engineering-impact. Furthermore, we have enabled such public CI based on Intel® Extension for OpenXLA* and shared it in OpenXLA community meeting. Right now, it can only work with specific OpenXLA commit and Intel Extension. Once the related PRs are merged, it can work with OpenXLA main branch.

  3. Is there a working branch where all this code works end-to-end? How many tests are passing now?
    Yes, as I mentioned above we have all these changes in specific OpenXLA commit (worked with Intel Extension) and we can pass 95 JAX native UT from public JAX repo and several popular models here

  4. What is the desired end state?
    Any hardware based on generic SPIRV/SYCL can be functionality enabled after all PRs are merged. To get advanced performance, plug-in/extension are still needed for critical operations.

@cheshire
Copy link
Contributor

OK!

Copy link
Contributor

@cheshire cheshire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two nits.

Comment on lines +58 to +65
return {
llvm::Intrinsic::nvvm_read_ptx_sreg_tid_x,
llvm::Intrinsic::amdgcn_workitem_id_x,
[](llvm::IRBuilder<>* b_) -> llvm::CallInst* {
return EmitDeviceFunctionCall(
"_Z32__spirv_BuiltInLocalInvocationIdi", {b_->getInt32(0)},
{U32}, U64, {b_->getContext()}, b_);
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, there's quite a bit of duplication across those EmitDeviceFunctionCall, maybe just extracting it to a function would be better?

gpu_root_names.spir_root == "_Z15__spirv_ocl_pow" ||
gpu_root_names.spir_root == "_Z17__spirv_ocl_atan2" ||
gpu_root_names.spir_root == "_Z16__spirv_ocl_fmod")
return StrCat(gpu_root_names.spir_root, "ff");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: XLA-preferred style is to use braces everywhere, even for single lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@github-actions github-actions bot added the kokoro:force-run Forces CI to rerun label Jan 12, 2024
@kokoro-team kokoro-team removed the kokoro:force-run Forces CI to rerun label Jan 12, 2024
@github-actions github-actions bot added the kokoro:force-run Forces CI to rerun label Jan 15, 2024
@kokoro-team kokoro-team removed the kokoro:force-run Forces CI to rerun label Jan 15, 2024
// Helper function to emit call to SPIR shfl_down intrinsic.
llvm::Value* EmitSPIRShflDown(llvm::Value* value, llvm::Value* offset,
llvm::IRBuilder<>* b) {
llvm::Module* module = b->GetInsertBlock()->getModule();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internally the build is broken due to this unused variable and the one below. Can you please remove these two lines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@github-actions github-actions bot added the kokoro:force-run Forces CI to rerun label Jan 15, 2024
@kokoro-team kokoro-team removed the kokoro:force-run Forces CI to rerun label Jan 15, 2024
Copy link
Member

@akuegel akuegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the quick fix :)

copybara-service bot pushed a commit to tensorflow/tensorflow that referenced this pull request Jan 15, 2024
Imported from GitHub PR openxla/xla#7940

This PR aims to add LLVM codegen changes for SPIRV backend. It contains SPIRV target and related intrinsic.
Copybara import of the project:

--
5e2116a277b01777cd319ecb484106b4848a920b by Sheng, Yang <yang.sheng@intel.com>:

SPIR changes

--
3695b57a08c8e2f353c670596b691a9479ed2012 by Sheng, Yang <yang.sheng@intel.com>:

Remove unused variable

Merging this change closes #7940

PiperOrigin-RevId: 598587122
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants