Skip to content

[pull] master from tensorflow:master#1635

Merged
pull[bot] merged 21 commits into
makesoftwaresafe:masterfrom
tensorflow:master
May 11, 2026
Merged

[pull] master from tensorflow:master#1635
pull[bot] merged 21 commits into
makesoftwaresafe:masterfrom
tensorflow:master

Conversation

@pull
Copy link
Copy Markdown

@pull pull Bot commented May 11, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

mdfaijul and others added 21 commits May 10, 2026 23:07
…SyclExecutor::LoadKernel

Imported from GitHub PR openxla/xla#42206

This PR fixes a bug in `SyclExecutor::LoadKernel` that has been exposed when `MlirKernelFusion` emits `CustomKernelThunk` instead of `KernelThunk` (ref: openxla/xla@c986fc2). Without this fix, many tests fail at launch time with:

```
INTERNAL: Kernel is missing a custom arguments packing function for device memory arguments array
```

Copybara import of the project:

--
fb58f0460b3437f9ff48a059e2a1ea8cc82383f5 by Faijul Amin <md.faijul.amin@intel.com>:

Fix kernel arguments packing

--
bb83029df9e79aa7dc07c305386a011d46bbf82c by Faijul Amin <md.faijul.amin@intel.com>:

Fix DWYU error

--
e523dc4e770f2411eba1cf58e316df9ae02c2b43 by Faijul Amin <md.faijul.amin@intel.com>:

Add missing dependency

Merging this change closes #42206

PiperOrigin-RevId: 913506507
PiperOrigin-RevId: 913511527
There's a possibility that the early-abort status is returned last depending on
the thread scheduling order. To address this, store a Status instead of a bool,
which also requires switching away from std::atomic since Status isn't
trivially copyable.

While here, use status matchers in the unit tests to improve error messages.

The test now passes 500/500 times.

PiperOrigin-RevId: 913536234
Use `absl::call_once` (which is a blocking call) in `TFE_Py_TapeSetNew`, `TFE_Py_VariableWatcherNew`, and `TFE_Py_ForwardAccumulatorNew` to ensure thread-safe one-time initialization of global Python type objects (`TFE_Py_Tape_Type`, `TFE_Py_VariableWatcher_Type`, `TFE_Py_ForwardAccumulator_Type`).

PiperOrigin-RevId: 913546030
… in onednn_cc_test bazel functions

Imported from GitHub PR openxla/xla#42192

A [commit](openxla/xla@b49f469) recently added the `manual` tag to oneDNN unit tests, since they were failing when `ENABLE_ONEDNN_ASYNC` is not defined, which does not work when one wants to run these tests via wildcard patterns like `bazel test //xla/service/cpu/tests/....`.

Instead of unconditionally adding the `manual` tag, this PR switches `onednn_cc_test` to use `if_onednn_async` so these targets only build when the oneDNN async runtime is enabled (`--define=build_with_onednn_async=true`), rather than for all oneDNN-enabled platforms.
Copybara import of the project:

--
45c886e2bf17b80628446bbdf5d786cc5b5603ed by Om Thakkar <om.thakkar@intel.com>:

use if_onednn_async instead of if_onednn in onednn_cc_* bazel functions

--
02c9d7b1ee9cbe4d6a310fedafe20f845d1bfa4c by Om Thakkar <om.thakkar@intel.com>:

use onednn_cc_test instead of xla_cc_test for onednn ln/softmax tests

--
778d02289706b8dec724b637849adf499ecc6556 by Om Thakkar <om.thakkar@intel.com>:

restrict the use of if_onednn_async to onednn_cc_test only

Merging this change closes #42192

PiperOrigin-RevId: 913555019
This refactors CreateSearchSpace into a struct that holds necessary information to build a gemm fusion - most of the code here is the same.

We then use this new smaller HLOModule to create a fusion within that. This allows us to make changes without worrying about touching the original module and build out a fusion op-by-op. Before fusing, we check that the resulting fusion would be tileable and then we include it.

This version doesn't do any profitability checks yet and it does not hoist bitcasts.

PiperOrigin-RevId: 913563298
PiperOrigin-RevId: 913565823
…not needed

Imported from GitHub PR openxla/xla#42236

Instead of ignoring the flag in allocator set it to false in `CudaExecutor::Init` if device doesn't have P2P links
Copybara import of the project:

--
9d47013cdb1b6612917f072e921e8849fe902ab8 by Eugene Zhulenev <ezhulenev@openxla.org>:

[xla:gpu] Disable fabric handles in CudaExecutor::Init if not needed

--
97c4ab60d08253c440212537b423b7dc205c630d by Eugene Zhulenev <ezhulenev@openxla.org>:

Initializwe CudaVmmAllocator::Options based on executor flags

Merging this change closes #42236

PiperOrigin-RevId: 913566399
PiperOrigin-RevId: 913572604
Updating func arg attributes one-by-one leads to creation of quadratic amount of pointers, that is N^2 pointers for N results.

PiperOrigin-RevId: 913579525
ForEach is used with flat_hash_map & flat_hash_set iterators, where std::distance has O(n) cost. Microbenchmark for flat_hash_set: 38x speedup for 100K elements,m 5.75x for 10K elements.

```
Before (Using std::distance):
  -----------------------------------------------------------------------
  Benchmark                             Time             CPU   Iterations
  -----------------------------------------------------------------------
  BM_ForEachIterateCost/10000   321409909 ns    321252488 ns            2
  BM_ForEachIterateCost/100000 2.1894e+10 ns   2.1889e+10 ns            1

  After (Using scalar loop counter):
  -----------------------------------------------------------------------
  Benchmark                             Time             CPU   Iterations
  -----------------------------------------------------------------------
  BM_ForEachIterateCost/10000    56443170 ns     55800863 ns           13
  BM_ForEachIterateCost/100000  584972023 ns    574674974 ns            1
```

PiperOrigin-RevId: 913595572
PiperOrigin-RevId: 913597452
PiperOrigin-RevId: 913599207
PiperOrigin-RevId: 913607767
Imported from GitHub PR openxla/xla#40464

📝 Summary of Changes
This PR is the fourth PR in a series of four.

PR 1/4: openxla/xla#40460
PR 2/4: openxla/xla#40462
PR 3/4: openxla/xla#40463
PR 4/4: openxla/xla#40464 <= This PR

This PR enables the ROCm Triton backend for AllReduce (collective emitter).
To this end:
- It extends the `TritonXLAImplementExternElementWise` function so that it accepts a target input parameter and calls it within the ROCm Triton pipeline.
- it adds missing API to rocm_executor RocmExecutor::CanEnablePeerAccessTo(int other_device_ordinal) (this API is required to enable collective_emitter thunk).

🎯 Justification
This PR is part of a rework of the AllReduce triton support to facilitate the enabling of different target GPUs (starting with ROCm).
Prior to these PRs, the triton_xla.get_tid, triton_xla.atomic_write and triton_xla.atomic_spin_wait operations were lowered using PTX assembly. Therefore, AllReduce triton backend was only available for CUDA target.
These PR aim to unify the lowering of triton_xla.atomic_write and triton_xla.atomic_spin_wait and triton_xla.get_tid operations for CUDA and ROCm targets.

🚀 Kind of Contribution
Please remove what does not apply: ✨ New Feature

🧪 Unit Tests:
This PR includes a LIT test checking the lowering of atomic operations.

🧪 Execution Tests:
The existing test xla/tests:all_reduce_e2e_test is relevant for testing this support.
Copybara import of the project:

--
2cf50cd06a46a7cb65a96e45b0c9849043fb8eac by Maxime France-Pillois <mfrancep@amd.com>:

Enable AMD/ROCm support for AllReduce triton backend

--
0482141a26164dc68d374214e30e8f08709134fd by Maxime France-Pillois <mfrancep@amd.com>:

Add missing headers and deps

Merging this change closes #40464

PiperOrigin-RevId: 913612232
…attempt

Imported from GitHub PR openxla/xla#34572

Added CommandBuffer support for Convolution ops
Graph capture of convolutions is by default disabled (since not all convolutions can be captured).
It can be enabled through a newly added flag: '**--xla_gpu_enable_command_buffers=+convolution**"

🎯 Justification
This op was missing for whatever reason: this results in graph fragmentation especially for large models. Hence one gets several (sometimes many) execution graphs instead of just one.

🚀 Kind of Contribution
✨ New Feature

🧪 Unit Tests:
Added new subtest to xla/service/gpu/transforms/command_buffer_scheduling_test.cc and xla/backends/gpu/runtime/command_buffer_conversion_pass_test.cc

This is a fork of the original PR which was reverted due to conv problems: openxla/xla#32053
Now convolutions capture is by default disabled.

@beckerhe , @ezhulenev, @dimitar-asenov  please have another look !
Copybara import of the project:

--
abe76203921cc4405758c69bc92b1664279c0c5b by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:

adding convolution to command buffers

added UTs  and convolution command

test fixes

added rebase fixes

capture only those convolution targets which are explictly

Revert "adding coll permute and convolution to command buffers"

This reverts commit 75847e67261b4589162411c9846ed9c0b9fc1ed5.

added conv to command buffers

fixing build and test

fixing build

rewritten ConvolutionCmd, adapted command_buffer_conv_pass

some cosmetics

added extra param for convolution cmd buf

small fix

fix after rebase

fixing after rebase

simplified runner cache call

WIP fixing compile errors and new conv test

updated the conv subtest and ensure command buffer update is called

clang format

fixing test

applied clang format

applied clang format

fixing test

applied clang format

fixing command buffer conv test

applying clang format

clang format

fixing the test after rebase

fixing yet another conflict

--
13b02c7619ba0483c2993222c5e7d8a2329d438f by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:

fixing build

--
1fc153c6233f159c54110e13212aed7b546c437c by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:

cleanup

--
3e0f6c82d601394fc2afc1197d0d7efe6507ee33 by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:

clang format

--
bb3d5907ba75681c3c87fe595f4645ce664b03d4 by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:

cleaning build deps

Merging this change closes #34572

PiperOrigin-RevId: 913612872
… logic across `MaybeConvertToV1/V2/Named`

PiperOrigin-RevId: 913631803
PiperOrigin-RevId: 913635240
@pull pull Bot locked and limited conversation to collaborators May 11, 2026
@pull pull Bot added the ⤵️ pull label May 11, 2026
@pull pull Bot merged commit ef5e2de into makesoftwaresafe:master May 11, 2026
0 of 4 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.