[pull] master from tensorflow:master by pull[bot] · Pull Request #1635 · makesoftwaresafe/tensorflow

pull · 2026-05-11T12:04:49Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

…SyclExecutor::LoadKernel Imported from GitHub PR openxla/xla#42206 This PR fixes a bug in `SyclExecutor::LoadKernel` that has been exposed when `MlirKernelFusion` emits `CustomKernelThunk` instead of `KernelThunk` (ref: openxla/xla@c986fc2). Without this fix, many tests fail at launch time with: ``` INTERNAL: Kernel is missing a custom arguments packing function for device memory arguments array ``` Copybara import of the project: -- fb58f0460b3437f9ff48a059e2a1ea8cc82383f5 by Faijul Amin <md.faijul.amin@intel.com>: Fix kernel arguments packing -- bb83029df9e79aa7dc07c305386a011d46bbf82c by Faijul Amin <md.faijul.amin@intel.com>: Fix DWYU error -- e523dc4e770f2411eba1cf58e316df9ae02c2b43 by Faijul Amin <md.faijul.amin@intel.com>: Add missing dependency Merging this change closes #42206 PiperOrigin-RevId: 913506507

PiperOrigin-RevId: 913511527

There's a possibility that the early-abort status is returned last depending on the thread scheduling order. To address this, store a Status instead of a bool, which also requires switching away from std::atomic since Status isn't trivially copyable. While here, use status matchers in the unit tests to improve error messages. The test now passes 500/500 times. PiperOrigin-RevId: 913536234

Use `absl::call_once` (which is a blocking call) in `TFE_Py_TapeSetNew`, `TFE_Py_VariableWatcherNew`, and `TFE_Py_ForwardAccumulatorNew` to ensure thread-safe one-time initialization of global Python type objects (`TFE_Py_Tape_Type`, `TFE_Py_VariableWatcher_Type`, `TFE_Py_ForwardAccumulator_Type`). PiperOrigin-RevId: 913546030

… in onednn_cc_test bazel functions Imported from GitHub PR openxla/xla#42192 A [commit](openxla/xla@b49f469) recently added the `manual` tag to oneDNN unit tests, since they were failing when `ENABLE_ONEDNN_ASYNC` is not defined, which does not work when one wants to run these tests via wildcard patterns like `bazel test //xla/service/cpu/tests/....`. Instead of unconditionally adding the `manual` tag, this PR switches `onednn_cc_test` to use `if_onednn_async` so these targets only build when the oneDNN async runtime is enabled (`--define=build_with_onednn_async=true`), rather than for all oneDNN-enabled platforms. Copybara import of the project: -- 45c886e2bf17b80628446bbdf5d786cc5b5603ed by Om Thakkar <om.thakkar@intel.com>: use if_onednn_async instead of if_onednn in onednn_cc_* bazel functions -- 02c9d7b1ee9cbe4d6a310fedafe20f845d1bfa4c by Om Thakkar <om.thakkar@intel.com>: use onednn_cc_test instead of xla_cc_test for onednn ln/softmax tests -- 778d02289706b8dec724b637849adf499ecc6556 by Om Thakkar <om.thakkar@intel.com>: restrict the use of if_onednn_async to onednn_cc_test only Merging this change closes #42192 PiperOrigin-RevId: 913555019

This refactors CreateSearchSpace into a struct that holds necessary information to build a gemm fusion - most of the code here is the same. We then use this new smaller HLOModule to create a fusion within that. This allows us to make changes without worrying about touching the original module and build out a fusion op-by-op. Before fusing, we check that the resulting fusion would be tileable and then we include it. This version doesn't do any profitability checks yet and it does not hoist bitcasts. PiperOrigin-RevId: 913563298

PiperOrigin-RevId: 913565823

…not needed Imported from GitHub PR openxla/xla#42236 Instead of ignoring the flag in allocator set it to false in `CudaExecutor::Init` if device doesn't have P2P links Copybara import of the project: -- 9d47013cdb1b6612917f072e921e8849fe902ab8 by Eugene Zhulenev <ezhulenev@openxla.org>: [xla:gpu] Disable fabric handles in CudaExecutor::Init if not needed -- 97c4ab60d08253c440212537b423b7dc205c630d by Eugene Zhulenev <ezhulenev@openxla.org>: Initializwe CudaVmmAllocator::Options based on executor flags Merging this change closes #42236 PiperOrigin-RevId: 913566399

PiperOrigin-RevId: 913572604

PiperOrigin-RevId: 913577057

Updating func arg attributes one-by-one leads to creation of quadratic amount of pointers, that is N^2 pointers for N results. PiperOrigin-RevId: 913579525

PiperOrigin-RevId: 913592196

ForEach is used with flat_hash_map & flat_hash_set iterators, where std::distance has O(n) cost. Microbenchmark for flat_hash_set: 38x speedup for 100K elements,m 5.75x for 10K elements. ``` Before (Using std::distance): ----------------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------------- BM_ForEachIterateCost/10000 321409909 ns 321252488 ns 2 BM_ForEachIterateCost/100000 2.1894e+10 ns 2.1889e+10 ns 1 After (Using scalar loop counter): ----------------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------------- BM_ForEachIterateCost/10000 56443170 ns 55800863 ns 13 BM_ForEachIterateCost/100000 584972023 ns 574674974 ns 1 ``` PiperOrigin-RevId: 913595572

PiperOrigin-RevId: 913597452

PiperOrigin-RevId: 913599207

PiperOrigin-RevId: 913607767

Imported from GitHub PR openxla/xla#40464 📝 Summary of Changes This PR is the fourth PR in a series of four. PR 1/4: openxla/xla#40460 PR 2/4: openxla/xla#40462 PR 3/4: openxla/xla#40463 PR 4/4: openxla/xla#40464 <= This PR This PR enables the ROCm Triton backend for AllReduce (collective emitter). To this end: - It extends the `TritonXLAImplementExternElementWise` function so that it accepts a target input parameter and calls it within the ROCm Triton pipeline. - it adds missing API to rocm_executor RocmExecutor::CanEnablePeerAccessTo(int other_device_ordinal) (this API is required to enable collective_emitter thunk). 🎯 Justification This PR is part of a rework of the AllReduce triton support to facilitate the enabling of different target GPUs (starting with ROCm). Prior to these PRs, the triton_xla.get_tid, triton_xla.atomic_write and triton_xla.atomic_spin_wait operations were lowered using PTX assembly. Therefore, AllReduce triton backend was only available for CUDA target. These PR aim to unify the lowering of triton_xla.atomic_write and triton_xla.atomic_spin_wait and triton_xla.get_tid operations for CUDA and ROCm targets. 🚀 Kind of Contribution Please remove what does not apply: ✨ New Feature 🧪 Unit Tests: This PR includes a LIT test checking the lowering of atomic operations. 🧪 Execution Tests: The existing test xla/tests:all_reduce_e2e_test is relevant for testing this support. Copybara import of the project: -- 2cf50cd06a46a7cb65a96e45b0c9849043fb8eac by Maxime France-Pillois <mfrancep@amd.com>: Enable AMD/ROCm support for AllReduce triton backend -- 0482141a26164dc68d374214e30e8f08709134fd by Maxime France-Pillois <mfrancep@amd.com>: Add missing headers and deps Merging this change closes #40464 PiperOrigin-RevId: 913612232

@beckerhe

…attempt Imported from GitHub PR openxla/xla#34572 Added CommandBuffer support for Convolution ops Graph capture of convolutions is by default disabled (since not all convolutions can be captured). It can be enabled through a newly added flag: '**--xla_gpu_enable_command_buffers=+convolution**" 🎯 Justification This op was missing for whatever reason: this results in graph fragmentation especially for large models. Hence one gets several (sometimes many) execution graphs instead of just one. 🚀 Kind of Contribution ✨ New Feature 🧪 Unit Tests: Added new subtest to xla/service/gpu/transforms/command_buffer_scheduling_test.cc and xla/backends/gpu/runtime/command_buffer_conversion_pass_test.cc This is a fork of the original PR which was reverted due to conv problems: openxla/xla#32053 Now convolutions capture is by default disabled. @beckerhe , @ezhulenev, @dimitar-asenov please have another look ! Copybara import of the project: -- abe76203921cc4405758c69bc92b1664279c0c5b by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>: adding convolution to command buffers added UTs and convolution command test fixes added rebase fixes capture only those convolution targets which are explictly Revert "adding coll permute and convolution to command buffers" This reverts commit 75847e67261b4589162411c9846ed9c0b9fc1ed5. added conv to command buffers fixing build and test fixing build rewritten ConvolutionCmd, adapted command_buffer_conv_pass some cosmetics added extra param for convolution cmd buf small fix fix after rebase fixing after rebase simplified runner cache call WIP fixing compile errors and new conv test updated the conv subtest and ensure command buffer update is called clang format fixing test applied clang format applied clang format fixing test applied clang format fixing command buffer conv test applying clang format clang format fixing the test after rebase fixing yet another conflict -- 13b02c7619ba0483c2993222c5e7d8a2329d438f by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>: fixing build -- 1fc153c6233f159c54110e13212aed7b546c437c by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>: cleanup -- 3e0f6c82d601394fc2afc1197d0d7efe6507ee33 by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>: clang format -- bb3d5907ba75681c3c87fe595f4645ce664b03d4 by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>: cleaning build deps Merging this change closes #34572 PiperOrigin-RevId: 913612872

…erprints. PiperOrigin-RevId: 913616997

… logic across `MaybeConvertToV1/V2/Named` PiperOrigin-RevId: 913631803

PiperOrigin-RevId: 913635240

mdfaijul and others added 21 commits May 10, 2026 23:07

Automated Code Change

a8196ed

PiperOrigin-RevId: 913511527

Automated Code Change

ac677c8

PiperOrigin-RevId: 913565823

Automated Code Change

8037b3c

PiperOrigin-RevId: 913572604

Refactor HloShardingV2 sharding import logic to reuse attributename.

228bc72

PiperOrigin-RevId: 913577057

Batch update func argument attributes on shardy.

19747c7

Updating func arg attributes one-by-one leads to creation of quadratic amount of pointers, that is N^2 pointers for N results. PiperOrigin-RevId: 913579525

[XLA:GPU][NFC] Fix clang-tidy findings in gemm_rewriter.cc

006fecb

PiperOrigin-RevId: 913592196

Automated Code Change

f22cbef

PiperOrigin-RevId: 913597452

Automated Code Change

b231c2f

PiperOrigin-RevId: 913599207

Automated Code Change

468b9de

PiperOrigin-RevId: 913607767

[XLA:MSA] Cache hashes for instruction shapes when computing MSA fing…

cf6fdc4

…erprints. PiperOrigin-RevId: 913616997

Handle Tuple shardings in MaybeConvertToV1 and reuse tuple handling…

cc054a6

… logic across `MaybeConvertToV1/V2/Named` PiperOrigin-RevId: 913631803

Automated Code Change

ef5e2de

PiperOrigin-RevId: 913635240

pull Bot locked and limited conversation to collaborators May 11, 2026

pull Bot added the ⤵️ pull label May 11, 2026

pull Bot merged commit ef5e2de into makesoftwaresafe:master May 11, 2026
0 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from tensorflow:master#1635

[pull] master from tensorflow:master#1635
pull[bot] merged 21 commits into
makesoftwaresafe:masterfrom
tensorflow:master

pull Bot commented May 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

Conversation

pull Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

pull Bot commented May 11, 2026 •

edited

Loading