[pull] master from tensorflow:master#1635
Merged
Merged
Conversation
…SyclExecutor::LoadKernel Imported from GitHub PR openxla/xla#42206 This PR fixes a bug in `SyclExecutor::LoadKernel` that has been exposed when `MlirKernelFusion` emits `CustomKernelThunk` instead of `KernelThunk` (ref: openxla/xla@c986fc2). Without this fix, many tests fail at launch time with: ``` INTERNAL: Kernel is missing a custom arguments packing function for device memory arguments array ``` Copybara import of the project: -- fb58f0460b3437f9ff48a059e2a1ea8cc82383f5 by Faijul Amin <md.faijul.amin@intel.com>: Fix kernel arguments packing -- bb83029df9e79aa7dc07c305386a011d46bbf82c by Faijul Amin <md.faijul.amin@intel.com>: Fix DWYU error -- e523dc4e770f2411eba1cf58e316df9ae02c2b43 by Faijul Amin <md.faijul.amin@intel.com>: Add missing dependency Merging this change closes #42206 PiperOrigin-RevId: 913506507
PiperOrigin-RevId: 913511527
There's a possibility that the early-abort status is returned last depending on the thread scheduling order. To address this, store a Status instead of a bool, which also requires switching away from std::atomic since Status isn't trivially copyable. While here, use status matchers in the unit tests to improve error messages. The test now passes 500/500 times. PiperOrigin-RevId: 913536234
Use `absl::call_once` (which is a blocking call) in `TFE_Py_TapeSetNew`, `TFE_Py_VariableWatcherNew`, and `TFE_Py_ForwardAccumulatorNew` to ensure thread-safe one-time initialization of global Python type objects (`TFE_Py_Tape_Type`, `TFE_Py_VariableWatcher_Type`, `TFE_Py_ForwardAccumulator_Type`). PiperOrigin-RevId: 913546030
… in onednn_cc_test bazel functions Imported from GitHub PR openxla/xla#42192 A [commit](openxla/xla@b49f469) recently added the `manual` tag to oneDNN unit tests, since they were failing when `ENABLE_ONEDNN_ASYNC` is not defined, which does not work when one wants to run these tests via wildcard patterns like `bazel test //xla/service/cpu/tests/....`. Instead of unconditionally adding the `manual` tag, this PR switches `onednn_cc_test` to use `if_onednn_async` so these targets only build when the oneDNN async runtime is enabled (`--define=build_with_onednn_async=true`), rather than for all oneDNN-enabled platforms. Copybara import of the project: -- 45c886e2bf17b80628446bbdf5d786cc5b5603ed by Om Thakkar <om.thakkar@intel.com>: use if_onednn_async instead of if_onednn in onednn_cc_* bazel functions -- 02c9d7b1ee9cbe4d6a310fedafe20f845d1bfa4c by Om Thakkar <om.thakkar@intel.com>: use onednn_cc_test instead of xla_cc_test for onednn ln/softmax tests -- 778d02289706b8dec724b637849adf499ecc6556 by Om Thakkar <om.thakkar@intel.com>: restrict the use of if_onednn_async to onednn_cc_test only Merging this change closes #42192 PiperOrigin-RevId: 913555019
This refactors CreateSearchSpace into a struct that holds necessary information to build a gemm fusion - most of the code here is the same. We then use this new smaller HLOModule to create a fusion within that. This allows us to make changes without worrying about touching the original module and build out a fusion op-by-op. Before fusing, we check that the resulting fusion would be tileable and then we include it. This version doesn't do any profitability checks yet and it does not hoist bitcasts. PiperOrigin-RevId: 913563298
PiperOrigin-RevId: 913565823
…not needed Imported from GitHub PR openxla/xla#42236 Instead of ignoring the flag in allocator set it to false in `CudaExecutor::Init` if device doesn't have P2P links Copybara import of the project: -- 9d47013cdb1b6612917f072e921e8849fe902ab8 by Eugene Zhulenev <ezhulenev@openxla.org>: [xla:gpu] Disable fabric handles in CudaExecutor::Init if not needed -- 97c4ab60d08253c440212537b423b7dc205c630d by Eugene Zhulenev <ezhulenev@openxla.org>: Initializwe CudaVmmAllocator::Options based on executor flags Merging this change closes #42236 PiperOrigin-RevId: 913566399
PiperOrigin-RevId: 913572604
PiperOrigin-RevId: 913577057
Updating func arg attributes one-by-one leads to creation of quadratic amount of pointers, that is N^2 pointers for N results. PiperOrigin-RevId: 913579525
PiperOrigin-RevId: 913592196
ForEach is used with flat_hash_map & flat_hash_set iterators, where std::distance has O(n) cost. Microbenchmark for flat_hash_set: 38x speedup for 100K elements,m 5.75x for 10K elements. ``` Before (Using std::distance): ----------------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------------- BM_ForEachIterateCost/10000 321409909 ns 321252488 ns 2 BM_ForEachIterateCost/100000 2.1894e+10 ns 2.1889e+10 ns 1 After (Using scalar loop counter): ----------------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------------- BM_ForEachIterateCost/10000 56443170 ns 55800863 ns 13 BM_ForEachIterateCost/100000 584972023 ns 574674974 ns 1 ``` PiperOrigin-RevId: 913595572
PiperOrigin-RevId: 913597452
PiperOrigin-RevId: 913599207
PiperOrigin-RevId: 913607767
Imported from GitHub PR openxla/xla#40464 📝 Summary of Changes This PR is the fourth PR in a series of four. PR 1/4: openxla/xla#40460 PR 2/4: openxla/xla#40462 PR 3/4: openxla/xla#40463 PR 4/4: openxla/xla#40464 <= This PR This PR enables the ROCm Triton backend for AllReduce (collective emitter). To this end: - It extends the `TritonXLAImplementExternElementWise` function so that it accepts a target input parameter and calls it within the ROCm Triton pipeline. - it adds missing API to rocm_executor RocmExecutor::CanEnablePeerAccessTo(int other_device_ordinal) (this API is required to enable collective_emitter thunk). 🎯 Justification This PR is part of a rework of the AllReduce triton support to facilitate the enabling of different target GPUs (starting with ROCm). Prior to these PRs, the triton_xla.get_tid, triton_xla.atomic_write and triton_xla.atomic_spin_wait operations were lowered using PTX assembly. Therefore, AllReduce triton backend was only available for CUDA target. These PR aim to unify the lowering of triton_xla.atomic_write and triton_xla.atomic_spin_wait and triton_xla.get_tid operations for CUDA and ROCm targets. 🚀 Kind of Contribution Please remove what does not apply: ✨ New Feature 🧪 Unit Tests: This PR includes a LIT test checking the lowering of atomic operations. 🧪 Execution Tests: The existing test xla/tests:all_reduce_e2e_test is relevant for testing this support. Copybara import of the project: -- 2cf50cd06a46a7cb65a96e45b0c9849043fb8eac by Maxime France-Pillois <mfrancep@amd.com>: Enable AMD/ROCm support for AllReduce triton backend -- 0482141a26164dc68d374214e30e8f08709134fd by Maxime France-Pillois <mfrancep@amd.com>: Add missing headers and deps Merging this change closes #40464 PiperOrigin-RevId: 913612232
…attempt Imported from GitHub PR openxla/xla#34572 Added CommandBuffer support for Convolution ops Graph capture of convolutions is by default disabled (since not all convolutions can be captured). It can be enabled through a newly added flag: '**--xla_gpu_enable_command_buffers=+convolution**" 🎯 Justification This op was missing for whatever reason: this results in graph fragmentation especially for large models. Hence one gets several (sometimes many) execution graphs instead of just one. 🚀 Kind of Contribution ✨ New Feature 🧪 Unit Tests: Added new subtest to xla/service/gpu/transforms/command_buffer_scheduling_test.cc and xla/backends/gpu/runtime/command_buffer_conversion_pass_test.cc This is a fork of the original PR which was reverted due to conv problems: openxla/xla#32053 Now convolutions capture is by default disabled. @beckerhe , @ezhulenev, @dimitar-asenov please have another look ! Copybara import of the project: -- abe76203921cc4405758c69bc92b1664279c0c5b by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>: adding convolution to command buffers added UTs and convolution command test fixes added rebase fixes capture only those convolution targets which are explictly Revert "adding coll permute and convolution to command buffers" This reverts commit 75847e67261b4589162411c9846ed9c0b9fc1ed5. added conv to command buffers fixing build and test fixing build rewritten ConvolutionCmd, adapted command_buffer_conv_pass some cosmetics added extra param for convolution cmd buf small fix fix after rebase fixing after rebase simplified runner cache call WIP fixing compile errors and new conv test updated the conv subtest and ensure command buffer update is called clang format fixing test applied clang format applied clang format fixing test applied clang format fixing command buffer conv test applying clang format clang format fixing the test after rebase fixing yet another conflict -- 13b02c7619ba0483c2993222c5e7d8a2329d438f by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>: fixing build -- 1fc153c6233f159c54110e13212aed7b546c437c by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>: cleanup -- 3e0f6c82d601394fc2afc1197d0d7efe6507ee33 by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>: clang format -- bb3d5907ba75681c3c87fe595f4645ce664b03d4 by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>: cleaning build deps Merging this change closes #34572 PiperOrigin-RevId: 913612872
…erprints. PiperOrigin-RevId: 913616997
… logic across `MaybeConvertToV1/V2/Named` PiperOrigin-RevId: 913631803
PiperOrigin-RevId: 913635240
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )