Skip to content

[pull] master from tensorflow:master#1636

Merged
pull[bot] merged 10 commits into
makesoftwaresafe:masterfrom
tensorflow:master
May 11, 2026
Merged

[pull] master from tensorflow:master#1636
pull[bot] merged 10 commits into
makesoftwaresafe:masterfrom
tensorflow:master

Conversation

@pull
Copy link
Copy Markdown

@pull pull Bot commented May 11, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

vickyliu-go4it and others added 10 commits May 11, 2026 05:18
There was a recent upstream change and we cannot rely anymore on having the
pattern be applied multiple times in one go and also deleting dead ops. So we
need to delete them ourselves.
This change moves the pattern from tablegen to C++ to make this possible.
Also, do a small fix to the "interestingness" script to avoid printing the
result of the grep command.

PiperOrigin-RevId: 913646216
Imported from GitHub PR openxla/xla#41779

• 📝 Summary of Changes

  This PR migrates `ReduceScatterCmd` to use `ReduceScatterThunk` directly as a command-buffer command, matching the existing `AllReduceThunk` command migration pattern. It removes the dedicated `ReduceScatterCmd` wrapper and appends `ReduceScatterThunk` as a borrowed command in the command-buffer emitter.

  It also adds multi-GPU command-buffer tests for `ReduceScatterThunk`, covering eager warmup via `ExecuteOnStream`, command-buffer create, command-buffer update, and output correctness.

  🎯 Justification

  This reduces duplicate command-buffer collective plumbing and keeps reduce-scatter behavior aligned with the shared `CollectiveThunk` recording path. The change benefits GPU workloads using reduce-scatter collectives captured into command buffers, especially distributed workloads that rely on command-buffer update paths.

  🚀 Kind of Contribution

  ♻️  Cleanup, 🧪 Tests

  📊 Benchmark (for Performance Improvements)

  Not applicable. This PR is a cleanup/test coverage change and does not claim a performance improvement.

  🧪 Unit Tests:

  Added/updated command-buffer recording tests in:
  `//xla/backends/gpu/runtime:all_reduce_thunk_test`

  Coverage includes:
  - `ReduceScatterThunkTest.RecordCommandBufferCreate`
  - `ReduceScatterThunkTest.RecordCommandBufferUpdate`

  🧪 Execution Tests:

  Added multi-GPU execution coverage in:
  `//xla/backends/gpu/runtime:all_reduce_thunk_multigpu_test`

  New tests:
  - `ReduceScatterThunkMultiGpuTest.RecordCommandBufferCreate`
  - `ReduceScatterThunkMultiGpuTest.RecordCommandBufferUpdate`

  These run with 2 GPUs and verify expected reduce-scatter outputs for both command-buffer create and update paths.

  Validated locally with:

  `bazel test --test_output=errors --test_filter='ReduceScatterThunkMultiGpuTest.*' //xla/backends/gpu/runtime:all_reduce_thunk_multigpu_test`

Copybara import of the project:

--
77960ea67396bf055ee18937c14863b082e5f1d1 by Shawn Wang <shawnw@nvidia.com>:

[xla:gpu] Migrate ReduceScatterCmd to thunk command

--
25dd889f23c30976c389923d43f1fba644c01e07 by Shawn Wang <shawnw@nvidia.com>:

[xla:gpu] Add ReduceScatterThunk multigpu tests

--
2f7b052976da7ae21a85762f0d632c9877fb1334 by Shawn Wang <shawnw@nvidia.com>:

[xla:gpu] Clean up ReduceScatterThunk command buffer deps

--
77715f319a63d5517e3a7ca8ba7173cfb10a26f0 by Shawn Wang <shawnw@nvidia.com>:

remove usused header

Merging this change closes #41779

PiperOrigin-RevId: 913656825
Imported from GitHub PR openxla/xla#42218

Add a diffing tool for clang-tidy output
Copybara import of the project:

--
14f33777161c2eac9a9eeb1fa8e6b8d37413b8d3 by Sohaib Iftikhar <sohaibiftikhar@google.com>:

[XLA:BUILD] Add a diffing tool for clang-tidy output

Adds a diffing tool for reading clang-tidy report files and reporting
on the output only if that line was affected in the change.

Merging this change closes #42218

PiperOrigin-RevId: 913665880
Flags:
* mixin_max_same_mnk limits the number of mixin configs with the same M N K block sizes. This should help mitigate the cost model not differentiating configs very well.
* mixin_only_faster only allows mixin configs faster than the base set. This may help reduce compile time hit while keeping perf benefits.
PiperOrigin-RevId: 913705752
…imation.

This old logic only worked for cases when we tile the full row anyway. Also fix the test name.

PiperOrigin-RevId: 913717643
@pull pull Bot locked and limited conversation to collaborators May 11, 2026
@pull pull Bot added the ⤵️ pull label May 11, 2026
@pull pull Bot merged commit 8653c16 into makesoftwaresafe:master May 11, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants