[pull] master from tensorflow:master by pull[bot] · Pull Request #1627 · makesoftwaresafe/tensorflow

pull · 2026-05-09T00:04:16Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

The new API provides `RunIsolationTestOnModule` which performs comparisons between the test runner and a reference runner, including TPU vs Defused TPU and TPU vs Interpreter. PiperOrigin-RevId: 912336410

PiperOrigin-RevId: 912338506

… pattern Imported from GitHub PR openxla/xla#36224 📝 Summary of Changes This PR enables hoisting all-reduce operations out of while loops for scatter-based gradient accumulation patterns, commonly used in ZeRO-1 with gradient accumulation. 🎯 Justification This optimization improves ZeRO-1 gradient accumulation performance by replacing all-reduce operations inside a loop with one after the loop, reducing communication overhead. 🚀 Kind of Contribution ✨ New Feature, 🧪 Tests 📊 Benchmark (for Performance Improvements) The public HLOs in `xla/tools/benchmarks/hlo/` do not have this pattern. 🧪 Unit Tests: Added to `xla/service/while_loop_all_reduce_code_motion_test.cc` 🧪 Execution Tests: N/A Copybara import of the project: -- f253cd406a6419d2c57818b42ade3617ae92c761 by Sevin Varoglu <svaroglu@nvidia.com>: Support all-reduce hoisting for scatter-based accumulation pattern -- b3debd31dd5d72d49f1afdc02f7654f095837516 by Sevin Varoglu <svaroglu@nvidia.com>: Incorporate review feedback Merging this change closes #36224 PiperOrigin-RevId: 912351282

Creating launch dimensions was an unnecessary and error-prone step. PiperOrigin-RevId: 912365353

Imported from GitHub PR openxla/xla#42201 Use latest bant for improved DWYU checks Copybara import of the project: -- 50ca3e78144a751560a6543e052cc7e035ca6fce by Eugene Zhulenev <ezhulenev@openxla.org>: Bump DWYU bant version to 0.2.7 Merging this change closes #42201 PiperOrigin-RevId: 912369650

This brings down the number of shape size computations from O(#edges) to O(#instructions). PiperOrigin-RevId: 912384006

PiperOrigin-RevId: 912385690

PiperOrigin-RevId: 912387500

PiperOrigin-RevId: 912388623

PiperOrigin-RevId: 912389759

PiperOrigin-RevId: 912390028

… devices Imported from GitHub PR openxla/xla#41901 ## Summary - `MakeGlobalTopologyFromPjRtClient` hardcodes `addressable_devices()[0]->device_kind()` for every device in the topology, causing all devices to report device 0's marketing name in mixed-GPU systems - Use `pjrt_device->device_kind()` for addressable devices, falling back to `addressable_devices()[0]` only for non-addressable devices ## Background In multi-GPU systems with different architectures (e.g., gfx906 + gfx1101, or Radeon VII + Radeon PRO W7700), `jax.devices()` reports device 0's name for all devices: ```python >>> [d.device_kind for d in jax.devices()] ['AMD Radeon VII', 'AMD Radeon VII'] # device 1 is actually Radeon PRO W7700 ``` While `compute_capability` and `core_count` are correctly per-device, `device_kind` is not — because the IFRT topology builder at `xla/python/pjrt_ifrt/pjrt_client.cc:366` always reads from `addressable_devices()[0]`. Traced and reproduced on a physical gfx906 + gfx1101 mixed-GPU system. The underlying PJRT layer (`PJRT_DeviceDescription_Kind`) returns correct per-device names — the bug is solely in the IFRT topology construction. Fixes ROCm/rocm-jax#390 (cosmetic portion) ## Test plan - [x] Verify `device_kind` reports correct per-device names in mixed-GPU systems - [x] Verify homogeneous multi-GPU systems are unaffected - [x] Verify single-GPU systems are unaffected - [x] Verify non-addressable device fallback still works Copybara import of the project: -- 76b9c85c046d04ca5a114791a85cdbd84b78f22d by Luca Bruni <lucbruni@amd.com>: Fix IFRT topology reporting device 0's device_kind for all devices. MakeGlobalTopologyFromPjRtClient always used addressable_devices()[0]->device_kind() when populating the DeviceProto for every device in the topology. In mixed-GPU systems (e.g., different GPU architectures), this caused all devices to report device 0's marketing name instead of their own. Use pjrt_device->device_kind() for addressable devices, falling back to addressable_devices()[0] only for non-addressable devices where no PjRtDevice is available. Co-Authored-By: Luca Bruni <lucbruni@amd.com> Co-Authored-By: Luca Bruni <lucaalexbruni@gmail.com> Merging this change closes #41901 PiperOrigin-RevId: 912404916

…ner. PiperOrigin-RevId: 912405131

Imported from GitHub PR openxla/xla#42198 When loading AOT result `xla_dump_to` path pointes to a path on a machine that compiled the binary. Override it with the path for the current process to write all dumps to correct location. Copybara import of the project: -- 6cebc8044554fccd326bef1f82fc251be1fe005d by Eugene Zhulenev <ezhulenev@openxla.org>: [xla:gpu] Override xla_dump_to path when loading AOT result Merging this change closes #42198 PiperOrigin-RevId: 912405569

… namespace. PiperOrigin-RevId: 912406774

Imported from GitHub PR openxla/xla#42103 🚀 Kind of Contribution ♻️ Cleanup Copybara import of the project: -- 62edde4be67bf0154635595852e44079423997df by Aleksei Nurmukhametov <anurmukh@amd.com>: [NFC] Fix IWYU/DWYU in xla/stream_executor/device_description_test.cc Merging this change closes #42103 PiperOrigin-RevId: 912406790

Imported from GitHub PR openxla/xla#41440 - Add `DynamicSliceFusion` analysis library that extracts hero instructions, resolves parameter/result slices, and returns `DynamicSliceConfig` protos from dynamic-slice fusion computations. - Key APIs: `FindHero()` locates the hero op inside a fusion body, `ResolveParameters()` maps hero operands back to fusion parameters with slice configs and per-dimension offsets, `ResolveResults()` does the same for DUS outputs. - Offsets are either `ConstantOffset` (literal sunk into the fusion) or `RuntimeOffset` (fusion parameter holding the induction variable), enabling the thunk emitter to verify annotated offsets at runtime. Copybara import of the project: -- d97fa268c679c74f6346e3eb07418bd4358f4722 by Eugene Zhulenev <ezhulenev@openxla.org>: [xla:gpu] Add dynamic-slice fusion analysis library Merging this change closes #41440 PiperOrigin-RevId: 912408149

PiperOrigin-RevId: 912424058

This flag will be effective when we switch downstream projects (JAX, TF) to Bzlmod, otherwise they are no-ops. We can later remove --override_repository when the Bzlmod migration is done. PiperOrigin-RevId: 912431884

PiperOrigin-RevId: 912432140

The test now starts by parsing a SymbolicMap from a string, converts it to an AffineMap, and then converts it back to a SymbolicMap to verify the round trip. This removes the need for a custom AffineMap parsing helper and therefore a TODO. I guess I could just directly remove the class, but since Adrian showed me that we still are using it in one place (simplify_affine), I will leave the file for now since it's not hurting. PiperOrigin-RevId: 912437098

The calculation of L2 bytes now takes into account the element types of the LHS and RHS operands, providing a more accurate estimate of the data loaded from L2. More info: First conclusion of b/501002656#comment2. We are now overestimating but this might be a separate problem as Nikita suggested in b/501002656#comment3. Also, the case-study of the bug is still producing the same suggestions for all the configs. But anyway, with this small change, cost-model suggestion has clearly improved. The tests had to be adjusted, probably because of the issue mentioned in b/510666436, so I added the TODO accordingly. PiperOrigin-RevId: 912437307

PiperOrigin-RevId: 912439391

PiperOrigin-RevId: 912440303

…ion) in HeapAlgorithmWithFallback for preventing device OOMs. PiperOrigin-RevId: 912451895

The GPU dot fusion cost model now checks if any transitive user of a dot instruction is a transpose. If a transpose is found in the users' graph, the dot fusion is marked as unsupported. A TODO() has been left so we can track the inclusion in the future PiperOrigin-RevId: 912453825

To match API of experimental::TilingHloInstruction. PiperOrigin-RevId: 912455533

PiperOrigin-RevId: 912463660

so we don't need --use_experimental_tiling and can also set other flags if need be PiperOrigin-RevId: 912463889

PiperOrigin-RevId: 912522529

…rt shardings. To avoid context bloat. In MLIR, updating function result attributes requires creating a new ArrayAttr that holds the DictionaryAttr for all results. PiperOrigin-RevId: 912524956

This change introduces a new GitHub Actions workflow (`actions-lint.yml`) to run `actionlint` on changes to workflow files. It also fixes several errors found by actionlint. The change also includes several improvements to existing workflows, such as quoting variables in shell commands, using here-documents for writing to output files, and adding a `halt-for-connection` input to some workflows for manual triggering. Minor fixes to conditional checks and command formatting are also included. PiperOrigin-RevId: 912531462

PiperOrigin-RevId: 912535650

This change updated scheduling (used by experimental emitter) to support customized traversal order. For dot, the experimental emitter now optimizes cache hits by swapping m and n traversal whenever the LHS operand is smaller than the RHS. PiperOrigin-RevId: 912556582

PiperOrigin-RevId: 912564229

…eamz Adds a new field "enable_priority_queue" to the "/tensorflow/serving/batching/mixed_priority_batching_policy" streamz metric to provide more context on whether priority queueing is enabled when reporting the mixed priority batching policy. PiperOrigin-RevId: 912586091

- When there is padding, the padding value should be the reduce identity value, not the reduce initial value. - Windows of size > 1 still need to be disclaimed as supported if the stride is not 1. PiperOrigin-RevId: 912589684

Updating (removing or setting) N func args/results one by one in a loop creates N^2 pointers, that is, each one will create an ArrayAttr of N pointers. PiperOrigin-RevId: 912590644

…ion is seen during the upward traversal. PiperOrigin-RevId: 912593212

PiperOrigin-RevId: 912629390

…_utils PiperOrigin-RevId: 912631693

PiperOrigin-RevId: 912655935

PiperOrigin-RevId: 912661680

PiperOrigin-RevId: 912661851

…aled_dot and writing a helper method SupportsPrecisionConfig PiperOrigin-RevId: 912667637

…m/test`. PiperOrigin-RevId: 912668258

PiperOrigin-RevId: 912668988

Eventually we want to replace SymbolicTileAnalysis. In order to do that, we need to have a way to evaluate whether certain tile sizes are valid. As a preparation, add the necessary constraints for Concat during tiling propagation. Also, the current constraint was too strong, it would have disallowed valid cases with slice(concat). Improve the constraints to handle cases with a constant base offset as well. PiperOrigin-RevId: 912669003

…line PiperOrigin-RevId: 912678171

utilities library. PiperOrigin-RevId: 912688344

Reverts 582b956 PiperOrigin-RevId: 912706974

PiperOrigin-RevId: 912716090

PiperOrigin-RevId: 912723972

…contact Shardy team. PiperOrigin-RevId: 912738934

Reverts fa92ad3 PiperOrigin-RevId: 912739599

tsl::AsyncValue. PiperOrigin-RevId: 912749857

tensorflower-gardener and others added 30 commits May 7, 2026 23:07

[PART 1] Externalize HLO isolation test into a new open-source API.

d3b4927

The new API provides `RunIsolationTestOnModule` which performs comparisons between the test runner and a reference runner, including TPU vs Defused TPU and TPU vs Interpreter. PiperOrigin-RevId: 912336410

Automated Code Change

3ba4b8f

PiperOrigin-RevId: 912338506

[XLA:GPU] Refactor tiled cost model to use num_warps directly.

29379df

Creating launch dimensions was an unnecessary and error-prone step. PiperOrigin-RevId: 912365353

[XLA:MSA] Cache shape sizes in HloCostAnalysis::Preprocess.

7913d58

This brings down the number of shape size computations from O(#edges) to O(#instructions). PiperOrigin-RevId: 912384006

[XLA:GPU] Migrate float_conversions_test to use HloPjRtGpuTestBase.

3a83d03

PiperOrigin-RevId: 912385690

[XLA:GPU] Rename TiledHloComputation::GetRoots() to roots().

e9e3dc5

PiperOrigin-RevId: 912387500

Automated Code Change

8cd6c87

PiperOrigin-RevId: 912388623

Automated Code Change

b8ce55b

PiperOrigin-RevId: 912389759

Fix one more bug in token literal handling in PjRt C API

dbaa6c2

PiperOrigin-RevId: 912390028

[XLA:GPU] Migrate gpu_too_many_blocks_test away from the legacy run…

ab7e2c4

…ner. PiperOrigin-RevId: 912405131

[XLA:GPU] Move EstimateRunTimeForTiledHloComputationImpl to anonymous…

65e6267

… namespace. PiperOrigin-RevId: 912406774

[XLA:GPU] Add support for experimental tiling in coalescing analysis.

ff9a120

PiperOrigin-RevId: 912424058

XLA: Support --override_module in build.py

1e1be93

This flag will be effective when we switch downstream projects (JAX, TF) to Bzlmod, otherwise they are no-ops. We can later remove --override_repository when the Bzlmod migration is done. PiperOrigin-RevId: 912431884

Reverts 3fdc7e9

10a7166

PiperOrigin-RevId: 912432140

Automated Code Change

cec25e0

PiperOrigin-RevId: 912439391

[XLA:GPU] Support kCollapseReshape with trivial inner dimensions.

7f86af9

PiperOrigin-RevId: 912440303

Introduce safety margin in get_memory_limit (available memory estimat…

2bf58d4

…ion) in HeapAlgorithmWithFallback for preventing device OOMs. PiperOrigin-RevId: 912451895

[XLA:GPU] Introduce TiledHloRegion and rename regions() accessor.

af208e2

To match API of experimental::TilingHloInstruction. PiperOrigin-RevId: 912455533

Avoid int32_t overflows when the embedding table is larger than 2 GiB.

82c1c9f

PiperOrigin-RevId: 912463660

[XLA:GPU] load all XLA flags in hlo_to_xtileir

cc8eec0

so we don't need --use_experimental_tiling and can also set other flags if need be PiperOrigin-RevId: 912463889

PatriosTheGreat and others added 27 commits May 8, 2026 07:54

[XLA:GPU] Prevent FABRIC+POSIX_FD usage when running not on a cluster.

8adfce4

PiperOrigin-RevId: 912522529

Batch update func result shardings on shard map export on import/expo…

2961143

…rt shardings. To avoid context bloat. In MLIR, updating function result attributes requires creating a new ArrayAttr that holds the DictionaryAttr for all results. PiperOrigin-RevId: 912524956

Automated Code Change

eb719ea

PiperOrigin-RevId: 912535650

[pallas:triton] Compile Pallas Triton kernels to PTX by default

fa92ad3

PiperOrigin-RevId: 912564229

Fix ReduceWindow bugs in YNNPACK

5f5edb9

- When there is padding, the padding value should be the reduce identity value, not the reduce initial value. - Windows of size > 1 still need to be disclaimed as supported if the stride is not 1. PiperOrigin-RevId: 912589684

Batch update func result shardings on shard map export.

94ea166

Updating (removing or setting) N func args/results one by one in a loop creates N^2 pointers, that is, each one will create an ArrayAttr of N pointers. PiperOrigin-RevId: 912590644

Stop optimizing broadcast to AllocateBuffer if an unexpected instruct…

c0b589d

…ion is seen during the upward traversal. PiperOrigin-RevId: 912593212

Adding Android workflow to Github

582b956

PiperOrigin-RevId: 912629390

[XLA:GPU] Move CanonicalizeDotOperand from emitter_helper to lowering…

e30a231

…_utils PiperOrigin-RevId: 912631693

Re-enable and fix ArrayMemoryKindTest.HostBufferTokens in IFRT proxy.

2eb43ad

PiperOrigin-RevId: 912655935

[XLA:CPU] Add more benchmark coverage of "expensive" unary ops

e42c87e

PiperOrigin-RevId: 912661680

Update test condition to run only for named shardings.

1c3cd9a

PiperOrigin-RevId: 912661851

Refactor precision config in hlo_instruction by adding the missing sc…

3e1728e

…aled_dot and writing a helper method SupportsPrecisionConfig PiperOrigin-RevId: 912667637

[XLA] Migrate hlo_evaluator_slow_reduce_window_test to `tsl/platfor…

e41a5cd

…m/test`. PiperOrigin-RevId: 912668258

Remove now-redundant string copies from TSL Monitoring API call sites

9c1a46a

PiperOrigin-RevId: 912668988

Add experimental unsafe rank reduction pass to TFLite conversion pipe…

d592454

…line PiperOrigin-RevId: 912678171

Implement dynamic shape reading and slicing via a shared dynamic_shapes

fd7450c

utilities library. PiperOrigin-RevId: 912688344

Adding Android workflow to Github

ca3e18a

Reverts 582b956 PiperOrigin-RevId: 912706974

make compilation of mlir ir to llvm ir async for MlirKernelFusion

e41476b

PiperOrigin-RevId: 912716090

introducing Cuda13 build target

8d6a21d

PiperOrigin-RevId: 912723972

Add HloShardingV3 checks where not expected, update log message to …

920a456

…contact Shardy team. PiperOrigin-RevId: 912738934

[pallas:triton] Compile Pallas Triton kernels to PTX by default

20a647e

Reverts fa92ad3 PiperOrigin-RevId: 912739599

Change the transfer APIs to take PjRtDeviceEventRef instead of

a4307ab

tsl::AsyncValue. PiperOrigin-RevId: 912749857

pull Bot locked and limited conversation to collaborators May 9, 2026

pull Bot added the ⤵️ pull label May 9, 2026

pull Bot merged commit a4307ab into makesoftwaresafe:master May 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from tensorflow:master#1627

[pull] master from tensorflow:master#1627
pull[bot] merged 62 commits into
makesoftwaresafe:masterfrom
tensorflow:master

pull Bot commented May 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

pull Bot commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

pull Bot commented May 9, 2026 •

edited

Loading