Support `Split` between logical domain to allocation domain to represent padding by jjsjann123 · Pull Request #5184 · NVIDIA/Fuser

jjsjann123 · 2025-09-18T20:58:58Z

Stacked PR

PR0: #5622 skip aggressive validation check on allocation domain for vectorization
PR1: #5184 Support Split between logical domain to allocation domain to represent padding <-- this one

This PR

Allows split of ID on the path logical->allocation to represent padding logic on allocation. Notably, we no longer require allocation domain on the path between logical->loop.

Motivation

Split on allocation domain allows a clean representation for padding. i.e.

  // `out` is a 2d TensorView with logical domain as [i0, i1]
  auto&& [io, ii] = IterDomain::split(
      out->axis(1), IrBuilder::create<Val>(16L, DataType::Index), true);
  // out now has
  //   logical [i0, i1]
  //     io(i1/16), ii(16) = split(i1, 16)
  //   alloc    [i0, io(i1/16), ii(16)]
  out->setAllocationDomain({out->axis(0), io, ii}, true);

The example above is just specifying that dimension i1 would be padded to a multiple of 16.

Main Code Change

In order to support this, we have to update TensorView::cacheBefore. CacheBefore changes the graph from this to producer -> set -> consumer:

The old cacheBefore logic keeps this->domain() on producer and replays from logical to loop on consumer;
This was arguably not correct, since we shouldn't dictate the layout of cache from the output tensor consumer;
A split that sits only between logical to allocation wouldn't work neither, since it isn't on the replay path.

Hence this PR changes the cacheBefore logic such that:

We replay the transformation from root to loop on producer;
this->domain() is now preserved on consumer after reduction IDs were removed.

Technical Challenges

In theory, we shouldn't need allocation domain on cache at all. One exception where allocation domain is preserved on cache is when the cache is sharded. This is because our shape inference done via ExpressionEvaluator relies on allocation domain. Without proper allocation domain, the reshape call would be called on global tensor instead of local tensor;
shape inference and indexing correctness is compromised with non-divisible split. See the added example in LogicalAndAllocationSizes. Since this PR is growing in size, I'll fix it in follow up PRs;
There's a separate codegen tests when a modified allocation domain on cache leads to incorrect codegen on vectorized store. See comment. I think this is more of a scheduler issue, which I'll continue investigating separately.

github-actions · 2025-09-18T21:00:53Z

Review updated until commit 4d240a4

Description

Enable split operations between logical and allocation domains for padding representation
Refactor TensorView::cacheBefore to properly handle domain transformations and preserve allocation domains
Update transform replay logic to maintain parallelization types and rfactor product information
Add support for scatter operations in cacheBefore with proper domain handling
Improve allocation domain preservation during caching operations

Changes walkthrough

Relevant files

Enhancement

tensor_view.cpp `Refactor cacheBefore with improved domain handling` csrc/tensor_view.cpp Major refactor of TensorView::cacheBefore() method with new two-step approach Add scatter operation support with custom domain handling Implement proper cleanup of consumer domains removing root and reduction IDs Preserve allocation domains and parallelization information during caching Add device mesh handling with allocation domain mapping	+139/-38
transform_replay.cpp `Enhance transform replay with parallelization preservation` csrc/transform_replay.cpp Preserve parallelization types during split operations Update merge operations to handle rfactor products correctly Refactor fullSelfReplay to return replay mapping for allocation domain updates Add new applyFullSelfReplay helper function	+42/-10
transform_replay.h `Extend fullSelfReplay API with replay mapping` csrc/transform_replay.h Add new fullSelfReplay overload accepting replay_map parameter Update documentation to clarify replay transformation behavior	+9/-1
internal_base_nodes.h `Add resetRFactorProduct utility method` csrc/ir/internal_base_nodes.h Add resetRFactorProduct method to IterDomain for clearing rfactor domain flag	+5/-0

Bug_fix

matmul.cpp `Fix matmul scheduler ID model updates` csrc/scheduler/matmul.cpp Update updateIdModel to handle eliminated reduction IDs from cacheBefore Fix cacheBefore to properly map logical domains between consumer and producer Add ValGroup traversal logic to find remaining IDs in new id_model	+20/-6

Tests

test_layout_op.cpp `Add allocation domain padding and vectorization tests` tests/cpp/test_layout_op.cpp Add test for logical and allocation domain sizes with padding Add test for allocation domain split vectorization factor Validate padding behavior and vectorization with allocation domain splits	+65/-0
test_allocation_domain.cpp `Update allocation domain test expectations` tests/cpp/test_allocation_domain.cpp Remove assertions about allocation domain preservation after cacheBefore Update test expectations to match new cacheBefore behavior	+0/-2

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Memory Management

The new cacheBefore implementation creates multiple new IterDomains using IrBuilder::createInContainer but doesn't explicitly clean up the old domain objects. While the old domain is stored in old_domain pointer, there's no clear deletion strategy which could lead to memory leaks, especially in long-running applications or when cacheBefore is called multiple times.

TensorDomain* old_domain = domain();

ScatterOp Edge Case

The special handling for ScatterOp creates logical and loop domains separately, but the comment suggests this is a workaround for limitations in replay. The logic for handling scatter dimensions and creating new IDs might have edge cases where the mapping isn't correct, particularly when the scatter has complex indexing patterns.

if (definition()->isA<ScatterOp>()) {
  // scatter output's loop is not connected to its root, we cannot support it
  // in replay
  NVF_ERROR(
      !domain()->hasRoot(),
      "scatter output's with root domain is not supported in cacheBefore");
  std::vector<IterDomain*> logical;
  std::vector<IterDomain*> loop;

  std::ranges::transform(
      domain()->logical(), std::back_inserter(logical), [&](IterDomain* id) {
        IterDomain* cloned_id =
            IrBuilder::createInContainer<IterDomain>(container(), id);
        producer_map[id] = cloned_id;
        return cloned_id;
      });
  std::ranges::transform(
      domain()->loop(), std::back_inserter(loop), [&](IterDomain* id) {
        if (auto it = producer_map.find(id); it != producer_map.end()) {
          // reuse cloned_ids
          return it->second;
        }
        // for scatter dimension, create new ID
        return IrBuilder::createInContainer<IterDomain>(container(), id);
      });
  producer = IrBuilder::createInContainer<TensorView>(
      container(),
      IrBuilder::createInContainer<TensorDomain>(
          container(),
          logical,
          loop,
          TensorDomain::getContiguityFilledWith(logical, true),
          /*skip_loop_validation=*/true),
      getDataType().value());
} else {

Allocation Domain Mapping

The new allocation domain mapping logic (lines 1272-1279) assumes that all IDs in the old allocation domain exist in the producer_map. This might not hold true in all scenarios, particularly with complex transformations or when reduction IDs are involved, potentially causing runtime crashes or incorrect memory layouts.

if (consumer->domain()->hasAllocation()) {
  std::vector<IterDomain*> mapped_alloc;
  mapped_alloc.reserve(old_domain->allocation().size());
  for (auto* c_id : old_domain->allocation()) {
    mapped_alloc.push_back(producer_map.at(c_id));
  }
  producer->setAllocationDomain(mapped_alloc, true);
}

Test failures

(Medium, 1) Tensor numerical mismatches in nvFuser matmul tests (H100 runner)

Test Name H100 Source

HopperMatmulTest.HSH_NT_UseScheduler_MultipleInstructionsPerWarpTile ❌ Link

clangformat

naoyam · 2025-09-18T22:54:21Z

tests/cpp/test_layout_op.cpp

+  out->split(1, 16);
+  out->setAllocationDomain(out->getLoopDomain(), true);
+  // restore loop domain
+  out->merge(1);


This doesn't restore. Is this necessary?

Touche. It unsplit the loop domain so that it has the same size as logical domain.
You are right that the extent is no longer the same, so it's not a restoration.

Schedulers expects un-scheduled fusion. Without this merge, I'm hitting the assert here:

Fuser/csrc/scheduler/pointwise.cpp

Line 357 in db9721d

NVF_ERROR(broadcast_bit_multiples.size() == ref_loop.size());

Hmm, not sure that's good enough WAR, though this is just a test.

I thought the schedulers can work with some scheduled loop domains (for DID parallelization), not?

Fuser/csrc/scheduler/pointwise.cpp

Lines 231 to 233 in 12121b9

// We always cacheBefore output at the beginning of the scheduling. And after

// cacheBefore, the reference tensor will have all reduction IDs removed.

ref_loop = TensorDomain::noDevices(TensorDomain::noReductions(ref_loop));

DID related IDs are just ignored by scheduler. So that's just too specific for multi-device.

I'm not a fan of this neither. Let me see if I can skip messing with loop and play transformation on allocation directly.

I suppose you can just modify the allocation domain with AbstractTensor. I remember there are some tests.

I can also directly using IterDomain::split for that.

Anyway, looks like if the transformation is not on logical to loop, our replay wouldn't pick it up. Felt similar to the allocation domain replay that rfactor was missing. fyi @Priya2698

#0 nvfuser::nvfCheckFail (func=0xaaaaac218080 "validateDomainEquivalence", file=0xaaaaac216938 "/opt/pytorch/nvfuser/csrc/ir/utils.cpp", line=1162, msg=" INTERNAL ASSERT FAILED at /opt/pytorch/nvfuser/csrc/ir/utils.cpp:1162, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. \nExpected !compare_result.dom0_has_u"...) at /opt/pytorch/nvfuser/csrc/exceptions.cpp:267 #1 0x0000aaaaab1bbe68 in nvfuser::nvfErrorFail (func=0xaaaaac218080 "validateDomainEquivalence", file=0xaaaaac216938 "/opt/pytorch/nvfuser/csrc/ir/utils.cpp", line=1162, condMsg=0xaaaaac217fd8 " INTERNAL ASSERT FAILED at /opt/pytorch/nvfuser/csrc/ir/utils.cpp:1162, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. ", userMsg="Expected !compare_result.dom0_has_unreachable_ids . dom0 has unreachable IDs. dom0: iS10{i0}, iS11{i2}. dom1: iS10{i0}") at /opt/pytorch/nvfuser/csrc/exceptions.cpp:277 #2 0x0000aaaaab60a3e8 in nvfuser::ir_utils::validateDomainEquivalence ( dom0=std::vector of length 2, capacity 2 = {...}, dom1=std::vector of length 1, capacity 3 = {...}, additional_ids=std::vector of length 0, capacity 0) at /opt/pytorch/nvfuser/csrc/ir/utils.cpp:1162 #3 0x0000aaaaab4aac30 in nvfuser::TensorDomain::setAllocationDomain (this=0xaaaab20918b0, new_allocation_domain=std::vector of length 1, capacity 3 = {...}, new_contiguity=std::vector of length 1, capacity 3 = {...}) at /opt/pytorch/nvfuser/csrc/ir/nodes.cpp:4055 #4 0x0000aaaaabc7b368 in nvfuser::TransformReplay::replayCasP (consumer=0xaaaab2088c00, producer=0xaaaab2091200, producer_pos=2, logical_map=..., opt=...) at /opt/pytorch/nvfuser/csrc/transform_replay.cpp:917 #5 0x0000aaaaabc7b7fc in nvfuser::TransformReplay::replayCasP (consumer=0xaaaab2088c00, producer=0xaaaab2091200, compute_at_axis=-1, opt=...) at /opt/pytorch/nvfuser/csrc/transform_replay.cpp:945 #6 0x0000aaaaabc44ccc in nvfuser::TensorView::cacheBefore (this=0xaaaab2088c00, op_type=nvfuser::LoadStoreOpType::Set) at /opt/pytorch/nvfuser/csrc/tensor_view.cpp:1160 #7 0x0000aaaaabbdb250 in nvfuser::scheduler_utils::cacheAndForkOutputs (fusion=0xaaaab2084910, unroll=true) at /opt/pytorch/nvfuser/csrc/scheduler/utils.cpp:1357 #8 0x0000aaaaabb067dc in nvfuser::schedulePointwise (fusion=0xaaaab2084910, pparams=0xaaaab207f880) at /opt/pytorch/nvfuser/csrc/scheduler/pointwise.cpp:822 #9 0x0000aaaaabb0898c in nvfuser::PointWiseScheduler::schedule (this=0xaaaab2083460, fusion=0xaaaab2084910, params=0xaaaab207f880) at /opt/pytorch/nvfuser/csrc/scheduler/pointwise.cpp:1304

So, what did you decide to do? Nothing seems to have changed?

I can also directly using IterDomain::split for that.

Of course, but you'd need to maintain the proper ordering of the ID vector yourself.

I can also directly using IterDomain::split for that.

Anyway, looks like if the transformation is not on logical to loop, our replay wouldn't pick it up. Felt similar to the allocation domain replay that rfactor was missing. fyi @Priya2698

Yes rfactor replay for allocation will also complain similarly if allocation transforms are disjoint from root-to-loop.
replayPasC also uses the loop domain as the target so if you intend to use IterDomain::split, we will have to update that, among other things.

yep. switched to selfReplay instead of replayCasP for TensorView::cacheBefore

naoyam · 2025-09-18T22:56:52Z

tests/cpp/test_layout_op.cpp

  }
 };

+TEST_F(LayoutOpTest, LogicalAndAllocationSizes) {


What is being tested here?

Without the relaxation in vectorization analysis, this test will trigger an assert.

So the test just verifies that we do allow allocation domain split now.
In the follow up PR, we added more validation to this test to check the produce tensor matches the logical sizes.

Priya2698

The changes look good for the multidevice support part. I am not familiar enough with the requirements for LayoutOp, so I will defer to Naoya to approve the PR.
Is there an existing issue or doc detailing the LayoutOp design?

csrc/multidevice/utils.cpp

jjsjann123 · 2025-09-19T07:37:41Z

!test

jjsjann123 · 2025-09-19T07:47:12Z

!test

jjsjann123 · 2025-09-19T17:40:55Z

Is there an existing issue or doc detailing the LayoutOp design?

Sorry I don't have anything on that yet. I'll try to write up one when I have the end-2-end example working at least in a prototype. Mostly trying to wing it at this moment.

jjsjann123 · 2025-09-19T23:08:04Z

!test

Priya2698 · 2025-09-19T23:15:02Z

csrc/transform_replay.cpp


  // Replay loop.
  if (self_loop != self->logical()) {
+    ReplaySelf replay(self_loop, axis_map);


Just FYI: #4585 reversed this. I expect some tests to break.

Thanks a ton. Let me sweep through failing tests and see if there's anything easy to patch. 🧑‍💼

jjsjann123 · 2025-09-22T21:43:58Z

!test

jjsjann123 · 2025-09-22T22:11:58Z

!test

jjsjann123 · 2025-09-22T22:13:03Z

tests/cpp/test_layout_op.cpp

+  fusion.addOutput(out);
+  // padding output to multiple of 16 on allocation domain
+  auto&& [io, ii] = IterDomain::split(
+      out->axis(1), IrBuilder::create<Val>(16L, DataType::Index), true);


tagging @naoyam changed the test to only apply split on logical -> allocation.

greptile-apps

_{12 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-11-25T19:42:50Z

csrc/tensor_view.cpp

+      domain()->logical() | std::views::transform([](IterDomain* id) {
+        id->setDefinition(nullptr);
+        return id->resetRFactorProduct();
+      }),


logic: mutating IterDomain objects by clearing their definitions with setDefinition(nullptr) affects the original objects that may still be referenced elsewhere in the codebase, potentially causing issues if other code paths expect these definitions to be intact

greptile-apps · 2025-11-25T19:42:51Z

tests/cpp/test_layout_op.cpp

+  // FIXME: this doesn't feel right, we have to mark contiguity on axis(0) as
+  // `false` to avoid accidntal indexing collapsing, this should be figured out
+  // by indexing from the ceilDiv.
+  out->setAllocationDomain({out->axis(0), io, ii}, {false, true, true});


logic: the FIXME indicates the contiguity workaround (false for axis(0)) is hiding an indexing bug - the indexing logic should automatically handle ceilDiv from splits without manual contiguity adjustments

greptile-apps · 2025-11-25T19:42:52Z

tests/cpp/test_layout_op.cpp

+  // FIXME: output shape inference is not correct yet.
+  // output should remain the correct logical size
+  // EXPECT_EQ(
+  //     cg_outputs[0].as<at::Tensor>().sizes(), std::vector<int64_t>({512,
+  //     9}));
+  // padding on the inner dimension is represented as stride on the outer
+  // dimension
+  EXPECT_EQ(
+      cg_outputs[0].as<at::Tensor>().strides(), std::vector<int64_t>({16, 1}));
+  // We need to slice because output buffer shape is not right
+  EXPECT_TRUE(t0.equal(cg_outputs[0].as<at::Tensor>().slice(1, 0, k)));


logic: shape inference is broken for non-divisible splits - the output tensor has wrong size requiring manual slicing to verify correctness, which means production code using this feature will produce incorrectly-shaped outputs

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

…wering' into HEAD

greptile-apps

_{8 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-11-25T21:32:01Z

csrc/tensor_view.cpp

+      domain()->logical() | std::views::transform([](IterDomain* id) {
+        id->setDefinition(nullptr);
+        return id->resetRFactorProduct();


logic: in-place mutation of IterDomain objects via setDefinition(nullptr) modifies objects that may still be referenced elsewhere (e.g., in producer_map, graph structures, or other tensor views). The comment mentions this is needed for BestEffortReplay, but this mutation happens BEFORE the consumer domain is created and could affect code that expects these IDs to retain their definitions

jjsjann123 · 2025-11-25T22:19:17Z

!test

jjsjann123 · 2025-12-01T21:29:10Z

!test

jjsjann123 · 2025-12-02T01:13:11Z

Looks like it's hitting a vectorization validation error now. I'm not sure if it's coming from changes in #5594.

Taking a look at that now.

…rather than using allocation domain.

jjsjann123 · 2025-12-03T19:52:09Z

!test

jjsjann123 · 2025-12-04T00:24:45Z

!test

greptile-apps

_{7 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-12-04T18:38:37Z

csrc/tensor_view.cpp

+      domain()->logical() | std::views::transform([](IterDomain* id) {
+        id->setDefinition(nullptr);
+        return id->resetRFactorProduct();


logic: calling setDefinition(nullptr) mutates the original IterDomain objects from domain()->logical(). These IDs are shared and may be referenced by:

old_domain (stored at line 1237)

producer_map entries (populated at lines 1140, 1180)

Other tensors in the fusion graph

Downstream transform replay operations

The mutation happens before the consumer domain is created, so any code expecting these IDs to retain their definitions will break. The comment mentions this is needed for BestEffortReplay, but the timing and scope of this mutation need validation.

greptile-apps · 2025-12-04T18:38:38Z

tests/cpp/test_layout_op.cpp

+  // FIXME: this doesn't feel right, we have to mark contiguity on axis(0) as
+  // `false` to avoid accidntal indexing collapsing, this should be figured out
+  // by indexing from the ceilDiv.
+  out->setAllocationDomain({out->axis(0), io, ii}, {false, true, true});


logic: FIXME indicates a workaround for broken indexing logic. Setting axis(0) contiguity to false prevents indexing from collapsing dimensions, but this should be automatically inferred from the ceilDiv in the split. This manual workaround masks an underlying indexing bug that could affect other use cases.

Suggested change

// FIXME: this doesn't feel right, we have to mark contiguity on axis(0) as

// `false` to avoid accidntal indexing collapsing, this should be figured out

// by indexing from the ceilDiv.

out->setAllocationDomain({out->axis(0), io, ii}, {false, true, true});

// TODO: Remove this workaround once indexing properly handles ceilDiv splits

out->setAllocationDomain({out->axis(0), io, ii}, {false, true, true});

greptile-apps · 2025-12-04T18:38:38Z

tests/cpp/test_layout_op.cpp

+  // FIXME: output shape inference is not correct yet.
+  // output should remain the correct logical size
+  // EXPECT_EQ(
+  //     cg_outputs[0].as<at::Tensor>().sizes(), std::vector<int64_t>({512,
+  //     9}));
+  // padding on the inner dimension is represented as stride on the outer
+  // dimension
+  EXPECT_EQ(
+      cg_outputs[0].as<at::Tensor>().strides(), std::vector<int64_t>({16, 1}));
+  // We need to slice because output buffer shape is not right
+  EXPECT_TRUE(t0.equal(cg_outputs[0].as<at::Tensor>().slice(1, 0, k)));


logic: shape inference is broken - the output tensor has the wrong size and requires manual slicing to verify correctness. This means production code using non-divisible splits will produce incorrectly-shaped outputs, breaking downstream operations that depend on tensor shapes.

The commented-out assertion at lines 119-121 shows the expected behavior isn't working. This is a critical correctness issue that needs resolution before merge.

jjsjann123 added 2 commits September 18, 2025 11:18

PR0: Relax assert on non-device split on allocation domain

fe90fb5

relaxing the check

a0df5e9

This was referenced Sep 18, 2025

Fix allocation logic: non-divisible split #5186

Draft

Fix allocation logic: unconnected alloc/logical #5185

Merged

Adding test validating vectorization

5097533

clangformat

jjsjann123 marked this pull request as ready for review September 18, 2025 21:40

jjsjann123 requested review from Priya2698 and naoyam September 18, 2025 21:40

jjsjann123 mentioned this pull request Sep 18, 2025

Patch allocation logic to produce outputs with correct logical size #5170

Closed

naoyam reviewed Sep 18, 2025

View reviewed changes

Priya2698 requested changes Sep 18, 2025

View reviewed changes

csrc/multidevice/utils.cpp Outdated Show resolved Hide resolved

renaming

d4b7c8b

jjsjann123 requested review from Priya2698 and naoyam September 19, 2025 07:37

clangformat

4b07e79

jjsjann123 added 2 commits September 19, 2025 15:27

I think it's working now!

051fc9e

clangformat

bf85c0b

Priya2698 reviewed Sep 19, 2025

View reviewed changes

jjsjann123 added 2 commits September 22, 2025 14:37

quick patch

6ff1050

Merge remote-tracking branch 'origin/main' into HEAD

f2f43be

fix clearing allocation domain on cache for cacheBefore

a32f54b

jjsjann123 requested a review from Priya2698 September 22, 2025 22:12

jjsjann123 commented Sep 22, 2025

View reviewed changes

refactor canUsePresetAllocationDomain

37287a3

greptile-apps bot reviewed Nov 25, 2025

View reviewed changes

jjsjann123 changed the base branch from main to jj/refactor_allocation_domain_lowering November 25, 2025 19:46

jjsjann123 and others added 2 commits November 25, 2025 12:08

Update csrc/ir/utils.cpp

41b077c

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

Merge remote-tracking branch 'origin/jj/refactor_allocation_domain_lo…

a76fd69

…wering' into HEAD

greptile-apps bot reviewed Nov 25, 2025

View reviewed changes

This was referenced Nov 25, 2025

refactor canUsePresetAllocationDomain #5590

Closed

Python bindings for block quantization #5591

Merged

jjsjann123 added 2 commits December 1, 2025 13:12

Merge remote-tracking branch 'origin/main' into HEAD

a79cca6

revert PR 5590

f9f713b

jjsjann123 mentioned this pull request Dec 2, 2025

codegen issue with allocation domain on cache #5611

Open

jjsjann123 marked this pull request as draft December 2, 2025 01:13

jjsjann123 changed the base branch from jj/refactor_allocation_domain_lowering to main December 2, 2025 17:12

jjsjann123 added 4 commits December 3, 2025 10:28

switch allocation domain validation to use effective allocation info …

2eaa5d0

…rather than using allocation domain.

addressing missing alloc info

8d3b71f

duplicating check function

7f0672e

quick hacky skip on validation

21b43ec

jjsjann123 added 2 commits December 3, 2025 16:19

Merge branch 'jj/skip_vectorization_allocation_validation' into HEAD

5db7448

revert unwanted changes

465ca2c

jjsjann123 changed the base branch from main to jj/skip_vectorization_allocation_validation December 4, 2025 00:22

revert empty lines

4d240a4

jjsjann123 mentioned this pull request Dec 4, 2025

skip aggressive validation check on allocation domain for vectorization #5622

Open

jjsjann123 marked this pull request as ready for review December 4, 2025 18:34

greptile-apps bot reviewed Dec 4, 2025

View reviewed changes

jjsjann123 marked this pull request as draft December 4, 2025 19:42

	// We always cacheBefore output at the beginning of the scheduling. And after
	// cacheBefore, the reference tensor will have all reduction IDs removed.
	ref_loop = TensorDomain::noDevices(TensorDomain::noReductions(ref_loop));

Conversation

jjsjann123 commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Stacked PR

This PR

Motivation

Main Code Change

Technical Challenges

Uh oh!

github-actions bot commented Sep 18, 2025 • edited by xwang233 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough

PR Reviewer Guide

Test failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Priya2698 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jjsjann123 commented Sep 19, 2025

Uh oh!

jjsjann123 commented Sep 19, 2025

Uh oh!

jjsjann123 commented Sep 19, 2025

Uh oh!

jjsjann123 commented Sep 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjsjann123 commented Sep 22, 2025

Uh oh!

jjsjann123 commented Sep 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

jjsjann123 commented Nov 25, 2025

jjsjann123 commented Sep 18, 2025 •

edited

Loading

github-actions bot commented Sep 18, 2025 •

edited by xwang233

Loading

Priya2698 left a comment •

edited

Loading