Skip to content

demo how to schedule allocation domain and get domains to be allocated#4791

Merged
liqiangxl merged 5 commits intomainfrom
llu/get_domains_should_be_allocated
Jul 24, 2025
Merged

demo how to schedule allocation domain and get domains to be allocated#4791
liqiangxl merged 5 commits intomainfrom
llu/get_domains_should_be_allocated

Conversation

@liqiangxl
Copy link
Collaborator

@liqiangxl liqiangxl commented Jul 17, 2025

Following #4792
This PR added a test to manually schedule allocation domain and use IdModel to detect mapping between scheduled allocation domain and loop domain.
Auto schedule is added in a following PR at #4795

@github-actions
Copy link

github-actions bot commented Jul 17, 2025

Review updated until commit 8763303

Description

  • Added test for scheduling allocation domain with CpAsyncBulk1d

  • Enhanced getAllocationDomainsAndContiguity to use IdModel for mapping

  • Included necessary headers and cleaned up code


Changes walkthrough 📝

Relevant files
Enhancement
allocation.cpp
Use IdModel for ID mapping in allocation domain setup       

csrc/device_lower/pass/allocation.cpp

  • Added logic to use IdModel for mapping excluded IDs in
    getAllocationDomainsAndContiguity
  • +10/-0   
    test_allocation_domain.cpp
    Add CpAsyncBulk1d test and clean up                                           

    tests/cpp/test_allocation_domain.cpp

  • Added new test case CpAsyncBulk1d for scheduling allocation domain
  • Included necessary header for inlining
  • Cleaned up code and comments
  • +77/-0   

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review

    IdModel Usage

    Ensure that the use of IdModel is appropriate and that it correctly identifies mappings between scheduled allocation domains and loop domains.

    // Fallback: use IdModel to check if any excluded ID is mapped
    if (GpuLower::current()->hasIdModel()) {
      const auto& exact_graph =
          GpuLower::current()->idModel().idGraph(IdMappingMode::EXACT);
      for (auto exclude_id : exclude_ca_ids) {
        if (exact_graph.disjointValSets().strictAreMapped(exclude_id, id)) {
          return exclude_id;
        }
      }
    }
    Test Coverage

    Verify that the test case covers all necessary scenarios and edge cases for the allocation domain scheduling.

    TEST_F(AllocationDomainTest, CpAsyncBulk1d) {
      NVFUSER_TEST_CUDA_ARCH_GUARD(9, 0);
      auto fusion = std::make_unique<Fusion>();
      FusionGuard fg(fusion.get());
      int64_t x = 2L, y = 12L, z = 16L;
      auto tv0 = makeContigConcreteTensor({x, y, z});
      fusion->addInput(tv0);
      std::vector<IterDomain*> tv0_dom = {tv0->axis(1), tv0->axis(0), tv0->axis(2)};
      tv0->setAllocationDomain(tv0_dom, true);
      auto tv2 = add(tv0, tv0);
      fusion->addOutput(tv2);
    
      auto tv1 = tv0->cacheAfter(LoadStoreOpType::CpAsyncBulk);
      tv1->setMemoryType(MemoryType::Shared);
      tv1->axis(-1)->parallelize(ParallelType::Bulk);
    
      for (auto tv : fusion->allTvs()) {
        // [2, 3, 4, 16]
        tv->split(1, 4);
      }
    
      inlineSelectedAt({tv1}, tv1, /*reference_pos=*/2);
    
      // Before fix, we have:
    
      // T2_s_float[iS6{2}, iS11{3}, iS12{4}, iB8{16}] ca_pos( 2 )
      // logical domain : (iS6{2}, iS7{12}, iB8{16})
      // allocation domain : (iS7{12}, iS6{2}, iB8{16})
      // contiguity: t t t
      //  Split: iS7{12} by factor 4 -> iS11{3}, iS12{4}
      // loop domain : (iS6{2}, iS11{3}, iS12{4}, iB8{16})
    
      // T2 is computed at pos 2, we don't need to allocate domains iS6{2} and
      // iS11{3} nvFuser tries to exclude these two domains from the allocation
      // domain, however, iS11{3} doesn't exist in the allocation domain, so it's
      // not excluded and this is considered a failed case.
    
      // To fix, we can reaplay transforms on the allocation domain.
      // How to split the allocation domain?
      // Create AbstractTensor from current allocation domain
      // Apply the same split transformation to the allocation domain
      // Update the allocation domain
      AbstractTensor alloc_tensor(tv1->getAllocationDomain());
      alloc_tensor.split(0, 4);
      tv1->setAllocationDomain(alloc_tensor.as<IterDomain*>(), true);
      // after this change to allocation domain, we have:
      // T2_s_float[iS6{2}, iS11{3}, iS12{4}, iB8{16}] ca_pos( 2 )
      // logical domain : (iS6{2}, iS7{12}, iB8{16})
      // allocation domain : (iS15{3}, iS16{4}, iS6{2}, iB8{16})
      // contiguity: t t t t
      //  Split: iS7{12} by factor 4 -> iS15{3}, iS16{4}
      //  Split: iS7{12} by factor 4 -> iS11{3}, iS12{4}
      // loop domain : (iS6{2}, iS11{3}, iS12{4}, iB8{16})
    
      // Based on loop domain and compute pos, we don't need to allocate iS6{2} and
      // iS11{3}. However, the corresponding allocation domain of iS11{3} is
      // iS15{3}. How do we map them in getAllocationDomainsAndContiguity()? use
      // IdModel if pointer comparison fails IdModel maintains a disjointValSets
      // id_sets: disjoint sets{
      //   { iS3{2}; iS6{2}; iS0{2} }
      //   { iS4{12}; iS7{12}; iS1{12} }
      //   { iS13{3}; iS11{3}; iS15{3}; iS9{3} }
      //   { iS14{4}; iS12{4}; iS16{4}; iS10{4} }
      //   { iS5{16}; iB8{16}; iS2{16} }
      // }
      // where iS11{3} and iS15{3} are in the same set.
    
      auto options = at::TensorOptions().dtype(at::kFloat).device(at::kCUDA);
      // shape: (x, y, z), alloc: (y, x, z), stride: (z, x * z, 1)
      auto t0 = at::randn({x, y, z}, options).as_strided({x, y, z}, {z, x * z, 1});
      KernelExecutor ke;
      ke.compile(fusion.get(), {t0});
      auto outputs = ke.run({t0});
      testValidate(fusion.get(), outputs, {t0}, __LINE__, __FILE__);
    }

    @liqiangxl liqiangxl changed the base branch from main to llu/refactor_getAllocationDomainsAndContiguity July 17, 2025 14:02
    @liqiangxl
    Copy link
    Collaborator Author

    !test

    @liqiangxl liqiangxl force-pushed the llu/get_domains_should_be_allocated branch from de0c7fb to 1338ae0 Compare July 17, 2025 14:53
    @liqiangxl
    Copy link
    Collaborator Author

    !test

    @liqiangxl liqiangxl force-pushed the llu/refactor_getAllocationDomainsAndContiguity branch from 040e349 to 3a06741 Compare July 17, 2025 18:14
    @liqiangxl liqiangxl force-pushed the llu/get_domains_should_be_allocated branch from 1338ae0 to 5e7592e Compare July 17, 2025 19:18
    @liqiangxl liqiangxl marked this pull request as ready for review July 22, 2025 01:09
    Base automatically changed from llu/refactor_getAllocationDomainsAndContiguity to main July 22, 2025 12:15
    @liqiangxl
    Copy link
    Collaborator Author

    !test

    @liqiangxl liqiangxl requested a review from jjsjann123 July 22, 2025 15:52
    Copy link
    Collaborator

    @jjsjann123 jjsjann123 left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    LGTM

    // }
    // where iS11{3} and iS15{3} are in the same set.

    fusion->print();
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    nitpick: remove debug code.

    // Update the allocation domain
    AbstractTensor alloc_tensor(tv1->getAllocationDomain());
    alloc_tensor.split(0, 4);
    tv1->setAllocationDomain(alloc_tensor.as<IterDomain*>(), true);
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    IIUC, the replay on allocation domain needs to be done by the scheduler. So there's going to be another PR plumbing that?

    Copy link
    Collaborator Author

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Yes, it is here

    // replay loop domain transformations to allocation domain for shared memory

    return exclude_id;
    }
    }
    }
    Copy link
    Collaborator

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Looks like the existing comment for this function already contains this piece, well planned sir 😆

    @liqiangxl
    Copy link
    Collaborator Author

    !build

    @liqiangxl liqiangxl merged commit b7afdc0 into main Jul 24, 2025
    17 checks passed
    @liqiangxl liqiangxl deleted the llu/get_domains_should_be_allocated branch July 24, 2025 01:29
    nsarka pushed a commit to nsarka/Fuser that referenced this pull request Jul 28, 2025
    …loop domain (NVIDIA#4791)
    
    (1) added a test to manually schedule allocation domain.
    (2) use IdModel to detect mapping between scheduled allocation domain and loop domain.
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants