Skip to content

Skip direct-dominating live-in optimization for loop headers#281

Open
guosran wants to merge 12 commits intomainfrom
fix/nested-loop-live-in-bug
Open

Skip direct-dominating live-in optimization for loop headers#281
guosran wants to merge 12 commits intomainfrom
fix/nested-loop-live-in-bug

Conversation

@guosran
Copy link
Collaborator

@guosran guosran commented Mar 8, 2026

Summary

This pr addresses issue #270.

Problem

In nested loop kernels, values defined in outer loop blocks and used in inner loop bodies were incorrectly classified as "direct dominating live-ins" by CanonicalizeLiveInPass. This prevented them from being promoted to block arguments, which in turn caused TransformCtrlToDataFlowPass to not create inner-rate PHI_START operations for them.

Fix

Before applying the direct-dominating optimization, check whether the using block is a loop header. If so, skip the optimization and let the value be promoted to a block argument.

Result

After the fix, the GEP pointer is correctly promoted to a block argument in ^bb4:

^bb1:
  %6 = neura.gep(%arg0, shl(i, 4))
  neura.br ..., %6 : !llvm.ptr, ... to ^bb4

^bb4(%15: i64, %16: i32, %17: !llvm.ptr, ...):  // %17 is now a block arg
  gep(%17, %15)

And TransformCtrlToDataFlowPass creates an inner-rate PHI_START for it:

%9  = neura.gep(%arg0, shl(i, 4))          // outer-rate GEP
%17 = neura.phi_start %9, %16              // inner-rate PHI_START for the pointer
%22 = neura.gep(%17, %21)                  // inner-loop body uses inner-rate pointer

TODO

Have not updated all failed checks yet, will do so once the changes are reviewed and approved. gemv has been updated.

guosran added 11 commits March 5, 2026 12:51
…ment

- findBestPlacement now tries rectangular shapes first, then non-rectangular
  connected shapes, then falls back to k-1 CGRAs (down to 1).
- Removed outdated TODO comment about MapTaskOnCgraPass not supporting
  multi-CGRA placement.
- Added assert for empty tile_shape offsets.
- Cleaned up USER COMMENT annotations.
…p comments

- findBestPlacement tries rect then non-rect shapes for requested cgra_count.
- If placement fails, caller falls back to cgra_count-1 (reject extra CGRA).
- Normalize /// to // comment style throughout MapTaskOnCgraPass.
- Remove outdated TODO comments.
- SRAM centroid now includes ALL CGRA positions of multi-CGRA tasks,
  not just placement[0].
- SSA proximity scoring uses min distance between two multi-CGRA
  placements (minDistToPlacement) instead of only comparing to
  the other task's primary position.
Copilot AI review requested due to automatic review settings March 8, 2026 05:01
@guosran
Copy link
Collaborator Author

guosran commented Mar 8, 2026

Commits from the previous unmerged pr is shown here. This pr should start from 1f60123 (the first failed commit).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes an SSA live-in canonicalization bug impacting nested loops by preventing the “direct-dominating live-in” optimization from firing on loop headers, enabling correct promotion to block arguments and subsequent inner-rate PHI_START creation during ctrl-to-data-flow lowering. It also renames/rewires mapping and task placement passes and updates affected tests/docs accordingly.

Changes:

  • Update CanonicalizeLiveInPass to skip the direct-dominating-live-in optimization for loop headers (fix for #270).
  • Rename MapToAcceleratorPassMapOperationOnTilePass and update pass registration, build files, docs, and tests.
  • Replace MapTaskOnCgraPass with AllocateCgraToTaskPass and invoke global placement from ResourceAwareTaskOptimizationPass, updating taskflow tests.

Reviewed changes

Copilot reviewed 43 out of 46 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
test/neura/steer_ctrl/loop_with_return_value.mlir Updates disabled mapping pass invocation name in test pipeline comments.
test/neura/fusion/test.mlir Updates mapping pass name and adjusts FileCheck IDs after mapping changes.
test/neura/for_loop/relu_test.mlir Updates mapping pass flag name in RUN line.
test/neura/ctrl/branch_for.mlir Updates mapping pass flag name in RUN lines.
test/multi-cgra/taskflow/resource-heavy/resource-heavy.mlir Splits long RESOPT attribute checks for readability/stability.
test/multi-cgra/taskflow/resnet/simple_resnet_tosa.mlir Splits long RESOPT attribute checks for readability/stability.
test/multi-cgra/taskflow/parallel-nested/parallel-nested.mlir Switches to --allocate-cgra-to-task and updates RESOPT/placement checks.
test/multi-cgra/taskflow/multi-nested/multi-nested.mlir Switches to --allocate-cgra-to-task and updates placement + RESOPT expectations.
test/multi-cgra/taskflow/irregular-loop/irregular-loop.mlir Switches to --allocate-cgra-to-task and splits RESOPT checks.
test/multi-cgra/kernel_mapping/relu/relu.mlir Updates mapping pass flag name in RUN line.
test/multi-cgra/kernel_mapping/loop-in-kernel/loop-in-kernel.mlir Updates mapping pass flag name in RUN line.
test/multi-cgra/kernel_mapping/fir/fir.mlir Updates mapping pass flag name in RUN line.
test/mapping_quality/tiny_loop.mlir Updates mapping pass flag name in RUN lines.
test/mapping_quality/branch_for.mlir Updates mapping pass flag name in RUN lines.
test/honor_arch/fir_removed_tiles_test.mlir Updates mapping pass flag name in RUN line.
test/e2e/relu/relu_kernel.mlir Updates mapping pass flag name in RUN line.
test/e2e/histogram/histogram_kernel.mlir Updates mapping pass flag name in RUN line.
test/e2e/gemv/gemv_kernel.mlir Updates mapping pass flag name and refreshes extensive mapping/YAML/ASM FileCheck expectations.
test/e2e/fir/fir_kernel_vec.mlir Updates mapping pass flag name in RUN line.
test/e2e/fir/fir_kernel.mlir Updates mapping pass flag name in RUN line.
test/e2e/axpy/axpy_kernel.mlir Updates mapping pass flag name in RUN line.
test/controflow_fuse/simple_loop_reduction/simple_loop_reduction.mlir Updates mapping pass flag name in RUN line.
test/controflow_fuse/simple_loop/simple_loop.mlir Updates mapping pass flag name in RUN line.
test/controflow_fuse/perfect_nested/perfect_nested.mlir Updates mapping pass flag name in RUN line.
test/code_gen/test_code_generate.mlir Updates mapping pass flag name in RUN line.
test/c2llvm2mlir/simple_loop/test.mlir Updates mapping pass flag name in RUN line.
test/c2llvm2mlir/nested_loop/test.mlir Updates mapping pass flag name in RUN line.
test/arch_spec/README.md Updates documentation to use --map-operation-on-tile.
lib/TaskflowDialect/Transforms/Optimizations/ResourceAwareTaskOptimizationPass.cpp Renames mapper references and invokes AllocateCgraToTask-based placement at convergence.
lib/TaskflowDialect/Transforms/CMakeLists.txt Swaps in AllocateCgraToTaskPass.cpp and adds Optimizations subdirectory.
lib/TaskflowDialect/Transforms/AllocateCgraToTaskPass.cpp Implements new task+SRAM placement pass with multi-CGRA placement search.
lib/TaskflowDialect/CMakeLists.txt Adjusts subdirectory structure so Optimizations is included via Transforms.
lib/NeuraDialect/Transforms/MapOperationOnTilePass.cpp Renames mapper pass (class/argument/logs) and updates factory function.
lib/NeuraDialect/Transforms/CanonicalizeLiveInPass.cpp Skips direct-dominating-live-in optimization when the using block is a loop header.
lib/NeuraDialect/Transforms/CMakeLists.txt Replaces MapToAcceleratorPass.cpp with MapOperationOnTilePass.cpp.
lib/NeuraDialect/NeuraPasses.cpp Updates conversion pipeline to run createMapOperationOnTilePass().
include/TaskflowDialect/TaskflowPasses.td Renames pass definition to allocate-cgra-to-task.
include/TaskflowDialect/TaskflowPasses.h Updates pass factory name and adds runAllocateCgraToTask() helper declaration.
include/NeuraDialect/NeuraPasses.td Renames pass definition to map-operation-on-tile and updates constructor name.
include/NeuraDialect/NeuraPasses.h Renames pass factory API to createMapOperationOnTilePass.
include/NeuraDialect/Architecture/Architecture.h Updates reference in comment to new mapper pass name.
Comments suppressed due to low confidence (4)

lib/TaskflowDialect/Transforms/AllocateCgraToTaskPass.cpp:285

  • placement.primary() returns {-1,-1} when placement.cgra_positions is empty, but the code still pushes it into task_node->placement. That can result in invalid task_mapping_info (row/col = -1) and can skew the SRAM centroid computation in later iterations. Handle the no-placement case explicitly (e.g., signal pass failure / emit an error, or keep the task unannotated and skip SRAM assignment) before committing any placement.
    lib/TaskflowDialect/Transforms/AllocateCgraToTaskPass.cpp:281
  • When a placement can’t be found for cgra_count, the pass falls back to cgra_count-1 but does not update the task’s cgra_count (or tile_shape) attribute accordingly. That can leave IR in an inconsistent state where cgra_count disagrees with the number of entries in task_mapping_info.cgra_positions, which downstream passes may rely on. Consider either failing the pass, or updating the attributes to match the chosen fallback placement (and possibly iterating fallback down to 1).
    lib/TaskflowDialect/Transforms/AllocateCgraToTaskPass.cpp:432
  • parseTileShapeOffsets() is introduced but never used, and the placement search currently ignores each task’s tile_shape attribute (even though other code paths/comments imply placement should respect it). If tile_shape is intended to constrain the physical shape, wire this helper into findBestPlacement() (read the task’s tile_shape StringAttr, validate it matches cgra_count, then call tryPlaceShape() with those offsets). Otherwise, remove the dead helper / update comments to reflect that the allocator chooses shapes autonomously.
    lib/TaskflowDialect/Transforms/AllocateCgraToTaskPass.cpp:553
  • The non-rectangular shape search uses a 64-bit bitmask (1ULL << (nr * grid_cols_ + nc)). If grid_rows * grid_cols >= 64, this shift becomes undefined and the mask cannot represent all cells. Add a guard/assert limiting the supported grid size (e.g., 4x4) or switch to a wider/dynamic bitset representation for visited states.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Values crossing from outer blocks to inner loop headers were marked as
direct-dominating live-ins, preventing block-arg promotion. This caused
missing inner-rate PHI_STARTs in the dataflow IR, starving inner-loop
operations of valid data each cycle.

Added back-edge check: if the using block has a predecessor it dominates
(i.e. it is a loop header), the value is promoted to a block argument.
@guosran guosran force-pushed the fix/nested-loop-live-in-bug branch from 1f60123 to fa5fb36 Compare March 8, 2026 05:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants