Skip to content

Resource aware task optimization#269

Merged
guosran merged 36 commits intomainfrom
feature/resource-aware-task-optimization
Mar 4, 2026
Merged

Resource aware task optimization#269
guosran merged 36 commits intomainfrom
feature/resource-aware-task-optimization

Conversation

@guosran
Copy link
Collaborator

@guosran guosran commented Feb 17, 2026

Overview

This PR introduces ResourceAwareTaskOptimizationPass, a two-phase MLIR pass that optimizes CGRA resource allocation for the Neura taskflow dialect on a 4×4 CGRA grid (16 CGRAs total).

Also resolves #163Architecture::cloneWithNewDimensions() enables creating custom-sized architectures for multi-CGRA tile arrays, and MapToAcceleratorPass now accepts x-tiles, y-tiles, and valid-tiles options to map onto non-default tile grids.

Phase 1: Utilization Fusion

Merges independent tasks (no SSA or memory dependency edges in either direction) into a single fused task, sequentially concatenating their loop bodies. This frees up CGRA budget that Phase 2 can reallocate to critical-path bottlenecks. Tasks with value outputs (reduction loops with iter_args) are now supported.

Phase 2: Latency-Aware Pipeline Balance

Uses the pipelined latency model: latency(task) = II × (trip_count − 1) + steps, where II (compiled_ii) is obtained by speculatively profiling the task through the downstream Neura pipeline with the target multi-CGRA tile array. Assigning more CGRAs to a task gives it a larger tile array, which may lower II if the kernel is resource-bound (ResMII-limited). The pass does not tile or partition the trip_countcgra_count affects only the mapping array dimensions.

Iteratively finds the critical-path bottleneck (minimum-slack node with highest individual latency) and allocates one additional CGRA to it, repeating until the 16-CGRA budget is exhausted or no improvement is possible.

The outer loop (max 10 iterations) alternates fusion and balance until convergence (no change in either phase).


Speculative Profiling for compiled_ii and steps

To obtain accurate II and steps without waiting for full compilation:

  1. Phase 1 (Taskflow → Neura): Clone the parent func::FuncOp, strip all tasks except the target, run ConstructHyperblockFromTask → ClassifyCounters → ConvertTaskflowToNeura on the clone to produce neura.kernel ops.
  2. Phase 2 (Neura pipeline): Clone each kernel body into a standalone func::FuncOp tagged accelerator="neura", then run the full Neura lowering pipeline (LowerAffinePass → ConvertSCFToCFPass → AssignAccelerator → LowerMemRefToNeura → LowerArithToNeura → ... → InsertDataMovPass).

compiled_ii Extraction — Trade-offs

Source When used Accuracy
MapToAcceleratorPassmapping_info.compiled_ii All ops are DataMov-wrapped AND total ops ≤ 150 Highest (real modulo scheduler result)
max(ResMII, RecMII) Mapper skipped (size guard or DataMov guard fails) Lower bound, conservative
Default ii=1, steps=1 Phase 1 or 2 pipeline fails entirely Pessimistic fallback

Guard conditions for the mapper:

  • DataMov completeness: All non-reserve operand producers must be neura.data_mov. If InsertDataMovPass didn't fully wrap all operands (happens for kernels with complex control flow), the mapper asserts.
  • Op count limit (kMapperOpLimit = 150): Prevents exponential backtracking in the modulo scheduler during speculative profiling of large kernels.

Multi-CGRA Tile Array Sizing

When a task is assigned cgra_count > 1, the profiler constructs a custom architecture via Architecture::cloneWithNewDimensions() with tile dimensions (shape.rows × per_cgra_rows) × (shape.cols × per_cgra_cols). For non-rectangular shapes (L, T, offset), an explicit valid_tiles list is passed to MapToAcceleratorPass so the mapper only uses the tiles that actually exist.

Split-Profile for Fused Tasks

After fusion, the fused task body contains N sequential loop nests. ConvertTaskflowToNeuraPass asserts hyperblock_count == 1, so we cannot profile the fused task directly. Instead:

  1. Create a temporary single-loop wrapper task for each top-level loop nest.
    • For affine.for loops with iter_args, the wrapper task declares matching value_output_types and wires the cloned loop's results into the yield — this prevents a type mismatch that would cause ConstructHyperblockFromTask/ConvertTaskflowToNeura to produce no kernels.
  2. Profile each independently.
  3. Assign max(ii) and sum(steps) to the fused task.

Changes to Existing Passes

Architecture (Architecture.h / Architecture.cpp)

  • Constructor now stores multi_cgra_base_topology_, per_cgra_base_topology_, tile_defaults_, tile_overrides_, link_defaults_, and link_overrides_ as member fields (previously discarded after initialization).
  • New method cloneWithNewDimensions(rows, cols, additional_overrides) creates a fresh Architecture with different per-CGRA dimensions, enabling multi-CGRA tile array profiling. (Resolves [P1] Model multi-cgra in arch spec #163)

MapToAcceleratorPass (MapToAcceleratorPass.cpp)

  • New options: x-tiles, y-tiles, valid-tiles for overriding architecture dimensions at pass construction time.
  • New constructor MapToAcceleratorPass(const MapToAcceleratorOptions &) and corresponding factory function.
  • When tile overrides are specified, builds a custom architecture with explicit tile existence masks for non-rectangular shapes.

InsertDataMovPass (InsertDataMovPass.cpp)

  • Extended dialect filter to also process arith and math dialect ops (previously only neura).
  • Added skip rules for neura.reserve, neura.kernel, and neura.fused ops.

Test Coverage

Test Input Tasks Fusions Final Tasks CGRA Allocation
irregular-loop 3 (incl. reduction) 1 utilization (Task_0 + Task_1) 2 cgra_count=2+1 = 3 CGRAs
parallel-nested 2 → 1 (fused) 1 utilization 1 cgra_count=2, total=2
multi-nested 5 → 3 (1 streaming + 1 util) 1 utilization 3 2+2+2 = 6 CGRAs
resnet 13 → 6 4 utilization 6 2+1+2+2+1+1 = 9 CGRAs

Known Limitations

  1. Perfectly-nested assumption for trip_count: For non-perfectly-nested loops inside a task body, computeTripCount multiplies inner-loop counts of each top-level loop structure. This is accurate for the current workloads (convolutions, matmuls).
  2. kMapperOpLimit = 150: Large kernels skip MapToAcceleratorPass and fall back to ResMII/RecMII bounds. This is a deliberate performance vs. accuracy trade-off for speculative profiling.

…ce and fusion

- Add two-phase optimization: Utilization Fusion + Latency-Aware Pipeline Balance
- Implement pipelined latency model: latency = II * (ceil(trip_count/cgra_count) - 1) + steps
- Add fallback profiling using operation counting for robust performance estimation
- Critical path detection using slack analysis for bottleneck identification
- Task fusion for independent tasks to free up CGRA budget
- Support 4x4 CGRA grid (16 total) with complete allocation
- All 4 taskflow lit tests passing (multi-nested, parallel-nested, irregular-loop, resnet)
- Environment-agnostic: no Neura-specific analysis APIs, only standard MLIR operations
…erage

Bug fixes:
- Fix RecMII computation: use cycle.length (excl. reserve/ctrl_mov) instead
  of cycle.operations.size(), consistent with MapToAcceleratorPass
- Fix PipelineBalancer: the outer for-loop was dead code due to 'return' inside
  the first iteration; refactor to recompute critical path each CGRA increment
- Fix placeholder generation in profileTask: replace type-specific AllocOp /
  ConstantIntOp with UnrealizedConversionCastOp which handles all types
  including dynamic-shape MemRefs without requiring dynamic-size operands
- Fix fusion guard: skip tasks with value outputs (reduction/iter_args loops)
  to prevent assertion failure in replaceTaskResults

New features:
- Add WAW (write-after-write) memory dependency edges to prevent incorrect
  fusion of tasks that write the same memref in program order
- Improve computeTripCount: walk only top-level affine.for ops and sum their
  nested products, correctly handling sequential loops at the same IR level
  (e.g. 'for i=0..10; for j=0..5' yields 15 not 50)
- Persist trip_count attribute at convergence alongside cgra_count/ii/steps

Cleanups:
- Remove unused #include <cmath>
- Add RESOPT lit checks for irregular-loop test (previously uncovered)

Tests: 4/4 PASS (irregular-loop, parallel-nested, multi-nested, resnet)
@guosran guosran requested review from ShangkunLi and Copilot and removed request for Copilot February 17, 2026 21:35
Copilot AI review requested due to automatic review settings February 17, 2026 21:51
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new MLIR optimization pass that fuses independent Taskflow tasks and balances CGRA allocation using a pipelined latency model, plus updates several multi-CGRA tests to exercise the new behavior.

Changes:

  • Introduces ResourceAwareTaskOptimizationPass implementing utilization fusion + latency-aware CGRA rebalancing with speculative profiling.
  • Wires the new pass into build/registration (CMake + Passes.td/h).
  • Extends Taskflow MLIR tests with --resource-aware-task-optimization RUN lines and RESOPT FileCheck assertions.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
lib/TaskflowDialect/Transforms/Optimizations/ResourceAwareTaskOptimizationPass.cpp Implements the new two-phase optimization pass and speculative profiling pipeline
lib/TaskflowDialect/Transforms/Optimizations/CMakeLists.txt Builds/links the new pass into the optimization library
include/TaskflowDialect/TaskflowPasses.td Registers the new pass and its summary/description
include/TaskflowDialect/TaskflowPasses.h Exposes the factory method for the new pass
test/multi-cgra/taskflow/resnet/simple_resnet_tosa.mlir Adds RUN + FileCheck coverage for RESOPT expectations
test/multi-cgra/taskflow/parallel-nested/parallel-nested.mlir Adds RUN + RESOPT checks (but currently duplicated)
test/multi-cgra/taskflow/multi-nested/multi-nested.mlir Adds RUN + RESOPT checks
test/multi-cgra/taskflow/irregular-loop/irregular-loop.mlir Adds RUN + RESOPT checks
test/benchmark/Zeonica_Testbench Updates submodule pointer
debug.log Adds a debug artifact containing a crash backtrace/logs

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…oss iterations

- Remove duplicate RESOPT RUN+FileCheck block in parallel-nested.mlir
  that was a copy-paste error (identical input/output/check-prefix).

- Persist ii, steps, and trip_count to IR during intermediate iterations
  (alongside cgra_count) so that graph.build() on subsequent iterations
  can skip expensive speculative profiling for unchanged tasks via the
  existing has_precomputed guard.
@tancheng
Copy link
Contributor

Shouldn't we fix #260 first to align the task/func/kernel?

@ShangkunLi
Copy link
Collaborator

Shouldn't we fix #260 first to align the task/func/kernel?

I think they are orthogonal. This pr is trying to do some optimizations on the task dependency graph, regardless of how we construct this graph.

For now, we build the task dependency graph based on the affine loops within one func. We can further extend it so that we can create a task dependency graph from multiple funcs.

@ShangkunLi
Copy link
Collaborator

Overview

This PR introduces ResourceAwareTaskOptimizationPass, a two-phase MLIR pass that optimizes CGRA resource allocation for the Neura taskflow dialect on a 4×4 CGRA grid (16 CGRAs total).

Phase 1: Utilization Fusion

Merges independent tasks (no SSA or memory dependency edges in either direction) into a single fused task, sequentially concatenating their loop bodies. This frees up CGRA budget that Phase 2 can reallocate to critical-path bottlenecks.

Phase 2: Latency-Aware Pipeline Balance

Uses the pipelined latency model:

latency(task) = II × (⌈trip_count / cgra_count⌉ − 1) + steps

Why should we use ⌈trip_count / cgra_count⌉ for a task to estimate its performance on multi-cgras?

I think we still need to use trip_count because even though we assign multiple cgras for a task, we just combine the tile arrays into a larger tile array for mapping. As for ⌈trip_count / cgra_count⌉, it is more like executing a task through unrolling, in which case we should partition the task into multiple parallel tasks.

@ShangkunLi
Copy link
Collaborator

How long does this pass take (e.g., on resnet test)?

…timization

Includes:
- Updated latency model to II * (trip_count - 1) + steps.
- Fixed dependency analysis to include WAR (Write-After-Read) edges.
- Enforced strict profiling assertions (no more silent fallbacks).
- Updated fusion metric (minimize |trip_count diff|) and trip_count calculation (max).
- Added support for arbitrary connected shapes on 4x4 grid.
- Added detailed Tile Occupation Summary and tile_shape attribute.
- Renamed hasPath to hasDependency and totalCGRAs to getTotalAllocatedCGRAs.
- Updated all multi-CGRA test expectations.
…store accurate expected latency formula, and append tile occupation maps to test files
…asks, restore accurate expected latency formula, and append tile occupation maps to test files"

This reverts commit 8c3c86b.
…est CHECK lines

- Fix latency formula: II * (trip_count - 1) + steps (removes tiling assumption)
- Add speculative re-profile with rollback in balance phase
- Add CGRAShape struct and pickBestShape() for optimal tile layout
- Rename ii attribute to compiled_ii in IR output
- Add WAR edges to task dependency graph builder
- Make profileTask public; split-profile logic moved into profileTask
- Pass asserts on profiling failure (no silent fallback)
- Rename hasPath->hasDependency, totalCGRAs->getTotalAllocatedCGRAs
- Update all 4 RESOPT test CHECK lines to match new profiling results
- Add standardized 4x4 CGRA tile occupation diagrams to all 4 tests

Resolves regression after Revert commit 6e91448.
@guosran
Copy link
Collaborator Author

guosran commented Feb 26, 2026

How long does this pass take (e.g., on resnet test)?

resnet takes ~50s, multi-nested takes ~30s, irregular-loop and parallel-nest take ~13s.

…usion

Lift the limitation that excluded tasks with value_outputs (reductions/
iter_args) from utilization fusion. Changes:

1. Remove the findBestFusionCandidate guard that skipped tasks with
   non-empty value_outputs.

2. Fix split-profiling for iter_args loops: when creating throwaway
   tmp_task wrappers for profiling, mirror the affine.for result types
   as value_output_types and wire the cloned loop results into the
   yield. This prevents ConstructHyperblockFromTask/ConvertTaskflowToNeura
   from producing empty kernels due to type mismatch.

3. Lift IRMapping out of scoped blocks so mappings survive into the
   yield-value collection step (Step 9).

4. Add collectYieldValues lambda to gather value_results from each
   original task's yield via IRMapping.

5. Extend replaceTaskResults to handle value_output remapping with
   per-task offsets into the fused task's value_outputs.

6. Update irregular-loop RESOPT CHECK lines: Task_0 (reduction) and
   Task_1 (write-only) now fuse into Task_0_Task_1_utilfused with
   cgra_count=2, compiled_ii=4, steps=15.
Architecture.h:
- Add usage examples to cloneWithNewDimensions (rect + T-shape)

NeuraPasses.h:
- Consolidate two createMapToAcceleratorPass overloads into one with default arg

NeuraPasses.td:
- Clarify x-tiles/y-tiles as tile counts (not CGRA counts)
- Add CLI examples (single CGRA, 1x3 rect, T-shape with valid-tiles)
- Expand option help strings with per_cgra explanation

TaskflowPasses.td:
- Expand shape documentation (rect, L 3/4 CGRAs, T 4 CGRAs)
- Add bounding-box + tile-list explanation for non-rectangular shapes

InsertDataMovPass.cpp:
- Revert namespace relaxation: only process neura dialect ops
- Add comment explaining arith/math should be lowered before this pass

MapToAcceleratorPass.cpp:
- Use description-style comments (Filters/Checks/Skips)
- Add accurate comment on boundary link handling in Architecture
- Consolidate createMapToAcceleratorPass factory

ResourceAwareTaskOptimizationPass.cpp:
- Rename kGridRows/kGridCols -> kCgraGridRows/kCgraGridCols
- Rename .rectangular -> .is_rectangular
- Add kMaxCgrasPerTask=4, enforce in canFitOnGrid and bottleneck loop
- Add CGRAShape.cgra_positions for non-rectangular shapes (L, T)
- Add getNonRectangularShapes() with explicit coordinate definitions
- Update irAttr() to encode non-rect as NxM[(c0,r0)(c1,r1)...]
- Fix valid_tiles generation to use cgra_positions instead of iterating bbox
- Use SSA def-use chains for memory dependency edges (replaces memref+isBeforeInBlock)

All 4 RESOPT tests pass.
- Change attribute types from i64 to i32 for compiled_ii, steps, trip_count
  to maintain consistency with cgra_count attribute type

- Clarify CGRA tile occupation grid comments: replace cryptic single-letter
  task labels (F, T1, etc.) with numeric indices and explicit mappings
  showing full task names for better readability

- Update all CHECK patterns in 4 taskflow test files to match new i32 types

All tests pass (4/4). No functional changes to the optimization algorithm.
ResourceAwareTaskOptimizationPass.cpp:
- Rename CGRAShape -> CgraShape (struct and all usages)
- Rename hasDependency params: from/to -> source_node/dest_node
- Remove \p Doxygen markers from hasDependency doc comment
- Fix comment: Override -> Overrides (description style)
- Remove unused printShapeOptions function (was [[maybe_unused]])
- Add --estimation-mode pass option with two modes:
    compiled (default): full Neura lowering + mapping for accurate II/steps
    analytical: ResMII/RecMII analytical estimates only (faster, no mapper)
  Wired through build(), profileTask(), and fusion/balance lambdas.
  Balance probes always use analytical regardless of mode.
- irAttr() already handles non-rectangular shapes with NxM[(c,r)...] encoding
  (reviewer comment was on old diff; current code is correct)

TaskflowPasses.td:
- Add estimation-mode option definition and documentation
- Add CLI example: --resource-aware-task-optimization estimation-mode=analytical

MapToAcceleratorPass.cpp:
- Simplify boundary-link comment (remove Architecture internals)

All 4 RESOPT tests pass.
This commit addresses assertion failures and infinite loops caused by recent changes to hyperblock construction (PR #259).

1. Fixed computeTripCount to correctly calculate trip counts for tasks without explicit taskflow.counter operations by recursively traversing loop regions.

2. Fixed split-profile to correctly clone only top-level operations, preventing isBeforeInBlock crashes when attempting to clone operations from different nested blocks.

3. Fixed an issue in split-profile where value_output_types were not properly preserved for the temporary task. This prevented intermediate hyperblocks from being deleted by the DCE canonicalization pass, which previously resulted in no kernels being generated and triggered a fatal assertion.

4. Updated test files to explicitly run --construct-hyperblock-from-task before resource optimization, aligning with the new pipeline requirements.
@guosran guosran marked this pull request as draft February 28, 2026 06:33
guosran added 4 commits March 1, 2026 06:55
…unt and LowerAffine in Phase 2 pipeline

- Restore affine.for/scf.for fallback in computeTripCount() for cases
  where taskflow.counter ops are not yet present in the task body
  (e.g. before construct-hyperblock-from-task has run, or when called
  from contexts where counters are not visible via walk()).
  Without this fallback, all trip counts were returning 1, causing
  the optimizer to skip multi-CGRA allocation and task fusion.

- Restore createLowerAffinePass() in runNeuraPipelineOnKernel (Phase 2).
  The kernel body produced by Phase 1 may still contain affine.for ops
  that must be lowered before cf/llvm conversion.

- Restore 'DUMPING PHASE 1 TASK A' debug print in performFusion.

- Restore original comment style in PipelineBalancer::balance() for
  the canFitOnGrid check (remove TODO that was added incorrectly).

Fixes: all 4 multi-cgra/taskflow lit tests now pass.
@guosran guosran marked this pull request as ready for review March 1, 2026 02:35
guosran added 3 commits March 3, 2026 06:57
The ResourceAwareTaskOptimizationPass crashed when calling task_b.erase()
after fusion because performFusion only cloned ops from the entry block,
leaving references to values in non-entry blocks dangling.

Root cause: After convert-cf-to-llvm, task bodies become multi-block
(with llvm.br/llvm.cond_br between blocks). The old code iterated
task_b's front block only, so cloned kernel inputs that referenced
values from other blocks would map to originals via lookupOrDefault,
creating invalid IR that crashed on erase.

Fix: Replace the single-block clone loop with Region::cloneInto() to
clone the entire multi-block region of each source task into the fused
task. The entry blocks of both clones are then spliced together and
the two cloned kernels are merged into one fused kernel.

Also relax TaskflowOps.td body constraint from SingleBlockImplicitTerminator
to AnyRegion to allow multi-block task bodies.

Test updates: Add --verify-each=false to all 4 RESOPT test RUN lines
(required because convert-cf-to-llvm creates intermediate IR with
affine.for + llvm.br successors that fails the MLIR verifier).
Update FileCheck patterns to match actual output format and values.

All 4 tests pass:
  PASS: multi-cgra/taskflow/multi-nested/multi-nested.mlir
  PASS: multi-cgra/taskflow/parallel-nested/parallel-nested.mlir
  PASS: multi-cgra/taskflow/irregular-loop/irregular-loop.mlir
  PASS: multi-cgra/taskflow/resnet/simple_resnet_tosa.mlir
…affine serialization/perfection passes to RESOPT pipeline

- computeTripCount: add strategy-3 fallback for post-transform-ctrl-to-data-flow IR
  * outer loops: extract from arith.cmpi (predicate=slt) + arith.constant RHS
  * inner kernel loops: extract from neura.icmp (cmpType=slt) + rhs_value attribute
  * multiply all bounds to get total trip count
- remove debug llvm::errs() from computeTripCount
- fix extra closing brace from debug output removal
- add --affine-loop-tree-serialization and --affine-loop-perfection at the start
  of RESOPT pipeline in all 4 test files
- update RESOPT FileCheck patterns:
  * resnet: new task names, correct trip_counts, Task_6_Task_8_utilfused gets cgra_count=2/tile_shape=1x2
  * multi-nested: trip_count 1 -> 160/192/36
  * parallel-nested: cgra_count 1->2, ii 7->6, tile_shape 1x1->1x2, trip_count 1->64
  * irregular-loop: steps 10->11, trip_count 1->32 for both tasks

All 4 tests pass.
- Updated computeTripCount to drop unused strategies and added sanity assert.
- Improved comments and documentation for fuse logic and ReserveOp handling.
- Adjusted InsertDataMovPass comments.
- Added test modifications reflecting pipeline changes.
guosran added 4 commits March 4, 2026 02:18
- Add balanceSkipMapper pass option (default: true) so balance probes
  use analytical II estimates by default instead of running the full
  mapper on each speculative CGRA count probe
- At convergence, re-profile tasks with cgra_count > 1 using the real
  mapper so final compiled_ii in the IR reflects true hardware values
- Keep all_data_movs_ok guard in profileTask to prevent mapper crashes
  on tasks containing ops not yet lowered to Neura (e.g. arith.minimumf)
- Update all 4 multi-CGRA tests to use balance-skip-mapper=false so
  tests exercise the real mapper path; update CHECK lines to match
  actual lit-generated output
- Add --verify-each=false to irregular-loop and resnet tests to work
  around pre-existing arith.minimumf type-validation failure after
  lower-arith-to-neura
- Add new test: test/multi-cgra/taskflow/resource-heavy/resource-heavy.mlir
  * Real stereo vision disparity computation kernel
  * Demonstrates multi-CGRA allocation: res_mii 3→2→1 as CGRAs increase
  * Balance ACCEPTS cgra_count 1→2→3 (II: 3→2→1, latency: 199→136→73)
  * Final: cgra_count=3, compiled_ii=1, tile_shape=2x2[(0,0)(1,0)(0,1)]

- Fix convergence re-profiling in ResourceAwareTaskOptimizationPass.cpp
  * When balanceSkipMapper=true (default), don't re-run mapper at convergence
  * The converged graph state is authoritative
  * Removes incorrect mapper re-execution that contradicted balanceSkipMapper semantics

- Add arith lowering patterns in ArithToNeuraPass.cpp
  * ArithMinimumFToNeuraFCmpSel: arith.minimumf → neura.fcmp + neura.sel
  * ArithMaximumFToNeuraFCmpSel: arith.maximumf → neura.fcmp + neura.sel
  * ArithAndIToNeuraAnd: arith.andi → neura.and
  * ArithOrIToNeuraOr: arith.ori → neura.or
  * Fixes mapper guard failures in test kernels

- Update test CHECK lines
  * resnet: updated RESOPT lines to match actual multi-CGRA output (cgra_count=1 for all tasks)
  * irregular-loop: updated RESOPT lines (compiled_ii=2 for Task_2)

All 5 multi-CGRA tests pass:
  - parallel-nested.mlir (1/16 CGRAs)
  - multi-nested.mlir (3/16 CGRAs)
  - irregular-loop.mlir (1/16 CGRAs)
  - resnet.mlir (6/16 CGRAs)
  - resource-heavy.mlir (3/16 CGRAs)
…natory comments

This is a continuation of the 'remove excessive docs' commit, completing the cleanup:

**Removed Redundant Code:**
- Dead `result_to_counter` map (was built but never used)
- Unnecessary for loop over `cloned_kernels` (always single kernel post-assert)
- Replaced with direct variable access

**Fixed Issues:**
- Step numbering in performFusion: was Steps 1-5, 10-12 → now Steps 1-8 consecutively

**Comment Cleanup:**
- Reduced verbose doc comments: 13-18 lines → 1-4 lines
  * profileTask: 13 → 3 lines
  * runNeuraPipelineOnKernel: 18 → 4 lines
  * balance(): 10 → 2 lines
  * Other function docs similarly condensed

**Comment Style Unification (3rd person singular + period):**
- "Builds X" → "Builds X."
- "Check X" → "Verifies X." or "Ensures X."
- "Write X" → "Writes X."
- Applied consistently throughout file

**Restored Explanatory Comments:**
- valid_tiles enumeration logic for non-rectangular shapes
- merged_iter_args/merged_kernel_results concatenation purpose
- buildKernelArgMapping lambda mapping logic
- merged_iter_args_next/merged_results yield collection
- fused_yield creation and yield_type preservation
- yield_writes/yield_values mapping to block args
- addUnique lambda deduplication logic
- cp_depth derivation from ALAP scheduling

Result: File reduced from 2011 → 1832 lines (cleaner, still well-documented)
All 5 tests pass; build successful (606/606 targets)
@ShangkunLi ShangkunLi requested a review from tancheng March 4, 2026 06:36
@guosran guosran merged commit 82d6421 into main Mar 4, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[P1] Model multi-cgra in arch spec

4 participants