Skip to content

[feat] Kernel-level fusion#276

Merged
HobbitQia merged 18 commits intocoredac:mainfrom
HobbitQia:kernel_fusion
Mar 6, 2026
Merged

[feat] Kernel-level fusion#276
HobbitQia merged 18 commits intocoredac:mainfrom
HobbitQia:kernel_fusion

Conversation

@HobbitQia
Copy link
Collaborator

PR 251

Implements --fuse-task, a new pass at the Taskflow IR level that merges adjacent loop kernels before lowering to Neura. Two strategies are supported:

  • Producer-consumer: fuses two tasks where the first writes an intermediate buffer consumed by the second; the intermediate store/load is eliminated.
  • Sibling: fuses two tasks that share input arrays but have no data dependency.

Both strategies are gated by an MII-based profitability check: the pass speculatively lowers candidate kernels through the full Taskflow → Neura pipeline on a cloned module, measures rec_mii and res_mii, and only fuses when the fused MII does not exceed the cost of running the two tasks independently.

Example: Producer-Consumer
Before fusion:

// Task 0: C[i] = A[i] + B[i]
%t0 = taskflow.task @Task_0 read_memrefs(%A, %B) write_memrefs(%C) : ... {
  ^bb0(%i: index):
    %a = memref.load %A[%i] : memref<64xf32>
    %b = memref.load %B[%i] : memref<64xf32>
    %s = arith.addf %a, %b : f32
    memref.store %s, %C[%i] : memref<64xf32>
}
// Task 1: D[i] = C[i] * 2.0  (consumes C from Task 0)
%t1 = taskflow.task @Task_1 read_memrefs(%t0) write_memrefs(%D) value_inputs(%cst) : ... {
  ^bb0(%i: index):
    %v = memref.load %C[%i] : memref<64xf32>
    %r = arith.mulf %v, %cst : f32
    memref.store %r, %D[%i] : memref<64xf32>
}

After fusion:

// Fused: D[i] = (A[i] + B[i]) * 2.0
%t = taskflow.task @fused_pc read_memrefs(%A, %B) write_memrefs(%C, %D) value_inputs(%cst) : ... {
  ^bb0(%i: index):
    %a = memref.load %A[%i] : memref<64xf32>
    %b = memref.load %B[%i] : memref<64xf32>
    %s = arith.addf %a, %b : f32
    %r = arith.mulf %s, %cst : f32
    memref.store %r, %D[%i] : memref<64xf32>
}

@HobbitQia HobbitQia requested review from ShangkunLi and tancheng and removed request for tancheng March 3, 2026 07:13
@HobbitQia
Copy link
Collaborator Author

@ShangkunLi You can check the function computeRealMetrics where I created a cloned module and used pass manager to apply the all pass pipelines on the current codes to obtain statistics.

@HobbitQia HobbitQia merged commit 3f3ad34 into coredac:main Mar 6, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants