[WIP DO NOT REVIEW] "Fold" operations to generalize Reductions by jacobhinkle · Pull Request #2307 · NVIDIA/Fuser

jacobhinkle · 2024-05-28T16:26:20Z

This PR is an experiment in generalizing ReductionOp to handle a wider class of operations than those in BinaryOpType. The approach can be summarized as:

Introduce a new IterType::Fold which represents dimensions that are "being reduced". These IterDomains must always be inlined with one another. The term "fold" was chosen to not conflict with the existing "reduction" terminology.
Introduce new IR nodes representing begin and end of a fold operation. A "fold group" is defined as all the ops between these two nodes.
When fold groups are finalized, the output tensors have IterType::Reduction dimensions.
During lowering, translate these nodes into assignments using kir::Assign nodes which allow us to reassign a variable. This lets us update accumulation tensors inside a loop, for example.

The goals of this design are:

Enable non-trivial reductions like online softmax.
Enable nested reductions for cases that cannot easily be written as rfactors. The outer loop of FlashAttention1 is a good example of such a case.
Avoid reinventing the complicated machinery in lowering like inlining semantics and indexing as much as possible.
If there is a non-awkward way to implement scan in this setting, do so but only as a secondary consideration. The current implementation seeks to do this by allowing both scan and reduction outputs when finalizing a fold group.

It compiles! But is incorrect due to absolutely no codegen yet

T3 is still allocated too high (should be inlined with the other intermediates, and of course I'm not yet codegening the begin and end fold ops.

jacobhinkle added 22 commits April 25, 2024 16:57

Add IterType::Fold and BeginFoldOp

bc3c70f

Add test, FoldGroup

e02d9a7

Some small fixes. Still no finalize op

656509d

Fix up print

a5c6df1

Filling out nodes

3d52673

Clone IDs in intermediate fold tensors

5a8a35d

Add to isTvOp and check compilation

406e222

It compiles! But is incorrect due to absolutely no codegen yet

Try to set up inlining

498365f

Merge remote-tracking branch 'origin/main' into fold_ops

d1b4364

Switch to EndFoldOp

1ca18ce

Merge remote-tracking branch 'origin/main' into fold_ops

7a9f859

Begin draft of validation pass

ab5c37b

Merge remote-tracking branch 'origin/main' into fold_ops

7885bba

Fix compile errors

90a9ba5

T3 is still allocated too high (should be inlined with the other intermediates, and of course I'm not yet codegening the begin and end fold ops.

Fix compile errors due to lintrunner misfire

703f24c

Clang-tidy

c0721e4

Fix printing for special case of 1 tensor

04d6fda

Add some tests

6fb203c

WIP working on nested folds

62737e3

Fix misformatting

ecd406c

Use bfs for finding most recent BeginFoldOp

0d7f043

Make BeginFoldOp::toString() more like that of WelfordOp

172fdc0

jacobhinkle mentioned this pull request Jun 3, 2024

Task formalism in our IR [Inspired by Online Softmax] #2329

Open

jacobhinkle closed this Oct 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP DO NOT REVIEW] "Fold" operations to generalize Reductions#2307

[WIP DO NOT REVIEW] "Fold" operations to generalize Reductions#2307
jacobhinkle wants to merge 22 commits intomainfrom
fold_ops

jacobhinkle commented May 28, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jacobhinkle commented May 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jacobhinkle commented May 28, 2024 •

edited

Loading