Task parallelism and warp-specialization

# Motivation

On Hopper, [efficient gemm requires warp-specialization](https://github.com/NVIDIA/cutlass/blob/main/media/docs/efficient_gemm.md#warp-specialization), which is not currently supported by nvFuser. This doc is to extend nvFuser in order to support such optimization. I believe this will not only benefit matmul, but also benefit other cases like optimal cat/stack scheduling, horizontal fusion, etc., see "Potential applications" section for more detail.

# Design

**Notation**: I will mostly use the term "task parallelism" for the new thing being added to nvFuser. "Warp-specialization" is a special case of "vertical task parallelism" (described below) on thread index.

## Partition of DAG as tasks

In order to use task parallelism, we need first partition the DAG into tasks. Tasks are non-overlapping and dense, that is, every `Val` in the fusion definition except fusion inputs belongs to a task (fusion inputs are special because they are given instead of computed), and one `Val` can only belong to one task. Initially, all `Val`s belong to task 0. Example partition:

<img width="793" alt="image" src="https://user-images.githubusercontent.com/1032377/234343175-9001e5dc-5f4f-477b-978a-4bab04d16d77.png">


## Grouping tasks into a hierarchical structure

Tasks are further grouped into task groups. Task groups form a hierarchical structure.

<img width="907" alt="image" src="https://user-images.githubusercontent.com/1032377/234345227-5f2a3780-3f27-4e94-ac96-f010f8def31b.png">

## Parallelization of task groups

A task group can be parallelized, for example

```C++
group1->parallelize(ParallelType::TIDy);
group3->parallelize(ParallelType::TIDz);
```

Not all task groups can be parallelized. A parallelizable group is either a "horizontal group" or a "vertical group". A "horizontal group" is a group whose members have no data dependency with each other. For example, group 1 is a horizontal group. A "vertical group" is a group whose members are connected, for example group 3 is a vertical group.

Below is an example where group 4 is neither a horizontal group nor a vertical group:

<img width="385" alt="image" src="https://user-images.githubusercontent.com/1032377/234348142-3a5356c1-eb19-4f72-a6e0-96a7f6932a88.png">

However, you can make it a horizontal group by grouping group 2 and group 3 together:
<img width="348" alt="image" src="https://user-images.githubusercontent.com/1032377/234364182-7f553250-387d-4e1d-a41c-e1cec6630e1a.png">

## Expression sorting

Expression sorting must be task and task group aware. For the above example, the sorted expressions can be

```C++
group 3
```

which zoom into

```C++
group 1
group 2
```

which zoom into

```C++
task 1
task 2
task 3
task 4
task 5

// the following order is also valid, and expr sort is free to choose one from all valid orders

task 3
task 4
task 1
task 2
task 5

// There are more valid orders...

// In later context, we assume the first order is picked
```

which zoom into

```C++
tv1 = sin(tv0);
tv5 = cos(tv1);
tv2 = tan(tv0);
tv6 = relu(tv2);
tv3 = exp(tv0);
tv7 = sigmoid(tv3);
tv4 = log(tv0);
tv8 = tanh(tv4);
tv9 = cat([tv5, tv6, tv7, tv8]);
tv10 = neg(tv9);
```

## Loop nest generation

When generating loop nest, for an unparallelized group, it just generate its members one after another. For parallelized groups, it will generate `kir::IfThenElse`s to dispatch between its members.

Assuming `tv0`-`tv8` has `[BIDx, TIDy{size0}, TIDx]`, `tv9` and `tv10` has `[BIDx, TIDy{size0*4}, TIDx]` (the cat dim is 1, I am assuming the cat dim is untouched by the scheduler in this example), and inline most.

Then the generated loop nest structure will be

```C++
FOR blockIdx.x in BIDx:
  IF threadIdx.z == 0:
    IF threadIdx.y >= 0 && threadIdx.y < size0:
      FOR threadIdx.y in TIDy{size0}:
        FOR threadIdx.x in TIDx:
          tv1 = sin(tv0);
          tv5 = cos(tv1);
    ELSE IF threadIdx.y - size0 >= 0 && threadIdx.y - size0 < size0:
      FOR threadIdx.y - size0 in TIDy{size0}:
        FOR threadIdx.x in TIDx:
          tv2 = tan(tv0);
          tv6 = relu(tv2);
    ELSE IF threadIdx.y - 2*size0 >= 0 && threadIdx.y - 2*size0 < size0:
      FOR threadIdx.y - 2*size0 in TIDy{size0}:
        FOR threadIdx.x in TIDx:
          tv3 = exp(tv0);
          tv7 = sigmoid(tv3);
    ELSE IF threadIdx.y - 3*size0 >= 0 && threadIdx.y - 3*size0 < size0:
      FOR threadIdx.y - 3*size0 in TIDy{size0}:
        FOR threadIdx.x in TIDx:
          tv4 = log(tv0);
          tv8 = tanh(tv4);
  ELSE IF threadIdx.z == 1:
    FOR threadIdx.y in TIDy{4*size0}:
      FOR threadIdx.x in TIDx:
        tv9 = cat([tv5, tv6, tv7, tv8]);
        tv10 = neg(tv9);
```

## Synchronization

For the parallelization of horizontal task groups, synchronization must happen before and after the dispatch. Depending on the parallel type, block sync or grid sync might be needed. For the parallelization of vertical task groups (a.k.a. warp specialization), parallelization boundary (in this case `tv5`-`tv8`) must be double/circular buffered, and [arrive-wait barrier is used for sync](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#spatial-partitioning-also-known-as-warp-specialization).

# Potential applications

## Efficient matmul on Hopper

See:
https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp

Warp specialization is used, and we are doing load and mma+store in different warps.

## Horizontal fusion

For example, we have multiple separate fusions, independently scheduled. If all fusions only uses `BIDx` but not `BIDy` and `BIDz`, then we can trivially horizontally fuse these fusions by partitioning each fusion as a task in the combined fusion and horizontal parallelize these tasks on `BIDx`.

## Cat/stack schedule

For cat/stack, the size of the output tensor is naturally the sum of the size of inputs, we could parallelize the computation of inputs in a way like the parallelization of group 1 in the above example.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task parallelism and warp-specialization #210

Motivation

Design

Partition of DAG as tasks

Grouping tasks into a hierarchical structure

Parallelization of task groups

Expression sorting

Loop nest generation

Synchronization

Potential applications

Efficient matmul on Hopper

Horizontal fusion

Cat/stack schedule

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Task parallelism and warp-specialization #210

Description

Motivation

Design

Partition of DAG as tasks

Grouping tasks into a hierarchical structure

Parallelization of task groups

Expression sorting

Loop nest generation

Synchronization

Potential applications

Efficient matmul on Hopper

Horizontal fusion

Cat/stack schedule

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions