Decouple loop and allocation during preseg for multi-GPU

This post tries to summarize why we need to decouple loop domains and allocation domains during preseg for multi-GPU. In addition to the philosophical reason that they represent different concepts, we have practical reasons for that. 

Context: This comes from several offline discussions with @Priya2698. Although we are on the same page, I think it's worth writing down our thought process for posterity. 

# First principles

At the start of segmentation,
1. Loop and allocation may differ in the following ways:
   a. Loop can be more parallelized than allocation. 
   b. Loop and allocation can be in different orders.  
2. Allocation of all segmentation boundaries must be finalized. 

# Rationales

1(a) is needed for stream parallelization. For example, 
```
TensorView* tv0 = makeContigTensor(2);
TensorView* tv1 = set(tv0);
TensorView* tv2 = set(tv1);

tv1->axis(0)->parallelize(ParallelType::Stream);
tv2->axis(1)->parallelize(ParallelType::Stream);
```
where `tv1` is produced in a row-wise host loop and consumed in a column-wise host loop. 
However, unlike loop domains, `tv1` needs to be allocated fully, i.e., its allocation domain is the same as logical without parallelization. 

1(b) is needed for convenience. The allocation domain of a tensor should respect its stride order by definition. However, the loop domain should prioritize placing Streams/DIDs in the front so we do things like `inlineMost` and `reorder(..., num_device_dims)`. It's better for other, non-parallel IterDomains to follow the logical order to minimize disruption. Alternatively, we can let the host IR lowering, all `get*Heuristics`s and all `schedule*`s reorder locally, but this won't be as convenient. 

2 is needed because it's too late to change allocation during or after segmentation. Segmentation can't change allocation because it's read-only. Schedulers can't change allocation of segmentation boundaries because it only sees one segment at a time and doesn't know the implications of changing the allocation of a boundary tensor. 

# Implications

Preseg passes need to decouple loop and allocation. 

This doesn't mean loop and allocation have to be decoupled from the start. They just need to be decoupled at the **end** of the preseg stage. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decouple loop and allocation during preseg for multi-GPU #4381

First principles

Rationales

Implications

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Decouple loop and allocation during preseg for multi-GPU #4381

Description

First principles

Rationales

Implications

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions