-
Notifications
You must be signed in to change notification settings - Fork 79
Description
This post tries to summarize why we need to decouple loop domains and allocation domains during preseg for multi-GPU. In addition to the philosophical reason that they represent different concepts, we have practical reasons for that.
Context: This comes from several offline discussions with @Priya2698. Although we are on the same page, I think it's worth writing down our thought process for posterity.
First principles
At the start of segmentation,
- Loop and allocation may differ in the following ways:
a. Loop can be more parallelized than allocation.
b. Loop and allocation can be in different orders. - Allocation of all segmentation boundaries must be finalized.
Rationales
1(a) is needed for stream parallelization. For example,
TensorView* tv0 = makeContigTensor(2);
TensorView* tv1 = set(tv0);
TensorView* tv2 = set(tv1);
tv1->axis(0)->parallelize(ParallelType::Stream);
tv2->axis(1)->parallelize(ParallelType::Stream);
where tv1 is produced in a row-wise host loop and consumed in a column-wise host loop.
However, unlike loop domains, tv1 needs to be allocated fully, i.e., its allocation domain is the same as logical without parallelization.
1(b) is needed for convenience. The allocation domain of a tensor should respect its stride order by definition. However, the loop domain should prioritize placing Streams/DIDs in the front so we do things like inlineMost and reorder(..., num_device_dims). It's better for other, non-parallel IterDomains to follow the logical order to minimize disruption. Alternatively, we can let the host IR lowering, all get*Heuristicss and all schedule*s reorder locally, but this won't be as convenient.
2 is needed because it's too late to change allocation during or after segmentation. Segmentation can't change allocation because it's read-only. Schedulers can't change allocation of segmentation boundaries because it only sees one segment at a time and doesn't know the implications of changing the allocation of a boundary tensor.
Implications
Preseg passes need to decouple loop and allocation.
This doesn't mean loop and allocation have to be decoupled from the start. They just need to be decoupled at the end of the preseg stage.