This should probably go into the FinalizeMultideviceDomainsPass. The idea is similar to our SyncMap analysis. For example,
- Given a TensorView at a segmentation boundary, if its producer and all consumers can be inlined into the same loop, stream-parallelize its allocation domain.
- Otherwise, don't stream-parallelize it because it has to be allocated outside the loop.
- The allocation of a fusion input/output can't be stream-parallelized because its producer/consumer is outside nvFuser.