Initial Building Blocks for Dynamic Transformation Support#24
Conversation
aa5e675 to
ae1b2b8
Compare
csarofeen
left a comment
There was a problem hiding this comment.
Pushing comments, only finished dynamic_transform[.cpp, .h]
|
|
||
| // Root and rfactor domains are updated. First mutate the | ||
| // TensorDomain and then TensorView | ||
| mutate(tv->domain()); |
There was a problem hiding this comment.
Seems like it would be nice if opt out mutator also covered transformations between root and domain of tensor domains.
| } | ||
| } | ||
|
|
||
| TEST_F(NVFuserTest, DynamicTransform3_CUDA) { |
There was a problem hiding this comment.
@jjsjann123 This is a simple example of defining a symbolic reshape and later concretizing it with actual sizes.
Line 172 is where the symbolic fusion is analyzed with actual sizes. Line 176 is where the fusion is concretized, i.e., the symbolic reshape is converted to a static reshape. Thereafter, the fusion can be fed to the rest of the system, i.e., segmenter and GpuLower.
| patterns.push_back(ShapeInfo{ | ||
| .ref_transform = {{{3, 4}, {4, 3}}}, | ||
| .equal_transforms = | ||
| {{{{3, 4}, {4, 3}}}, {{{2, 8}, {4, 4}}}, {{{3, 8}, {4, 6}}}}, | ||
| .different_transforms = {{{{3, 4}, {2, 6}}}}}); |
There was a problem hiding this comment.
@jjsjann123 Here, this defines several patterns of reshape ops. The ref transform is used as the reference, and is compared with the equal and different transforms. The equal transforms is a list of transforms that should be able to use the same concretized fusion as the reference, whereas the different transforms should result in different concretized fusions. There are TORCH_CHECK near the end of this test that assert these hypotheses.
There was a problem hiding this comment.
cc'ing @kevinstephano on this to comment on how this would affect our python cache.
I think the new workflow would be that, our python cache system should store the non-concretized-Fusion object. We'll need to add another layer on the caching system to map this to concretized-Fusion.
IMHO, this is still a big hammer and I'm uncertain if it's easier done at a higher level on the framework/integration side.
| auto tv1 = makeSymbolicTensor(2); | ||
| fusion.addInput(tv1); | ||
| auto tv2 = makeSymbolicTensor(2); | ||
| fusion.addInput(tv2); |
There was a problem hiding this comment.
QQ: we mentioned about tv1 and tv2 has the same shape, that's just implicit because we have add(tv1, tv2) later right?
csrc/dynamic_transform.h
Outdated
| Fusion*, | ||
| ExpressionEvaluator* expr_eval); | ||
|
|
||
| static void concretizeFusion(Fusion*, const DynamicTransformInfo& info); |
There was a problem hiding this comment.
Looks like concretizeFusion does modify the object pointed by Fusion* since it's not a const Fusion*?
cc'ing @kevinstephano since this likely would affect how/where Fusion* is cached on the python side.
There was a problem hiding this comment.
Yes, that's what the current design does. We could change if that's more preferred. For example, we could just create a copy, concretize it and return the new copy.
I think this is at this moment just an interface design question. We may want to have a different interface, but it seems still too early to think about concrete interfaces.
| std::vector<std::pair<std::vector<int64_t>, std::vector<int64_t>>> | ||
| before_after_shapes = { | ||
| {{4, 3}, {3, 4}}, | ||
| //{{4, 3}, {12, 1}}, not possible to do pad a broadcast domain yet |
There was a problem hiding this comment.
I thought pad on a broadcast domain has been patched?!
jjsjann123
left a comment
There was a problem hiding this comment.
Per our offline discussion. There's no blocking issue from the runtime side. I'm stamping for that.
Note that: Integration of dyanmic reshape support should be plumbed into FusionKernelRuntime.
So FusionExecutorCache will hold an inconcrete Fusion, by the time it retrieves FusionKernelRuntime, we would already have arguments available and be able to concretize the fusion. Which will replace the existing fusion clone happening in the constructor of FusionKernelRuntime.
DynamicTransformInfo will be cached and used along with the heuristics of a kernel scheduling.
| return set(tv->as<Val>())->as<TensorView>(); | ||
| } | ||
|
|
||
| namespace { |
There was a problem hiding this comment.
Moved to analyze_view.h/cpp
| auto squeezed = std::any_of( | ||
| view_analysis.squeeze_axes.begin(), | ||
| view_analysis.squeeze_axes.end(), | ||
| [](bool s) { return s; }) | ||
| ? squeeze(x, view_analysis.squeeze_axes) | ||
| : x; | ||
|
|
||
| auto view = view_analysis.transforms.empty() | ||
| ? squeezed | ||
| : applyViewTransforms(x, squeezed, view_analysis); | ||
|
|
||
| auto bcasted = std::any_of( | ||
| view_analysis.broadcast_axes.begin(), | ||
| view_analysis.broadcast_axes.end(), | ||
| [](bool b) { return b; }) | ||
| ? broadcast(view, view_analysis.broadcast_axes) | ||
| : view; | ||
|
|
||
| return bcasted; |
There was a problem hiding this comment.
Factored out as reshape in transform_view.h
|
|
||
| // Validate that the root domain consists of all inputs to domain | ||
| // Uncertain if this will hold for RFactor | ||
| void validateInputDependency( |
| return ss.str(); | ||
| } | ||
|
|
||
| bool AnalyzeViewResult::operator==(const AnalyzeViewResult& other) const { |
There was a problem hiding this comment.
Note: This comparison is used when comparing DynamicTransformConcretizationInfo. It is mostly the same as the equality comparison of AnalayzeViewConstraint, but here the original and new constraints, i.e., AnalyzeViewConstraint::original_constraint and AnalyzeViewConstraint::new_constraint, are not used as I don't think it's necessary.
There was a problem hiding this comment.
I think the old versions of constraints can just be removed as it seems the functionality is now replaced (since you added a direct == and hash on the class).
| GatherScatter, | ||
| VectorComponent | ||
| VectorComponent, | ||
| Symbolic |
There was a problem hiding this comment.
I'm wondering how this should be modelled generally. Is it that we don't know if it's one of the other types? Or that it's just an under-determined Iteration ID?
There was a problem hiding this comment.
I think this works fine in this PR for now. Still a little strange to me, but symbolic really can only be one of a few other types at the moment, so is fine to mark them as "symbolic" until they get changed later to another type.
There was a problem hiding this comment.
Yes, it's a symbolic IterType, meaning the actual type is unknown when it is created.
| return ss.str(); | ||
| } | ||
|
|
||
| bool AnalyzeViewResult::operator==(const AnalyzeViewResult& other) const { |
There was a problem hiding this comment.
I think the old versions of constraints can just be removed as it seems the functionality is now replaced (since you added a direct == and hash on the class).
| return true; | ||
| } | ||
|
|
||
| size_t AnalyzeViewResult::hash() const { |
There was a problem hiding this comment.
You should be able to remove AnalyzeViewConstraint which was my lazy implementation of this == and hash. It might be used in the torchscript integration though, so you may have to fix the usage there to do this. (Related to the above comment)
| // If it has an rfactor root domain, the IterTypes of the rfactor | ||
| // IDs may need to be updated as well. Traverse the rfactor exprs | ||
| // and mutate the IterTypes of output IDs if symbolic. | ||
| if (tv->hasRFactor()) { |
There was a problem hiding this comment.
@liqiangxl just an example of a traversal on transformations. You might not understand what's going on here, but this is a typical structure of such a traversal.
| // If any of output IDs is symbolic, all outputs should be symbolic | ||
| TORCH_INTERNAL_ASSERT(std::all_of( | ||
| expr->outputs().begin(), expr->outputs().end(), [](Val* output) { | ||
| return output->as<IterDomain>()->getIterType() == |
There was a problem hiding this comment.
nit: This should be safe but I'm generally in the habit of filtering IterDomains anyways (since it's not safe on inputs of transform exprs)
There was a problem hiding this comment.
Makes sense, but I actually think filtering may not be the best in this case. If there's any output that's not an IterDomain, we should not filter it out silently as it's likely the logic here needs to be updated, so instead of filtering I added an explicit assertion at the beginning of this loop.
|
|
||
| //! Dynamic version of reshape. The number of dimensions cannot | ||
| //! change, but the actual sizes of the dimensions can be symbolic. | ||
| TORCH_CUDA_CU_API TensorView* reshape( |
There was a problem hiding this comment.
I didn't realize that we have a requirement on tensor staying the same rank here... that's an interesting restriction.
There was a problem hiding this comment.
I figure we can fake it on the frontend... so that leave me wondering why we are not doing this on the backend side?!?
i.e. We can have a reshape to get a size-1 dimension and then squeeze. (or unsqueeze if we are expanding).
There was a problem hiding this comment.
NVM. I think it's just the comment here.
Dynamic version of reshape. The number of dimensions cannot change, but the actual sizes of the dimensions can be symbolic.
nitpick: this sounds like that we are requiring output to be the same rank as with the input, which I think is not the case at all.
There was a problem hiding this comment.
No that's not the case, just that the output rank of view must be known (and const relative to this cache).
There was a problem hiding this comment.
Updated the comment.
|
!build |
# Background #24 introduces support for dynamic shape arguments for the `reshape` command. When the shape cannot be fully inferred at runtime (defined here as having non-constant scalars present), then output `IterDomain`s are marked as having `IterType::Symbolic`. These are not allowed during lowering, and one must "concretize" the `Fusion` by binding some concrete integers to make the associated extents constant, before scheduling/lowering/execution. Given a compute definition for a `Fusion`, the `FusionExecutorCache` maps from inputs to `FusionKernelRuntime`s. This means that when passed inputs with new sizes or on a new device, `FusionExecutorCache` will determine whether it is safe to use a previously-determined index type (int32 or int64) and scheduler heuristics, and reuse `FusionKernelRuntime` if possible, avoiding resegmentation/rescheduling/recompilation. With the introduction of dynamic reshapes, we must ensure that reuse occurs only when the inputs are compatible with the cached concretized transforms. # Approach The current PR triggers a resegmentation and full rescheduling of the entire original `Fusion` whenever the inputs lead to a new set of concretized transforms. For a small `Fusion` this is reasonable. However, for large segmented `Fusion`s, this will be wasteful, as potentially only a single segment may require rescheduling/recompilation due to changing dynamic transforms. Re-using all concrete segmentation groups while rescheduling dynamic ones in such a case would be preferable, but we must be careful since (I think) this has the potential to require a resegmentation. That is, changing the transforms in a single segmentation group might invalidate the segment and require resegmentation of that segment and that may in turn result in a less than optimal global segmentation. Instead, we currently just resegment starting from scratch. --------- Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>
Fixes #256. This extends the concretization/dynamic Fusion machinery introduced in #24 to include `Resize` expressions, which is how ops like `pad`, `cat`, and `slice` are represented. --------- Co-authored-by: Naoya Maruyama <naoyam@users.noreply.github.com> Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>
This PR provides:
IrBuilder::create<Int>()is just enough.These APIs are intended to enable fusions with symbolic reshapes can be executed with the executor cache system. The actual extension of the caching system is not part of this PR.
Symbolic, to make dynamic axes and propagate the IterType. It can be eitherIterationorBroadcast.Fusionwith concrete inputsFusionwith concrete inputs