Skip to content

Initial Building Blocks for Dynamic Transformation Support#24

Merged
naoyam merged 33 commits intomainfrom
dynamic_reshape
Apr 25, 2023
Merged

Initial Building Blocks for Dynamic Transformation Support#24
naoyam merged 33 commits intomainfrom
dynamic_reshape

Conversation

@naoyam
Copy link
Collaborator

@naoyam naoyam commented Mar 16, 2023

This PR provides:

  • Create symbolic reshape ops. No need to provide actual sizes in reshape. IrBuilder::create<Int>() is just enough.
  • Analyze a fusion that has symbolic reshapes with an expression evaluator. The expression evaluator is required to be able to evaluate the before and after shapes of each symbolic reshape.
  • Modify the fusion with the analysis result so that all symbolic reshapes are translated to static reshapes.

These APIs are intended to enable fusions with symbolic reshapes can be executed with the executor cache system. The actual extension of the caching system is not part of this PR.

  • Allow tensors to be in an incomplete state with dynamic axes. Dynamic axes may not have a defining expression.
  • Add new IterType, Symbolic, to make dynamic axes and propagate the IterType. It can be either Iteration or Broadcast.
  • Analysis to gather dynamic transform information from an incomplete Fusion with concrete inputs
  • Concretize an incomplete Fusion with concrete inputs
  • Hash function
  • Final testing

@naoyam naoyam force-pushed the dynamic_reshape branch 2 times, most recently from aa5e675 to ae1b2b8 Compare March 25, 2023 04:32
Copy link
Collaborator

@csarofeen csarofeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushing comments, only finished dynamic_transform[.cpp, .h]


// Root and rfactor domains are updated. First mutate the
// TensorDomain and then TensorView
mutate(tv->domain());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it would be nice if opt out mutator also covered transformations between root and domain of tensor domains.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jjsjann123 Here's all the tests.

}
}

TEST_F(NVFuserTest, DynamicTransform3_CUDA) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jjsjann123 This is a simple example of defining a symbolic reshape and later concretizing it with actual sizes.

Line 172 is where the symbolic fusion is analyzed with actual sizes. Line 176 is where the fusion is concretized, i.e., the symbolic reshape is converted to a static reshape. Thereafter, the fusion can be fed to the rest of the system, i.e., segmenter and GpuLower.

Comment on lines +387 to +391
patterns.push_back(ShapeInfo{
.ref_transform = {{{3, 4}, {4, 3}}},
.equal_transforms =
{{{{3, 4}, {4, 3}}}, {{{2, 8}, {4, 4}}}, {{{3, 8}, {4, 6}}}},
.different_transforms = {{{{3, 4}, {2, 6}}}}});
Copy link
Collaborator Author

@naoyam naoyam Mar 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jjsjann123 Here, this defines several patterns of reshape ops. The ref transform is used as the reference, and is compared with the equal and different transforms. The equal transforms is a list of transforms that should be able to use the same concretized fusion as the reference, whereas the different transforms should result in different concretized fusions. There are TORCH_CHECK near the end of this test that assert these hypotheses.

Copy link
Collaborator

@jjsjann123 jjsjann123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc'ing @kevinstephano on this to comment on how this would affect our python cache.

I think the new workflow would be that, our python cache system should store the non-concretized-Fusion object. We'll need to add another layer on the caching system to map this to concretized-Fusion.

IMHO, this is still a big hammer and I'm uncertain if it's easier done at a higher level on the framework/integration side.

auto tv1 = makeSymbolicTensor(2);
fusion.addInput(tv1);
auto tv2 = makeSymbolicTensor(2);
fusion.addInput(tv2);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: we mentioned about tv1 and tv2 has the same shape, that's just implicit because we have add(tv1, tv2) later right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Fusion*,
ExpressionEvaluator* expr_eval);

static void concretizeFusion(Fusion*, const DynamicTransformInfo& info);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like concretizeFusion does modify the object pointed by Fusion* since it's not a const Fusion*?

cc'ing @kevinstephano since this likely would affect how/where Fusion* is cached on the python side.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what the current design does. We could change if that's more preferred. For example, we could just create a copy, concretize it and return the new copy.

I think this is at this moment just an interface design question. We may want to have a different interface, but it seems still too early to think about concrete interfaces.

std::vector<std::pair<std::vector<int64_t>, std::vector<int64_t>>>
before_after_shapes = {
{{4, 3}, {3, 4}},
//{{4, 3}, {12, 1}}, not possible to do pad a broadcast domain yet
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought pad on a broadcast domain has been patched?!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet

Copy link
Collaborator

@jjsjann123 jjsjann123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per our offline discussion. There's no blocking issue from the runtime side. I'm stamping for that.

Note that: Integration of dyanmic reshape support should be plumbed into FusionKernelRuntime.

So FusionExecutorCache will hold an inconcrete Fusion, by the time it retrieves FusionKernelRuntime, we would already have arguments available and be able to concretize the fusion. Which will replace the existing fusion clone happening in the constructor of FusionKernelRuntime.

DynamicTransformInfo will be cached and used along with the heuristics of a kernel scheduling.

return set(tv->as<Val>())->as<TensorView>();
}

namespace {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to analyze_view.h/cpp

Comment on lines -105 to -123
auto squeezed = std::any_of(
view_analysis.squeeze_axes.begin(),
view_analysis.squeeze_axes.end(),
[](bool s) { return s; })
? squeeze(x, view_analysis.squeeze_axes)
: x;

auto view = view_analysis.transforms.empty()
? squeezed
: applyViewTransforms(x, squeezed, view_analysis);

auto bcasted = std::any_of(
view_analysis.broadcast_axes.begin(),
view_analysis.broadcast_axes.end(),
[](bool b) { return b; })
? broadcast(view, view_analysis.broadcast_axes)
: view;

return bcasted;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Factored out as reshape in transform_view.h


// Validate that the root domain consists of all inputs to domain
// Uncertain if this will hold for RFactor
void validateInputDependency(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just factored out

@naoyam naoyam changed the title [DRAFT] Dynamic reshape support Initial Building Blocks for Dynamic Transformation Support Apr 18, 2023
return ss.str();
}

bool AnalyzeViewResult::operator==(const AnalyzeViewResult& other) const {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This comparison is used when comparing DynamicTransformConcretizationInfo. It is mostly the same as the equality comparison of AnalayzeViewConstraint, but here the original and new constraints, i.e., AnalyzeViewConstraint::original_constraint and AnalyzeViewConstraint::new_constraint, are not used as I don't think it's necessary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the old versions of constraints can just be removed as it seems the functionality is now replaced (since you added a direct == and hash on the class).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked here #217

Copy link
Collaborator

@csarofeen csarofeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice work.

GatherScatter,
VectorComponent
VectorComponent,
Symbolic
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering how this should be modelled generally. Is it that we don't know if it's one of the other types? Or that it's just an under-determined Iteration ID?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this works fine in this PR for now. Still a little strange to me, but symbolic really can only be one of a few other types at the moment, so is fine to mark them as "symbolic" until they get changed later to another type.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's a symbolic IterType, meaning the actual type is unknown when it is created.

return ss.str();
}

bool AnalyzeViewResult::operator==(const AnalyzeViewResult& other) const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the old versions of constraints can just be removed as it seems the functionality is now replaced (since you added a direct == and hash on the class).

return true;
}

size_t AnalyzeViewResult::hash() const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to remove AnalyzeViewConstraint which was my lazy implementation of this == and hash. It might be used in the torchscript integration though, so you may have to fix the usage there to do this. (Related to the above comment)

// If it has an rfactor root domain, the IterTypes of the rfactor
// IDs may need to be updated as well. Traverse the rfactor exprs
// and mutate the IterTypes of output IDs if symbolic.
if (tv->hasRFactor()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liqiangxl just an example of a traversal on transformations. You might not understand what's going on here, but this is a typical structure of such a traversal.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hint.

// If any of output IDs is symbolic, all outputs should be symbolic
TORCH_INTERNAL_ASSERT(std::all_of(
expr->outputs().begin(), expr->outputs().end(), [](Val* output) {
return output->as<IterDomain>()->getIterType() ==
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This should be safe but I'm generally in the habit of filtering IterDomains anyways (since it's not safe on inputs of transform exprs)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, but I actually think filtering may not be the best in this case. If there's any output that's not an IterDomain, we should not filter it out silently as it's likely the logic here needs to be updated, so instead of filtering I added an explicit assertion at the beginning of this loop.


//! Dynamic version of reshape. The number of dimensions cannot
//! change, but the actual sizes of the dimensions can be symbolic.
TORCH_CUDA_CU_API TensorView* reshape(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize that we have a requirement on tensor staying the same rank here... that's an interesting restriction.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figure we can fake it on the frontend... so that leave me wondering why we are not doing this on the backend side?!?

i.e. We can have a reshape to get a size-1 dimension and then squeeze. (or unsqueeze if we are expanding).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVM. I think it's just the comment here.

Dynamic version of reshape. The number of dimensions cannot change, but the actual sizes of the dimensions can be symbolic.

nitpick: this sounds like that we are requiring output to be the same rank as with the input, which I think is not the case at all.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No that's not the case, just that the output rank of view must be known (and const relative to this cache).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comment.

@naoyam
Copy link
Collaborator Author

naoyam commented Apr 25, 2023

!build

@naoyam naoyam merged commit 8f84284 into main Apr 25, 2023
@naoyam naoyam deleted the dynamic_reshape branch April 25, 2023 18:50
@jjsjann123 jjsjann123 restored the dynamic_reshape branch April 25, 2023 19:01
@jjsjann123 jjsjann123 deleted the dynamic_reshape branch April 25, 2023 19:02
jacobhinkle added a commit that referenced this pull request Apr 28, 2023
# Background

#24 introduces support for dynamic shape arguments for the `reshape`
command. When the shape cannot be fully inferred at runtime (defined
here as having non-constant scalars present), then output `IterDomain`s
are marked as having `IterType::Symbolic`. These are not allowed during
lowering, and one must "concretize" the `Fusion` by binding some
concrete integers to make the associated extents constant, before
scheduling/lowering/execution.

Given a compute definition for a `Fusion`, the `FusionExecutorCache`
maps from inputs to `FusionKernelRuntime`s. This means that when passed
inputs with new sizes or on a new device, `FusionExecutorCache` will
determine whether it is safe to use a previously-determined index type
(int32 or int64) and scheduler heuristics, and reuse
`FusionKernelRuntime` if possible, avoiding
resegmentation/rescheduling/recompilation. With the introduction of
dynamic reshapes, we must ensure that reuse occurs only when the inputs
are compatible with the cached concretized transforms.

# Approach

The current PR triggers a resegmentation and full rescheduling of the
entire original `Fusion` whenever the inputs lead to a new set of
concretized transforms. For a small `Fusion` this is reasonable.
However, for large segmented `Fusion`s, this will be wasteful, as
potentially only a single segment may require rescheduling/recompilation
due to changing dynamic transforms. Re-using all concrete segmentation
groups while rescheduling dynamic ones in such a case would be
preferable, but we must be careful since (I think) this has the
potential to require a resegmentation. That is, changing the transforms
in a single segmentation group might invalidate the segment and require
resegmentation of that segment and that may in turn result in a less
than optimal global segmentation. Instead, we currently just resegment
starting from scratch.

---------

Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>
jacobhinkle added a commit that referenced this pull request May 16, 2023
Fixes #256.

This extends the concretization/dynamic Fusion machinery introduced in
#24 to include `Resize` expressions, which is how ops like `pad`, `cat`,
and `slice` are represented.

---------

Co-authored-by: Naoya Maruyama <naoyam@users.noreply.github.com>
Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants