Skip to content

Support dynamic reshapes in FusionExecutorCache#215

Merged
jacobhinkle merged 55 commits intoNVIDIA:mainfrom
jacobhinkle:dynamic_reshape_fec
Apr 28, 2023
Merged

Support dynamic reshapes in FusionExecutorCache#215
jacobhinkle merged 55 commits intoNVIDIA:mainfrom
jacobhinkle:dynamic_reshape_fec

Conversation

@jacobhinkle
Copy link
Collaborator

This PR is stacked on #24.

Background

#24 introduces support for dynamic shape arguments for the reshape command. When the shape cannot be fully inferred at runtime (defined here as having non-constant scalars present), then output IterDomains are marked as having IterType::Symbolic. These are not allowed during lowering, and one must "concretize" the Fusion by binding some concrete integers to make the associated extents constant, before scheduling/lowering/execution.

Given a compute definition for a Fusion, the FusionExecutorCache maps from inputs to FusionKernelRuntimes. This means that when passed inputs with new sizes or on a new device, FusionExecutorCache will determine whether it is safe to use a previously-determined index type (int32 or int64) and scheduler heuristics, and reuse FusionKernelRuntime if possible, avoiding resegmentation/rescheduling/recompilation. With the introduction of dynamic reshapes, we must ensure that reuse occurs only when the inputs are compatible with the cached concretized transforms.

Approach

The current PR triggers a resegmentation and full rescheduling of the entire original Fusion whenever the inputs lead to a new set of concretized transforms. For a small Fusion this is reasonable. However, for large segmented Fusions, this will be wasteful, as potentially only a single segment may require rescheduling/recompilation due to changing dynamic transforms. Re-using all concrete segmentation groups while rescheduling dynamic ones in such a case would be preferable, but we must be careful since (I think) this has the potential to require a resegmentation. That is, changing the transforms in a single segmentation group might invalidate the segment and require resegmentation of that segment and that may in turn result in a less than optimal global segmentation. Instead, we currently just resegment starting from scratch.

@jacobhinkle jacobhinkle changed the base branch from main to dynamic_reshape April 24, 2023 17:29
Currently there is a problem where the managed data is continuing to be
managed after it is no longer valid (i.e. when we concretize the
fusion). That leads to a segfault as the symbolic reshaped tensorviews
are freed during concretization. The fix is to add a new facility to
Fusion which lets us "unmanage" data. I'll include that in a separate
commit on this PR, since we may want to save it for a separate PR.
// to clone them, ending in a segfault. Instead, we reset the object
// here, effectively as if it now describes a non-dynamic Fusion.
// cloned_conc_info.clear();
fusion->stopManaging(conc_info_index);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zasdfgbnm This is the use case I have in mind for stopManaging. Here I'd like to attach concretization metadata then after copying, clear it from the original as well as the copy. I should probably use a string key instead of an index for this case, but hopefully the idea is clear at least.

Comment on lines +363 to +364
template <typename T>
inline std::optional<const T> getManagedSafe(std::string key) const {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zasdfgbnm I had a little trouble trying to return an optional non-const reference, so I just added a const "safe" version here. The existing interface with the unchecked ("unsafe") versions is unmodified.

This is currently failing with a cache hit where there shouldn't be.
This fixes a bug where reshapes that depend on input scalars were
short-circuited if only the input scalars changed.
false}, // merge(1) merge(2) osplit(1, 3)
{{8, 3 * 5, 7, 9}, {8, 3, 5 * 7, 9}, false}, // merge(1) osplit(1, 3)
// test passing -1 dynamically for dimension size
//{{8, 3 * 5, 7, 9}, {8, 3, -1, 9}, false} // merge(1) osplit(1, 3)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@naoyam I have commented this case. I am unsure if we want to support -1 as a scalar input being used in a dynamic reshape or not.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't -1 be allowed in the user program? If so, unless -1 is supported, the frontend needs to analyze what -1 should be replaced with, which isn't ideal. So, I think -1 needs to be allowed.

Copy link
Collaborator Author

@jacobhinkle jacobhinkle Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is failing because kernel_runtime->getMaybeHeuristicsFor(args, forced_index_type) is inserting the -1 from args and then trying to get heuristics. This leaves the -1 in the output shape to the view op. getConcretizationInfo handles this problem by translating -1 appropriately to split/merge ops which have the proper (positive) sizes. Since the output of the dynamic reshape will be discarded at concretization anyway, perhaps we just need a method that will update any negative extents of that TensorView with the concrete size.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, in the reshape op in alias.cpp we could try and build the actual expression for each extent, handling the -1 case with a where. But that's going to be a challenge to simplify later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so with -1, do you mean an IterDomain is concretized to have extent -1? It seems like this is an issue of concretization, not necessarily about the executor extension.

@naoyam
Copy link
Collaborator

naoyam commented Apr 27, 2023

@zasdfgbnm Can you check if the changes of the managed data makes sense?

Keeping the test commented.
Copy link
Collaborator

@zasdfgbnm zasdfgbnm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes in managed data makes sense to me.

Copy link
Collaborator

@naoyam naoyam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Make sure to create an issue for the -1 problem.

@jacobhinkle
Copy link
Collaborator Author

!build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants