Skip to content

Add a new ATen Matmul IR node#2175

Merged
Priya2698 merged 34 commits intomainfrom
pm/linear_node
May 9, 2024
Merged

Add a new ATen Matmul IR node#2175
Priya2698 merged 34 commits intomainfrom
pm/linear_node

Conversation

@Priya2698
Copy link
Collaborator

@Priya2698 Priya2698 commented May 1, 2024

Issue #2149.

Adds a new MatmulOp IR node that has the same functionality as torch.matmul.

  1. Both tensors are 1D: Dot product is returned (sum(mul(a, b)), without creating a MatmulOp node.
  2. One of the tensors is 1D: [M, K] x [K] -> [M] / [K] x [K, N] -> [N]
  3. Both tensors are 2D: [M, K] x [K, N] -> [M, N]
  4. Both tensors are atleast 1D and one of the tensors is > 2D: [B, M, K] x [K, N] -> [B, M, N]

csrc/ops/utils/mapMatmulOpIterDomains defines the logic to map the input operands to the ouput. This is used to create the new output TensorView for the MatmulOp using the input iterdomains and in PairwiseRootDomainMap to accurately map the MatmulOp input/outputs. This is required since the inputs are no longer broadcasted affecting the alignment of the inputs with the output.

Copy link
Collaborator

@jacobhinkle jacobhinkle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass. Havent looked in detail at the mapping code yet.

Comment on lines +252 to +257
// Adding these pragmas since gcc-12.2.1
// incorrectly reports a warning with the use of evaluate
#if defined(__GNUC__) && !defined(__clang__)
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wfree-nonheap-object"
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @protonu. Please check that this GCC check is in the right place. Specifically, should it instead be at outIterDomain()? I checked the original PR #643 but I didn't find any mention of this code so I'm not sure what the original problem was.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the comment about issues with using evaluate, it should be around the new outIterDomain function.

@Priya2698
Copy link
Collaborator Author

I made some drive-by comments, but I'll really defer approval to others because I'm not a matmul expert and I have a bit too many PRs to review this week :)

Thanks @wujingyue, for the helpful comments.

@Priya2698 Priya2698 requested a review from jjsjann123 May 7, 2024 19:18
Copy link
Collaborator

@jacobhinkle jacobhinkle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM other than some minor comments

bool is_lhs,
size_t out_size);

IterDomain* newOutputIterDomain(const std::vector<IterDomain*>& ids);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment above this function indicating what you just said here.

}

// Add key-value iterdomain pair to the map.
void updatePairwiseRootDomainMap(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could make this a lambda inside map() so that you could capture the last three arguments instead of passing them explicitly.

std::make_tuple(Sizes({m, k}), Sizes({k})),
std::make_tuple(Sizes({k}), Sizes({b, k, n})),
std::make_tuple(Sizes({b, m, k}), Sizes({k})),
std::make_tuple(Sizes({b, 1, m, k}), Sizes({b, k, n}))));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it true that we can accept any combination of A and B where A is one of {k}, {m, k}, {b, m, k}, {b, 1, m, k} and B is one of {k}, {k, n}, {b, k, n}? If so maybe we could parametrize each of those separately and hit all combos.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity. What would happen if k/m/n happen to be 1?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity. What would happen if k/m/n happen to be 1?

Do you mean if the output shape is different? -- It will be the same as any other case.

Or how we intend to handle those cases?
@jacobhinkle pointed out we could special-case these cases though to increase opportunity for fusion. (For eg: For [M, 1] x [1, N], we can simply return the outer product without creating the MatmulOp node)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For [M, 1] x [1, N]

This is the K=1 case but I think Jie is asking what about M and N, which we aren't testing here. I would also add that we're not really testing this case properly either since we are creating the input tensors with makeSymbolicTensor(a_shape.size()) which will make all dimensions IterType::Iteration. I suggested a code change that I think will address that, then we can add shape combos that have 1s in each position.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added cases for M=1/N=1. At present, they will behave the same way as when M/N > 1.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also add that we're not really testing this case properly either since we are creating the input tensors with makeSymbolicTensor(a_shape.size()) which will make all dimensions IterType::Iteration.

makeSymbolicTensor(a_shape) -> Does this mark dimensions as Broadcast if the extent is 1?


FusionExecutor fe;
fusion->aliasOutputToInput(
fusion->outputs()[0], /*input=*/nullptr, AllocationType::Evaluate);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a strange API for marking something as AllocationType::Evaluate...

Copy link
Collaborator Author

@Priya2698 Priya2698 May 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used the existing framework for aliasing, hence, some of the API names may seem weird.
I'll make a note of this to modify in a future cleanup PR.

std::make_tuple(Sizes({m, k}), Sizes({k})),
std::make_tuple(Sizes({k}), Sizes({b, k, n})),
std::make_tuple(Sizes({b, m, k}), Sizes({k})),
std::make_tuple(Sizes({b, 1, m, k}), Sizes({b, k, n}))));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity. What would happen if k/m/n happen to be 1?

jacobhinkle added a commit that referenced this pull request May 9, 2024
This PR restricts the accepted matmul segments for the nvfuser matmul
scheduler to only those containing pointwise epilogues. Additionally, it
rules out cases for which we cannot yet reliably determine epilogue input
vectorization due to transposes (TODO, see #2169).

Note that this check can be lifted when more epilogue cases are
supported, e.g. #2213.

Fixes #2167.

This is stacked on #2175 and follow-up PR to that introducing LinearOp
because currently segmentation fails for matmuls unless the complete
fusion can be scheduled (see #1707). The MatmulOp and LinearOp IR nodes
remove the need to inspect operand producer branches, so segmentation
should work fine once that work is merged. This PR will be marked as
draft until then.
@Priya2698
Copy link
Collaborator Author

!build

Comment on lines +448 to +449
auto tv0 = makeSymbolicTensor(a_shape.size(), DataType::Half);
auto tv1 = makeSymbolicTensor(b_shape.size(), DataType::Half);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto tv0 = makeSymbolicTensor(a_shape.size(), DataType::Half);
auto tv1 = makeSymbolicTensor(b_shape.size(), DataType::Half);
auto tv0 = makeSymbolicTensor(a_shape, DataType::Half);
auto tv1 = makeSymbolicTensor(b_shape, DataType::Half);

This will ensure the tensors are defined in the fusion almost the way they would be using fd.from_pytorch in case the shapes contain 1s. That could be useful here since you might translate some ops to non-MatmulOp if they are trivial. To be more precise though, this will still not declare them as contiguous. For that you might want to do

  auto tv0 = TensorViewBuilder().shape(sym_shape).dtype(DataType::Half).contiguity(true).build();

where sym_shape is a vector of 1 and -1.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more precise though, this will still not declare them as contiguous

Why do we need this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However if in the future we try to evaluate this test by forcing the nvfuser matmul scheduler we might use an inefficient kernel since we won't be able to tell the inputs are contiguous. It's not a high priority, but it would be nice to have a test utility that could take an at::Tensor and create a TensorView* that matches it just like how we do in fd.from_pytorch.

std::make_tuple(Sizes({m, k}), Sizes({k})),
std::make_tuple(Sizes({k}), Sizes({b, k, n})),
std::make_tuple(Sizes({b, m, k}), Sizes({k})),
std::make_tuple(Sizes({b, 1, m, k}), Sizes({b, k, n}))));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For [M, 1] x [1, N]

This is the K=1 case but I think Jie is asking what about M and N, which we aren't testing here. I would also add that we're not really testing this case properly either since we are creating the input tensors with makeSymbolicTensor(a_shape.size()) which will make all dimensions IterType::Iteration. I suggested a code change that I think will address that, then we can add shape combos that have 1s in each position.

@Priya2698
Copy link
Collaborator Author

!build

@Priya2698 Priya2698 merged commit d214286 into main May 9, 2024
@Priya2698 Priya2698 deleted the pm/linear_node branch May 9, 2024 22:01
jacobhinkle added a commit that referenced this pull request May 16, 2024
This PR does the following:
1. Add `MatmulOp` to `ir_utils::isTvOp` so that its `IterDomain`s will
be automatically propagated by `IdModel`.
2. Updates the tests to check that all non-Broadcast axes are properly
mapped by `IdModel` through the `MatmulOp`.
3. Changes the output of `MatmulOp` to have an `IterType::Reduction`
axis in the last position of its root domain to represent the `K`
dimension. This change was motivated by needing a way to have both
operand K dimensions exact mapped together, as they would be if the op
were translated to a mul+sum+cast.
4. Updates the `matmul` op to translate trivial cases where K=1 to
simple multiply+cast patterns.

Fixes #1707. In fact, that test was actually fixed by #2175 but the test
validation was failing because `isTvOp` was not picking up the matmul as
a reduction.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants