-
Notifications
You must be signed in to change notification settings - Fork 3.8k
[Relax][TIR] Introduce new cumsum op for gpu
#16934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| size = (8, 2000) | ||
| np_data = np.random.randint(0, 10, size).astype("int32") | ||
| np_cumsum = np.cumsum(np_data, axis=-1) | ||
| for target in ["cuda", "vulkan -supports_int64=1"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: Use @tvm.testing.parametrize_targets("cuda", "vulkan -supports_int64=1") instead of looping over each target. This performs each test case in a separate pytest environment,
- Exercises each test in a separate pytest case. Can distinguish between failure on one specific backend as opposed to failure on every backend.
- Applies the appropriate
@tvm.testing.requires_*marks for each target. Currently, this test would fail if a developer runs it withset(USE_CUDA ON)andset(USE_VULKAN OFF).
@tvm.testing.parametrize_targets("cuda", "vulkan -supports_int64=1")
def test_dispatch_cumsum_gpu(target, dev):
...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in #16947
| shape = call.struct_info.shape | ||
| kwargs = {} | ||
| if ( | ||
| (axis == -1 or axis == len(shape) - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For tensors of unknown shape, the shape field is none. Instead of len(call.struct_info.shape), can we use call.struct_info.ndim? (Alternatively, since it looks like the implementation requires an explicit shape in order to apply a reshape, we could add shape is not None to this condition.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the caching. Unfortunately, the original implementation does not support unknown shape. I added a check in the pass
|
Looks like I took too long to review. I think the changes requested should probably be made in a follow-up PR. |
The current pass `ForceNarrowIndexToI32` fails to narrow dtype for let binding. This PR fixes the issue. BTW, this PR addresses the comments in apache#16934
The current pass `ForceNarrowIndexToI32` fails to narrow dtype for let binding. This PR fixes the issue. BTW, this PR addresses the comments in apache#16934
The current pass `ForceNarrowIndexToI32` fails to narrow dtype for let binding. This PR fixes the issue. BTW, this PR addresses the comments in apache#16934
The current pass `ForceNarrowIndexToI32` fails to narrow dtype for let binding. This PR fixes the issue. BTW, this PR addresses the comments in apache#16934
The current pass `ForceNarrowIndexToI32` fails to narrow dtype for let binding. This PR fixes the issue. BTW, this PR addresses the comments in #16934
New implementation for
cumsumforaxis=-1, which is the common case in LLM sampling and MoE inference)Tested CUDA, Vulkan on RTX 3080, and Metal on Apple M1 Pro. The new implementation is faster or equal vendor-libraries on small size (
<128k), and nearly 10x faster than topi implementation in all cases