CUDA: use shared mem for ssm_conv#20128
Conversation
|
Added some trivial fusions also
|
|
Do you intend to add still more features or is this ready for review? |
|
This it it for now. Will add more in a separate PR |
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
| #pragma unroll | ||
| for (size_t j = 0; j < d_conv; j++) { | ||
| w[j] = w_block[tid * stride_w + j]; | ||
| for (int idx = tid; idx < total_elems; idx += split_d_inner) { |
There was a problem hiding this comment.
@am17an The hip compiler throws a bunch of warnings for this function because of this loop, tid is not known at compile time, obviously, so there is no way for the compiler to know how many iterations this loop should be unrolled to.
There was a problem hiding this comment.
@ggerganov what do you think about -Werror for the hip build? the fact that the cuda compiler dosent warn for these cases has been a constant source of annoyance
There was a problem hiding this comment.
oh right, I think this should be inside the loop like
for (int idx = 0; idx < total_elems; idx += split_d_inner) {
int idx0 = idx0 + idx;
....
}
I think there are a lot of warnings we need to clean up before enabling -Werror, let me do that in separate PR
There was a problem hiding this comment.
A least building the hip backend usually never throws any warnings we could restrict werror just to the hip objects and only for the ci build. or even just -werror for the transformation warning. The goal being that the failing hip backend on the ci run for a pr blocks this kind of thing from entering the codebase, which has happend alot.
There was a problem hiding this comment.
Yes if this is the only warning for now, I'm okay enabling -Werror for HIP since it would help catch these bugs in the CI
There was a problem hiding this comment.
@ggerganov what do you think about -Werror for the hip build?
@IMbackK Sounds good
* CUDA: use shared mem for ssm_conv * fuse silu + ssm_conv * fuse unary + mul * enable for fp16 * formatting Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* CUDA: use shared mem for ssm_conv * fuse silu + ssm_conv * fuse unary + mul * enable for fp16 * formatting Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* CUDA: use shared mem for ssm_conv * fuse silu + ssm_conv * fuse unary + mul * enable for fp16 * formatting Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* CUDA: use shared mem for ssm_conv * fuse silu + ssm_conv * fuse unary + mul * enable for fp16 * formatting Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Add shared mem loading to
ssm_conv_long_tokenfor some mild benefit in prompt processingAt ub = 2048 on a 4090