Hi, I took a look at softmax_ kernel.cu, Is the code customized for the sequence length of power 2, such as 128, 256, 512? Doesn't it seem to apply to a sequence length of 50? right?
I also need to make a PR for the DeepSpeed-Example branch to consolidate these change and won't let our bing-bert example crash!
Originally posted by @zmx19951103 in #587 (comment)
Hi, I took a look at softmax_ kernel.cu, Is the code customized for the sequence length of power 2, such as 128, 256, 512? Doesn't it seem to apply to a sequence length of 50? right?
Originally posted by @zmx19951103 in #587 (comment)