From reading this thread:
pytorch/pytorch#96099 (comment)
It seems to me that the relative positional embedding can be integrated with scaled_dot_product_attention 's attn_mask argument. However, it can be slow as it's not taking the "fast path".
Do you think we can keep this option open for users who wants to use flash_attention and rel_pos_embedding?
Originally posted by @mingxin-zheng in #7977 (comment)