When I tried to reproduce the paper's results on an ARM-based Linux system, since flash attention is not supported, I replaced the flash attention with SDPA. Theoretically, the two calculations are equivalent, and the error precision is negligible.. But I was unable to reproduce the effct described in the paper!.
When I tried to reproduce the paper's results on an ARM-based Linux system, since flash attention is not supported, I replaced the flash attention with SDPA. Theoretically, the two calculations are equivalent, and the error precision is negligible.. But I was unable to reproduce the effct described in the paper!.