Hi,
I have been exploring your implementation and came across the parameters topk_ratio and local_range. Could you please clarify the following points?
topk_ratio:
What does the topk_ratio parameter control in your model? How does it relate to the resolution of the input data?
local_range:
What is the role of local_range in the attention process? How does it constrain the attention span or locality?
Additionally, I am interested in understanding the Locality-Constrained Sparse Attention mechanism. Specifically:
Where is the implementation of Locality-Constrained Sparse Attention in the codebase?