a question in the paper

In your paper, you mentioned that "In practice, the projection layer can transform the queries to any desired output, making the self-attention module redundant". But self-attention has softmax, which means that self-attention is non-linear in general, but the projection layer can only make linear transformations. I don't understand why you said it can transform the queries to any desired output.