I was assuming that BLIS is generally better than reference BLAS, so substituting the latter with BLIS OS packages I'm working on would always be sensible. However, I found BLIS is more than two times slower for medium-sized dgemm on x86_64/RHEL7 for a "generic" build compared with the system reference blas package (which should be built with -O2 -mtune=generic, not -O3). I can't usefully test an architecture without a tuned implementation, but I don't see any reason to think that would be much different, though I haven't looked into the gcc optimization.
Is that expected, or might it be something worth investigating?
I was assuming that BLIS is generally better than reference BLAS, so substituting the latter with BLIS OS packages I'm working on would always be sensible. However, I found BLIS is more than two times slower for medium-sized dgemm on x86_64/RHEL7 for a "generic" build compared with the system reference blas package (which should be built with -O2 -mtune=generic, not -O3). I can't usefully test an architecture without a tuned implementation, but I don't see any reason to think that would be much different, though I haven't looked into the gcc optimization.
Is that expected, or might it be something worth investigating?