Switch to using SSE by default for Intel builds, rather than AVX #28

weefuzzy · 2021-09-15T11:13:46Z

As discussed. Between folk with new M1 machines and folk with older machines, having AVX on by default creates problems out of proportion to the benefits. This change uses SSE by default instead on x64/86 and people can still set their own preferred options using FLUID_ARCH

@jamesb93 There are SIMD options for ARM by default, but TBH I don't know how sensible they are. Are you using these on your M1 builds?

jamesb93 · 2021-09-15T11:17:12Z

@weefuzzy Would those defaults be activated by passing my own flags to FLUID_ARCH? I am not explicitly modifying any other parts of the built process apart from the OSX_ARCHITECTURES flag. Is there a way to verify if I was using said SIMD ops?

weefuzzy · 2021-09-15T11:29:24Z

Would those defaults be activated by passing my own flags to FLUID_ARCH?

They would only activate if nothing was passed in FLUID_ARCH. So, if you do a native (i.e. ARM) build on your machine and don't supply anything further, I'd expect them to be on

Is there a way to verify if I was using said SIMD ops?

If you're building with ninja or make, you can get CMake to produce a compile commands JSON file; then grep for the relevant flags. In XCode, you can get to the raw compile commands through the report navigator (speech bubble icon)

jamesb93 · 2021-09-15T11:41:03Z

What are the relevant flags I should grep for?

weefuzzy · 2021-09-15T11:47:42Z

-march=armv7-a -mtune=cortex-a8 -mfloat-abi=hard -mfpu=neon

jamesb93 · 2021-09-15T11:56:48Z

I couldn't find anything.

compile_commands.json.zip

weefuzzy · 2021-09-15T12:36:34Z

Those compile commands are all x64 builds. You'd probably need to delete your CMake cache or whole build tree to reset to doing ARM or both. Although, worryingly I didn't see either AVX or SSE in there either (although perhaps your FLUID_ARCH is already in the cache as being blank?)

jamesb93 · 2021-09-15T12:40:41Z

It will take a mo ⏲️ but I will iterate through all the possible combos of flags cleaning between each :)

jamesb93 · 2021-09-15T13:53:54Z

cmake -GNinja -DMAX_SDK_PATH=../sdk -DFLUID_ARCH=-mcpu=native -DCMAKE_OSX_ARCHITECTURES="x86_64" ..

This does not work if I don't pass FLUID_ARCH. It whinges about not being able to build for armv7.

weefuzzy · 2021-09-15T13:58:37Z

Presumably because it's automatically downloading HISSTools. What happens with

cmake -GNinja -DMAX_SDK_PATH=../sdk -DHISS_PATH=<Some local and up to date HISSTools clone> ..

?

jamesb93 · 2021-09-15T14:15:01Z

This fails:

cmake -GNinja -DHISS_PATH=../../hiss -DMAX_SDK_PATH=../sdk -DCMAKE_OSX_ARCHITECTURES="x86_64" .. for the same reason

but this works:

cmake -GNinja -DHISS_PATH=../../hiss -DFLUID_ARCH=-mcpu=native -DMAX_SDK_PATH=../sdk -DCMAKE_OSX_ARCHITECTURES="x86_64" ..

weefuzzy · 2021-09-15T14:20:22Z

What's the reason though? (and I'm confused about the -DCMAKE_OSX_ARCHITECTURES="x86_64", I thought you were trying an ARM build, but I'm getting lost).

If the HISSTools in ../../hiss is up to date then presumably the hand rolled SIMD stuff is happy, so is it complaining about compiler flags vis-a-vis architecture? What's the error message?

jamesb93 · 2021-09-15T14:40:01Z

I guess I'm a little lost too 🪘

cmake -GNinja -DHISS_PATH=../../hiss -DFLUID_ARCH=-mcpu=native -DMAX_SDK_PATH=../sdk -DCMAKE_OSX_ARCHITECTURES="arm64" .. makes a working binary using the new beta SDK.

Perhaps you could draw up a list of combinations that you want me to run because I'm getting bogged down in dependency + flags and what is meaningful in helping push forward this side of things.

weefuzzy · 2021-09-15T14:48:11Z

Ok. One ambition is to make sure FLUID_ARCH is still optional, and then to make sure that on an M1 we can build just ARM, just x64, and produce a fat boi without having to do too much acrobatics.

So, in a nice world, all of the following would work:

cmake -GNinja -DHISS_PATH=../../hiss -DMAX_SDK_PATH=../sdk  ..

I would expect to produce either an arm or a universal binary

cmake -GNinja -DHISS_PATH=../../hiss -DMAX_SDK_PATH=../sdk -DCMAKE_OSX_ARCHITECTURES="arm64"  ..

I would expect just arm64

cmake -GNinja -DHISS_PATH=../../hiss -DMAX_SDK_PATH=../sdk -DCMAKE_OSX_ARCHITECTURES="x86_64"  ..

I would expect just x64

If any of them fail, post the error messages and we'll take it from there

jamesb93 · 2021-09-15T14:49:17Z

And just to clarify which SDK would this point to in each example? AFAIK you cannot make a fat binary with the old sdk.

weefuzzy · 2021-09-15T14:51:35Z

8.2 in all cases. I'm happy enough that we can build < 8.2 on x64

jamesb93 · 2021-09-15T14:54:09Z

error1.txt
error2.txt
error3.txt

None of these work and I think its because of the lack of FLUID_ARCH. They are respective to the commands you wrote out.

weefuzzy · 2021-09-15T15:00:25Z

Only indirectly. For the first two, it doesn't like the flags that I plonked in (for R-PI maybe). For the third, it's still picking up ARM in the platform test, so we'll need to find a better platform test

clang: error: the clang compiler does not support '-march=armv7-a'
clang: error: the clang compiler does not support '-march=armv7-a'
error: unknown target CPU 'armv7-a'

weefuzzy · 2021-09-15T15:06:07Z

Anyway, thanks for looking at this. I'll borrow my beloved's m1 in due course and have a crack at fixing all this properly

jamesb93 · 2021-09-15T15:08:39Z

Feel free to bug me with more things to try, and sorry for the confusion somewhere in the middle there 😭

AlexHarker · 2021-09-15T16:02:21Z

In case it's helpful (and not obvious) I believe that arm64 is the correct thing to pass for ARM M1 builds (that is what is happening in the recent work I did on M1ifying some externals through c74)

weefuzzy · 2021-09-15T16:06:18Z

Cheers. Yes, it seemed a bit much to expect whatever I'd pasted from somewhere else for R-PI or Bela or something to Just Work 😆 Having skimmed the CMake issues list, it looks like there's possible additional complications based on whether one is running the intel or arm build of CMake on Silicon, and that relying on the processor type isn't going to be useful for managing this, so we'll maybe need a different set of conditions for Apple.

Now. I have access to a Shiny Thing I can mash the keyboard until it works.

AlexHarker · 2021-09-15T17:54:04Z

The machine you are compiling on is irrelevant. The only relevant thing is your version of Xcode, but older versions simply ignore the arm64 architecture flag. Given that 32 bit Max is no longer a thing that produces acceptable results In all circumstances for the versions of Max you are targeting - no?

AlexHarker · 2021-09-15T17:54:33Z

(assuming you are passing both flags and making a UB2)

weefuzzy · 2021-09-15T19:24:43Z

Thanks – Xcode and Cmake don't behave the same here (from what I gathered earlier); the issue isn't the machine – different, and not wholly predictable host processors were getting reported depending on whether CMake was running under Rosetta or not. It's also not clear that one will get UB2 by default without manually setting the flags, the upshot being that I need more intelligent probing CMake code than I currently have.

AlexHarker · 2021-09-15T19:27:36Z

Well my suggestion is that you manually set the flags in all cases to x86_64 arm64 (a UB2) - what is the case that you don't think this will satisfy? In other words I am unclear as to why you need to know the processor of the machine you are building on. The default for Max OS debug builds is native architecture only, so if you want to avoid building both in that circumstance you need to target configurations, but you still wouldn't need to know the architecture of the build machine.

weefuzzy · 2021-09-15T20:21:51Z

They came from somewhere. Maybe it was bela not R-PI 🤷 Anyway, the happy news seems to be that I might be able to get rid of some complexity in the build by ditching this whole thing.

… for Linux

weefuzzy · 2021-09-16T07:32:26Z

@jamesb93 I've now made it so that the silly things shouldn't happen on arm64. So it should work with nothing in FLUID_ARCH for both x64 and arm64. AFAICT, CMake won't set the CMAKE_OSX_ARCHITECTURES to produce a UB2 by default, so if you want a fatty you'll need to pass that yourself.

I have high hopes for:

cmake -GNinja -DHISS_PATH=../../hiss -DMAX_SDK_PATH=../sdk -DCMAKE_OSX_ARCHITECTURES="arm64;x86_64"  ..

jamesb93 · 2021-09-16T08:19:06Z

I get this error atm. I've looked at the .cmake and it looks sane to me...

CMake Error at /Users/james/dev/flucoma/core/script/flucoma_simdcmd.cmake:27 (endif):
  Flow control statements are not properly nested.
Call Stack (most recent call first):
  /Users/james/dev/flucoma/core/CMakeLists.txt:184 (include)

EDIT:

here was my build command:

cmake -GNinja -DHISS_PATH=../../hiss -DMAX_SDK_PATH=../sdk -DFLUID_PATH=../../core -DCMAKE_OSX_ARCHITECTURES="x86_64;arm64" ..

weefuzzy · 2021-09-16T08:26:47Z

Whoops. Extra endif(). Try now?

jamesb93 · 2021-09-16T08:40:41Z

I can build an intel 64 bit binary that works under Rosetta:

cmake -GNinja -DHISS_PATH=../../hiss -DMAX_SDK_PATH=../old-sdk -DFLUID_PATH=../../core -DCMAKE_OSX_ARCHITECTURES="x86_64" ..

This is with the 8.0.3 Max SDK. Building a fat binary with the 8.0.3 SDK doesn't work and when I point to the beta 8.2 I get this error:

Generating: fluid.stftpass_tilde
CMake Error at source/script/max-pretarget.cmake:72 (file):
  file STRINGS file
  "/Users/james/dev/flucoma/max/sdk/source/c74support/max-includes/c74_linker_flags.txt"
  cannot be read.
Call Stack (most recent call first):
  source/projects/fluid.stftpass_tilde/CMakeLists.txt:11 (include)

with the following command:

cmake -GNinja -DHISS_PATH=../../hiss -DMAX_SDK_PATH=../sdk -DFLUID_PATH=../../core -DCMAKE_OSX_ARCHITECTURES="x86_64;arm64" ..

So I assume its not accounting for the change in structure of the SDK (yet?).

weefuzzy · 2021-09-16T09:02:16Z

Ok. I would expect this to work in conjunction with flucoma/flucoma-max#32 because that deals with the SDK layout changes.

jamesb93 · 2021-09-16T09:05:00Z

Ok. I would expect this to work in conjunction with flucoma/flucoma-max#32 because that deals with the SDK layout changes.

Testing against flucoma/flucoma-max#32 with:

cmake -GNinja -DHISS_PATH=../../hiss -DMAX_SDK_PATH=../sdk -DFLUID_PATH=../../core -DCMAKE_OSX_ARCHITECTURES="x86_64;arm64" ..

gives:

clang: error: the clang compiler does not support '-march=core2'
ninja: build stopped: subcommand failed.

weefuzzy · 2021-09-16T09:08:34Z

Gah. I see what I did wrong there. Try now?

jamesb93 · 2021-09-16T09:11:38Z

CMake Error at /Users/james/dev/flucoma/core/script/flucoma_simdcmd.cmake:13:
Parse error. Expected a command name, got quoted argument with text
"amd64.|x86_64.|AMD64.|i686.|i386.|x86.".
Call Stack (most recent call first):
/Users/james/dev/flucoma/core/CMakeLists.txt:184 (include)

-- Configuring incomplete, errors occurred!
See also "/Users/james/dev/flucoma/max/build/CMakeFiles/CMakeOutput.log".
See also "/Users/james/dev/flucoma/max/build/CMakeFiles/CMakeError.log"

weefuzzy · 2021-09-16T09:15:38Z

🤦 and now?

jamesb93 · 2021-09-16T09:23:54Z

Works to make a fat boi! 🍬

jamesb93 · 2021-09-16T09:25:17Z

Just testing to see if the old sdk + building an intel binary still works

jamesb93 · 2021-09-16T09:27:38Z

yes! woo - nice work @weefuzzy

tremblap · 2021-09-17T17:09:17Z

Sooooooo my list of tests for this:

does it compile on intel Mac (Mac+SC)
(does it change anything on intel Windows?)

Then we talked about performance loss assessment (to see if that is a real problem) and I had taken notes of that list a long time ago:
-Tests on kr queries to see if it chokes

Large and small kdtree queries
Test on umap with large and small datasets
test on my classic bufnmf

I expect a little loss and that is fine as we agreed. I'll also try with AVX512 to see if that gives any gain for realz

Do I forget anything?

AlexHarker · 2021-09-17T22:16:47Z

It might be worth noting that the speed differential between SSE and AVX may have complex links to the specifics of the architecture of the testing machine - or in other words if you know how much faster it is on one machine, that might not translate directly to another one.

The gains will also depend on how much of your code is likely to directly make use of AVX. However, the larger vector size only guarantees that equivalent operations can be done on more data with less CPU instructions - it may, however be the case that the AVX instructions take more cycles, and hence the relationship to actual performance gains is complex on these levels (before you even think about memory speed etc.)

If there are significant benefits to AVX (or AVX512) then I can talk through how you might do runtime selection in a manageable fashion, but it is not a simple issue at all as you have to compile different files with different compiler flags, and create variants of the code for all relevant sections, which also bloats code size. The selection code must also use the lowest guaranteed instructions only. The simplest way of achieving all of this, which I've considered for my externals/other projects would be to compile variants of the entire object and then just branch to the right one at initialisation, but that's least efficient in terms of memory used to load the code.

A simpler option is to provide multiple compiles directly, but that won't play nice with the Max package manager etc.

weefuzzy · 2021-09-17T23:01:28Z

Thanks for those useful notes. I'd certainly need a great deal of convincing to sink time into runtime branching at this point 😆

I'm not expecting huge differences, tbh. We can probably make a decent stab at predicting whether it would make much difference at all by looking at the generated assembly. For instance, notionally the most expensive bit about querying a Kd Tree is the distance calculations, but this may not get vectorised at all (and the cost of the calculation itself might be smaller than the time from cache misses).

So, I agree with the thrust of Alex's advice, as I read it, which is that doing a sanity check is wholly appropriate but there's no point reading too much into the detail of the results. For the future, there is much we could do to find out what the compiler is able to vectorise, and to change code to make its job easier.

AlexHarker · 2021-09-17T23:12:08Z

I really wouldn't expect auto-vectorisation on things like a kdtree - the compiler is only likely to auto-vectorise on fairly loops dealing with consistently laid out memory - you are unlikely to have much code of that nature, given Gerard's coding style. I have seen little in the codebase that I would expect to auto-vectorise.

Looking at the assembly is IMHO unlikely to tell you much at all, unless you are willing to sit with some reference manuals that tell you how many CPU cycles various instructions take on the kind of hardware you expect to run on - that's a very time-consuming pursuit.

The biggest wins I would expect would likely be in large Eigen operations where the code is explicitly written with intrinsics (there's also some explicitly vectorised stuff in my code also, but most of it affects only the transient objects, with the exception of the FFT on windows - if you have something for which the FFT is a big cost then that would be worth testing). So, things with big matrix operations (there's a solver in the transient stuff, for instance, which is quite a slow part of the code). That might be useful in figuring out which objects are likely to be candidates for more detailed testing.

AlexHarker · 2021-09-17T23:16:27Z

Ah sorry - the distance measure in KDTree is done using vectors in Eigen, so I would expect an explicit vectorisation there. Testing it on datasets with a large dimensionality, rather than low dimensionally and lots of data is likely to expose whether that might translate into real results.

tremblap · 2021-09-18T07:14:14Z

I go there with the open-minded expectation that losing 10-20% is worth it, especially if we make compiling with local optimisation on so straightforward (I'll try our instructions as part of this process). The idea of accessibility is very near my heart and the beginner with old machine having to compile to go around optimisation is not as inclusive as asking the other way round.

tremblap · 2021-09-20T19:10:10Z

speedtest report:

nmf 2400-2600 Max job takes 2500-2700 / 2.75-2.87 scjob takes 2.52-2.72 (faster in SC and slightly slower in Max)

kdtree 16kitems of 200col, fit and query same ball park in Max (55ms and 7ms)

UMAP help file but 500 iterations same ballpark (1750-1850)

fluidtransients~ help but order+window @100 (55%cpu in AM)

sines = help but 64 hop and 16364 fft size (46%cpu in AM)

===
This is sooooo similar I wonder is there a way to make sure I really didn't compile avx? I sort of remember a command line that I could run on an external...

tremblap · 2021-09-20T19:31:02Z

ok after testing the same stuff on SC I get again a little less power but quite insignificant. So I vote for merging

jamesb93 · 2021-09-20T20:18:49Z

WOO🪨

Switch to using SSE by default for Intel builds, rather than AVX

ac8996c

weefuzzy requested review from jamesb93 and tremblap September 15, 2021 11:13

weefuzzy self-assigned this Sep 15, 2021

jamesb93 added the enhancement New feature or request label Sep 15, 2021

No explicit switches needed for SSE + x64 on Apple / MSVC. Leave them…

1865354

… for Linux

Remove rogue endif() from cmake

a725631

Don't set silly flags on Apple Silicon

efc75f9

Bracket error

7858bdb

jamesb93 mentioned this pull request Sep 16, 2021

Adjust cmake to handle changes to Max SDK from 8.2 onwards flucoma/flucoma-max#32

Merged

jamesb93 approved these changes Sep 16, 2021

View reviewed changes

tremblap merged commit bfe4092 into flucoma:main Sep 20, 2021

weefuzzy deleted the fix/sse_by_default branch November 4, 2021 16:48

Switch to using SSE by default for Intel builds, rather than AVX #28

Switch to using SSE by default for Intel builds, rather than AVX #28

Uh oh!

Conversation

weefuzzy commented Sep 15, 2021

Uh oh!

jamesb93 commented Sep 15, 2021

Uh oh!

weefuzzy commented Sep 15, 2021

Uh oh!

jamesb93 commented Sep 15, 2021

Uh oh!

weefuzzy commented Sep 15, 2021

Uh oh!

jamesb93 commented Sep 15, 2021

Uh oh!

weefuzzy commented Sep 15, 2021

Uh oh!

jamesb93 commented Sep 15, 2021

Uh oh!

jamesb93 commented Sep 15, 2021

Uh oh!

weefuzzy commented Sep 15, 2021

Uh oh!

jamesb93 commented Sep 15, 2021

Uh oh!

weefuzzy commented Sep 15, 2021

Uh oh!

jamesb93 commented Sep 15, 2021

Uh oh!

weefuzzy commented Sep 15, 2021

Uh oh!

jamesb93 commented Sep 15, 2021

Uh oh!

weefuzzy commented Sep 15, 2021

Uh oh!

jamesb93 commented Sep 15, 2021

Uh oh!

weefuzzy commented Sep 15, 2021

Uh oh!

weefuzzy commented Sep 15, 2021

Uh oh!

jamesb93 commented Sep 15, 2021

Uh oh!

AlexHarker commented Sep 15, 2021

Uh oh!

weefuzzy commented Sep 15, 2021

Uh oh!

AlexHarker commented Sep 15, 2021

Uh oh!

AlexHarker commented Sep 15, 2021

Uh oh!

weefuzzy commented Sep 15, 2021

Uh oh!

AlexHarker commented Sep 15, 2021

Uh oh!

weefuzzy commented Sep 15, 2021

Uh oh!

weefuzzy commented Sep 16, 2021

Uh oh!

jamesb93 commented Sep 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

weefuzzy commented Sep 16, 2021

Uh oh!

jamesb93 commented Sep 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

weefuzzy commented Sep 16, 2021

Uh oh!

jamesb93 commented Sep 16, 2021

Uh oh!

weefuzzy commented Sep 16, 2021

Uh oh!

jamesb93 commented Sep 16, 2021

Uh oh!

weefuzzy commented Sep 16, 2021

Uh oh!

jamesb93 commented Sep 16, 2021

Uh oh!

jamesb93 commented Sep 16, 2021 •

edited

Loading

jamesb93 commented Sep 16, 2021 •

edited

Loading